One of the great things about WordPress is its portability and its popularity. It is extremely easy for a WordPress blogger to move their entire blog, comments and all, between different hosting providers without the use of any complex database languages such as SQL.
Every WordPress blogging system provides the option to import and export data between other WordPress blogs. This is not only limited to the blog entries themselves but can also include the post’s categories, tags, comments, drafts and even spam! It does all this using the WordPress Extended Rss document format, WXR.
The WXR format is based on the Really Simple Syndication or Rss specification which is a very popular dialect of XML. It has been designed as a syndication format for websites who wish to share and serialise some of their data. http://www.rssboard.org/
A web syndication specification might seem an odd choice for a blog exporting tool but Rss popularity on today’s Internet, its simplicity and its expandable format through the use of 3rd party extensions make it a great choice. Being an XML dialect also means you can open up any text editor and have complete access to all blog data in a mark-up format that is human readable (in a layout not too different from a HTML file.)
To create a WXR export file of your own you need to login into your WordPress Dashboard, scroll down to Tools and select Export. A filter option allows you to drill down to specific data to trim your export file size. If you are exporting the complete site I’d recommend changing the Statuses filter to ‘Published’. If left as ‘All Statuses’ all the blog’s redundant auto-saved entries will be included, which ineffectively duplicate the published articles.
Once you have pressed Download Export File button and it has finished downloading you should have an XML document with the name of wordpress-[yyyy]-[mm]-[dd].xml. You can open this with any text editor or even Windows Notepad. But it is preferable that you use a text editor that can parse the XML document for colourisation as it makes the document much easier to read. NotePad++ http://notepad-plus-plus.org/ is a good choice for Windows users while TextMate http://macromates.com/ is probably the best choice for OS/X.
As the title suggests in this post I will attempt to decode the content of the WordPress Extended Rss document. This means I will list in published order the Rss elements contained within a standard export and briefly describe their purpose.
This will not be a tutorial on XML or Rss and I will assume you have some understanding of either. However if this is not the case things should not be too hard to follow especially for those people familiar with HTML documents.
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your blog. -->
At the top of the WXR file there is a large commented section explaining the purpose of the document and in case you have forgotten instructions on how to import the file into a WordPress blog.
Beyond the comments is the required <rss> element containing 5 namespace extensions as well as the Rss version as a numeric value. The extensions listed include the RDF site summary content module, the well-formed web comment API, the Dublin Core metadata element set and 2 WordPress extensions. If this isn’t making too much sense then don’t worry as it is not really important unless you are developing an Rss parser.
The namespaces listed are unique, with each serving specific functions that the base Rss specification does not cover. Each XML namespace starts with xmlns: and is followed by an abbreviated title of the namespace which is usually an acronym. The URL that follows each title is a requirement and should point to a webpage that provides further information on the namespace.
Is an example of the Dublin Core element set namespace.
Below the <rss> element is the <channel> container element. This holds all the child elements and data related to the WordPress blog. You can find the closing </rss> element at the bottom of the Rss document. At the top of the <channel> we have the elements that are associated with the WordPress blog metadata.
<title> Contains the site title of the blog.
<link> Is the URL of the blog as determined by WordPress.
<description> Is a tagline that can be modified in the Dashboard under General Settings.
<pubDate> Was the time and date that the WXR document was created. It is in the RFC-822 format http://asg.web.cmu.edu/rfc/rfc822.html as required by the Rss standard. The format should be self explanatory except for the last numeric value which represents the local differential from GMT using a +/-hhmm format. Plus 2 hours from GMT would be represented as +0200. The WordPress time zone can be changed in the Dashboard under General Settings, Timezone.
<generator> Is the name or a URL pointing to the homepage of the application that was used to create the Rss document.
<language> Is the primary language the blog is written in as determined by General Setting, Language in the WordPress Dashboard. A list of valid codes used to represent the language can be found at http://www.rssboard.org/rss-language-codes.
<wp: wxr_version> This is our first example of an extended Rss element. We can recognise that it does not belong to the Rss specification as the element contains a colon. Left of the colon contains the elements extension while right is the element name. wp:wxr_version is the version number for the WordPress extension Rss.
<wp:base_site_url> Is the root URL of the WordPress hosting provider.
<wp:base_blog_url> Is the root URL of the WordPress blog.
<wp:category> Contains a complete collection of categories associated with the blog. You can view and edit the list within the Dashboard under Posts, Categories. Each category is given its own <category> element and contains the following 3 child elements.
- <wp:category_nicename> Is the category name in a URL friendly format.
- <wp:category_parent> If the category belongs to a hierarchy then the parent category is listed.
- <wp:cat_name><![CDATA]> The original name of the category contained within a <![CDDATA[ ]]>. The CDATA or character data enclosure tells the XML/Rss parser not to process the text contained within. This is a safety measure in case the text contains any illegal characters that could generate errors. http://www.w3schools.com/xml/xml_cdata.asp
<wp:tag> Contains a complete collection of the blog post tags. You can view and edit the post tags within the Dashboard under Posts, Posts Tags. It contains the following 2 child elements.
- <wp:tag_slug> Is the URL friendly name of the tag.
- <wp:tag_name> Is the original name of the tag contained within a character data enclosure.
<cloud> Is a pointer to the RssCloud API which is a blog monitoring service supported by WordPress.com. It enables a supporting client to receive instant notification when the blog is updated. http://www.rssboard.org/rsscloud-interface
<image> Is a logo belonging to the site that can be displayed by Rss clients. You can modify the logo under the General Settings, Blog Picture / Icon dialog in the Dashboard. There are strict size and image formats requirements imposed by the Rss standard. http://www.rssboard.org/rss-specification#ltimagegtSubelementOfLtchannelgt
<atom:link rel=”search”> Is a URL pointing to the Open Search description document supplied by WordPress. It enables supported Rss clients and web browsers an easy means to provide search terms to the blog and receive results in a standardised XML format. http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document
<atom:link rel=”pub”> Is a URL pointing to the Google designed pubsubhubbub notification service that is supported by WordPress. In my opinion this is easier to implement and use then the alternative <cloud> service that offers similar functionality. http://code.google.com/p/pubsubhubbub/
That is the end of the Rss metadata related elements. Below are the list of child elements contained within the <item></item> elements. Items are repeated multiple times as each item holds a single blog post, article or page.
<title> Title of the
<link> URL to the
<pubDate> Time and date that the post was posted online.
<dc:creator> Lists the author of the post. The element is a Dublin Core Rss extension as the Rss specification doesn’t contain any suitable elements for this role.
<category> Each category associated with the blog is given 2 category elements. The first element contains just the category as a name, while the second element contains both the category name and the URL friendly nicename attribute.
<guid> Is the globally unique identifier used for the identification of the blog post by Rss and WordPress clients. The isPermaLink=false attribute just means that this identifier is not a legitimate website URL and is not usable in a web browser.
<description> In Rss documents this element contains the synopsis of the item but in WXR it is left blank.
<content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post, HTML tags and all.
This is an unknown element.
<wp:post_id> This is an auto-incremental, numeric, unique identification number given to each post, article or page.
<wp:post_date> Time and date that the post was published.
<wp:post_date_gmt> Time and date in GMT that the post was published.
<wp:comment_status> A value stating whether public access for posting comments is opened or closed.
<wp:post_name> Is a unique, URL friendly nicename based on the post title.
<wp:status> Publish status of the post with the options; ‘publish’, ‘draft’, ‘pending’,’private’.
<wp:post_parent> The numeric identification number if the post’s parent. This I think is applicable to WordPress pages which can be nested within each other.
<wp:menu_order> I assume is related to menu navigation of nested pages.
<wp:post_type> Post type either ‘post’, ‘page’,’media’.
<wp:post_password> A non-encrypted password used by WordPress to restrict reading access to the post.
<wp:is_sticky> A numeric Boolean value (0 = false, 1 = true) to determine if the post as a sticky. A sticky post means the post will always be displayed at the top of any list of posts.
<wp:postmeta> Are containers for newer additions the WXR document format that have been introduced after the original WXR specification. Each <wp:postmeta> element contains 2 child elements.
- <wp:meta_key> Is the reference key for the meta data element.
- <wp:meta_value> Is the value for the meta data element contained within a character data enclosure.
Below is a list of the <wp:meta_key> references currently used by WXR.
delicious; is data related to the Delicious social bookmarking web service. http://www.delicious.com/
geo_latitude; is the positioning location of the author when submitted the post. The value is the latitude in degrees using the World Geodetic System 1984 (WGS84) datum. It seems to be based on the Google Gears Geolocation API. http://code.google.com/apis/gears/api_geolocation.html
geo_longitude; is the positioning location of the author when they submitted the post. The value is the longitude coordinates.
geo_accuracy; is the horizontal accuracy of the above positioning values in metres.
geo_address; is the address determined by the above geolocation data.
geo_public; is a Boolean numeric value that determines if the geolocation data should be displayed in the post.
email_notification; is an unknown value related to the email notification service for posting comments.
_wpas_done_yup; is an unknown numeric Boolean value.
_wpas_done_twitter; is an unknown numeric Boolean value related to Twitter.
reddit; is data related to the reddit social news web service. http://www.reddit.com/
_edit_last; is an unknown reference.
_edit_lock; is an unknown reference.
<wp:comment> Is a child element for the post item that contains 12 sub-elements listed below. These sub-elements belong to the a single post comment contained within a <wp:comment> element set.
- <wp:comment_id> This is an auto-incremental, numeric, unique identification number given to each comment.
- <wp:comment_author> The name of author who submitted the comment. The name value is contained within a character data enclosure.
- <wp:comment_author_email>An e-mail address provided by the author of the comment.
- <wp:comment_author_url> The URL of the author’s website provided by the author of the comment.
- <wp:comment_author_IP> The IP address belonging to the author of the comment. The IP address is automatically recorded by WordPress.
- <wp:comment_date> The date and time local to the blog that the comment was posted.
- <wp:comment_date_gmt> The date and time at GMT that the comment was posted.
- <wp:comment_content> The comment text enclosed within a character data enclosure.
- <wp:comment_approved> A numeric Boolean value to determine if the comment is displayed.
- <wp:comment_type> The type of comment. If left blank it is classed as a normal comment otherwise a value of ‘pingback’ means it is a post request notification link. http://en.wikipedia.org/wiki/Pingback
- <wp:comment_parent> The numeric identification of the parent comment used when the comment is a response to a pre-existing comment.
- <wp:comment_user_id> A numeric identification belonging to the author if they were logged in when they submitted the comment.
Hopefully that extensive list helps you out. It should be current with all the main elements in a standard WordPress Extended Rss document been covered. If you find any mistakes, errors or know the purpose of any of the unknown elements please leave a comment.