Changes
Jump to navigation
Jump to search
no edit summary
{{Wikipedia how-to|H:EX}}
{{Linking and page manipulation|linking and diffs}}
Wiki pages can be exported in a special [[w:XML|XML]] format to [[Help:Import|import]] into another MediaWiki installation or use it elsewise for instance for analysing the content. See also [[m:Syndication feeds]] for exporting all other information except pages, and see [[Help:Import]] on importing pages.
== How to export ==
There are at least six ways to export pages:
* Paste the name of the articles in the box in [[Special:Export]] or use {{canonicalurl:Special:Export/FULLPAGENAME}}.
* Use <code>action=raw</code>. For example: https://en.wikipedia.org/w/index.php?title=Wikipedia&action=raw .. it's important to use <code>/w/index.php?title=PAGENAME&action=raw</code> and not <code>/wiki/PAGENAME?action=raw</code> (see [https://phabricator.wikimedia.org/T126183 Phab T126183])
* Use the API to fetch data in XML or JSON packaging
* The backup script {{tt|[https://doc.wikimedia.org/mediawiki-core/master/php/dumpBackup_8php_source.html dumpBackup.php]}} dumps all the wiki pages into an XML file. {{tt|dumpBackup.php}} only works on MediaWiki 1.5 or newer. You need to have direct access to the server to run this script. Dumps of mediawiki projects are (more or less) regularly made available at http://download.wikipedia.org. More help is at http://www.mediawiki.org/wiki/Manual:DumpBackup.php
* There is an [[OAI-PMH]]-interface to regularly fetch pages that have been modified since a specific time. For Wikimedia projects this interface is not publicly available. OAI-PMH contains a wrapper format around the actual exported articles.
* Use the [http://pywikipediabot.sourceforge.net/ Python Wikipedia Robot Framework]. This won't be explained here.
By default only the current version of a page is included. Optionally you can get all versions with date, time, user name and edit summary.
Additionally you can copy the SQL database. This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.
===Using 'Special:Export'===
To export '''all pages of a namespace''', for example.
====1. Get the names of pages to export====
* Go to [[Special:Allpages]] and choose the desired namespace.
* Copy the list of page names to a text editor
* Put all page names on separate lines
* Prefix the namespace to the page names (e.g. 'Help:Contents'), unless the selected namespace is the main namespace.
====2. Perform the export====
* Go to [[Special:Export]] and paste all your page names into the textbox, making sure there are no empty lines.
* Click 'Submit query'
* Save the resulting XML to a file using your browser's save facility.
and finally...
* Open the XML file in a text editor. Scroll to the bottom to '''check for error messages'''.
Now you can use this XML file to [[Help:Import|perform an import]].
====Exporting the full history====
A checkbox in the [[Special:Export]] interface selects whether to export the full history (all versions of an article) or the most recent version of articles. A maximum of 1000 revisions are returned; other revisions can be requested as detailed in [[MW:Parameters to Special:Export]].
== Export format ==
The format of the XML file you receive is the same in all ways. This format is codified in [[w:XML Schema|XML Schema]] at http://www.mediawiki.org/xml/export-0.6.xsd. This format is not intended for viewing in a web browser, though some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts. Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice. If you directly read the XML source it won't be difficult to find the actual wikitext. If you don't use a special XML editor <nowiki>"<" and ">" appear as &lt; and &gt;, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&amp;".</nowiki>
In the current version the export format does not contain an XML replacement of wiki markup (see [[Wikipedia DTD]] for an older proposal, or [[Wikipedia:WML|Wiki Markup Language]]). You only get the wikitext as you get when editing the article. (After export you can use [http://www.mediawiki.org/wiki/Alternative_parsers alternative parsers] to convert wikitext to other format)
=== Example ===
<source lang="xml">
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<!-- page namespace code -->
<ns>0</ns>
<id>2</id>
<!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
<redirect title="Redirect page title" />
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[revision]].</text>
</revision>
<revision>
<!-- deleted revision example -->
<id>4557485</id>
<parentid>1243372</parentid>
<timestamp>2010-06-24T02:40:22Z</timestamp>
<contributor deleted="deleted" />
<model>wikitext</model>
<format>text/x-wiki</format>
<text deleted="deleted" />
<sha1/>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>
</mediawiki>
</source>
=== DTD ===
Here is an unofficial, short [[w:Document Type Definition|Document Type Definition]] version of the format. If you don't know what a DTD is just ignore it.
<source lang="html4strict">
<!ELEMENT mediawiki (siteinfo?,page*)>
<!-- version contains the version number of the format (currently 0.3) -->
<!ATTLIST mediawiki
version CDATA #REQUIRED
xmlns CDATA #FIXED "http://www.mediawiki.org/xml/export-0.3/"
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #FIXED
"http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"
>
<!ELEMENT siteinfo (sitename,base,generator,case,namespaces)>
<!ELEMENT sitename (#PCDATA)> <!-- name of the wiki -->
<!ELEMENT base (#PCDATA)> <!-- url of the main page -->
<!ELEMENT generator (#PCDATA)> <!-- MediaWiki version string -->
<!ELEMENT case (#PCDATA)> <!-- how cases in page names are handled -->
<!-- possible values: 'first-letter' | 'case-sensitive'
'case-insensitive' option is reserved for future -->
<!ELEMENT namespaces (namespace+)> <!-- list of namespaces and prefixes -->
<!ELEMENT namespace (#PCDATA)> <!-- contains namespace prefix -->
<!ATTLIST namespace key CDATA #REQUIRED> <!-- internal namespace number -->
<!ELEMENT page (title,id?,restrictions?,(revision|upload)*)>
<!ELEMENT title (#PCDATA)> <!-- Title with namespace prefix -->
<!ELEMENT id (#PCDATA)>
<!ELEMENT restrictions (#PCDATA)> <!-- optional page restrictions -->
<!ELEMENT revision (id?,timestamp,contributor,minor?,comment,text)>
<!ELEMENT timestamp (#PCDATA)> <!-- according to ISO8601 -->
<!ELEMENT minor EMPTY> <!-- minor flag -->
<!ELEMENT comment (#PCDATA)>
<!ELEMENT text (#PCDATA)> <!-- Wikisyntax -->
<!ATTLIST text xml:space CDATA #FIXED "preserve">
<!ELEMENT contributor ((username,id) | ip)>
<!ELEMENT username (#PCDATA)>
<!ELEMENT ip (#PCDATA)>
<!ELEMENT upload (timestamp,contributor,comment?,filename,src,size)>
<!ELEMENT filename (#PCDATA)>
<!ELEMENT src (#PCDATA)>
<!ELEMENT size (#PCDATA)>
</source>
=== Processing XML export ===
Many tools can process the exported XML. If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based on [[w:Simple API for XML|SAX]] or other event-driven methods.
You can also use regular expressions to directly process parts of the XML code. These run fast but are difficult to maintain.
Please list methods and tools for processing XML export here:
* [[w:Wikipedia:Computer help desk/ParseMediaWikiDump|Parse::MediaWikiDump]] is a perl module for processing the XML dump file.
* [[m:Processing MediaWiki XML with STX]] - Stream based XML transformation
=== Details and practical advice ===
* To determine the namespace of a page you have to match its title to the prefixed defined in
{{tt|/mediawiki/siteinfo/namespaces/namespace}}
* Possible restrictions are
** {{tt|sysop}} (protected pages)
==See also==
*[[mw:Help:How to move a wiki to another server]]
*[[mw:Manual:Moving_a_wiki]]
==Wikipedia-specific help==
<!-- {{MediaWiki:Exportnohistory}} -->
*[[Wikipedia:WikiProject Transwiki/exporting]] - instructions on how to export the entire history of a Wikipedia article.
{{Linking and page manipulation|linking and diffs}}
Wiki pages can be exported in a special [[w:XML|XML]] format to [[Help:Import|import]] into another MediaWiki installation or use it elsewise for instance for analysing the content. See also [[m:Syndication feeds]] for exporting all other information except pages, and see [[Help:Import]] on importing pages.
== How to export ==
There are at least six ways to export pages:
* Paste the name of the articles in the box in [[Special:Export]] or use {{canonicalurl:Special:Export/FULLPAGENAME}}.
* Use <code>action=raw</code>. For example: https://en.wikipedia.org/w/index.php?title=Wikipedia&action=raw .. it's important to use <code>/w/index.php?title=PAGENAME&action=raw</code> and not <code>/wiki/PAGENAME?action=raw</code> (see [https://phabricator.wikimedia.org/T126183 Phab T126183])
* Use the API to fetch data in XML or JSON packaging
* The backup script {{tt|[https://doc.wikimedia.org/mediawiki-core/master/php/dumpBackup_8php_source.html dumpBackup.php]}} dumps all the wiki pages into an XML file. {{tt|dumpBackup.php}} only works on MediaWiki 1.5 or newer. You need to have direct access to the server to run this script. Dumps of mediawiki projects are (more or less) regularly made available at http://download.wikipedia.org. More help is at http://www.mediawiki.org/wiki/Manual:DumpBackup.php
* There is an [[OAI-PMH]]-interface to regularly fetch pages that have been modified since a specific time. For Wikimedia projects this interface is not publicly available. OAI-PMH contains a wrapper format around the actual exported articles.
* Use the [http://pywikipediabot.sourceforge.net/ Python Wikipedia Robot Framework]. This won't be explained here.
By default only the current version of a page is included. Optionally you can get all versions with date, time, user name and edit summary.
Additionally you can copy the SQL database. This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.
===Using 'Special:Export'===
To export '''all pages of a namespace''', for example.
====1. Get the names of pages to export====
* Go to [[Special:Allpages]] and choose the desired namespace.
* Copy the list of page names to a text editor
* Put all page names on separate lines
* Prefix the namespace to the page names (e.g. 'Help:Contents'), unless the selected namespace is the main namespace.
====2. Perform the export====
* Go to [[Special:Export]] and paste all your page names into the textbox, making sure there are no empty lines.
* Click 'Submit query'
* Save the resulting XML to a file using your browser's save facility.
and finally...
* Open the XML file in a text editor. Scroll to the bottom to '''check for error messages'''.
Now you can use this XML file to [[Help:Import|perform an import]].
====Exporting the full history====
A checkbox in the [[Special:Export]] interface selects whether to export the full history (all versions of an article) or the most recent version of articles. A maximum of 1000 revisions are returned; other revisions can be requested as detailed in [[MW:Parameters to Special:Export]].
== Export format ==
The format of the XML file you receive is the same in all ways. This format is codified in [[w:XML Schema|XML Schema]] at http://www.mediawiki.org/xml/export-0.6.xsd. This format is not intended for viewing in a web browser, though some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts. Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice. If you directly read the XML source it won't be difficult to find the actual wikitext. If you don't use a special XML editor <nowiki>"<" and ">" appear as &lt; and &gt;, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&amp;".</nowiki>
In the current version the export format does not contain an XML replacement of wiki markup (see [[Wikipedia DTD]] for an older proposal, or [[Wikipedia:WML|Wiki Markup Language]]). You only get the wikitext as you get when editing the article. (After export you can use [http://www.mediawiki.org/wiki/Alternative_parsers alternative parsers] to convert wikitext to other format)
=== Example ===
<source lang="xml">
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<!-- page namespace code -->
<ns>0</ns>
<id>2</id>
<!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
<redirect title="Redirect page title" />
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[revision]].</text>
</revision>
<revision>
<!-- deleted revision example -->
<id>4557485</id>
<parentid>1243372</parentid>
<timestamp>2010-06-24T02:40:22Z</timestamp>
<contributor deleted="deleted" />
<model>wikitext</model>
<format>text/x-wiki</format>
<text deleted="deleted" />
<sha1/>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>
</mediawiki>
</source>
=== DTD ===
Here is an unofficial, short [[w:Document Type Definition|Document Type Definition]] version of the format. If you don't know what a DTD is just ignore it.
<source lang="html4strict">
<!ELEMENT mediawiki (siteinfo?,page*)>
<!-- version contains the version number of the format (currently 0.3) -->
<!ATTLIST mediawiki
version CDATA #REQUIRED
xmlns CDATA #FIXED "http://www.mediawiki.org/xml/export-0.3/"
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #FIXED
"http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"
>
<!ELEMENT siteinfo (sitename,base,generator,case,namespaces)>
<!ELEMENT sitename (#PCDATA)> <!-- name of the wiki -->
<!ELEMENT base (#PCDATA)> <!-- url of the main page -->
<!ELEMENT generator (#PCDATA)> <!-- MediaWiki version string -->
<!ELEMENT case (#PCDATA)> <!-- how cases in page names are handled -->
<!-- possible values: 'first-letter' | 'case-sensitive'
'case-insensitive' option is reserved for future -->
<!ELEMENT namespaces (namespace+)> <!-- list of namespaces and prefixes -->
<!ELEMENT namespace (#PCDATA)> <!-- contains namespace prefix -->
<!ATTLIST namespace key CDATA #REQUIRED> <!-- internal namespace number -->
<!ELEMENT page (title,id?,restrictions?,(revision|upload)*)>
<!ELEMENT title (#PCDATA)> <!-- Title with namespace prefix -->
<!ELEMENT id (#PCDATA)>
<!ELEMENT restrictions (#PCDATA)> <!-- optional page restrictions -->
<!ELEMENT revision (id?,timestamp,contributor,minor?,comment,text)>
<!ELEMENT timestamp (#PCDATA)> <!-- according to ISO8601 -->
<!ELEMENT minor EMPTY> <!-- minor flag -->
<!ELEMENT comment (#PCDATA)>
<!ELEMENT text (#PCDATA)> <!-- Wikisyntax -->
<!ATTLIST text xml:space CDATA #FIXED "preserve">
<!ELEMENT contributor ((username,id) | ip)>
<!ELEMENT username (#PCDATA)>
<!ELEMENT ip (#PCDATA)>
<!ELEMENT upload (timestamp,contributor,comment?,filename,src,size)>
<!ELEMENT filename (#PCDATA)>
<!ELEMENT src (#PCDATA)>
<!ELEMENT size (#PCDATA)>
</source>
=== Processing XML export ===
Many tools can process the exported XML. If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based on [[w:Simple API for XML|SAX]] or other event-driven methods.
You can also use regular expressions to directly process parts of the XML code. These run fast but are difficult to maintain.
Please list methods and tools for processing XML export here:
* [[w:Wikipedia:Computer help desk/ParseMediaWikiDump|Parse::MediaWikiDump]] is a perl module for processing the XML dump file.
* [[m:Processing MediaWiki XML with STX]] - Stream based XML transformation
=== Details and practical advice ===
* To determine the namespace of a page you have to match its title to the prefixed defined in
{{tt|/mediawiki/siteinfo/namespaces/namespace}}
* Possible restrictions are
** {{tt|sysop}} (protected pages)
==See also==
*[[mw:Help:How to move a wiki to another server]]
*[[mw:Manual:Moving_a_wiki]]
==Wikipedia-specific help==
<!-- {{MediaWiki:Exportnohistory}} -->
*[[Wikipedia:WikiProject Transwiki/exporting]] - instructions on how to export the entire history of a Wikipedia article.