|
| 1 | +The **Special Export** tool fetches specific pages with their raw content (*wikitext*) in real-time, without needing to download the entire dataset. The content is provided in XML format. |
| 2 | + |
| 3 | +[TOC] |
| 4 | + |
| 5 | +## Using the **Special Export** Tool |
| 6 | + |
| 7 | +You can actually use **Special Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way. |
| 8 | + |
| 9 | +### Examples |
| 10 | + |
| 11 | +**Exporting Pages from Any Wiki Site** |
| 12 | + |
| 13 | +To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser: |
| 14 | + |
| 15 | +```pyodide session="webbrowser" |
| 16 | +import webbrowser |
| 17 | +
|
| 18 | +title = 'Austria' |
| 19 | +domain = 'en.wikipedia.org' |
| 20 | +url = f'https://{domain}/wiki/Special:Export/{title}' |
| 21 | +webbrowser.open_new_tab(url) |
| 22 | +``` |
| 23 | + |
| 24 | +**Exporting Pages from the German Wiktionary** |
| 25 | + |
| 26 | +For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"): |
| 27 | + |
| 28 | +```pyodide session="webbrowser" |
| 29 | +title = 'schön' |
| 30 | +domain = 'de.wiktionary.org' |
| 31 | +url = f'https://{domain}/wiki/Spezial:Exportieren/{title}' |
| 32 | +webbrowser.open_new_tab(url) |
| 33 | +``` |
| 34 | + |
| 35 | +## Using the `requests` Library |
| 36 | + |
| 37 | +To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title. |
| 38 | + |
| 39 | +```python exec="true" source="above" session="requests" |
| 40 | +import requests |
| 41 | + |
| 42 | +def fetch(title): |
| 43 | + # Construct the URL for the XML export of the given page title |
| 44 | + url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}' |
| 45 | + |
| 46 | + # Send a GET request |
| 47 | + resp = requests.get(url) |
| 48 | + |
| 49 | + # Check if the request was successful, and raise an error if not |
| 50 | + resp.raise_for_status() |
| 51 | + |
| 52 | + # Return the XML content of the requested page |
| 53 | + return resp.content |
| 54 | +``` |
| 55 | + |
| 56 | +Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab. |
| 57 | + |
| 58 | + |
| 59 | +```python exec="true" source="tabbed-left" result="pycon" session="requests" |
| 60 | +page = fetch('hoch') |
| 61 | +print(page[:500]) |
| 62 | +``` |
| 63 | + |
| 64 | +<!-- Which will return |
| 65 | +```xml |
| 66 | +b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n <siteinfo>\n <sitename>Wiktionary</sitename' |
| 67 | +``` |
| 68 | + --> |
| 69 | +We will continue to use the `fetch` function throughout this tutorial. |
| 70 | + |
| 71 | + |
0 commit comments