lennon-c
diff --git a/‎.github/workflows/ci.yaml‎
Lines changed: 35 additions & 0 deletions b/‎.github/workflows/ci.yaml‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 29 additions & 0 deletions b/‎.gitignore‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/Fetching XML data/Dump files.md‎
Lines changed: 37 additions & 0 deletions b/‎docs/Fetching XML data/Dump files.md‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎docs/Fetching XML data/Special Exports.md‎
Lines changed: 71 additions & 0 deletions b/‎docs/Fetching XML data/Special Exports.md‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎docs/Fetching XML data/index.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/Fetching XML data/index.md‎
Lines changed: 13 additions & 0 deletions
@@ -0,0 +1,35 @@
+name: ci 
+on:
+  push:
+    branches:
+      # - master 
+      - main
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Configure Git Credentials
+        run: |
+          git config user.name github-actions[bot]
+          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+      - uses: actions/setup-python@v5
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - uses: actions/cache@v4
+        with:
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs-material 
+      - run: pip install mkdocstrings-python
+      - run: pip install markdown-exec[ansi]
+      - run: pip install mkdocs-open-in-new-tab
+      - run: pip install requests
+      - run: pip install lxml
+      - run: pip install mwparserfromhell 
+      - run: mkdocs gh-deploy --force
@@ -0,0 +1,29 @@
+# MkDocs documentation
+/site/
+
+# Cache
+__pycache__/
+.pytest_cache/
+ 
+
+# editors
+.vscode/
+ 
+# Test
+/tests/
+
+# Python Built 
+/build/
+/dist/
+
+# temporary folders
+temp/
+analysis/
+docs/attachments
+docs/Sandbox.md
+pyodide.md
+ 
+# Data 
+/data/
+out/
+
@@ -0,0 +1,37 @@
+**Dump File Fetching:** This is a static snapshot of all wiki pages, which is stored in a single compressed file (e.g., dewiktionary-latest-pages-articles.xml.bz2). 
+
+[TOC]
+
+## German Wiktionary Dump Files
+
+### Latest Version
+
+- To download the latest dump file of the German Wiktionary, click [here](https://dumps.wikimedia.org/dewiktionary/latest/dewiktionary-latest-pages-articles-multistream.xml.bz2).
+    - This will download the compressed file: `dewiktionary-latest-pages-articles-multistream.xml.bz2`.
+    - The file is stored in this directory: [https://dumps.wikimedia.org/dewiktionary/latest/](https://dumps.wikimedia.org/dewiktionary/latest/).
+
+### Older Versions
+
+- If you need an older version of the Wiktionary dump, visit this directory: [https://dumps.wikimedia.org/dewiktionary/](https://dumps.wikimedia.org/dewiktionary/).
+    - To download a specific version:
+        - Navigate to the folder for the desired date.
+        - A new window will open.
+        - Look for the section titled **Articles, templates, media/file descriptions, and primary meta-pages**.
+        - Select the file. The file name will follow the pattern: `dewiktionary-YYYYMMDD-pages-articles.xml.bz2`, where `YYYYMMDD` represents the dump date.
+    - **Notes**:
+        - Older dumps are only retained for a few months.
+        - You can also fetch the latest version from this directory by choosing the most recent date.
+
+## Any Wiki Dump File
+
+- Click on **[Database backup dumps](https://dumps.wikimedia.org/backup-index.html)**.
+- Scroll down the page to find the domain of interest.
+    - For example, use `enwiktionary` for the English Wiktionary.
+- Click on the domain link, then look for the section titled **Articles, templates, media/file descriptions, and primary meta-pages**.
+    - The file you are looking for should end with `-pages-articles.xml.bz2`.
+ 
+## Should I download a multistream dump file or not?
+
+The files `-pages-articles.xml.bz2` and `multistream-pages-articles.xml.bz2` contain the same information. For our purposes here, either option will work just fine because we are working with a relatively small wiki database.
+
+However, if you plan to work with a larger dump file in the future that exceeds your computer's memory capacity, you could download the `multistream-pages-articles.xml.bz2` version. This would allow you to adjust your parsing strategy to process the data in smaller chunks.
@@ -0,0 +1,71 @@
+The **Special Export** tool fetches specific pages with their raw content (*wikitext*) in real-time, without needing to download the entire dataset. The content is provided in XML format.
+
+[TOC]
+
+## Using the **Special Export** Tool
+
+You can actually use **Special Export** to retrieve pages from *any* Wiki site. On the German Wiktionary, however, the tool is labelled **Spezial:Exportieren**, but it works the same way.
+
+### Examples
+
+**Exporting Pages from Any Wiki Site**
+
+To access the XML content of the page titled "Austria" from English Wikipedia, you can use the following Python code. When you press `run`, it will open the export link in your default browser:
+
+```pyodide session="webbrowser"
+import webbrowser
+
+title = 'Austria'
+domain = 'en.wikipedia.org'
+url = f'https://{domain}/wiki/Special:Export/{title}'
+webbrowser.open_new_tab(url)
+```
+
+**Exporting Pages from the German Wiktionary**
+
+For the German Wiktionary, the export tool uses `Spezial:Exportieren` instead of `Special:Export`. You can use similar Python code to open the export link for the page titled "schön" (German for "beautiful"):
+
+```pyodide session="webbrowser"
+title = 'schön'
+domain = 'de.wiktionary.org'
+url = f'https://{domain}/wiki/Spezial:Exportieren/{title}'
+webbrowser.open_new_tab(url)
+```
+
+## Using the `requests` Library
+
+To programmatically fetch and download XML content, you can use Python's `requests` library. This example shows how to build the URL, make a request, and get the XML content of a Wiktionary page by its title.
+ 
+```python exec="true" source="above"   session="requests"
+import requests
+
+def fetch(title):
+    # Construct the URL for the XML export of the given page title
+    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
+    
+    # Send a GET request
+    resp = requests.get(url)
+    
+    # Check if the request was successful, and raise an error if not
+    resp.raise_for_status()
+    
+    # Return the XML content of the requested page
+    return resp.content
+```
+
+Next, let us attempt to retrieve the XML content for the page titled "hoch" and print the initial 500 bytes for a glimpse of the XML content displayed in the `Result` tab.
+
+ 
+```python exec="true" source="tabbed-left" result="pycon" session="requests"
+page = fetch('hoch')
+print(page[:500])
+```
+
+<!-- Which will return 
+```xml
+b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n  <siteinfo>\n    <sitename>Wiktionary</sitename'
+```
+   -->
+We will continue to use the `fetch` function throughout this tutorial.
+
+
@@ -0,0 +1,13 @@
+---
+title: Fetching XML
+---
+
+
+In the first section, we will cover two ways of accessing the XML files that contain *Wikitext*.
+
+- First, we will access them online using the Wiki [Special Export tool](Special Exports.md).
+- Next, we will learn where to find the Wiki [Dump File](Dump files.md).
+
+Notice that you can use the `Previous` and `Next` links in the footer to navigate forward or backward throughout the hands-on tutorial.
+
+Let us begin by exploring the [Special Export tool](Special Exports.md) method.