A Python library for extracting clean text from Wikipedia articles. This is a refactored and modularized version of the original WikiExtractor tool, designed to be more maintainable and easier to integrate into other projects.
- Clean text extraction from Wikipedia markup
- Template expansion support
- Multiple output formats: Plain text, JSON, Markdown
- Configurable processing options
- Modular architecture for easy customization
- Language support for multiple Wikipedia languages
- HTML entity handling and cleanup
git clone https://github.com/Phongng26/wiki-extractor.git
cd wiki-extractor
pip install -r requirements.txtpip install wiki-extractor"""
Basic usage example for WikiExtractor
Simple demonstration with Wikipedia URL
"""
from wiki_extractor.extractor import Extractor
# Example raw Wikipedia markup (usually fetched via the Wikipedia API)
raw_text = """
{{Short description|Quantum algorithm}}
'''Shor's algorithm''' is a [[quantum algorithm]] for integer factorization...
"""
# Initialize extractor
extractor = Extractor(
id="1",
revid="101",
urlbase="https://en.wikipedia.org/wiki",
title="Shor's algorithm",
page=raw_text
)
# Extract clean text (list of paragraphs)
result = extractor.clean_text(raw_text)
print("Number of paragraphs:", len(result))
print("First paragraph:", result[0])The library provides several configuration options:
keepLinks: Preserve internal links in outputkeepSections: Keep section structureHtmlFormatting: Enable HTML formattingmarkdown: Output in Markdown formatlanguage: Target language codediscardSections: Set of section titles to discarddiscardTemplates: Set of template names to discard
- Python 3.10+
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Run tests
python -m pytest tests/
# Run tests with coverage
python -m pytest tests/ --cov=wiki_extractorThis project is licensed under the MIT License - see the LICENSE file for details.
- Based on the original WikiExtractor by Giuseppe Attardi
- Inspired by the MediaWiki markup processing community
See CHANGELOG.md for a detailed history of changes.