wikipedia_extractor

A tool and library crate for parsing Wikipedia XML dumps.

Building

cargo build --release

Usage Bin

Extract articles from a Wikipedia dump

Usage: wiki_extractor [OPTIONS] --path <PATH> --output-path <OUTPUT_PATH> --output-format <OUTPUT_FORMAT>

Options:
  -p, --path <PATH>
          Path to the xml file containing the dump
  -n, --number-of-articles <NUMBER_OF_ARTICLES>
          Number of articles to extract. -1 will extract all articles [default: -1]
  -o, --output-path <OUTPUT_PATH>
          Where the output should be written to
              files: Writes all files to the output_path directory
              single-file-with-index: Generates two files <output_path>.txt and <output_path>.csv
      --output-format <OUTPUT_FORMAT>
          Specify the format of the output:
              files: Generates one file per article
              single-file-with-index: Generates a single files containing all article texts and a CSV file.
                                      Every value in the CSV file represents the offset to an article. [possible values: files, single-file-with-index]
  -h, --help
          Print help
  -V, --version
          Print version

Example

Generate out.txt and out.csv for the first 1000 articles

cargo run --release -- -p ./wiki.xml -n 1000 -o ./out --output-format=single-file-with-index

Generate one file per article in out/ for the first 1000 articles

cargo run --release -- -p ./wiki.xml -n 1000 -o ./out --output-format=files

Usage Lib

use wiki_extractor::WikipediaIterator;
let article_iter = WikipediaIterator::new("./wiki.xml")?;
for article in article_iter {
  println!("title: {}\n content: {}", article.title, article.content);
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wikipedia_extractor

Building

Usage Bin

Example

Usage Lib

About

Uh oh!

Languages

License

LennardKittner/wikipedia_extractor

Folders and files

Latest commit

History

Repository files navigation

wikipedia_extractor

Building

Usage Bin

Example

Usage Lib

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages