Name	Name	Last commit message	Last commit date
parent directory ..
download	download
mathml2latex	mathml2latex
url_filtering	url_filtering
utils	utils
README.md	README.md
requirements.txt	requirements.txt
stage1_download_and_extract.py	stage1_download_and_extract.py

Name

Last commit message

Last commit date

mathml2latex

stage1_download_and_extract.py

Web Pipeline

This folder contains the code for the web pipeline. First please follow instructions in download folder to get all the available WARC file paths.

Stage 1: Download and Extract

This will download the WARC files from the Common Crawl and extract the text and HTML content. Meanwhile, we will perform language identification and math text filtering using fasttext models.

python stage1_download_and_extract.py

Stage 2: Deduplication

We mainly follow DataTrove's example to perform deduplication. Please refer to the example code in datatrove for more details. The majority of the code is the same, but we use a different bucket size and hash function number (11 , 10).

Stage 3: Re-extraction

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Web Pipeline

Stage 1: Download and Extract

Stage 2: Deduplication

Stage 3: Re-extraction

FilesExpand file tree

web_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

web_pipeline

Folders and files

parent directory

README.md

Web Pipeline

Stage 1: Download and Extract

Stage 2: Deduplication

Stage 3: Re-extraction