This folder contains the code for the web pipeline. First please follow instructions in download folder to get all the available WARC file paths.
This will download the WARC files from the Common Crawl and extract the text and HTML content. Meanwhile, we will perform language identification and math text filtering using fasttext models.
python stage1_download_and_extract.pyWe mainly follow DataTrove's example to perform deduplication. Please refer to the example code in datatrove for more details. The majority of the code is the same, but we use a different bucket size and hash function number (11 , 10).
TODO