Skip to content

Typo in error message and many hash mismatches #3

@joelb-git

Description

@joelb-git

After downloading the data as described, I ran:

$ python fix_document_order.py --hc4_file ./data/zho/hc4_docs.jsonl \
                             --id_file ./resources/hc4/zho/ids*.jsonl.gz \
                             --check_hash
...
Traceback (most recent call last):
  File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 71, in <module>
    main(args)
  File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 40, in main
    assert len(ordered_ids) == len(docs_pos), \
AssertionError: Downloaded 646268 unique documents but id file(s) have 646268 unique ids.

There's a typo in the error message. Here is a diff to fix:

$ git diff
diff --git a/fix_document_order.py b/fix_document_order.py
index caa76de..f1d5e0f 100644
--- a/fix_document_order.py
+++ b/fix_document_order.py
@@ -38,7 +38,7 @@ def main(args):
             pbar.update()

     assert len(ordered_ids) == len(docs_pos), \
-           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(docs_pos)} unique ids."
+           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(ordered_ids)} unique ids."

     output_file = args.hc4_file.with_name(f"{args.hc4_file.name}.sorted")

@@ -68,4 +68,4 @@ if __name__ == '__main__':
     if len(args.id_file) > 1:
         args.id_file = sorted(args.id_file, key=lambda x: int(x.name.split(".")[1]))

-    main(args)
\ No newline at end of file
+    main(args)

Running --resume:

$ python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 --resume
...
Looking for 478 documents in 1 cc_files
...
Found all needed docs in crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz, early stopping
done-cc-file:crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz

Then re-running fix_document_order.py showed that all docs were
downloaded, however I still had many hash errors. For example, for
rus:

$ python fix_document_order.py --hc4_file ./data/rus/hc4_docs.jsonl \
  --id_file ./resources/hc4/rus/ids*.jsonl.gz \
  --check_hash
...
Doc 81f3aa7d-ab14-4dea-be41-6b3474249953 hash mismatch -- should be d4ca468d21616841a2144d0dad123eb4 but got 8a1dc724e4164c6532ed36a7946bd981
Doc 86b099ba-1511-4326-828e-e0d5e1c0f90a hash mismatch -- should be 1a50b2a9ba514dfc68810fc4632ab97a but got 6914a2420218286b33955723d126d8e9
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4719506/4721064 [07:08<00:00, 11117.84it/s]Doc 589b3401-1e65-40f0-a905-fd6b6fc1e04a hash mismatch -- should be 771d69c411674f627e0be95f7b0ce98d but got a81ca88d6e61cf64f594957001045f84
Doc b99a9558-d946-4732-b09a-e9d1600cdafa hash mismatch -- should be 0309f37937d498a4bfaeeb3366934c07 but got da68945322ddb025ed1c0430a8aba8cb
Doc e55d8cae-58af-43af-bf04-19f3628f4273 hash mismatch -- should be 27cbce31eeb04088f5cf529f56d498b9 but got 9828b6d5d53a28dc0cf7876509920bff
Doc 4bbe64ca-11b7-479f-8e02-83861d470e53 hash mismatch -- should be ab6f0ff18ba464da2f2ac78f3a7a69e4 but got dd0380def9f1d217ae8601ba47aec389
Doc 382f3ed2-6155-4415-adc3-993a287f129b hash mismatch -- should be 64b44761217e3d8a924c769baeaf8b3d but got 26bf5ed5298ab99c5072772fff4d506f
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4720621/4721064 [07:08<00:00, 10955.39it/s]Doc 5d195cd8-e402-4714-8bf7-0c95807f96ff hash mismatch -- should be 662f2f0ca832a52bb2630c1749c74276 but got a0fb8bb09d0b134fef9ccb143ba6a7a0
Doc 3b6ec592-48c4-4d95-a692-62fd7b9f1529 hash mismatch -- should be 42bcb22637bf3d61669a62781c79a496 but got 373c61c948ef7088bfd25a6234c3a557
Doc a155ba8a-d5bb-4d5d-b532-4accf699a3b4 hash mismatch -- should be 5aa723febfa6be837456cbc0637c24db but got d493913737cd51554e61f7e34c353261
Reading downloaded file: 100%|███████████████████████████████████████████████████████████| 4721064/4721064 [07:08<00:00, 11006.19it/s]
Writing sorted docuements: 100%|█████████████████████████████████████████████████████████| 4721064/4721064 [02:19<00:00, 33754.77it/s]
Backing up the original file...
Done

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions