-
Notifications
You must be signed in to change notification settings - Fork 0
Typo in error message and many hash mismatches #3
Copy link
Copy link
Open
Description
After downloading the data as described, I ran:
$ python fix_document_order.py --hc4_file ./data/zho/hc4_docs.jsonl \
--id_file ./resources/hc4/zho/ids*.jsonl.gz \
--check_hash
...
Traceback (most recent call last):
File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 71, in <module>
main(args)
File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 40, in main
assert len(ordered_ids) == len(docs_pos), \
AssertionError: Downloaded 646268 unique documents but id file(s) have 646268 unique ids.
There's a typo in the error message. Here is a diff to fix:
$ git diff
diff --git a/fix_document_order.py b/fix_document_order.py
index caa76de..f1d5e0f 100644
--- a/fix_document_order.py
+++ b/fix_document_order.py
@@ -38,7 +38,7 @@ def main(args):
pbar.update()
assert len(ordered_ids) == len(docs_pos), \
- f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(docs_pos)} unique ids."
+ f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(ordered_ids)} unique ids."
output_file = args.hc4_file.with_name(f"{args.hc4_file.name}.sorted")
@@ -68,4 +68,4 @@ if __name__ == '__main__':
if len(args.id_file) > 1:
args.id_file = sorted(args.id_file, key=lambda x: int(x.name.split(".")[1]))
- main(args)
\ No newline at end of file
+ main(args)
Running --resume:
$ python download_documents.py --storage ./data/ \
--zho ./resources/hc4/zho/ids.jsonl.gz \
--fas ./resources/hc4/fas/ids.jsonl.gz \
--rus ./resources/hc4/rus/ids.*.jsonl.gz \
--jobs 4 --resume
...
Looking for 478 documents in 1 cc_files
...
Found all needed docs in crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz, early stopping
done-cc-file:crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz
Then re-running fix_document_order.py showed that all docs were
downloaded, however I still had many hash errors. For example, for
rus:
$ python fix_document_order.py --hc4_file ./data/rus/hc4_docs.jsonl \
--id_file ./resources/hc4/rus/ids*.jsonl.gz \
--check_hash
...
Doc 81f3aa7d-ab14-4dea-be41-6b3474249953 hash mismatch -- should be d4ca468d21616841a2144d0dad123eb4 but got 8a1dc724e4164c6532ed36a7946bd981
Doc 86b099ba-1511-4326-828e-e0d5e1c0f90a hash mismatch -- should be 1a50b2a9ba514dfc68810fc4632ab97a but got 6914a2420218286b33955723d126d8e9
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4719506/4721064 [07:08<00:00, 11117.84it/s]Doc 589b3401-1e65-40f0-a905-fd6b6fc1e04a hash mismatch -- should be 771d69c411674f627e0be95f7b0ce98d but got a81ca88d6e61cf64f594957001045f84
Doc b99a9558-d946-4732-b09a-e9d1600cdafa hash mismatch -- should be 0309f37937d498a4bfaeeb3366934c07 but got da68945322ddb025ed1c0430a8aba8cb
Doc e55d8cae-58af-43af-bf04-19f3628f4273 hash mismatch -- should be 27cbce31eeb04088f5cf529f56d498b9 but got 9828b6d5d53a28dc0cf7876509920bff
Doc 4bbe64ca-11b7-479f-8e02-83861d470e53 hash mismatch -- should be ab6f0ff18ba464da2f2ac78f3a7a69e4 but got dd0380def9f1d217ae8601ba47aec389
Doc 382f3ed2-6155-4415-adc3-993a287f129b hash mismatch -- should be 64b44761217e3d8a924c769baeaf8b3d but got 26bf5ed5298ab99c5072772fff4d506f
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4720621/4721064 [07:08<00:00, 10955.39it/s]Doc 5d195cd8-e402-4714-8bf7-0c95807f96ff hash mismatch -- should be 662f2f0ca832a52bb2630c1749c74276 but got a0fb8bb09d0b134fef9ccb143ba6a7a0
Doc 3b6ec592-48c4-4d95-a692-62fd7b9f1529 hash mismatch -- should be 42bcb22637bf3d61669a62781c79a496 but got 373c61c948ef7088bfd25a6234c3a557
Doc a155ba8a-d5bb-4d5d-b532-4accf699a3b4 hash mismatch -- should be 5aa723febfa6be837456cbc0637c24db but got d493913737cd51554e61f7e34c353261
Reading downloaded file: 100%|███████████████████████████████████████████████████████████| 4721064/4721064 [07:08<00:00, 11006.19it/s]
Writing sorted docuements: 100%|█████████████████████████████████████████████████████████| 4721064/4721064 [02:19<00:00, 33754.77it/s]
Backing up the original file...
Done
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels