When trying to work with these data via Dataflow, I noticed a few things:
- the ID field key is inconsistent between files. it is
id in minhash and signals, doc_id in duplicates.
- IDs are not present as an explicit field in documents. They must be reconstructed from the file path and line number.
This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available. I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it).
For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale. Just a UUID would be fine.
When trying to work with these data via Dataflow, I noticed a few things:
idin minhash and signals,doc_idin duplicates.This creates a lot of unnecessary friction when working with big data pipelines, since line number is not usually available. I'm finding myself writing a custom reader (sort of a bummer if you've ever had to do it).
For future data releases, please consider embedding a consistent key between all file groups for easier joining at scale. Just a UUID would be fine.