docs: clarify temporal deduplication strategies and document types (Issue #267) #276
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
docs: clarify temporal deduplication strategies and document types
Fixes #267
Summary
This PR improves the Dolma documentation around temporal deduplication and document types, addressing conceptual questions raised in allenai/dolma#267. It explains how temporal behaviour emerges from the existing deduplication and mixer pipeline, and how different document categories can be handled in practice.
Motivation
In issue #267, users asked:
The existing documentation briefly covers deduplication and the document format, but it does not make the temporal aspects or the role of document types explicit. This PR fills that gap using the current implementation behaviour, without changing any code.
What this PR changes
docs/deduplication.mdAdds a new section “Temporal deduplication and document types” that:
2019-08,2019-09,2019-10) are processed.source(top-level) for high-level source/category.added/created(where present) for acquisition and creation times.metadatafields for more fine-grained document-type information.documentslist order encodes the temporal policy.bloom_filter.fileacross snapshots leads to temporal deduplication.Adds a short reference at the top of the file pointing to
data-format.mdfor background on the Dolma document structure, so the new section has a clear foundation.docs/README.mdDeduplication (including temporal strategies and document types)Implementation notes
bff_duplicate_paragraph_spansand Bloom filter parameters similar to the current docs).Testing
docs/deduplication.mdanddocs/README.mdin a Markdown preview to check:data-format.md,mixer.md) render correctly.No automated tests are affected, as this is a documentation-only change.