docs: clarify temporal deduplication strategies and document types (Issue #267) #276

ada-ggf25 · 2025-11-27T14:33:02Z

docs: clarify temporal deduplication strategies and document types

Fixes #267

Summary

This PR improves the Dolma documentation around temporal deduplication and document types, addressing conceptual questions raised in allenai/dolma#267. It explains how temporal behaviour emerges from the existing deduplication and mixer pipeline, and how different document categories can be handled in practice.

Motivation

In issue #267, users asked:

Whether Dolma performs deduplication across time (for example, multiple crawls of the same URL or near-identical content in different years).
How to reason about which version is retained (earliest, latest, or highest quality).
How to express and control document types / content categories (such as web pages, academic papers, news, textbooks) within Dolma.

The existing documentation briefly covers deduplication and the document format, but it does not make the temporal aspects or the role of document types explicit. This PR fills that gap using the current implementation behaviour, without changing any code.

What this PR changes

docs/deduplication.md
- Adds a new section “Temporal deduplication and document types” that:
  - Explains that Dolma does not have a separate “temporal deduper”; instead, temporal behaviour is achieved by:
    - Reusing the same Bloom filter across multiple runs, and
    - Controlling the order in which snapshots (for example, 2019-08, 2019-09, 2019-10) are processed.
  - Clarifies how timestamps and document types appear in the standard Dolma document format:
    - source (top-level) for high-level source/category.
    - added / created (where present) for acquisition and creation times.
    - metadata fields for more fine-grained document-type information.
  - Describes two common temporal strategies:
    - “Keep newest”: process snapshots from newest → oldest with a shared Bloom filter, so the latest copy is treated as canonical.
    - “Keep oldest”: process snapshots from oldest → newest with a shared Bloom filter, so the first-seen copy is canonical.
  - Discusses document-type-aware deduplication:
    - Using a single Bloom filter to deduplicate across types (for example, web vs academic) when that is desired.
    - Using separate Bloom filters per type or source when cross-type deduplication is not desired.
  - Provides a concrete JSON configuration example for a temporal paragraph-level deduplication strategy over multiple monthly web snapshots, highlighting:
    - How the documents list order encodes the temporal policy.
    - How sharing a bloom_filter.file across snapshots leads to temporal deduplication.
    - How this interacts with downstream mixer configuration for further filtering or weighting.
- Adds a short reference at the top of the file pointing to data-format.md for background on the Dolma document structure, so the new section has a clear foundation.
docs/README.md
- Updates the index entry for deduplication to make the new content discoverable:
  - Changes the bullet to:
    Deduplication (including temporal strategies and document types)

Implementation notes

No code or CLI behaviour is changed in this PR; it is documentation-only.
The example config is intentionally minimal and consistent with existing dedupe examples (for example, use of
bff_duplicate_paragraph_spans and Bloom filter parameters similar to the current docs).

Testing

Manually reviewed docs/deduplication.md and docs/README.md in a Markdown preview to check:
- Internal links (for example, data-format.md, mixer.md) render correctly.
- Tables and code blocks render as expected.
- New section reads coherently alongside the existing deduplication documentation.

No automated tests are affected, as this is a documentation-only change.

Describe temporal deduplication behaviour when sequencing dedupe runs and reusing Bloom filters. Clarify the role of document structure, timestamps and document types, and add an example temporal paragraph-level deduplication configuration with key points.

Update the main documentation index to signal that the deduplication page now covers temporal strategies and document type handling.

docs: clarify temporal deduplication strategies and document types

ada-ggf25 added 3 commits November 27, 2025 14:10

docs: highlight temporal deduplication in index

4a211a0

Update the main documentation index to signal that the deduplication page now covers temporal strategies and document type handling.

Merge pull request #1 from ada-ggf25:Guilherme_Grancho

b07934b

docs: clarify temporal deduplication strategies and document types

ada-ggf25 mentioned this pull request Nov 27, 2025

Questions on Temporal Deduplication Strategy and Document Types in Dolma #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: clarify temporal deduplication strategies and document types (Issue #267) #276

docs: clarify temporal deduplication strategies and document types (Issue #267) #276

Uh oh!

ada-ggf25 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

docs: clarify temporal deduplication strategies and document types (Issue #267) #276

Are you sure you want to change the base?

docs: clarify temporal deduplication strategies and document types (Issue #267) #276

Uh oh!

Conversation

ada-ggf25 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

docs: clarify temporal deduplication strategies and document types

Summary

Motivation

What this PR changes

Implementation notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ada-ggf25 commented Nov 27, 2025 •

edited

Loading