Skip to content

Conversation

@ada-ggf25
Copy link

@ada-ggf25 ada-ggf25 commented Nov 27, 2025

docs: clarify temporal deduplication strategies and document types

Fixes #267


Summary

This PR improves the Dolma documentation around temporal deduplication and document types, addressing conceptual questions raised in allenai/dolma#267. It explains how temporal behaviour emerges from the existing deduplication and mixer pipeline, and how different document categories can be handled in practice.


Motivation

In issue #267, users asked:

  • Whether Dolma performs deduplication across time (for example, multiple crawls of the same URL or near-identical content in different years).
  • How to reason about which version is retained (earliest, latest, or highest quality).
  • How to express and control document types / content categories (such as web pages, academic papers, news, textbooks) within Dolma.

The existing documentation briefly covers deduplication and the document format, but it does not make the temporal aspects or the role of document types explicit. This PR fills that gap using the current implementation behaviour, without changing any code.


What this PR changes

  1. docs/deduplication.md

    • Adds a new section “Temporal deduplication and document types” that:

      • Explains that Dolma does not have a separate “temporal deduper”; instead, temporal behaviour is achieved by:
        • Reusing the same Bloom filter across multiple runs, and
        • Controlling the order in which snapshots (for example, 2019-08, 2019-09, 2019-10) are processed.
      • Clarifies how timestamps and document types appear in the standard Dolma document format:
        • source (top-level) for high-level source/category.
        • added / created (where present) for acquisition and creation times.
        • metadata fields for more fine-grained document-type information.
      • Describes two common temporal strategies:
        • “Keep newest”: process snapshots from newest → oldest with a shared Bloom filter, so the latest copy is treated as canonical.
        • “Keep oldest”: process snapshots from oldest → newest with a shared Bloom filter, so the first-seen copy is canonical.
      • Discusses document-type-aware deduplication:
        • Using a single Bloom filter to deduplicate across types (for example, web vs academic) when that is desired.
        • Using separate Bloom filters per type or source when cross-type deduplication is not desired.
      • Provides a concrete JSON configuration example for a temporal paragraph-level deduplication strategy over multiple monthly web snapshots, highlighting:
        • How the documents list order encodes the temporal policy.
        • How sharing a bloom_filter.file across snapshots leads to temporal deduplication.
        • How this interacts with downstream mixer configuration for further filtering or weighting.
    • Adds a short reference at the top of the file pointing to data-format.md for background on the Dolma document structure, so the new section has a clear foundation.

  2. docs/README.md

    • Updates the index entry for deduplication to make the new content discoverable:
      • Changes the bullet to:
        Deduplication (including temporal strategies and document types)

Implementation notes

  • No code or CLI behaviour is changed in this PR; it is documentation-only.
  • The example config is intentionally minimal and consistent with existing dedupe examples (for example, use of
    bff_duplicate_paragraph_spans and Bloom filter parameters similar to the current docs).

Testing

  • Manually reviewed docs/deduplication.md and docs/README.md in a Markdown preview to check:
    • Internal links (for example, data-format.md, mixer.md) render correctly.
    • Tables and code blocks render as expected.
    • New section reads coherently alongside the existing deduplication documentation.

No automated tests are affected, as this is a documentation-only change.

Describe temporal deduplication behaviour when sequencing dedupe runs and reusing Bloom filters.

Clarify the role of document structure, timestamps and document types, and add an example temporal paragraph-level deduplication configuration with key points.
Update the main documentation index to signal that the deduplication page now covers temporal strategies and document type handling.
docs: clarify temporal deduplication strategies and document types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Questions on Temporal Deduplication Strategy and Document Types in Dolma

1 participant