Added a datatrove based pipeline for filtering tokenized data using scores. #235

BlueCrescent · 2025-07-25T08:39:59Z

Included an example configuration file.
Added datatrove and pydantic-settings to requirements.
Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

Copilot

Pull Request Overview

This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.

Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
Introduces configuration management using pydantic-settings for both local and Slurm execution environments
Updates dependencies to include datatrove and pydantic-settings

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py	Implements ScoresParser class for reading JSONL score files and mapping to tokenized data
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py	Implements DataFiltering class for filtering datasets based on score thresholds
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py	Main pipeline orchestration with configuration management and execution settings
pyproject.toml	Adds datatrove and pydantic-settings dependencies
configs/data_processing/example_filter_pipeline_config.yaml	Example configuration file for the filtering pipeline

Comments suppressed due to low confidence (1)

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241

[nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.

            "and use the FilterPipelineBuilder class directly."

src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

Copilot · 2025-07-25T08:41:04Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+    tasks: int = 1
+    time: str = "00:15:00"
+    partition: str = "default"
+    account: str | None = None  # FIXME is this supported?


The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.

Suggested change

account: str | None = None # FIXME is this supported?

account: str | None = None # The Slurm account to charge for the job. Optional.

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…g pipeline and adapted the codebase for new changes from main

…tionality

… execution settings

…line.py

…dle duplicates in score parsing

AbasKhan · 2025-11-04T11:55:55Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+        document = self.get_document_from_dict(doc_content, filepath, 0)
+        return [document]
+
+    def _parse_scores_jsonl_file(self, filepath: str) -> tuple[str, list[dict[str, float]]]:


the scores are emitted in lexicographic order of the document IDs. IDs such as sample1, sample2, sample10 will be reordered to sample1, sample10, sample2, so the thresholds get applied to the wrong rows in the packed dataset. Please preserve the original file order (e.g. rely on insertion order or track the original line index when deduplicating).

AbasKhan · 2025-11-11T11:38:44Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+
+        Returns: generator of Document
+        """
+        base_file_path_or_name, scores_as_list = self._parse_scores_jsonl_file(filepath)


this returns the reader-supplied JSONL filename and never uses the document IDs it just disambiguated, so _map_to_tokenized_data_path always strips only the bare filename. Any base_file_prefix configured in YAML is ignored, and two shards with the same filename in different folders will map to the same .pbin

AbasKhan · 2025-11-11T11:41:15Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+        duplicate_counts: dict[str, int] = {}  # track counts per original document_id
+        processed_count = 0
+
+        with self.data_folder.open(filepath, "r", compression=self._compression) as f:


Not sure if this is very critical but here for every line the parser scans the entire scores_for_document_idx list to see if an ID was seen before, even though a duplicate_counts dict is already maintained. Large score shards (millions of documents) will result in quadratic behavior and long runtimes. Track seen IDs in a set/dict and avoid the repeated linear scans.

AbasKhan · 2025-11-11T11:52:27Z

tests/score_based_filtering/test_filter_pipeline.py

+            {"score_A": 2.0},
+            {"score_A": 10.0},
+        ]
+        self.assertEqual(score_entries, expected_scores)


You’re only comparing the list of scores, not the (document_id, score) pairs. If two items had the same score_A, a reordered list could still pass.

AbasKhan · 2025-11-11T12:02:47Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+        output_folder (Path): The folder where the filtered datasets will be saved.
+        thresholds (dict[str, float]): A dictionary where keys are score names and values are the
+            thresholds to filter samples.
+        hash_to_base_file_mapping_csv (Path): A CSV file mapping base file hashes to their corresponding paths.


Seems like an artifact

BlueCrescent requested a review from Copilot July 25, 2025 08:39

Copilot AI reviewed Jul 25, 2025

View reviewed changes

BlueCrescent and others added 7 commits July 25, 2025 10:43

chore(filtering): More robust doc id parsing.

81aafa8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(filtering): Removed duplicate file exists check.

b1d1a46

fix(filtering): fixed docstring

af89182

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into filtering_pipeline

12fbc95

refactor: removed reliance on file hashes in the score-based filterin…

22dddeb

…g pipeline and adapted the codebase for new changes from main

test: add comprehensive tests for score-based filtering pipeline func…

e2d02f2

…tionality

chore: remove hardcoded YAML file path from main execution block

936462a

ajude2s requested a review from AbasKhan October 29, 2025 20:47

feat: add Slurm configuration files for filtering pipeline and update…

6bb08f7

… execution settings

ajude2s self-assigned this Nov 2, 2025

ajude2s added 2 commits November 4, 2025 12:56

refactor: clean up imports and remove unused code in test_filter_pipe…

3a5c21e

…line.py

fix: enhance ScoresParser to preserve original document order and han…

a0698c2

…dle duplicates in score parsing

AbasKhan requested changes Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Uh oh!

BlueCrescent commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jul 25, 2025

Uh oh!

Uh oh!

AbasKhan Nov 4, 2025

Uh oh!

AbasKhan Nov 11, 2025

Uh oh!

AbasKhan Nov 11, 2025

Uh oh!

AbasKhan Nov 11, 2025

Uh oh!

AbasKhan Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	account: str \| None = None # FIXME is this supported?
	account: str \| None = None # The Slurm account to charge for the job. Optional.

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Are you sure you want to change the base?

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Uh oh!

Conversation

BlueCrescent commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AbasKhan Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

AbasKhan Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

AbasKhan Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

AbasKhan Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

AbasKhan Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants