-
Notifications
You must be signed in to change notification settings - Fork 0
Added a datatrove based pipeline for filtering tokenized data using scores. #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
BlueCrescent
commented
Jul 25, 2025
- Included an example configuration file.
- Added datatrove and pydantic-settings to requirements.
- Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.
- Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
- Introduces configuration management using pydantic-settings for both local and Slurm execution environments
- Updates dependencies to include datatrove and pydantic-settings
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py | Implements ScoresParser class for reading JSONL score files and mapping to tokenized data |
| src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py | Implements DataFiltering class for filtering datasets based on score thresholds |
| src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py | Main pipeline orchestration with configuration management and execution settings |
| pyproject.toml | Adds datatrove and pydantic-settings dependencies |
| configs/data_processing/example_filter_pipeline_config.yaml | Example configuration file for the filtering pipeline |
Comments suppressed due to low confidence (1)
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241
- [nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.
"and use the FilterPipelineBuilder class directly."
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py
Outdated
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py
Show resolved
Hide resolved
| tasks: int = 1 | ||
| time: str = "00:15:00" | ||
| partition: str = "default" | ||
| account: str | None = None # FIXME is this supported? |
Copilot
AI
Jul 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.
| account: str | None = None # FIXME is this supported? | |
| account: str | None = None # The Slurm account to charge for the job. Optional. |
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…g pipeline and adapted the codebase for new changes from main
… execution settings
…dle duplicates in score parsing
| document = self.get_document_from_dict(doc_content, filepath, 0) | ||
| return [document] | ||
|
|
||
| def _parse_scores_jsonl_file(self, filepath: str) -> tuple[str, list[dict[str, float]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the scores are emitted in lexicographic order of the document IDs. IDs such as sample1, sample2, sample10 will be reordered to sample1, sample10, sample2, so the thresholds get applied to the wrong rows in the packed dataset. Please preserve the original file order (e.g. rely on insertion order or track the original line index when deduplicating).
|
|
||
| Returns: generator of Document | ||
| """ | ||
| base_file_path_or_name, scores_as_list = self._parse_scores_jsonl_file(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns the reader-supplied JSONL filename and never uses the document IDs it just disambiguated, so _map_to_tokenized_data_path always strips only the bare filename. Any base_file_prefix configured in YAML is ignored, and two shards with the same filename in different folders will map to the same .pbin
| duplicate_counts: dict[str, int] = {} # track counts per original document_id | ||
| processed_count = 0 | ||
|
|
||
| with self.data_folder.open(filepath, "r", compression=self._compression) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is very critical but here for every line the parser scans the entire scores_for_document_idx list to see if an ID was seen before, even though a duplicate_counts dict is already maintained. Large score shards (millions of documents) will result in quadratic behavior and long runtimes. Track seen IDs in a set/dict and avoid the repeated linear scans.
| {"score_A": 2.0}, | ||
| {"score_A": 10.0}, | ||
| ] | ||
| self.assertEqual(score_entries, expected_scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re only comparing the list of scores, not the (document_id, score) pairs. If two items had the same score_A, a reordered list could still pass.
| output_folder (Path): The folder where the filtered datasets will be saved. | ||
| thresholds (dict[str, float]): A dictionary where keys are score names and values are the | ||
| thresholds to filter samples. | ||
| hash_to_base_file_mapping_csv (Path): A CSV file mapping base file hashes to their corresponding paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an artifact