Skip to content

Conversation

@BlueCrescent
Copy link
Collaborator

  • Included an example configuration file.
  • Added datatrove and pydantic-settings to requirements.
  • Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

…ized data using scores.

- Included an example configuration file.
- Added datatrove and pydantic-settings to requirements.
- Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
@BlueCrescent BlueCrescent requested a review from Copilot July 25, 2025 08:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.

  • Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
  • Introduces configuration management using pydantic-settings for both local and Slurm execution environments
  • Updates dependencies to include datatrove and pydantic-settings

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py Implements ScoresParser class for reading JSONL score files and mapping to tokenized data
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py Implements DataFiltering class for filtering datasets based on score thresholds
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py Main pipeline orchestration with configuration management and execution settings
pyproject.toml Adds datatrove and pydantic-settings dependencies
configs/data_processing/example_filter_pipeline_config.yaml Example configuration file for the filtering pipeline
Comments suppressed due to low confidence (1)

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241

  • [nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.
            "and use the FilterPipelineBuilder class directly."

tasks: int = 1
time: str = "00:15:00"
partition: str = "default"
account: str | None = None # FIXME is this supported?
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.

Suggested change
account: str | None = None # FIXME is this supported?
account: str | None = None # The Slurm account to charge for the job. Optional.

Copilot uses AI. Check for mistakes.
@ajude2s ajude2s requested a review from AbasKhan October 29, 2025 20:47
@ajude2s ajude2s self-assigned this Nov 2, 2025
document = self.get_document_from_dict(doc_content, filepath, 0)
return [document]

def _parse_scores_jsonl_file(self, filepath: str) -> tuple[str, list[dict[str, float]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scores are emitted in lexicographic order of the document IDs. IDs such as sample1, sample2, sample10 will be reordered to sample1, sample10, sample2, so the thresholds get applied to the wrong rows in the packed dataset. Please preserve the original file order (e.g. rely on insertion order or track the original line index when deduplicating).


Returns: generator of Document
"""
base_file_path_or_name, scores_as_list = self._parse_scores_jsonl_file(filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns the reader-supplied JSONL filename and never uses the document IDs it just disambiguated, so _map_to_tokenized_data_path always strips only the bare filename. Any base_file_prefix configured in YAML is ignored, and two shards with the same filename in different folders will map to the same .pbin

duplicate_counts: dict[str, int] = {} # track counts per original document_id
processed_count = 0

with self.data_folder.open(filepath, "r", compression=self._compression) as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is very critical but here for every line the parser scans the entire scores_for_document_idx list to see if an ID was seen before, even though a duplicate_counts dict is already maintained. Large score shards (millions of documents) will result in quadratic behavior and long runtimes. Track seen IDs in a set/dict and avoid the repeated linear scans.

{"score_A": 2.0},
{"score_A": 10.0},
]
self.assertEqual(score_entries, expected_scores)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re only comparing the list of scores, not the (document_id, score) pairs. If two items had the same score_A, a reordered list could still pass.

output_folder (Path): The folder where the filtered datasets will be saved.
thresholds (dict[str, float]): A dictionary where keys are score names and values are the
thresholds to filter samples.
hash_to_base_file_mapping_csv (Path): A CSV file mapping base file hashes to their corresponding paths.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an artifact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants