Merged
Conversation
dalloliogm
pushed a commit
to dalloliogm/PyHealth
that referenced
this pull request
Nov 26, 2025
* update to add tsv support for BaseDataset * init test cases --------- Co-authored-by: John Wu <johnwu3@sunlab-serv-03.cs.illinois.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds support for loading TSV and TSV.GZ files in addition to CSV and CSV.GZ files in the
BaseDatasetclass, and introduces comprehensive tests to ensure correct TSV file handling. The main changes include extending the file scanning logic to detect and process TSV files, updating internal calls to use the new logic, and adding a dedicated test suite to verify TSV dataset loading and joining.TSV and CSV file loading enhancements:
scan_csv_gz_or_csvtoscan_csv_gz_or_csv_tsv, now supporting.tsv,.tsv.gz,.csv, and.csv.gzfiles, automatically detecting the correct separator and extension fallback.BaseDataset(load_table) to use the newscan_csv_gz_or_csv_tsvfor both main and joined tables, ensuring TSV files are correctly loaded. [1] [2]Testing improvements:
tests/core/test_tsv_load.py, a comprehensive test suite for TSV file support, including tests for single and multiple table loading, column detection, joining, and dev mode functionality.