Skip to content

Add/load tsv#542

Merged
jhnwu3 merged 2 commits intomasterfrom
add/load_tsv
Sep 10, 2025
Merged

Add/load tsv#542
jhnwu3 merged 2 commits intomasterfrom
add/load_tsv

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Sep 10, 2025

This pull request adds support for loading TSV and TSV.GZ files in addition to CSV and CSV.GZ files in the BaseDataset class, and introduces comprehensive tests to ensure correct TSV file handling. The main changes include extending the file scanning logic to detect and process TSV files, updating internal calls to use the new logic, and adding a dedicated test suite to verify TSV dataset loading and joining.

TSV and CSV file loading enhancements:

  • Updated scan_csv_gz_or_csv to scan_csv_gz_or_csv_tsv, now supporting .tsv, .tsv.gz, .csv, and .csv.gz files, automatically detecting the correct separator and extension fallback.
  • Modified calls to the file scanning function in BaseDataset (load_table) to use the new scan_csv_gz_or_csv_tsv for both main and joined tables, ensuring TSV files are correctly loaded. [1] [2]

Testing improvements:

  • Added tests/core/test_tsv_load.py, a comprehensive test suite for TSV file support, including tests for single and multiple table loading, column detection, joining, and dev mode functionality.

@jhnwu3 jhnwu3 merged commit 7c75907 into master Sep 10, 2025
1 check passed
dalloliogm pushed a commit to dalloliogm/PyHealth that referenced this pull request Nov 26, 2025
* update to add tsv support for BaseDataset

* init test cases

---------

Co-authored-by: John Wu <johnwu3@sunlab-serv-03.cs.illinois.edu>
@jhnwu3 jhnwu3 deleted the add/load_tsv branch January 19, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant