Skip to content

Conversation

@satvshr
Copy link
Collaborator

@satvshr satvshr commented Sep 23, 2025

Fixes #149
Fixes #154

from datasets import load_dataset


def hf_to_dataset(path, keep_in_memory=True, **kwargs):
Copy link
Contributor

@fkiraly fkiraly Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function feels unnecessary, it simply aliases a simple call of huggingface load_dataset.

Why do we need this? I would simply remove it.

Maybe you could explain?

Side remark: The try/expect structure looks problematic. Generally, we should avoid using try/except as an if/else, instead we should use if/else with the correct condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there not a condition that we can replace this in, in an if/else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a defined variable which hugging face used to list all their recognisable file formats I would say yes, but as of now I dont think it is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still doubt the reasoning behind having this function.

Can you please give two examples where we currently need this, one example for each branch of the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).

except block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.

Copy link
Contributor

@fkiraly fkiraly Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not get what scenario we would need the try/except in.

Does the user not always know what format the file is in that they are attempting to load? And in cases where we hardcode the dataset, do we not know?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user not always know what format the file is in that they are attempting to load?

The function has to be used differently depending on whether or not the file is a recognisable file format. This information is something I discovered after I spent a bit of time looking into it, the function is just making life easier and saving users from looking into it themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please give an example of a scenario where this would be useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).

except block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.

The try case would need a call like load_dataset("imbd") whereas in the except case it would need a call like load_dataset("text", data_files="https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed")

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 26, 2025

The initial ImportError was a masked error caused by naming a file the same as the function, so Python thought I was importing the file instead of the function. After renaming, the real problem showed up, a "circular import error" from python. Pytest starts by importing utils; inside init.py it first pulls in fasta_to_aaseq, which in turn tries to import hf_to_dataset from utils; but hf_to_dataset hasn’t been imported yet (it’s the next line in init.py), so Python throws the importerror. The fix is to either reorder imports so hf_to_dataset is defined first, or better, import it directly (which is what i did) to fix the problem.

@fkiraly fkiraly changed the title Added hugging face and FASTA loaders [ENH] hugging face and FASTA loaders Oct 2, 2025
Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments above.

@satvshr satvshr requested a review from fkiraly October 2, 2025 10:54
@satvshr
Copy link
Collaborator Author

satvshr commented Oct 28, 2025

May I suggest a modification so we do not download every time the function is called?

We are not downloading anything unless the keep_in_memory argument is False.

Looking back at the design I used I do not like it either and this is what I propose instead:

  1. hf_to_dataset remains the same.
  2. Remove fasta_to_aaseq entirely and replace it with a transformation to convert any file format supported by BioPython.SeqIO to String, a name for the function will be appreciated. There will be no hf_to_dataset call inside it.

@satvshr satvshr requested a review from fkiraly October 28, 2025 07:54
@fkiraly
Copy link
Contributor

fkiraly commented Oct 28, 2025

you seem to be disregarding my suggestion - if you do not like it, I would appreciate a discussion at least.

@satvshr
Copy link
Collaborator Author

satvshr commented Oct 28, 2025

you seem to be disregarding my suggestion - if you do not like it, I would appreciate a discussion at least.

Apologies, I felt that your entire design is built on top of the argument that we download every time the loader is called, which as I mentioned above is not the case. Regardless, I think my approach involves lesser functions, more arguments, and a transformation class to convert any file format supported by SeqIO to a String. Hence making it easier to use.

If the specified format is not supported or parsing fails.
"""
file_path = self._check_X(X)
records = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nice - but I think it should be integrated into MoleculeLoader and not be a transformer.

  • _determine_type based on file ending, for the formats for which file ending is known.
  • if the type is determined to be one supported by SeqIO, then call a catch-all function _load_seqio(self, path, format), were format is passed to SeqIO.parse (second argument)

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments above:

  • on "to asseq" feature, this should move to MoleculeLoader
  • the hf loader is ok
  • I would also add another dispatch path to MoleculeLoader, where a path hf://gcos/Holo etc is admissible. Separate PR perhaps.

@satvshr
Copy link
Collaborator Author

satvshr commented Nov 15, 2025

@fkiraly should we assume that each pdb file is only going to return one sequence? That is what you are doing in _load_pdb_seq of MolecularLoader by only returning one string right?

@satvshr satvshr requested a review from fkiraly November 15, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DISCUSSION] Hugging face loaders [ENH] FASTA to String converter

3 participants