-
Notifications
You must be signed in to change notification settings - Fork 5
[ENH] hugging face and FASTA loaders #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pyaptamer/utils/_hf_to_dataset.py
Outdated
| from datasets import load_dataset | ||
|
|
||
|
|
||
| def hf_to_dataset(path, keep_in_memory=True, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function feels unnecessary, it simply aliases a simple call of huggingface load_dataset.
Why do we need this? I would simply remove it.
Maybe you could explain?
Side remark: The try/expect structure looks problematic. Generally, we should avoid using try/except as an if/else, instead we should use if/else with the correct condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there not a condition that we can replace this in, in an if/else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was a defined variable which hugging face used to list all their recognisable file formats I would say yes, but as of now I dont think it is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still doubt the reasoning behind having this function.
Can you please give two examples where we currently need this, one example for each branch of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try block: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).
except block: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not get what scenario we would need the try/except in.
Does the user not always know what format the file is in that they are attempting to load? And in cases where we hardcode the dataset, do we not know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user not always know what format the file is in that they are attempting to load?
The function has to be used differently depending on whether or not the file is a recognisable file format. This information is something I discovered after I spent a bit of time looking into it, the function is just making life easier and saving users from looking into it themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please give an example of a scenario where this would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tryblock: We want to load a file from hugging face which is in one of the known file formats supported by hugging face, so something like the imbd dataset, which is a csv file (known file format).
exceptblock: We want to load a file from hugging face which is not one of the known file formats (say a fasta/pdb/mmcif file), like the fasta file which Jakob provided. If it was used in the above `try1 block, it would throw an error. This block will also work if any local files are supplied to the loader and would return a Dataset object.
The try case would need a call like load_dataset("imbd") whereas in the except case it would need a call like load_dataset("text", data_files="https://huggingface.co/datasets/gcos/HoloRBP4_round8_trimmed")
|
The initial ImportError was a masked error caused by naming a file the same as the function, so Python thought I was importing the file instead of the function. After renaming, the real problem showed up, a "circular import error" from python. Pytest starts by importing utils; inside init.py it first pulls in fasta_to_aaseq, which in turn tries to import hf_to_dataset from utils; but hf_to_dataset hasn’t been imported yet (it’s the next line in init.py), so Python throws the importerror. The fix is to either reorder imports so hf_to_dataset is defined first, or better, import it directly (which is what i did) to fix the problem. |
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments above.
We are not downloading anything unless the Looking back at the design I used I do not like it either and this is what I propose instead:
|
|
you seem to be disregarding my suggestion - if you do not like it, I would appreciate a discussion at least. |
Apologies, I felt that your entire design is built on top of the argument that we download every time the loader is called, which as I mentioned above is not the case. Regardless, I think my approach involves lesser functions, more arguments, and a transformation class to convert any file format supported by SeqIO to a String. Hence making it easier to use. |
| If the specified format is not supported or parsing fails. | ||
| """ | ||
| file_path = self._check_X(X) | ||
| records = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very nice - but I think it should be integrated into MoleculeLoader and not be a transformer.
_determine_typebased on file ending, for the formats for which file ending is known.- if the type is determined to be one supported by
SeqIO, then call a catch-all function_load_seqio(self, path, format), wereformatis passed toSeqIO.parse(second argument)
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments above:
- on "to asseq" feature, this should move to
MoleculeLoader - the hf loader is ok
- I would also add another dispatch path to
MoleculeLoader, where a pathhf://gcos/Holoetc is admissible. Separate PR perhaps.
|
@fkiraly should we assume that each pdb file is only going to return one sequence? That is what you are doing in |
Fixes #149
Fixes #154