-
Notifications
You must be signed in to change notification settings - Fork 53
allow/non/lims/filepaths #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
allow/non/lims/filepaths #83
Conversation
This will protect us against our naming conventions for motion-corrected movies changing. It will also allow non-Allen users to use the MovieJSONGenerator, which is very useful, since it gives the user granular control over which frames are used in training and validation.
This will give the user the option to see what happened that prevented the generator from passing along all of their specified data This only affects the MovieJSONGenerator
so that the order over which the movies are iterated does not depend on the way in which json deserializes the input parameter file
|
|
||
| movie_obj = h5py.File(motion_path, "r") | ||
| motion_path = None | ||
| if os.path.isfile(local_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeromelecoq This if/else block is confusing. To reduce complexity, can we introduce a breaking change in which the full path is required in the CLI? That way the else block will go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I will say in favor of Adam's position: these names
_filenames = ["motion_corrected_video.h5", "concat_31Hz_0.h5"]
are no longer the canonical names of the motion corrected movies. They have names like {ophys_exp_id}_suite2p_motion_corrected_video.h or somesuch. concate_31Hz_0.h5 might still exist as a symlink to the other file, but the original logic ignores symlinks, so I don't think that will help you, Jerome.
| msg = f"Issues with {local_lims}\n" | ||
| msg += f"Error: {str(err)}\n" | ||
| msg += "moving on\n" | ||
| warnings.warn(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of skipping over problematic inputs. @jeromelecoq can we instead leave this error as uncaught so that it is raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely: if one file you thought you were training on becomes inaccessible, do you really want to continue? Worse: if one drive on Isilon with (for instance) all of the experiments from 2019 became unavailable, wouldn't you rather abort training than keep going, even though you were no longer training on the dataset you intended to train on?
| self.frame_data_location = json.load(json_handle) | ||
|
|
||
| self.lims_id = list(self.frame_data_location.keys()) | ||
| self.lims_id.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was testing my changes, I wanted to make sure that every time I ran the code I got the frames in the same order (also: I just don't like it when I can't guarantee the order in which things will get processed ;)
It shouldn't affect the performance of the trainer, since the generator class loops over videos, then frames i.e. it supplies
movie0, frame0
movie1, frame0
movie2, frame0
....
movie0, frame1
movie1, frame1
movie2, frame1
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sorts the experiment ids, not the frames, correct? I want to make sure that each batch is randomly sampled, since experiment ids that are close together tend to be similar, and each batch should be randomly drawn. It's hard to tell where/if shuffling occurs somewhere downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Apparently they fixed that in python 3.7 so that dicts to preserve insertion order
https://stackoverflow.com/questions/1867861/how-to-keep-keys-values-in-same-order-as-declared
I'll remove that line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI an alternative would be to read in the list, sort it, then randomize it with a random number generator that the user specified at run time. Then the onus wouldn't be on the user to remember to randomize file orders when creating the input JSON (I often scan through files in alphabetical order just to maintain my own sanity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, onus should not be on the user to randomize the inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are welcome to randomize it in the generator. But I am still worried about sorting. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeromelecoq
This is what I am describing
The lims_ids are sorted upon ingest, but immediately randomized using a random number generator with a user-specified seed. Allowing the user to specify the seed gives them the ability to reproduce results across runs (assuming, of course, that tensorflows output depends exactly on the order in which the data is provided to it; I cannot speak to that)
What do you think?
| self.frame_data_location = json.load(json_handle) | ||
|
|
||
| self.lims_id = list(self.frame_data_location.keys()) | ||
| self.lims_id.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...
| msg = f"Issues with {local_lims}\n" | ||
| msg += f"Error: {str(err)}\n" | ||
| msg += "moving on\n" | ||
| warnings.warn(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?
| self.frame_data_location = json.load(json_handle) | ||
|
|
||
| self.lims_id = list(self.frame_data_location.keys()) | ||
| self.lims_id.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together
|
|
||
| movie_obj = h5py.File(motion_path, "r") | ||
| motion_path = None | ||
| if os.path.isfile(local_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).
This PR makes it possible for users to specify filepaths that are not specific to the Allen Institute LIMS when instantiating a MovieJSONGenerator. This will be useful for outside users, and Allen Institute users, since the naming convention for our motion corrected movies may change.
I would also like to talk about that massive try/except block in
MovieJSONGenerator.__data_generation__My experience has been that it has obscured failures that I actually want to fix in my configuration files. Is there a reason we don't want to alert users when their input data is ill-formed?