Skip to content

Conversation

@danielsf
Copy link

@danielsf danielsf commented Feb 1, 2022

This PR makes it possible for users to specify filepaths that are not specific to the Allen Institute LIMS when instantiating a MovieJSONGenerator. This will be useful for outside users, and Allen Institute users, since the naming convention for our motion corrected movies may change.

I would also like to talk about that massive try/except block in MovieJSONGenerator.__data_generation__ My experience has been that it has obscured failures that I actually want to fix in my configuration files. Is there a reason we don't want to alert users when their input data is ill-formed?

This will protect us against our naming conventions for motion-corrected
movies changing. It will also allow non-Allen users to use the
MovieJSONGenerator, which is very useful, since it gives the user
granular control over which frames are used in training and validation.
This will give the user the option to see what happened that prevented the
generator from passing along all of their specified data

This only affects the MovieJSONGenerator
so that the order over which the movies are iterated does not
depend on the way in which json deserializes the input parameter file

movie_obj = h5py.File(motion_path, "r")
motion_path = None
if os.path.isfile(local_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromelecoq This if/else block is confusing. To reduce complexity, can we introduce a breaking change in which the full path is required in the CLI? That way the else block will go away.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I will say in favor of Adam's position: these names

_filenames = ["motion_corrected_video.h5", "concat_31Hz_0.h5"]

are no longer the canonical names of the motion corrected movies. They have names like {ophys_exp_id}_suite2p_motion_corrected_video.h or somesuch. concate_31Hz_0.h5 might still exist as a symlink to the other file, but the original logic ignores symlinks, so I don't think that will help you, Jerome.

msg = f"Issues with {local_lims}\n"
msg += f"Error: {str(err)}\n"
msg += "moving on\n"
warnings.warn(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of skipping over problematic inputs. @jeromelecoq can we instead leave this error as uncaught so that it is raised?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely: if one file you thought you were training on becomes inaccessible, do you really want to continue? Worse: if one drive on Isilon with (for instance) all of the experiments from 2019 became unavailable, wouldn't you rather abort training than keep going, even though you were no longer training on the dataset you intended to train on?

self.frame_data_location = json.load(json_handle)

self.lims_id = list(self.frame_data_location.keys())
self.lims_id.sort()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line needed?

Copy link
Author

@danielsf danielsf Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was testing my changes, I wanted to make sure that every time I ran the code I got the frames in the same order (also: I just don't like it when I can't guarantee the order in which things will get processed ;)

It shouldn't affect the performance of the trainer, since the generator class loops over videos, then frames i.e. it supplies

movie0, frame0
movie1, frame0
movie2, frame0
....
movie0, frame1
movie1, frame1
movie2, frame1
....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sorts the experiment ids, not the frames, correct? I want to make sure that each batch is randomly sampled, since experiment ids that are close together tend to be similar, and each batch should be randomly drawn. It's hard to tell where/if shuffling occurs somewhere downstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together

Copy link
Author

@danielsf danielsf Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Apparently they fixed that in python 3.7 so that dicts to preserve insertion order

https://stackoverflow.com/questions/1867861/how-to-keep-keys-values-in-same-order-as-declared

I'll remove that line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI an alternative would be to read in the list, sort it, then randomize it with a random number generator that the user specified at run time. Then the onus wouldn't be on the user to remember to randomize file orders when creating the input JSON (I often scan through files in alphabetical order just to maintain my own sanity).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, onus should not be on the user to randomize the inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome to randomize it in the generator. But I am still worried about sorting. :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromelecoq
This is what I am describing

9161114

The lims_ids are sorted upon ingest, but immediately randomized using a random number generator with a user-specified seed. Allowing the user to specify the seed gives them the ability to reproduce results across runs (assuming, of course, that tensorflows output depends exactly on the order in which the data is provided to it; I cannot speak to that)

What do you think?

self.frame_data_location = json.load(json_handle)

self.lims_id = list(self.frame_data_location.keys())
self.lims_id.sort()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...

msg = f"Issues with {local_lims}\n"
msg += f"Error: {str(err)}\n"
msg += "moving on\n"
warnings.warn(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?

self.frame_data_location = json.load(json_handle)

self.lims_id = list(self.frame_data_location.keys())
self.lims_id.sort()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together


movie_obj = h5py.File(motion_path, "r")
motion_path = None
if os.path.isfile(local_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants