allow/non/lims/filepaths #83

danielsf · 2022-02-01T20:15:10Z

This PR makes it possible for users to specify filepaths that are not specific to the Allen Institute LIMS when instantiating a MovieJSONGenerator. This will be useful for outside users, and Allen Institute users, since the naming convention for our motion corrected movies may change.

I would also like to talk about that massive try/except block in MovieJSONGenerator.__data_generation__ My experience has been that it has obscured failures that I actually want to fix in my configuration files. Is there a reason we don't want to alert users when their input data is ill-formed?

This will protect us against our naming conventions for motion-corrected movies changing. It will also allow non-Allen users to use the MovieJSONGenerator, which is very useful, since it gives the user granular control over which frames are used in training and validation.

This will give the user the option to see what happened that prevented the generator from passing along all of their specified data This only affects the MovieJSONGenerator

so that the order over which the movies are iterated does not depend on the way in which json deserializes the input parameter file

aamster · 2022-02-07T20:49:52Z

deepinterpolation/generator_collection.py

-
-            movie_obj = h5py.File(motion_path, "r")
+            motion_path = None
+            if os.path.isfile(local_path):


@jeromelecoq This if/else block is confusing. To reduce complexity, can we introduce a breaking change in which the full path is required in the CLI? That way the else block will go away.

That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).

One thing I will say in favor of Adam's position: these names

_filenames = ["motion_corrected_video.h5", "concat_31Hz_0.h5"]

are no longer the canonical names of the motion corrected movies. They have names like {ophys_exp_id}_suite2p_motion_corrected_video.h or somesuch. concate_31Hz_0.h5 might still exist as a symlink to the other file, but the original logic ignores symlinks, so I don't think that will help you, Jerome.

aamster · 2022-02-07T20:51:47Z

deepinterpolation/generator_collection.py

+            msg = f"Issues with {local_lims}\n"
+            msg += f"Error: {str(err)}\n"
+            msg += "moving on\n"
+            warnings.warn(msg)


I don't like the idea of skipping over problematic inputs. @jeromelecoq can we instead leave this error as uncaught so that it is raised?

In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?

Conversely: if one file you thought you were training on becomes inaccessible, do you really want to continue? Worse: if one drive on Isilon with (for instance) all of the experiments from 2019 became unavailable, wouldn't you rather abort training than keep going, even though you were no longer training on the dataset you intended to train on?

aamster · 2022-02-07T20:52:31Z

deepinterpolation/generator_collection.py

            self.frame_data_location = json.load(json_handle)

        self.lims_id = list(self.frame_data_location.keys())
+        self.lims_id.sort()


Why is this line needed?

When I was testing my changes, I wanted to make sure that every time I ran the code I got the frames in the same order (also: I just don't like it when I can't guarantee the order in which things will get processed ;)

It shouldn't affect the performance of the trainer, since the generator class loops over videos, then frames i.e. it supplies

movie0, frame0 movie1, frame0 movie2, frame0 .... movie0, frame1 movie1, frame1 movie2, frame1 ....

Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...

This sorts the experiment ids, not the frames, correct? I want to make sure that each batch is randomly sampled, since experiment ids that are close together tend to be similar, and each batch should be randomly drawn. It's hard to tell where/if shuffling occurs somewhere downstream.

Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together

Sorry. Apparently they fixed that in python 3.7 so that dicts to preserve insertion order

https://stackoverflow.com/questions/1867861/how-to-keep-keys-values-in-same-order-as-declared

I'll remove that line

FYI an alternative would be to read in the list, sort it, then randomize it with a random number generator that the user specified at run time. Then the onus wouldn't be on the user to remember to randomize file orders when creating the input JSON (I often scan through files in alphabetical order just to maintain my own sanity).

Agreed, onus should not be on the user to randomize the inputs.

You are welcome to randomize it in the generator. But I am still worried about sorting. :-)

@jeromelecoq
This is what I am describing

9161114

The lims_ids are sorted upon ingest, but immediately randomized using a random number generator with a user-specified seed. Allowing the user to specify the seed gives them the ability to reproduce results across runs (assuming, of course, that tensorflows output depends exactly on the order in which the data is provided to it; I cannot speak to that)

What do you think?

jeromelecoq · 2022-02-07T21:41:50Z

deepinterpolation/generator_collection.py

            self.frame_data_location = json.load(json_handle)

        self.lims_id = list(self.frame_data_location.keys())
+        self.lims_id.sort()


Not sure about sorting. This list was randomized. If you sort, that will make older experiments comes first in training...

jeromelecoq · 2022-02-07T21:51:53Z

deepinterpolation/generator_collection.py

+            msg = f"Issues with {local_lims}\n"
+            msg += f"Error: {str(err)}\n"
+            msg += "moving on\n"
+            warnings.warn(msg)


In principle, I agree with you. I would be careful though. Those training are occurring over the course of multiple days. if this crashes for any reasons (One file become somehow temporally unaccessible over the course of training) you likely lose your training job. I would not be surprised if this introduced some robustness to these kind of issues?

jeromelecoq · 2022-02-07T23:09:20Z

deepinterpolation/generator_collection.py

            self.frame_data_location = json.load(json_handle)

        self.lims_id = list(self.frame_data_location.keys())
+        self.lims_id.sort()


Yes, this would sort the experiment ids. Without the sort it would go through the keys by order of entry in the dictionary. Sorting would make them group experiments that are close in time batch together

jeromelecoq · 2022-02-08T00:57:31Z

deepinterpolation/generator_collection.py

-
-            movie_obj = h5py.File(motion_path, "r")
+            motion_path = None
+            if os.path.isfile(local_path):


That will break all of my previous training json files which is annoying. I would vote against because of that as this seems to be marginal technical debt. This PR seems to allow both behavior (pass a full path or a folder).

danielsf added 8 commits February 1, 2022 11:35

pass error msg from data generation try/except along as warning

dcae28a

This will give the user the option to see what happened that prevented the generator from passing along all of their specified data This only affects the MovieJSONGenerator

remove unused motion_path = []

f941a9c

use context manager to open and close movies

28cd6db

sort the lims_id list in MovieJSONGenerator

02e7179

so that the order over which the movies are iterated does not depend on the way in which json deserializes the input parameter file

pep8 changes

6c24a4c

add unit test for MovieJSONGenerator

e464e9b

only keep movie_obj open as long as you need to

885037f

aamster reviewed Feb 7, 2022

View reviewed changes

jeromelecoq approved these changes Feb 8, 2022

View reviewed changes

danielsf added 2 commits February 7, 2022 17:32

do not sort the LIMS ID of experiment in the generator

ce7b9c3

allow users to specify seed used in randomization

9161114

allow/non/lims/filepaths #83

Are you sure you want to change the base?

allow/non/lims/filepaths #83

Uh oh!

Conversation

danielsf commented Feb 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielsf Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielsf Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielsf Feb 7, 2022 •

edited

Loading

danielsf Feb 8, 2022 •

edited

Loading