Add metadata specification #77
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a metadata specification to NeuroBlueprint.
While the PR contains proposed spec, below reviews the BIDS metadata specification to make sure we are not duplicating efforts or missing anything useful. This is further to #30, in that review BIDS, Allen and openMINDS were reviewed. While Allen and openMINDS contain specific parts that may be of use (e.g. metadata fields for particular datatypes not covered by BIDS) in general the BIDS metadata organization approach is most intuitive and closest to our existing standard.
BIDS review
Every project has a top level mandatory
dataset_description.jsonin the project root that contains information about all animals as key-value pairs.An optional
participants.tsvfile that contains a high-level overview of all participants in the studysessions.tsvthat describes sessions for each subject. Again, this gives a high-level overview of all sessions within a subjectThese
.tsvfiles can be accompanied by additional.jsondescriptive files that describe in more detail what each key is (i.e. metadata on the metadata).Acquired or derivatives data files can be accompanied by sidecar
.jsonproviding metadata fields specific to that datatype. e.g. for behaviour or microscopy. For example, the BEP for behavior specifies different types of data file (e.g._events.tsv,_physio.tsv,_stim.tsv) and each may be accompanied a respective metadata file.Inheritance: in bids, inheritance works based on the suffix. e.g.
Proposal
The BIDS specification is extremely thorough and is the perfect metadata specification for its aim, to allow full description of a published dataset to allow completely automated analysis. I think it is also tailored to fully automated a data collection pipelines that will write accompanying metadata
.jsons(e.g. as is done in imaging experiments).The emphasis for NeuroBlueprint is a little different. For the most part, researchers asking about metadata would like to add additional notes or information to their project during acquisition, rather than have a comprehensive metadata standard. As such, the emphasis is to have an way for people to easily add information ad-hoc to the project, in a lab-notebook style (i.e. easy for those without coding experience). However, this should be extensible to include more detailed metadata if required.
For the high-level metadata, the plan is to include optional
<>_metadata.ymlfiles at theproject,rawdata,sub,sesand level..jsonare also supported, butymlis preferred as it is more human readable. This will essentially contain the information that could be put into BIDSdataset_description.tsv,participants.tsvetc, but as key-value pairs in the folder rather than as a single.tsv. I think this is preferred for our case as it is easier to write these ad-hoc during acquisition, rather than trying to maintain a large.tsvtable. We can construct.tsvtables easily from these metadata.ymlfor BIDS compatibility if required.For the low-level, we also have
<datatype>_metadata.ymlwith required fields that contain relevant fields, this can be included at high-levels (e.g. aephysentry in therawdata_metadata.ymlthat applies to the entire project) for an easy project overview. I think this is a good starting place.Can we just use BIDS?
I think it is worth thinking about whether we could adopt the BIDS spec outright, or with minor changes (e.g. suggest using
.ymlrather than.jsonto keep things human readable). Some parts we could use:dataset_description.json,participants.tsvandsessions.tsvinstead of<project>_metadata.yml/rawdata_metadata.yml,sub_metadata.yml,ses_metadata.ymlrespectively.I think that for an acquisition-focused spec, it makes sense to have the metadata file for each subject located in the subject folder. Use of a single
.ymlper subject is easier to maintain that writing information into a.tsvshared between subjects. That being said, I can see the advantage of just adding a new line toparticipants.tsvwhen you acquire the data, or newsession.tsvwhen adding a new session. It does make it harder to automate though..json) and inheritance based on suffixesI don't think we can directly adopt these, because we are less strict on the format of data included in the datatype folders. We also do not have suffixes for inheritance. Therefore having a simpler
<datatype>_metadata.ymlworks better for our spec. Because of this simplicity, we will defiantly run into cases we can't handle, or researchers who want to write more detailed metadata. In these cases I suggest we point them to the relevant datatype BIDS spec and they can adopt that directly. After all, NeuroBlueprint is supposed to be a stepping stone towards BIDS anyway.Where the spec will break down (datatype level)
If there are multiple runs within a session, they cannot be covered in one file. There may also be lots of different data-types within a folder, that might be hard to combine into one metadata file (e.g. the
events,stim,physiofor behavior). We can have subfields, but this will lead to more complex metadata files. We do not mandate the format of the acquired data within thedatatypefolder, which also makes it difficult to have data-specific metadata.Maybe we stick with this for now, and each sub-team looks into what metadata fields can be included, and how complicated these files are likely to get?