Skip to content

Conversation

@JoeZiminski
Copy link
Member

@JoeZiminski JoeZiminski commented Sep 15, 2025

This PR adds a metadata specification to NeuroBlueprint.

While the PR contains proposed spec, below reviews the BIDS metadata specification to make sure we are not duplicating efforts or missing anything useful. This is further to #30, in that review BIDS, Allen and openMINDS were reviewed. While Allen and openMINDS contain specific parts that may be of use (e.g. metadata fields for particular datatypes not covered by BIDS) in general the BIDS metadata organization approach is most intuitive and closest to our existing standard.

BIDS review

  • Every project has a top level mandatory dataset_description.json in the project root that contains information about all animals as key-value pairs.

  • An optional participants.tsv file that contains a high-level overview of all participants in the study

image
  • An optional sessions.tsv that describes sessions for each subject. Again, this gives a high-level overview of all sessions within a subject
image
  • These .tsv files can be accompanied by additional .json descriptive files that describe in more detail what each key is (i.e. metadata on the metadata).

  • Acquired or derivatives data files can be accompanied by sidecar .json providing metadata fields specific to that datatype. e.g. for behaviour or microscopy. For example, the BEP for behavior specifies different types of data file (e.g. _events.tsv, _physio.tsv, _stim.tsv) and each may be accompanied a respective metadata file.

  • Inheritance: in bids, inheritance works based on the suffix. e.g.

image

Proposal

The BIDS specification is extremely thorough and is the perfect metadata specification for its aim, to allow full description of a published dataset to allow completely automated analysis. I think it is also tailored to fully automated a data collection pipelines that will write accompanying metadata .jsons (e.g. as is done in imaging experiments).

The emphasis for NeuroBlueprint is a little different. For the most part, researchers asking about metadata would like to add additional notes or information to their project during acquisition, rather than have a comprehensive metadata standard. As such, the emphasis is to have an way for people to easily add information ad-hoc to the project, in a lab-notebook style (i.e. easy for those without coding experience). However, this should be extensible to include more detailed metadata if required.

For the high-level metadata, the plan is to include optional <>_metadata.yml files at the project, rawdata, sub, ses and level. .json are also supported, but yml is preferred as it is more human readable. This will essentially contain the information that could be put into BIDS dataset_description.tsv, participants.tsv etc, but as key-value pairs in the folder rather than as a single .tsv. I think this is preferred for our case as it is easier to write these ad-hoc during acquisition, rather than trying to maintain a large .tsv table. We can construct .tsv tables easily from these metadata .yml for BIDS compatibility if required.

For the low-level, we also have <datatype>_metadata.yml with required fields that contain relevant fields, this can be included at high-levels (e.g. a ephys entry in the rawdata_metadata.yml that applies to the entire project) for an easy project overview. I think this is a good starting place.

Can we just use BIDS?

I think it is worth thinking about whether we could adopt the BIDS spec outright, or with minor changes (e.g. suggest using .yml rather than .json to keep things human readable). Some parts we could use:

  • dataset_description.json, participants.tsv and sessions.tsv instead of <project>_metadata.yml / rawdata_metadata.yml, sub_metadata.yml, ses_metadata.yml respectively.

I think that for an acquisition-focused spec, it makes sense to have the metadata file for each subject located in the subject folder. Use of a single .yml per subject is easier to maintain that writing information into a .tsv shared between subjects. That being said, I can see the advantage of just adding a new line to participants.tsv when you acquire the data, or new session.tsv when adding a new session. It does make it harder to automate though.

  • datatype level metadata (sidecar .json) and inheritance based on suffixes

I don't think we can directly adopt these, because we are less strict on the format of data included in the datatype folders. We also do not have suffixes for inheritance. Therefore having a simpler <datatype>_metadata.yml works better for our spec. Because of this simplicity, we will defiantly run into cases we can't handle, or researchers who want to write more detailed metadata. In these cases I suggest we point them to the relevant datatype BIDS spec and they can adopt that directly. After all, NeuroBlueprint is supposed to be a stepping stone towards BIDS anyway.

Where the spec will break down (datatype level)

If there are multiple runs within a session, they cannot be covered in one file. There may also be lots of different data-types within a folder, that might be hard to combine into one metadata file (e.g. the events, stim, physio for behavior). We can have subfields, but this will lead to more complex metadata files. We do not mandate the format of the acquired data within the datatype folder, which also makes it difficult to have data-specific metadata.

Maybe we stick with this for now, and each sub-team looks into what metadata fields can be included, and how complicated these files are likely to get?

@JoeZiminski JoeZiminski changed the title Add placeholder page. Add metadata specification Sep 15, 2025
@JoeZiminski JoeZiminski marked this pull request as draft September 15, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants