-
Notifications
You must be signed in to change notification settings - Fork 15
Add per-asset dataStandard, HED standard, and extensions to StandardsType #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,3 +14,5 @@ sandbox/ | |
| venv/ | ||
| venvs/ | ||
| dandischema/_version.py | ||
| uv.lock | ||
| .cache/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code when working with code in this | ||
| repository. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| dandischema defines the Pydantic v2 metadata models for the DANDI | ||
| neurophysiology data archive. It is used by both the dandi-cli client and the | ||
| dandi-archive server. Key concerns: model definitions, JSON Schema generation, | ||
| metadata validation, schema migration between versions, and asset metadata | ||
| aggregation. | ||
|
|
||
| ## Build/Test Commands | ||
|
|
||
| ```bash | ||
| tox -e py3 # Run full test suite (preferred) | ||
| pytest dandischema/ # Run tests directly in active venv | ||
| pytest dandischema/tests/test_metadata.py -v -k "test_name" # Single test | ||
| tox -e lint # codespell + flake8 | ||
| tox -e typing # mypy (strict, with pydantic plugin) | ||
| ``` | ||
|
|
||
| - `filterwarnings = error` is active — new warnings will fail tests. | ||
| - Coverage is collected by default (`--cov=dandischema`). | ||
|
|
||
| ## Code Style | ||
|
|
||
| - **Formatter**: Black (no explicit line-length override → default 88) | ||
| - **Import sorting**: isort with `profile = "black"`, `force_sort_within_sections`, | ||
| `reverse_relative` | ||
| - **Linting**: flake8 (max-line-length=100, ignores E203/W503) | ||
| - **Type checking**: mypy strict — `no_implicit_optional`, `warn_return_any`, | ||
| `warn_unreachable`, pydantic plugin enabled | ||
| - **Pre-commit hooks**: trailing-whitespace, end-of-file-fixer, check-yaml, | ||
| check-added-large-files, black, isort, codespell, flake8 | ||
| - Imports at top of file; avoid function-level imports unless there is a | ||
| concrete reason (circular deps, heavy transitive imports) | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Key Modules | ||
|
|
||
| | File | Role | | ||
| |------|------| | ||
| | `models.py` | All Pydantic models (~2000 lines). Class hierarchy rooted at `DandiBaseModel`. | | ||
| | `metadata.py` | `validate()`, `migrate()`, `aggregate_assets_summary()`. | | ||
| | `consts.py` | `DANDI_SCHEMA_VERSION`, `ALLOWED_INPUT_SCHEMAS`, `ALLOWED_TARGET_SCHEMAS`. | | ||
| | `conf.py` | Instance configuration via env vars (`DANDI_INSTANCE_NAME`, etc.). | | ||
| | `types.py` | Custom Pydantic types (`ByteSizeJsonSchema`). | | ||
| | `utils.py` | JSON schema helpers, `version2tuple()`, `name2title()`. | | ||
| | `exceptions.py` | `ValidationError`, `JsonschemaValidationError`, `PydanticValidationError`. | | ||
| | `digests/` | `DandiETag` multipart-upload checksum calculation. | | ||
| | `datacite/` | DataCite DOI metadata conversion. | | ||
|
|
||
| ### Model Hierarchy (simplified) | ||
|
|
||
| ``` | ||
| DandiBaseModel | ||
| ├── PropertyValue # recursive (self-referencing) | ||
| ├── BaseType | ||
| │ ├── StandardsType # name, identifier, version, extensions (recursive) | ||
| │ ├── ApproachType, AssayType, SampleType, Anatomy, ... | ||
| │ └── MeasurementTechniqueType | ||
| ├── Person, Organization # Contributor subclasses | ||
| ├── BioSample # recursive (wasDerivedFrom) | ||
| ├── AssetsSummary # aggregated stats | ||
| └── CommonModel | ||
| ├── Dandiset → PublishedDandiset | ||
| └── BareAsset → Asset → PublishedAsset | ||
| ``` | ||
|
|
||
| Several models are **self-referencing** (PropertyValue, BioSample, | ||
| StandardsType). These require `model_rebuild()` after the class definition. | ||
|
|
||
| ### Data Flow: Asset Metadata Aggregation | ||
|
|
||
| 1. dandi-cli calls `asset.get_metadata()` → populates `BareAsset` including | ||
| per-asset `dataStandard` list | ||
| 2. Asset metadata is serialized via `model_dump(mode="json", exclude_none=True)` | ||
| 3. Server calls `aggregate_assets_summary(assets)` → | ||
| `_add_asset_to_stats()` per asset → `AssetsSummary` | ||
| 4. `_add_asset_to_stats()` collects: numberOfBytes, numberOfFiles, approach, | ||
| measurementTechnique, variableMeasured, species, subjects, dataStandard | ||
| 5. `dataStandard` has deprecated path/encoding heuristic fallbacks for old | ||
| clients (remove after 2026-12-01) | ||
|
|
||
| ### Pre-instantiated Standard Constants | ||
|
|
||
| ```python | ||
| nwb_standard # RRID:SCR_015242 | ||
| bids_standard # RRID:SCR_016124 | ||
| ome_ngff_standard # DOI:10.25504/FAIRsharing.9af712 | ||
| hed_standard # RRID:SCR_014074 | ||
| ``` | ||
|
|
||
| These are dicts (`model_dump(mode="json", exclude_none=True)`) used by both | ||
| dandischema (heuristic fallbacks) and dandi-cli (per-asset population). | ||
|
|
||
| ### Vendorization | ||
|
|
||
| The schema supports deployment for different DANDI instances. Environment | ||
| variables (`DANDI_INSTANCE_NAME`, `DANDI_INSTANCE_IDENTIFIER`, | ||
| `DANDI_DOI_PREFIX`, etc.) must be set **before** importing | ||
| `dandischema.models`. This dynamically adjusts identifier patterns, DOI | ||
| prefixes, license enums, and URL patterns. CI tests multiple vendored | ||
| configurations. | ||
|
|
||
| ## Schema Change Checklist | ||
|
|
||
| When adding or removing fields from any model (BareAsset, Dandiset, | ||
| AssetsSummary, etc.): | ||
|
|
||
| 1. **Update `_FIELDS_INTRODUCED` in `metadata.py:migrate()`** if adding a new | ||
| **top-level field to Dandiset metadata** — `migrate()` only processes | ||
| Dandiset-level dicts (not Asset metadata). Fields on BareAsset or nested | ||
| inside existing structures (e.g. new fields on StandardsType) do not need | ||
| entries here. | ||
|
|
||
| 2. **Update `consts.py`** if bumping `DANDI_SCHEMA_VERSION` or adding to | ||
| `ALLOWED_INPUT_SCHEMAS`. | ||
|
|
||
| 3. **Add tests** covering migration/aggregation with the new field. | ||
|
|
||
| 4. **Coordinate with dandi-cli** — new fields that dandi-cli populates need | ||
| backward-compat guards there (check `"field" in Model.model_fields`) until | ||
| the minimum dandischema dependency is bumped. | ||
|
|
||
| ## Testing Notes | ||
|
|
||
| - Tests use `filterwarnings = error` — any new deprecation warning will fail. | ||
| - The `clear_dandischema_modules_and_set_env_vars` fixture (conftest.py) | ||
| supports testing vendored configurations by clearing cached modules and | ||
| setting env vars. | ||
| - Network-dependent tests are skipped when `DANDI_TESTS_NONETWORK` is set. | ||
| - DataCite tests require `DATACITE_DEV_LOGIN` / `DATACITE_DEV_PASSWORD`. | ||
| - `test_models.py:test_duplicate_classes` checks for duplicate field qnames | ||
| across models; allowed duplicates are listed explicitly. |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -860,11 +860,27 @@ class MeasurementTechniqueType(BaseType): | |||||
| class StandardsType(BaseType): | ||||||
| """Identifier for data standard used""" | ||||||
|
|
||||||
| version: Optional[str] = Field( | ||||||
| None, | ||||||
| description="Version of the standard used.", | ||||||
| json_schema_extra={"nskey": "schema"}, | ||||||
| ) | ||||||
| extensions: Optional[List["StandardsType"]] = Field( | ||||||
| None, | ||||||
| description="Extensions to the standard used in this dataset " | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
The standard extended can be just one applicable to an asset, not an entire dataset per definition of |
||||||
| "(e.g. NWB extensions like ndx-*, HED library schemas).", | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we set |
||||||
| ) | ||||||
| # TODO: consider how to formalize BIDS extensions (BEPs) once BIDS | ||||||
| # has a machine-readable way to declare them. | ||||||
| schemaKey: Literal["StandardsType"] = Field( | ||||||
| "StandardsType", validate_default=True, json_schema_extra={"readOnly": True} | ||||||
| ) | ||||||
|
|
||||||
|
|
||||||
| # Self-referencing model needs rebuild after class definition | ||||||
| # https://docs.pydantic.dev/latest/concepts/postponed_annotations/#self-referencing-or-recursive-models | ||||||
| StandardsType.model_rebuild() | ||||||
|
|
||||||
| nwb_standard = StandardsType( | ||||||
| name="Neurodata Without Borders (NWB)", | ||||||
| identifier="RRID:SCR_015242", | ||||||
|
|
@@ -880,6 +896,11 @@ class StandardsType(BaseType): | |||||
| identifier="DOI:10.25504/FAIRsharing.9af712", | ||||||
| ).model_dump(mode="json", exclude_none=True) | ||||||
|
|
||||||
| hed_standard = StandardsType( | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Defining I think that should be done in a separate PR though. |
||||||
| name="Hierarchical Event Descriptors (HED)", | ||||||
| identifier="RRID:SCR_014074", | ||||||
| ).model_dump(mode="json", exclude_none=True) | ||||||
|
|
||||||
|
|
||||||
| class ContactPoint(DandiBaseModel): | ||||||
| email: Optional[EmailStr] = Field( | ||||||
|
|
@@ -1841,6 +1862,12 @@ class BareAsset(CommonModel): | |||||
| json_schema_extra={"nskey": "prov"}, | ||||||
| ) | ||||||
|
|
||||||
| dataStandard: Optional[List[StandardsType]] = Field( | ||||||
| None, | ||||||
| description="Data standard(s) applicable to this asset.", | ||||||
| json_schema_extra={"readOnly": True}, | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would guess the answer is no -- that it is intended to be extracted from the source files by the DANDI CLI running client-side.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. INTERESTING! We also have readOnly on above "TODO" attributes which I guess nobody ever assigned to any asset. Although we are indeed to populate it from client (or on server) but since we have other similar
Suggested change
right?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we have an |
||||||
| ) | ||||||
|
|
||||||
| # Bare asset is to be just Asset. | ||||||
| schemaKey: Literal["Asset"] = Field( | ||||||
| "Asset", validate_default=True, json_schema_extra={"readOnly": True} | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
from __future__ import annotationsat the beginning of the file, you don't need the type to be wrapped in a string.