Skip to content

Metadata correction#2177

Merged
jjnesbitt merged 9 commits intodandi:masterfrom
candleindark:meta-correction
Mar 28, 2025
Merged

Metadata correction#2177
jjnesbitt merged 9 commits intodandi:masterfrom
candleindark:meta-correction

Conversation

@candleindark
Copy link
Copy Markdown
Member

This PR provides a solution to correct the corruptions in metadata documented in dandi/dandi-schema#276.

The solution is implemented in two parts.

  1. A general command, correct_metadata, to correct dandiset metadata.
  2. A specific helper function which the general command calls to do the actual correction.

Extensive tests are provided for the helper function and its supporting function. However, because I am not familiar with this repo and Django in general, I am not able to provide tests for the command which interacts with the database. Advice and additional tests are very much appreciated.

The command can be run on a targeted dandiset at a particular version and run on all versions of all dandisets. Running on all dandiset versions will only correct the corrupted dandisets. If running only on targeted dandiset version is preferable, please let me know, and I will provide the list of corrupted dandiset versions. (Additionally, would changing the interface of the command to a file consisting of corrupted dandiset versions be better?)

@yarikoptic yarikoptic requested a review from asmacdo February 10, 2025 16:52
Copy link
Copy Markdown
Member

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of questions to start off with-- I'll have another look tomorrow.

@yarikoptic yarikoptic added internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling labels Feb 12, 2025
Copy link
Copy Markdown
Member

@waxlamp waxlamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @candleindark! One thing to change listed below. And I've asked @jjnesbitt to provide a more comprehensive review on this R as well.

@waxlamp waxlamp requested a review from jjnesbitt February 27, 2025 19:11
@candleindark
Copy link
Copy Markdown
Member Author

@jjnesbitt I have put in changes requested by @waxlamp regarding the command line interface. With the latest push, two unrelated tests failed. Can they be related to the latest changes in dandi-cli?

@jjnesbitt
Copy link
Copy Markdown
Member

I've pushed a commit which aims to simplify the logic in this management command. Namely:

  • Remove the configuration of the correcting function.
    • We have no real use case for this yet, and no way to change this function is provided.
  • Removal of verbose error handling logic
    • Along the same lines as the above point, since this is a single function, this is largely unnecessary. We have tests, and if the correcting function fails for some reason, the error will simply bubble up.
  • Only support correction of the Affiliation schema key
    • Similarly again, we only have the use case of Affiliation at the moment, so all of the configuration surrounding other schema keys is unnecessary at the moment.
  • Remove find_objs tests against generalized schema key
  • Write manifest files synchronously
    • Since we are not in a request context, it's better to write the manifest files to S3 synchronously, so that we can be sure it succeeds.
  • Use transactions to couple logic of metadata save and manifest file generation

IMO this is ready, maybe @waxlamp should take another look.

@mvandenburgh mvandenburgh self-requested a review March 4, 2025 16:08
@waxlamp
Copy link
Copy Markdown
Member

waxlamp commented Mar 11, 2025

As per the meeting of 2025-03-10 (involving @yarikoptic, @candleindark, @jjnesbitt, @kabilar, and @waxlamp), we decided to reduce the scope of this change to just the draft versions; @jjnesbitt will work on adjusting the PR to that narrowed scope.

@candleindark, @yarikoptic: are we planning to release a new schema version that has restrictions on extra fields? If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

@candleindark
Copy link
Copy Markdown
Member Author

@candleindark, @yarikoptic: are we planning to release a new schema version that has restrictions on extra fields?

Yes, some minor changes to the schema will be introduced by this PR

If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

I can't say that after the corrections to the draft version that there will be no more validation errors in those versions. I can only say that if you make the corrections to the draft versions, there will not be validation errors in those version due to that particular change brought by that PR

dandi/dandi-schema#266 (comment) provides an analysis of validation errors in the metadata instances caused by the changes in the PR but not an analysis of validation errors in the metadata instances in general.

@waxlamp
Copy link
Copy Markdown
Member

waxlamp commented Mar 12, 2025

If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

I can't say that after the corrections to the draft version that there will be no more validation errors in those versions. I can only say that if you make the corrections to the draft versions, there will not be validation errors in those version due to that particular change brought by that PR

Sorry, this was what I meant.

@jjnesbitt
Copy link
Copy Markdown
Member

Placing this into draft mode until #2224 is merged, at which point this PR can be updated to make use of it.

@jjnesbitt jjnesbitt marked this pull request as draft March 20, 2025 15:25
@yarikoptic
Copy link
Copy Markdown
Member

do you think you would have time to work out solution for #2224 (it is an issue) or should someone else try to approach it to facilitate a potentially (not necessarily "likely" ;) ) faster resolution?

@jjnesbitt
Copy link
Copy Markdown
Member

do you think you would have time to work out solution for #2224 (it is an issue) or should someone else try to approach it to facilitate a potentially (not necessarily "likely" ;) ) faster resolution?

I'm already working on #2224, it shouldn't take much time to complete.

candleindark and others added 6 commits March 27, 2025 16:08
Provide solution to correct the corruption
of `Affiliation` JSON objects documented in
dandi/dandi-schema#276
This make the default behavior to require
user to specify a particular dandiset version
to apply the correct to. Only when the `--all`
flag is provided, should the command apply the
correction to all dandiset versions
- Don't allow correct function configuration
- Remove verbose error handling logic
- Write manifest files synchronously
- Only support correction of the `Affiliation` schema key
- Remove `find_objs` tests against generalized schema key
- Use transactions to couple logic of metadata save and manifest file generation
@jjnesbitt jjnesbitt marked this pull request as ready for review March 27, 2025 21:32
@jjnesbitt
Copy link
Copy Markdown
Member

@waxlamp @mvandenburgh This is ready to go now.

Copy link
Copy Markdown
Member

@mvandenburgh mvandenburgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some suggested optimizations to memory usage. It's probably not needed due to the number of Versions not being super high, but the heroku run dynos have pretty limited memory so it seems logical to do.

Otherwise, LGTM

Co-Authored-By: Mike VanDenburgh <michael.vandenburgh@kitware.com>
@jjnesbitt jjnesbitt requested a review from mvandenburgh March 28, 2025 15:19
@jjnesbitt jjnesbitt merged commit 8b12b99 into dandi:master Mar 28, 2025
9 checks passed
@yarikoptic
Copy link
Copy Markdown
Member

So this was merged and now we got the command. Who is allowed to run it, as could I ?

@jjnesbitt
Copy link
Copy Markdown
Member

So this was merged and now we got the command. Who is allowed to run it, as could I ?

It is only in staging at the moment, not in production. I am planning to run in staging, and then once this is deployed in production (blocked by vue3), run there as well.

@dandibot
Copy link
Copy Markdown
Member

🚀 PR was released in v0.5.0 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Apr 16, 2025
@candleindark candleindark deleted the meta-correction branch April 16, 2025 19:02
@jjnesbitt
Copy link
Copy Markdown
Member

This has been successfully applied in staging and production.

kabilar pushed a commit to lincbrain/linc-archive that referenced this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling released This issue/pull request has been released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants