Metadata correction by candleindark · Pull Request #2177 · dandi/dandi-archive

candleindark · 2025-02-10T09:31:28Z

This PR provides a solution to correct the corruptions in metadata documented in dandi/dandi-schema#276.

The solution is implemented in two parts.

A general command, correct_metadata, to correct dandiset metadata.
A specific helper function which the general command calls to do the actual correction.

Extensive tests are provided for the helper function and its supporting function. However, because I am not familiar with this repo and Django in general, I am not able to provide tests for the command which interacts with the database. Advice and additional tests are very much appreciated.

The command can be run on a targeted dandiset at a particular version and run on all versions of all dandisets. Running on all dandiset versions will only correct the corrupted dandisets. If running only on targeted dandiset version is preferable, please let me know, and I will provide the list of corrupted dandiset versions. (Additionally, would changing the interface of the command to a file consisting of corrupted dandiset versions be better?)

asmacdo

a couple of questions to start off with-- I'll have another look tomorrow.

dandiapi/api/management/commands/correct_metadata.py

waxlamp

Thank you, @candleindark! One thing to change listed below. And I've asked @jjnesbitt to provide a more comprehensive review on this R as well.

dandiapi/api/management/commands/correct_metadata.py

candleindark · 2025-03-02T23:44:07Z

@jjnesbitt I have put in changes requested by @waxlamp regarding the command line interface. With the latest push, two unrelated tests failed. Can they be related to the latest changes in dandi-cli?

jjnesbitt · 2025-03-03T18:03:09Z

I've pushed a commit which aims to simplify the logic in this management command. Namely:

Remove the configuration of the correcting function.
- We have no real use case for this yet, and no way to change this function is provided.
Removal of verbose error handling logic
- Along the same lines as the above point, since this is a single function, this is largely unnecessary. We have tests, and if the correcting function fails for some reason, the error will simply bubble up.
Only support correction of the Affiliation schema key
- Similarly again, we only have the use case of Affiliation at the moment, so all of the configuration surrounding other schema keys is unnecessary at the moment.
Remove find_objs tests against generalized schema key
Write manifest files synchronously
- Since we are not in a request context, it's better to write the manifest files to S3 synchronously, so that we can be sure it succeeds.
Use transactions to couple logic of metadata save and manifest file generation

IMO this is ready, maybe @waxlamp should take another look.

waxlamp · 2025-03-11T20:21:54Z

As per the meeting of 2025-03-10 (involving @yarikoptic, @candleindark, @jjnesbitt, @kabilar, and @waxlamp), we decided to reduce the scope of this change to just the draft versions; @jjnesbitt will work on adjusting the PR to that narrowed scope.

@candleindark, @yarikoptic: are we planning to release a new schema version that has restrictions on extra fields? If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

candleindark · 2025-03-11T21:48:28Z

@candleindark, @yarikoptic: are we planning to release a new schema version that has restrictions on extra fields?

Yes, some minor changes to the schema will be introduced by this PR

If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

I can't say that after the corrections to the draft version that there will be no more validation errors in those versions. I can only say that if you make the corrections to the draft versions, there will not be validation errors in those version due to that particular change brought by that PR

dandi/dandi-schema#266 (comment) provides an analysis of validation errors in the metadata instances caused by the changes in the PR but not an analysis of validation errors in the metadata instances in general.

waxlamp · 2025-03-12T17:23:56Z

If we apply the metadata correction to draft versions, then I suppose whether or not such a schema is released, there will not be any further validation errors.

I can't say that after the corrections to the draft version that there will be no more validation errors in those versions. I can only say that if you make the corrections to the draft versions, there will not be validation errors in those version due to that particular change brought by that PR

Sorry, this was what I meant.

jjnesbitt · 2025-03-20T15:25:44Z

Placing this into draft mode until #2224 is merged, at which point this PR can be updated to make use of it.

yarikoptic · 2025-03-20T16:15:12Z

do you think you would have time to work out solution for #2224 (it is an issue) or should someone else try to approach it to facilitate a potentially (not necessarily "likely" ;) ) faster resolution?

jjnesbitt · 2025-03-20T17:12:57Z

do you think you would have time to work out solution for #2224 (it is an issue) or should someone else try to approach it to facilitate a potentially (not necessarily "likely" ;) ) faster resolution?

I'm already working on #2224, it shouldn't take much time to complete.

Provide solution to correct the corruption of `Affiliation` JSON objects documented in dandi/dandi-schema#276

This make the default behavior to require user to specify a particular dandiset version to apply the correct to. Only when the `--all` flag is provided, should the command apply the correction to all dandiset versions

- Don't allow correct function configuration - Remove verbose error handling logic - Write manifest files synchronously - Only support correction of the `Affiliation` schema key - Remove `find_objs` tests against generalized schema key - Use transactions to couple logic of metadata save and manifest file generation

jjnesbitt · 2025-03-27T21:33:53Z

@waxlamp @mvandenburgh This is ready to go now.

mvandenburgh

Just some suggested optimizations to memory usage. It's probably not needed due to the number of Versions not being super high, but the heroku run dynos have pretty limited memory so it seems logical to do.

Otherwise, LGTM

dandiapi/api/management/commands/correct_metadata.py

Co-Authored-By: Mike VanDenburgh <michael.vandenburgh@kitware.com>

Stale

yarikoptic · 2025-04-02T20:03:40Z

So this was merged and now we got the command. Who is allowed to run it, as could I ?

jjnesbitt · 2025-04-02T20:29:00Z

So this was merged and now we got the command. Who is allowed to run it, as could I ?

It is only in staging at the moment, not in production. I am planning to run in staging, and then once this is deployed in production (blocked by vue3), run there as well.

dandibot · 2025-04-16T18:59:03Z

🚀 PR was released in v0.5.0 🚀

jjnesbitt · 2025-04-24T20:11:35Z

This has been successfully applied in staging and production.

Metadata correction

yarikoptic requested a review from asmacdo February 10, 2025 16:52

asmacdo reviewed Feb 10, 2025

View reviewed changes

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

yarikoptic added internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling labels Feb 12, 2025

waxlamp previously requested changes Feb 27, 2025

View reviewed changes

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

waxlamp requested a review from jjnesbitt February 27, 2025 19:11

jjnesbitt reviewed Feb 28, 2025

View reviewed changes

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

candleindark mentioned this pull request Feb 28, 2025

Add proper migration for "Organization" -> "Affiliation" change dandi/dandi-schema#276

Closed

mvandenburgh self-requested a review March 4, 2025 16:08

waxlamp assigned jjnesbitt and candleindark Mar 11, 2025

jjnesbitt requested a review from waxlamp March 14, 2025 17:13

yarikoptic mentioned this pull request Mar 18, 2025

Do try to (re)mint DOIs for prior failed to be minted but released dandisets #1953

Closed

2 tasks

jjnesbitt marked this pull request as draft March 20, 2025 15:25

jjnesbitt force-pushed the meta-correction branch from 0f84e38 to e79b486 Compare March 26, 2025 19:43

candleindark and others added 6 commits March 27, 2025 16:08

feat: Add API management command for correcting corrupt metadata

8ea2255

feat: add solution to correct Affiliation corruption

25e4cfa

Provide solution to correct the corruption of `Affiliation` JSON objects documented in dandi/dandi-schema#276

Move test_correct_metadata.py into tests/

eba4c3b

feat: change call interface of correct_metadata cmd

a21f6d0

This make the default behavior to require user to specify a particular dandiset version to apply the correct to. Only when the `--all` flag is provided, should the command apply the correction to all dandiset versions

Add --check flag

256a6dc

Only apply correction to draft versions

b02e2ff

jjnesbitt force-pushed the meta-correction branch from e79b486 to b02e2ff Compare March 27, 2025 20:08

Add audit event to correct_metadata command

c2c14f5

jjnesbitt marked this pull request as ready for review March 27, 2025 21:32

mvandenburgh requested changes Mar 28, 2025

View reviewed changes

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

dandiapi/api/management/commands/correct_metadata.py Outdated Show resolved Hide resolved

Use queryset iterators to reduce memory usage

3ee22a7

Co-Authored-By: Mike VanDenburgh <michael.vandenburgh@kitware.com>

jjnesbitt requested a review from mvandenburgh March 28, 2025 15:19

mvandenburgh approved these changes Mar 28, 2025

View reviewed changes

jjnesbitt merged commit 8b12b99 into dandi:master Mar 28, 2025
9 checks passed

dandibot added the released This issue/pull request has been released. label Apr 16, 2025

candleindark deleted the meta-correction branch April 16, 2025 19:02

kabilar pushed a commit to lincbrain/linc-archive that referenced this pull request Jul 25, 2025

Merge pull request dandi#2177 from candleindark/meta-correction

8c6bef1

Metadata correction

Conversation

candleindark commented Feb 10, 2025

Uh oh!

asmacdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

waxlamp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

candleindark commented Mar 2, 2025

Uh oh!

jjnesbitt commented Mar 3, 2025

Uh oh!

waxlamp commented Mar 11, 2025

Uh oh!

candleindark commented Mar 11, 2025

Uh oh!

waxlamp commented Mar 12, 2025

Uh oh!

jjnesbitt commented Mar 20, 2025

Uh oh!

yarikoptic commented Mar 20, 2025

Uh oh!

jjnesbitt commented Mar 20, 2025

Uh oh!

jjnesbitt commented Mar 27, 2025

Uh oh!

mvandenburgh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yarikoptic commented Apr 2, 2025

Uh oh!

jjnesbitt commented Apr 2, 2025

Uh oh!

dandibot commented Apr 16, 2025

Uh oh!

jjnesbitt commented Apr 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants