Skip to content

feat: add publication format and DOI validation checks#651

Open
bendichter wants to merge 6 commits intodevfrom
add-check-related-pubs
Open

feat: add publication format and DOI validation checks#651
bendichter wants to merge 6 commits intodevfrom
add-check-related-pubs

Conversation

@bendichter
Copy link
Copy Markdown
Contributor

Add two new inspector checks for related_publications metadata:

  • check_publication_list_format: detects comma-separated DOIs/URLs that should be separate list entries
  • check_publication_doi_resolves: verifies DOI URLs resolve to valid publications via network requests

Includes documentation updates for best practices guidance.

Closes #419

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

bendichter and others added 2 commits December 24, 2025 09:34
Add two new inspector checks for related_publications metadata:
- check_publication_list_format: detects comma-separated DOIs/URLs
  that should be separate list entries
- check_publication_doi_resolves: verifies DOI URLs resolve to valid
  publications via network requests

Includes documentation updates for best practices guidance.

Closes #419
bendichter and others added 4 commits March 25, 2026 09:35
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original tests hit real DOI endpoints, making them flaky in CI and
offline environments. Replace with mock-based tests that cover the same
logic paths without network access. Add unit tests for _convert_doi_to_url
and _check_url_resolves helpers to cover previously missing lines
(HTTP error, URL error, timeout branches).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.29%. Comparing base (60bfe7f) to head (062df69).
⚠️ Report is 1 commits behind head on dev.

Files with missing lines Patch % Lines
src/nwbinspector/checks/_nwbfile_metadata.py 92.72% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #651      +/-   ##
==========================================
+ Coverage   79.87%   83.29%   +3.42%     
==========================================
  Files          48       48              
  Lines        1873     1928      +55     
==========================================
+ Hits         1496     1606     +110     
+ Misses        377      322      -55     
Files with missing lines Coverage Δ
src/nwbinspector/checks/__init__.py 100.00% <ø> (ø)
src/nwbinspector/checks/_nwbfile_metadata.py 95.41% <92.72%> (-0.80%) ⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check_publication_list_format severity should be a BestPracticeViolation or even a CRITICAL one.

The schema definition for related_publications specifies this field as a 1-D array with dims: [num_publications] and shape: [null]. Each element in the array is meant to be one publication reference. When someone stuffs multiple DOIs into a single comma-separated string, they are violating the intended data structure, which I think warrants a stronger severity. I am also unable to think of false positives that would justify making this a weaker severity. What do you think?

Regarding check_publication_doi_resolves, I am wary of adding a network-based check. I think this can fail in many ways even if the DOI is valid:

  1. The user has no internet connection or is running the inspector offline.
  2. The user is behind a firewall or proxy that blocks outbound HTTP (common in institutional and CI environments).
  3. doi.org is temporarily down, rate-limiting, or returning transient errors.
  4. DOI blocks certain types of requests when coming from specific places. This happens to some of the link checks on the GitHub CI and I am expecting this will be on the CI pipeline causing problems in the same direction.

In its current state the check cannot distinguish between "the DOI is invalid" and "the network is not reaching it," which means it would report false positives in all of these cases.

A possible pattern to make this more robust: before checking the user's DOIs, first test against a well-known DOI that we know resolves (a canary request). If the canary fails, we know the problem is the network, not the data, and we skip the check entirely. If the canary succeeds, we can trust that any subsequent failures are real. With that safeguard in place, I think the check should be CRITICAL as URLs that do not resolve are clearly wrong, right?

What do you say if I open another PR with the check_publication_list_format so we merge that quickly and we think a bit more about this as network checks are uncharted territory for this repo?

@h-mayorquin h-mayorquin marked this pull request as ready for review March 25, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Add Check]: Comma-separated entries in releated_publication should be list of entries

3 participants