feat: add publication format and DOI validation checks#651
feat: add publication format and DOI validation checks#651bendichter wants to merge 6 commits intodevfrom
Conversation
Add two new inspector checks for related_publications metadata: - check_publication_list_format: detects comma-separated DOIs/URLs that should be separate list entries - check_publication_doi_resolves: verifies DOI URLs resolve to valid publications via network requests Includes documentation updates for best practices guidance. Closes #419
for more information, see https://pre-commit.ci
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original tests hit real DOI endpoints, making them flaky in CI and offline environments. Replace with mock-based tests that cover the same logic paths without network access. Add unit tests for _convert_doi_to_url and _check_url_resolves helpers to cover previously missing lines (HTTP error, URL error, timeout branches). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #651 +/- ##
==========================================
+ Coverage 79.87% 83.29% +3.42%
==========================================
Files 48 48
Lines 1873 1928 +55
==========================================
+ Hits 1496 1606 +110
+ Misses 377 322 -55
🚀 New features to boost your workflow:
|
h-mayorquin
left a comment
There was a problem hiding this comment.
I think the check_publication_list_format severity should be a BestPracticeViolation or even a CRITICAL one.
The schema definition for related_publications specifies this field as a 1-D array with dims: [num_publications] and shape: [null]. Each element in the array is meant to be one publication reference. When someone stuffs multiple DOIs into a single comma-separated string, they are violating the intended data structure, which I think warrants a stronger severity. I am also unable to think of false positives that would justify making this a weaker severity. What do you think?
Regarding check_publication_doi_resolves, I am wary of adding a network-based check. I think this can fail in many ways even if the DOI is valid:
- The user has no internet connection or is running the inspector offline.
- The user is behind a firewall or proxy that blocks outbound HTTP (common in institutional and CI environments).
doi.orgis temporarily down, rate-limiting, or returning transient errors.- DOI blocks certain types of requests when coming from specific places. This happens to some of the link checks on the GitHub CI and I am expecting this will be on the CI pipeline causing problems in the same direction.
In its current state the check cannot distinguish between "the DOI is invalid" and "the network is not reaching it," which means it would report false positives in all of these cases.
A possible pattern to make this more robust: before checking the user's DOIs, first test against a well-known DOI that we know resolves (a canary request). If the canary fails, we know the problem is the network, not the data, and we skip the check entirely. If the canary succeeds, we can trust that any subsequent failures are real. With that safeguard in place, I think the check should be CRITICAL as URLs that do not resolve are clearly wrong, right?
What do you say if I open another PR with the check_publication_list_format so we merge that quickly and we think a bit more about this as network checks are uncharted territory for this repo?
Add two new inspector checks for related_publications metadata:
Includes documentation updates for best practices guidance.
Closes #419
Motivation
What was the reasoning behind this change? Please explain the changes briefly.