Skip to content

Harvesting client improvements: configurable delay between GetRecord calls; a fix for a problem with long-running DataCite harvests#11486

Merged
sekmiller merged 16 commits intodevelopfrom
11473-harvesting-client-ratelimit
Mar 4, 2026
Merged

Harvesting client improvements: configurable delay between GetRecord calls; a fix for a problem with long-running DataCite harvests#11486
sekmiller merged 16 commits intodevelopfrom
11473-harvesting-client-ratelimit

Conversation

@landreev
Copy link
Contributor

@landreev landreev commented May 12, 2025

What this PR does / why we need it:

This is based on a patch that I made a while ago for another Dataverse instance. But it has come handy here at HDV and it may be of benefit to other instances out there.
The changes are quite straightforward.

From the accompanying release note:

A setting has been added for configuring sleep intervals between OAI calls for specific harvesting clients. Making it possible to harvest uninterrupted from servers enforcing rate limit policies. See the configuration guide for details. Additionally, this release fixes a problem with harvesting from DataCite OAI-PMH where initial, long-running harvests were failing on sets with large numbers of records.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

Create a harvesting client and harvest something. For example, the controlTestSet from demo. Confirm that it's working without this new, optional setting.
Create the setting as described in the guide. If you configure an interval for the client you're testing, there will be an .info message in the log that the specified number of milliseconds will be used as the delay. If you specify it for a different client, confirm that the client you're testing is unaffected and the delay utilized is 0; i.e. no delay.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…g client calls

(the only thing I want to add is an option of enabling this setting for specific clients; similarly to how ingest size limits can be for all, or some specific formats only. #11473
@coveralls
Copy link

coveralls commented May 12, 2025

Coverage Status

coverage: 24.397% (-0.02%) from 24.414%
when pulling 5b87133 on 11473-harvesting-client-ratelimit
into f20e75a on develop.

@github-actions

This comment has been minimized.

Resolved merge conflicts in:
	src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java
	src/main/java/edu/harvard/iq/dataverse/util/SystemConfig.java
@github-actions

This comment has been minimized.

@landreev landreev changed the title adds a setting for configuring a delay between GetRecord calls in harvesting client Harvesting client improvements: configurable delay between GetRecord calls; a fix for a problem with long-running DataCite harvests Feb 26, 2026
@landreev
Copy link
Contributor Author

Un-drafting this thing.

@landreev landreev marked this pull request as ready for review February 26, 2026 15:57
@landreev
Copy link
Contributor Author

(will sync w/ develop shortly)

@github-actions

This comment has been minimized.

@scolapasta scolapasta moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Feb 26, 2026
@scolapasta scolapasta added this to the 6.10 milestone Feb 26, 2026
@github-actions

This comment has been minimized.

@stevenwinship stevenwinship self-assigned this Feb 26, 2026
@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Feb 26, 2026
@stevenwinship stevenwinship added the FY26 Sprint 18 FY26 Sprint 18 (2026-02-25 - 2026-03-11) label Feb 26, 2026
@landreev
Copy link
Contributor Author

Synced the branch with develop just now. There were no merge conflicts to resolve however, contrary to what GitHub was saying. 🤔

@github-actions

This comment has been minimized.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Feb 26, 2026
@stevenwinship stevenwinship removed their assignment Feb 26, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@stevenwinship stevenwinship moved this from Ready for QA ⏩ to In Review 🔎 in IQSS Dataverse Project Feb 27, 2026
@stevenwinship stevenwinship self-assigned this Feb 27, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@landreev
Copy link
Contributor Author

@stevenwinship Thanks for adding the test - looks good.

@landreev landreev moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Feb 27, 2026
Co-authored-by: Philip Durbin <philip_durbin@harvard.edu>
@landreev
Copy link
Contributor Author

landreev commented Mar 2, 2026

I just killed and deleted the latest Jenkins built, since it was triggered by a change in the release note.
The last Jenkins job that ran did in fact succeed.

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11473-harvesting-client-ratelimit
ghcr.io/gdcc/configbaker:11473-harvesting-client-ratelimit

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@sekmiller sekmiller self-assigned this Mar 2, 2026
@sekmiller sekmiller moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Mar 2, 2026
@sekmiller
Copy link
Contributor

If you add the setting via the api with a bad json file you still get a success message "Setting :HarvestingClientCallRateLimit added", but when you run the harvest you get the "Disabling all harvesting client delay intervals completely until fixed!" . in the log. Is there any way that we can note this on setting create or put the error in the UI on Harvest instead of just in the log?

@landreev
Copy link
Contributor Author

landreev commented Mar 4, 2026

Hmm. I based this implementation on, or copied it from, how Oliver implemented a similar map setting for tabular ingest size limits. So yes, it is possible to create it with junk values that are only validated when used; similarly to :TabularIngestSizeLimit.

I couldn't think of a completely trivial/cheap way to quickly add validation on create. So I'd like to petition to let it slip. On the rationale that this one is an even more exotic feature than the tab. ingest limits; and something only sysadmins - advanced users, by definition - will have to tinker with; very rarely, if ever.

Do we ever validate any other settings on create though?

If you add the setting via the api with a bad json file you still get a success message "Setting :HarvestingClientCallRateLimit added", but when you run the harvest you get the "Disabling all harvesting client delay intervals completely until fixed!" . in the log. Is there any way that we can note this on setting create or put the error in the UI on Harvest instead of just in the log?

@sekmiller sekmiller merged commit 122736c into develop Mar 4, 2026
14 of 15 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project Mar 4, 2026
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY26 Sprint 18 FY26 Sprint 18 (2026-02-25 - 2026-03-11)

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

Add a setting for configuring a rate limit in Harvesting Client (a limit on outgoing calls)

6 participants