Upgrade express-rate-limit to v8.0.1 #4620

alanorth · 2025-08-02T06:21:04Z

Description

Use the latest version of the express-rate-limit dependency. This version has support for headers conforming to the RateLimit header fields for HTTP standardization draft adopted by the IETF. This version also has optional support for express v5 if we migrate to that in the future, improved support for external rate limit stores, and improved support for other similar middleware like express-slow-down.

See the long list of changes from version 5.x to 8.x.

Instructions for Reviewers

Please add a more detailed description of the changes made by your PR. At a minimum, providing a bulleted list of changes in your PR is helpful to reviewers.

List of changes in this PR:

First, upgrade express-rate-limit from ^5.1.3 to ^8.0.1
Second, rename the max configuration option to limit and adjust default limit from 500 requests per IP per minute to 20
Third, enable use of newer standard rate limit headers, see: https://express-rate-limit.mintlify.app/reference/configuration#standardheaders
Fourth, disable legacy rate limit headers: https://express-rate-limit.mintlify.app/reference/configuration#legacyheaders
Fifth, allow configuration of the new ipv6Subnet setting

Include guidance for how to test or review your PR.

Make sure ui.rateLimiter.limit is set to some non-zero value (default 20)
Start the Angular frontend in production mode with npm start
Make a GET request to any page and look for RateLimit-* and other headers in the response (you should not see X-RateLimit-*)

Checklist

This checklist provides a reminder of what we are going to look for when reviewing your PR. You do not need to complete this checklist prior creating your PR (draft PRs are always welcome).
However, reviewers may request that you complete any actions in this list if you have not done so. If you are unsure about an item in the checklist, don't hesitate to ask. We're here to help!

server.ts

src/config/default-app-config.ts

tdonohue

Thanks @alanorth ! This looks great so far. I've added a few inline comments above. My only other thought here is whether we should add ipv6Subnet as a configurable setting in DSpace?

It seems like a setting that some sites may want more direct control over in order to better control rate limiting of IP ranges.

alanorth · 2025-08-05T06:26:44Z

Thanks for the comments @mwoodiupui and @tdonohue.

My only other thought here is whether we should add ipv6Subnet as a configurable setting in DSpace?

Good idea, I will add it.

alanorth · 2025-08-05T08:09:52Z

Cypress tests are now failing in CI due to rate limits. 🤦

 The response we received from your web server was:

  > 429: Too Many Requests

This was considered a failure because the status code was not `2xx`.

Not sure how to handle that...

tdonohue · 2025-08-05T13:45:04Z

@alanorth : e2e tests run in "production mode", so they won't use environment.test.ts at all (so you can remove your changes there).

So, I think you'd instead need to override the default using an environment variable in build.yml... similar to what we do to turn off all caching in all tests here: https://github.com/DSpace/dspace-angular/blob/main/.github/workflows/build.yml#L28-L29

We might be able to get away with setting it to 50 or 100 for e2e tests. We'll have to see what works better...but I didn't think about the fact that e2e tests are going to be requesting a large number of pages per minute in order to run the tests.

saschaszott · 2025-08-07T14:36:41Z

There may be an issue with the rate limit configuration when used in NAT environments.

At our university, all users access external services through a shared public IPv4 address due to NAT. This means that the global threshold (requests per IPv4 address per minute) affects all users collectively, not individually.

If the threshold is set too low, even normal usage by a few users may trigger the rate limit and block legitimate traffic for the entire institution.

While this could be mitigated for authenticated users by applying rate limits per user ID or session, it remains problematic for unauthenticated users, who are all subject to the same global limit.

saschaszott · 2025-08-07T14:40:22Z

Would it make sense to use device fingerprinting in combination with the IP address, for example by integrating https://github.com/fingerprintjs/fingerprintjs?

pnbecker · 2025-08-07T14:49:00Z

A limit of 20 might be a problem depending on your local theme. I expect it counts every request, even if the connection stays open with keep a live. If you have a lot of fonts, images and other things and do not use CDNs for privacy reasons, 20 might be low.

alanorth · 2025-08-08T07:01:38Z

There may be an issue with the rate limit configuration when used in NAT environments.

@saschaszott Yes you are right. The limit setting is configurable but we need to find a better default.

Would it make sense to use device fingerprinting in combination with the IP address, for example by integrating fingerprintjs/fingerprintjs?

Maybe, but that particular library has a weird license and even if we found something else, if it runs in JavaScript then only bots that execute JavaScript will use it. I have heard some bots hit item pages (SSR) and scrape the metadata without executing JavaScript or loading any assets.

A limit of 20 might be a problem depending on your local theme. I expect it counts every request, even if the connection stays open with keep a live. If you have a lot of fonts, images and other things and do not use CDNs for privacy reasons, 20 might be low.

@pnbecker Yes, fair point. Keep in mind that once the client loads a page once they switch to CSR. Even so, I just loaded the default theme on main in a private window and checked the response of the last request and found that I had 2 requests remaining.

So the previous default of 500 is too high, and 20 is too low. Shall we go for 50?

tdonohue · 2025-08-08T15:00:25Z

@alanorth : 50 sounds reasonable enough to try out and see if it works better. As @kshepherd pointed out in yesterday's Developers Meeting, we need to make sure to do testing with browser caching disabled. I believe my initial testing (where I suggested 20) may have been flawed because of my browser caching. So, we need to try accessing the site as a brand new user, and see what level works best for them. (I'm about to head on vacation, so I'll leave this to others to do some testing in the meantime)

As a sidenote, you are correct that some crawlers (especially Google Scholar) do not execute Javascript. I'm not sure whether their crawl will trigger a download of other assets, but we could always ask them as necessary.

alanorth · 2025-08-08T18:33:25Z

Thanks @tdonohue. I've updated the default to 50. For what it's worth, I re-tested and found that I used 18 requests in the default DSpace theme. Maybe the default should be some multiple of 18 since that's how many requests are needed to load the default theme. 🤔

Unfortunately, it seems to me that whatever default we choose, we will impact legitimate users and "good" bots the most.

For example, our organization also uses one public IP externally (NAT) so if we send an email out with a link to the repository and ten or twenty people click it at the same time, how many will hit HTTP 429 errors and get weird broken pages? We also have lots of concurrent submitters. So we will need to override this limit with something higher anyway.

The class of "bad" bots that use residential proxies and only make a few requests from each IP, and they may only be scraping HTML from SSR, not loading assets. Impossible to stop these guys with express-rate-limit. We need some other kind of middleware like proof of work.

The other class of "bad" bots like Semrush, Yandex, Baidu, Bing, Meta, Bytedance/Bytespider/TikTok, etc are dumb and make tons of requests non stop, so this may stop them.

The "good" bots like Google definitely request more than 50 per minute at times, so we will block them too. Do they understand Rate-Limit-* headers?

pnbecker · 2025-08-09T06:59:23Z

I think the approach of a general rate limit is not the best. @alanorth, I think you told me that you were rate limiting access to certain paths like /item or /entities. As long as browser have to make more requests than bots, we will never be able to adjust rate limits correctly. Is there a chance to not rate limit assets but other paths that are accessed by bots without JavaScript? That would also allow us to set rate limits independently from themes.

tdonohue · 2025-08-26T21:05:37Z

All, I talked to Google Scholar today about this today. Their bots unfortunately do not read / follow RateLimit-* HTTP Headers., which is disappointing to hear. They also said that the rate at which they harvest a site can vary based on the size of the site (larger sites are harvested at a faster rate). After more discussion though, they agreed that it may be very difficult to find a "one size fits all solution" for any built in rate-limiter...so they admitted this may not be as useful as first brainstormed.

Overall, I still feel we should either upgrade express-rate-limit or remove it altogether. I lean towards upgrading it, just in case some sites find it useful in specific scenarios.

But, based on Google Scholar's feedback, maybe we should avoid setting a rate limit that could be potentially problematic for good bots. This might be a good argument for leaving the default at 500, or only decreasing it to something like 200.

This essentially means it won't be the most useful tool out-of-the-box, but it still would be available to sites if they needed it.

56 is a moderately aggressive default. It may be increased to if users are being incorrectly blocked (try 60 or 64), or decreased if you are seeing evidence of abuse. See: https://express-rate-limit.mintlify.app/reference/configuration#ipv6subnet

alanorth · 2025-08-27T05:05:01Z

Thanks @tdonohue. I've re-worked this patch series to use the old default of 500 requests, and dropped the override for CI as it is no longer needed.

Shame that this cannot be a useful tool. At this point we should only be upgrading this because it is an aging dependency with its own aging dependencies. Long term I think we should drop it because I've never seen it be useful. Sadly this means that repository admins will have to resort to web application firewalls if they have the skills, or Cloudflare if they have the budget (bad for the open Internet).

One thing I like about the proof of work (PoW) tools like Anubis and go-away is that they have mechanisms to match user agents with known IP ranges. So you can claim you are Googlebot, but unless you come from a known Google network, you get challenged (or blocked). It is non-trivial to deploy though, and comes with other side effects.

github-actions · 2025-11-10T20:19:10Z

Hi @alanorth,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

alanorth added this to DSpace 10.0 Release Aug 2, 2025

alanorth added dependencies Pull requests that update a dependency file performance / caching Related to performance, caching or embedded objects labels Aug 2, 2025

github-actions bot assigned alanorth Aug 2, 2025

alanorth mentioned this pull request Aug 2, 2025

[Discussion] Provide DSpace tools or better documentation regarding managing aggressive bots / harvesters #4565

Open

tdonohue reviewed Aug 4, 2025

View reviewed changes

server.ts Show resolved Hide resolved

tdonohue reviewed Aug 4, 2025

View reviewed changes

src/config/default-app-config.ts Show resolved Hide resolved

tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 10.0 Release Aug 4, 2025

tdonohue added this to the 10.0 milestone Aug 4, 2025

tdonohue reviewed Aug 4, 2025

View reviewed changes

alanorth force-pushed the upgrade-express-rate-limit branch from 49a5631 to f05c5d9 Compare August 5, 2025 06:09

alanorth force-pushed the upgrade-express-rate-limit branch 2 times, most recently from 81e22e4 to 1110d21 Compare August 5, 2025 06:47

alanorth force-pushed the upgrade-express-rate-limit branch from 1110d21 to f60bd1a Compare August 5, 2025 09:33

Upgrade express-rate-limit to v8.0.1

84c3138

alanorth force-pushed the upgrade-express-rate-limit branch from debf763 to 396faad Compare August 6, 2025 07:11

tdonohue requested a review from kshepherd August 7, 2025 14:19

alanorth force-pushed the upgrade-express-rate-limit branch from 396faad to bb86b87 Compare August 8, 2025 17:47

alanorth force-pushed the upgrade-express-rate-limit branch from bb86b87 to 46c7a61 Compare August 27, 2025 04:57

github-actions bot added the merge conflict label Nov 10, 2025

Upgrade express-rate-limit to v8.0.1 #4620

Are you sure you want to change the base?

Upgrade express-rate-limit to v8.0.1 #4620

Uh oh!

Conversation

alanorth commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Instructions for Reviewers

Checklist

Uh oh!

Uh oh!

Uh oh!

tdonohue left a comment

Choose a reason for hiding this comment

Uh oh!

alanorth commented Aug 5, 2025

Uh oh!

alanorth commented Aug 5, 2025

Uh oh!

tdonohue commented Aug 5, 2025

Uh oh!

saschaszott commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saschaszott commented Aug 7, 2025

Uh oh!

pnbecker commented Aug 7, 2025

Uh oh!

alanorth commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdonohue commented Aug 8, 2025

Uh oh!

alanorth commented Aug 8, 2025

Uh oh!

pnbecker commented Aug 9, 2025

Uh oh!

tdonohue commented Aug 26, 2025

Uh oh!

alanorth commented Aug 27, 2025

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alanorth commented Aug 2, 2025 •

edited

Loading

saschaszott commented Aug 7, 2025 •

edited

Loading

alanorth commented Aug 8, 2025 •

edited

Loading