Skip to content

Conversation

@alanorth
Copy link
Contributor

@alanorth alanorth commented Aug 2, 2025

Description

Use the latest version of the express-rate-limit dependency. This version has support for headers conforming to the RateLimit header fields for HTTP standardization draft adopted by the IETF. This version also has optional support for express v5 if we migrate to that in the future, improved support for external rate limit stores, and improved support for other similar middleware like express-slow-down.

See the long list of changes from version 5.x to 8.x.

Instructions for Reviewers

Please add a more detailed description of the changes made by your PR. At a minimum, providing a bulleted list of changes in your PR is helpful to reviewers.

List of changes in this PR:

Include guidance for how to test or review your PR.

  • Make sure ui.rateLimiter.limit is set to some non-zero value (default 20)
  • Start the Angular frontend in production mode with npm start
  • Make a GET request to any page and look for RateLimit-* and other headers in the response (you should not see X-RateLimit-*)

Checklist

This checklist provides a reminder of what we are going to look for when reviewing your PR. You do not need to complete this checklist prior creating your PR (draft PRs are always welcome).
However, reviewers may request that you complete any actions in this list if you have not done so. If you are unsure about an item in the checklist, don't hesitate to ask. We're here to help!

  • My PR is created against the main branch of code (unless it is a backport or is fixing an issue specific to an older branch).
  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & specs/tests), or I have provided reasons as to why that's not possible.
  • My PR passes ESLint validation using npm run lint
  • My PR doesn't introduce circular dependencies (verified via npm run check-circ-deps)
  • My PR includes TypeDoc comments for all new (or modified) public methods and classes. It also includes TypeDoc for large or complex private methods.
  • My PR passes all specs/tests and includes new/updated specs or tests based on the Code Testing Guide.
  • My PR aligns with Accessibility guidelines if it makes changes to the user interface.
  • My PR uses i18n (internationalization) keys instead of hardcoded English text, to allow for translations.
  • My PR includes details on how to test it. I've provided clear instructions to reviewers on how to successfully test this fix or feature.
  • If my PR includes new libraries/dependencies (in package.json), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR includes new features or configurations, I've provided basic technical documentation in the PR itself.
  • If my PR fixes an issue ticket, I've linked them together.

@alanorth alanorth added dependencies Pull requests that update a dependency file performance / caching Related to performance, caching or embedded objects labels Aug 2, 2025
@tdonohue tdonohue moved this to 🙋 Needs Reviewers Assigned in DSpace 10.0 Release Aug 4, 2025
@tdonohue tdonohue added this to the 10.0 milestone Aug 4, 2025
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alanorth ! This looks great so far. I've added a few inline comments above. My only other thought here is whether we should add ipv6Subnet as a configurable setting in DSpace?

It seems like a setting that some sites may want more direct control over in order to better control rate limiting of IP ranges.

@alanorth alanorth force-pushed the upgrade-express-rate-limit branch from 49a5631 to f05c5d9 Compare August 5, 2025 06:09
@alanorth
Copy link
Contributor Author

alanorth commented Aug 5, 2025

Thanks for the comments @mwoodiupui and @tdonohue.

My only other thought here is whether we should add ipv6Subnet as a configurable setting in DSpace?

Good idea, I will add it.

@alanorth alanorth force-pushed the upgrade-express-rate-limit branch 2 times, most recently from 81e22e4 to 1110d21 Compare August 5, 2025 06:47
@alanorth
Copy link
Contributor Author

alanorth commented Aug 5, 2025

Cypress tests are now failing in CI due to rate limits. 🤦

 The response we received from your web server was:

  > 429: Too Many Requests

This was considered a failure because the status code was not `2xx`.

Not sure how to handle that...

@alanorth alanorth force-pushed the upgrade-express-rate-limit branch from 1110d21 to f60bd1a Compare August 5, 2025 09:33
@tdonohue
Copy link
Member

tdonohue commented Aug 5, 2025

@alanorth : e2e tests run in "production mode", so they won't use environment.test.ts at all (so you can remove your changes there).

So, I think you'd instead need to override the default using an environment variable in build.yml... similar to what we do to turn off all caching in all tests here: https://github.com/DSpace/dspace-angular/blob/main/.github/workflows/build.yml#L28-L29

We might be able to get away with setting it to 50 or 100 for e2e tests. We'll have to see what works better...but I didn't think about the fact that e2e tests are going to be requesting a large number of pages per minute in order to run the tests.

@alanorth alanorth force-pushed the upgrade-express-rate-limit branch from debf763 to 396faad Compare August 6, 2025 07:11
@tdonohue tdonohue requested a review from kshepherd August 7, 2025 14:19
@saschaszott
Copy link
Contributor

saschaszott commented Aug 7, 2025

There may be an issue with the rate limit configuration when used in NAT environments.

At our university, all users access external services through a shared public IPv4 address due to NAT. This means that the global threshold (requests per IPv4 address per minute) affects all users collectively, not individually.

If the threshold is set too low, even normal usage by a few users may trigger the rate limit and block legitimate traffic for the entire institution.

While this could be mitigated for authenticated users by applying rate limits per user ID or session, it remains problematic for unauthenticated users, who are all subject to the same global limit.

@saschaszott
Copy link
Contributor

Would it make sense to use device fingerprinting in combination with the IP address, for example by integrating https://github.com/fingerprintjs/fingerprintjs?

@pnbecker
Copy link
Member

pnbecker commented Aug 7, 2025

A limit of 20 might be a problem depending on your local theme. I expect it counts every request, even if the connection stays open with keep a live. If you have a lot of fonts, images and other things and do not use CDNs for privacy reasons, 20 might be low.

@alanorth
Copy link
Contributor Author

alanorth commented Aug 8, 2025

There may be an issue with the rate limit configuration when used in NAT environments.

@saschaszott Yes you are right. The limit setting is configurable but we need to find a better default.

Would it make sense to use device fingerprinting in combination with the IP address, for example by integrating fingerprintjs/fingerprintjs?

Maybe, but that particular library has a weird license and even if we found something else, if it runs in JavaScript then only bots that execute JavaScript will use it. I have heard some bots hit item pages (SSR) and scrape the metadata without executing JavaScript or loading any assets.

A limit of 20 might be a problem depending on your local theme. I expect it counts every request, even if the connection stays open with keep a live. If you have a lot of fonts, images and other things and do not use CDNs for privacy reasons, 20 might be low.

@pnbecker Yes, fair point. Keep in mind that once the client loads a page once they switch to CSR. Even so, I just loaded the default theme on main in a private window and checked the response of the last request and found that I had 2 requests remaining.

2025-08-08T09:58:05,510758206+03:00

So the previous default of 500 is too high, and 20 is too low. Shall we go for 50?

@tdonohue
Copy link
Member

tdonohue commented Aug 8, 2025

@alanorth : 50 sounds reasonable enough to try out and see if it works better. As @kshepherd pointed out in yesterday's Developers Meeting, we need to make sure to do testing with browser caching disabled. I believe my initial testing (where I suggested 20) may have been flawed because of my browser caching. So, we need to try accessing the site as a brand new user, and see what level works best for them. (I'm about to head on vacation, so I'll leave this to others to do some testing in the meantime)

As a sidenote, you are correct that some crawlers (especially Google Scholar) do not execute Javascript. I'm not sure whether their crawl will trigger a download of other assets, but we could always ask them as necessary.

@alanorth alanorth force-pushed the upgrade-express-rate-limit branch from 396faad to bb86b87 Compare August 8, 2025 17:47
@alanorth
Copy link
Contributor Author

alanorth commented Aug 8, 2025

Thanks @tdonohue. I've updated the default to 50. For what it's worth, I re-tested and found that I used 18 requests in the default DSpace theme. Maybe the default should be some multiple of 18 since that's how many requests are needed to load the default theme. 🤔

Unfortunately, it seems to me that whatever default we choose, we will impact legitimate users and "good" bots the most.

For example, our organization also uses one public IP externally (NAT) so if we send an email out with a link to the repository and ten or twenty people click it at the same time, how many will hit HTTP 429 errors and get weird broken pages? We also have lots of concurrent submitters. So we will need to override this limit with something higher anyway.

The class of "bad" bots that use residential proxies and only make a few requests from each IP, and they may only be scraping HTML from SSR, not loading assets. Impossible to stop these guys with express-rate-limit. We need some other kind of middleware like proof of work.

The other class of "bad" bots like Semrush, Yandex, Baidu, Bing, Meta, Bytedance/Bytespider/TikTok, etc are dumb and make tons of requests non stop, so this may stop them.

The "good" bots like Google definitely request more than 50 per minute at times, so we will block them too. Do they understand Rate-Limit-* headers?

@pnbecker
Copy link
Member

pnbecker commented Aug 9, 2025

I think the approach of a general rate limit is not the best. @alanorth, I think you told me that you were rate limiting access to certain paths like /item or /entities. As long as browser have to make more requests than bots, we will never be able to adjust rate limits correctly. Is there a chance to not rate limit assets but other paths that are accessed by bots without JavaScript? That would also allow us to set rate limits independently from themes.

@tdonohue
Copy link
Member

All, I talked to Google Scholar today about this today. Their bots unfortunately do not read / follow RateLimit-* HTTP Headers., which is disappointing to hear. They also said that the rate at which they harvest a site can vary based on the size of the site (larger sites are harvested at a faster rate). After more discussion though, they agreed that it may be very difficult to find a "one size fits all solution" for any built in rate-limiter...so they admitted this may not be as useful as first brainstormed.

Overall, I still feel we should either upgrade express-rate-limit or remove it altogether. I lean towards upgrading it, just in case some sites find it useful in specific scenarios.

But, based on Google Scholar's feedback, maybe we should avoid setting a rate limit that could be potentially problematic for good bots. This might be a good argument for leaving the default at 500, or only decreasing it to something like 200.

This essentially means it won't be the most useful tool out-of-the-box, but it still would be available to sites if they needed it.

56 is a moderately aggressive default. It may be increased to if
users are being incorrectly blocked (try 60 or 64), or decreased
if you are seeing evidence of abuse.

See: https://express-rate-limit.mintlify.app/reference/configuration#ipv6subnet
@alanorth alanorth force-pushed the upgrade-express-rate-limit branch from bb86b87 to 46c7a61 Compare August 27, 2025 04:57
@alanorth
Copy link
Contributor Author

Thanks @tdonohue. I've re-worked this patch series to use the old default of 500 requests, and dropped the override for CI as it is no longer needed.

Shame that this cannot be a useful tool. At this point we should only be upgrading this because it is an aging dependency with its own aging dependencies. Long term I think we should drop it because I've never seen it be useful. Sadly this means that repository admins will have to resort to web application firewalls if they have the skills, or Cloudflare if they have the budget (bad for the open Internet).

One thing I like about the proof of work (PoW) tools like Anubis and go-away is that they have mechanisms to match user agents with known IP ranges. So you can claim you are Googlebot, but unless you come from a known Google network, you get challenged (or blocked). It is non-trivial to deploy though, and comes with other side effects.

@github-actions
Copy link

Hi @alanorth,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file merge conflict performance / caching Related to performance, caching or embedded objects

Projects

Status: 🙋 Needs Reviewers Assigned

Development

Successfully merging this pull request may close these issues.

5 participants