-
Notifications
You must be signed in to change notification settings - Fork 504
Upgrade express-rate-limit to v8.0.1 #4620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tdonohue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alanorth ! This looks great so far. I've added a few inline comments above. My only other thought here is whether we should add ipv6Subnet as a configurable setting in DSpace?
It seems like a setting that some sites may want more direct control over in order to better control rate limiting of IP ranges.
49a5631 to
f05c5d9
Compare
|
Thanks for the comments @mwoodiupui and @tdonohue.
Good idea, I will add it. |
81e22e4 to
1110d21
Compare
|
Cypress tests are now failing in CI due to rate limits. 🤦 Not sure how to handle that... |
1110d21 to
f60bd1a
Compare
|
@alanorth : e2e tests run in "production mode", so they won't use So, I think you'd instead need to override the default using an environment variable in build.yml... similar to what we do to turn off all caching in all tests here: https://github.com/DSpace/dspace-angular/blob/main/.github/workflows/build.yml#L28-L29 We might be able to get away with setting it to 50 or 100 for e2e tests. We'll have to see what works better...but I didn't think about the fact that e2e tests are going to be requesting a large number of pages per minute in order to run the tests. |
debf763 to
396faad
Compare
|
There may be an issue with the rate limit configuration when used in NAT environments. At our university, all users access external services through a shared public IPv4 address due to NAT. This means that the global threshold (requests per IPv4 address per minute) affects all users collectively, not individually. If the threshold is set too low, even normal usage by a few users may trigger the rate limit and block legitimate traffic for the entire institution. While this could be mitigated for authenticated users by applying rate limits per user ID or session, it remains problematic for unauthenticated users, who are all subject to the same global limit. |
|
Would it make sense to use device fingerprinting in combination with the IP address, for example by integrating https://github.com/fingerprintjs/fingerprintjs? |
|
A limit of 20 might be a problem depending on your local theme. I expect it counts every request, even if the connection stays open with keep a live. If you have a lot of fonts, images and other things and do not use CDNs for privacy reasons, 20 might be low. |
@saschaszott Yes you are right. The
Maybe, but that particular library has a weird license and even if we found something else, if it runs in JavaScript then only bots that execute JavaScript will use it. I have heard some bots hit item pages (SSR) and scrape the metadata without executing JavaScript or loading any assets.
@pnbecker Yes, fair point. Keep in mind that once the client loads a page once they switch to CSR. Even so, I just loaded the default theme on
So the previous default of 500 is too high, and 20 is too low. Shall we go for 50? |
|
@alanorth : 50 sounds reasonable enough to try out and see if it works better. As @kshepherd pointed out in yesterday's Developers Meeting, we need to make sure to do testing with browser caching disabled. I believe my initial testing (where I suggested 20) may have been flawed because of my browser caching. So, we need to try accessing the site as a brand new user, and see what level works best for them. (I'm about to head on vacation, so I'll leave this to others to do some testing in the meantime) As a sidenote, you are correct that some crawlers (especially Google Scholar) do not execute Javascript. I'm not sure whether their crawl will trigger a download of other assets, but we could always ask them as necessary. |
396faad to
bb86b87
Compare
|
Thanks @tdonohue. I've updated the default to 50. For what it's worth, I re-tested and found that I used 18 requests in the default DSpace theme. Maybe the default should be some multiple of 18 since that's how many requests are needed to load the default theme. 🤔 Unfortunately, it seems to me that whatever default we choose, we will impact legitimate users and "good" bots the most. For example, our organization also uses one public IP externally (NAT) so if we send an email out with a link to the repository and ten or twenty people click it at the same time, how many will hit HTTP 429 errors and get weird broken pages? We also have lots of concurrent submitters. So we will need to override this limit with something higher anyway. The class of "bad" bots that use residential proxies and only make a few requests from each IP, and they may only be scraping HTML from SSR, not loading assets. Impossible to stop these guys with The other class of "bad" bots like Semrush, Yandex, Baidu, Bing, Meta, Bytedance/Bytespider/TikTok, etc are dumb and make tons of requests non stop, so this may stop them. The "good" bots like Google definitely request more than 50 per minute at times, so we will block them too. Do they understand |
|
I think the approach of a general rate limit is not the best. @alanorth, I think you told me that you were rate limiting access to certain paths like /item or /entities. As long as browser have to make more requests than bots, we will never be able to adjust rate limits correctly. Is there a chance to not rate limit assets but other paths that are accessed by bots without JavaScript? That would also allow us to set rate limits independently from themes. |
|
All, I talked to Google Scholar today about this today. Their bots unfortunately do not read / follow Overall, I still feel we should either upgrade But, based on Google Scholar's feedback, maybe we should avoid setting a rate limit that could be potentially problematic for good bots. This might be a good argument for leaving the default at 500, or only decreasing it to something like 200. This essentially means it won't be the most useful tool out-of-the-box, but it still would be available to sites if they needed it. |
56 is a moderately aggressive default. It may be increased to if users are being incorrectly blocked (try 60 or 64), or decreased if you are seeing evidence of abuse. See: https://express-rate-limit.mintlify.app/reference/configuration#ipv6subnet
bb86b87 to
46c7a61
Compare
|
Thanks @tdonohue. I've re-worked this patch series to use the old default of 500 requests, and dropped the override for CI as it is no longer needed. Shame that this cannot be a useful tool. At this point we should only be upgrading this because it is an aging dependency with its own aging dependencies. Long term I think we should drop it because I've never seen it be useful. Sadly this means that repository admins will have to resort to web application firewalls if they have the skills, or Cloudflare if they have the budget (bad for the open Internet). One thing I like about the proof of work (PoW) tools like Anubis and go-away is that they have mechanisms to match user agents with known IP ranges. So you can claim you are Googlebot, but unless you come from a known Google network, you get challenged (or blocked). It is non-trivial to deploy though, and comes with other side effects. |
|
Hi @alanorth, |

Description
Use the latest version of the express-rate-limit dependency. This version has support for headers conforming to the RateLimit header fields for HTTP standardization draft adopted by the IETF. This version also has optional support for express v5 if we migrate to that in the future, improved support for external rate limit stores, and improved support for other similar middleware like express-slow-down.
See the long list of changes from version 5.x to 8.x.
Instructions for Reviewers
Please add a more detailed description of the changes made by your PR. At a minimum, providing a bulleted list of changes in your PR is helpful to reviewers.
List of changes in this PR:
express-rate-limitfrom^5.1.3to^8.0.1maxconfiguration option tolimitand adjust default limit from 500 requests per IP per minute to 20ipv6SubnetsettingInclude guidance for how to test or review your PR.
ui.rateLimiter.limitis set to some non-zero value (default 20)npm startRateLimit-*and other headers in the response (you should not seeX-RateLimit-*)Checklist
This checklist provides a reminder of what we are going to look for when reviewing your PR. You do not need to complete this checklist prior creating your PR (draft PRs are always welcome).
However, reviewers may request that you complete any actions in this list if you have not done so. If you are unsure about an item in the checklist, don't hesitate to ask. We're here to help!
mainbranch of code (unless it is a backport or is fixing an issue specific to an older branch).npm run lintnpm run check-circ-deps)package.json), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.