-
Notifications
You must be signed in to change notification settings - Fork 29
Description
We have been researching and trying to address the issue with the massive number of AI scrapers, especially affecting DSTS.
Chris sampled an hour of traffic on DSTS. There were 78,920 requests in that time, which came from 74,853 unique IP addresses and spread across 700 different User Agents, most of which looked like normal browser User Agents, with no indication that they were bots.
Unfortunately, blocking or rate-limiting by IP address or User Agent is not a viable option here.
Moving forward, we have been looking at AI maze solutions. We have already enabled the new AI Labrynth feature on cloudflare, but with limited to no success. AI Labrynth is supposed to identify unauthorized bots and then link those bots to a bunch of AI generated garbage content served by cloudflare. My guess is that they are failing to identify unauthorized bots, just as we are.
Currently we are looking at implementing a similar solution: https://chronicles.mad-scientist.club/tales/a-season-on-iocaine/
The difference is that rather than trying to identify unauthorized bots, we may just leave hidden links on the site that link them to a web of garbage content that we generate with our own instance of iocaine.
At the moment, I am not sure at the moment whether we should have one instance of iocaine per site or not, but it probably makes sense to have these links point to content that can be viewed on the same domain.
The one commonality of the problematic traffic is the forced multi-parameter searches against the faceted search. Essentially, every possible combination of every value for all search parameters is being requested in quick succession by these various sources. It is possible to limit these requests based on the number of parameters used, however this would limit legitimate users as well. One potential option is to limit how many parameters may be used anonymously, and requesting users to login to do more detailed searching.