This tool is a prototype that tries to scan a set of given pages and output all the Tracking technology instances found, at the moment only Cookies, butmore types will be supported in the future.
This tool is especially useful for any companies who are required by regulators to set up a baseline for their existing Cookies, but do not know how to get started.
Under the hood, this tool sort of reflects how website scanning tool such as OneTrust or ObservePoint works.
By default, The tool launches 10 headless browsers in parallel to do the scanning. Check the last part of the doc for more fine tuning options
The normal usage, 3 steps
# this will scan 300 urls in www.google.com domain
DOMAIN=google yarn start
# check all cookies found in this job
cat output/cookie-names.txt
# get cookie information, on which page this cookie is dropped and by which initiator will be output
COOKIE=_ga_A12345 yarn get-cookie
$ COOKIE=BAIDUID yarn get-cookie
Domain:.baidu.com ,print at most 100 items
=================================================================
Found in:
https://www.csdn.net//marketing.csdn.net/questions/Q2202181748074189855
its initiator is https://gsp0.baidu.com/yrwHcjSl0MgCo2Kml5_Y_D3/...
this cookie is set via set-cookie
TODO: Add diagram to replace textual description
Before reading on, be sure to check the configs under config/domain folder, which are self explanatory.
-
You need to pass domain in the way of
DOMAIN=google, and the tool will search for the config file under theconfig/domainsfolder. The tool will add theconfig.entryListto its crawl queue and set counter = 0 -
This step is temporarily disabled: Load previous load data so that incremantal scanning can be realized
-
Enter while loop: Pick up a task if the queue is not empty or counter less than
TASKStasks. -
If
config.allowUrlAsTaskis not defined, the default behavior is that urls outside of this domain are skipped, but this rule does not apply to iframe tasks. But you may always customize this behavior -
Check if the task needs to be run, if the task matches a certain regex defined in
config.limitsByRegexand that regex can only run up to N times and N times is reached, it is skipped. If the task does not match any regex, it is normalized by dropping all its trailing parameters and the normalized instance can only run up to 10 times.
The above two methods prevent the crawler from endlessly crawling similar or the same content, which is a big waste to machine resources.
This is similar to how Observe Point handles deduplicaion, check skipThisUrl in task.ts for more information.
-
use
node-fetchto fetch the original page HTML response -
patch the result by inserting
document.cookie=hijacking logic defined indocument-cookie-interceptor.js -
Render the html again using the modified html body
-
Get console result for that web page and search for content related to
document.cookie=and write result to memory. -
Search for all
set-cookieheaders for each resource(.js,.css, xhr calls, etc) it loads and write result to memory -
search for
set-cookieheader for the html page itself and write result to memory -
find all
alinks andiframesrcs -
add to task queue.
-
counter++
-
If no task left or
TASKStasks have been executed, write scanned data tooutputfolder
config/master-sheet.txt contains all 3rd party cookies whose owner need to be detected.
config/document-cookie-interceptor.js The js file used for document.cookie= task, you may ignore it
config/deprecated.txt Not actively used not, you may ignore it.
output/cookie-info.json This is the source of truth of all scanning result.
output/cookie-names.txt All cookies+domain pairs found, this is the derivative of the the cookie-info.json for easier reading.
output/crawled-urls.txt All urls that have been crawled.
- Install node manager NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash- Put this into your ~/.zshrc to make nvm command accessible everyone
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm- Restart your terminal and install node 22 and yarn
nvm install v22
npm install -g yarnNote that this step needs to be done everytime you do git pull
yarn
# compile typescript to javascript, the final product is in the dist/ folder
yarn compile yarn detect
The current repo already contains previous running result(which might not be complete),
and you may run COOKIE=$SOME_COOKIE_NAME yarn get-cookie to get its information.
Caveat: always write it to an external file for inspection as vscode always truncates console output.
# by default only output at max 100 traces per domain
npx cross-env COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
# but you can customize the maximum
npx cross-env MAX_LINE_COUNT=150 COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
# you can also output all and do not paste to pastboard
npx cross-env ALL=true COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt
this is for normal scanning, which initializes its starting pages from config.entryList
# search for all set-cookie instances
# TASKS if not defined, is equal to 300
npx cross-env TASKS=100 yarn startThis is to only crawl all urls in config.entryList and exit
npx cross-env STOP_IMMEDIATE_PROPAGATION=true DOMAIN=https://www.example.com yarn start
This is to only crawl all urls in config.entryList plus the iframes and a links on these pages, and then exits
npx cross-env STOP_PROPAGATION=true DOMAIN=google yarn start
this is for debug
# make sure POOL=1 otherwise you will get entangled result
npx cross-env POOL=1 yarn debug
# It would be nice to turn on verbose mode + devtool in the debug mode
npx cross-env POOL=1 VERBOSE=true DEV_TOOL=true yarn debugand then open chrome://inspect in chrome, and you will something like
In case you don't see it, click "Discover network target" and try to update like this

Prettier is not implemented for the time being. Husky precommit hook is not implemented for the time being
yarn lint
yarn lint --fix
- TASKS count does not include the tasks in the
config.entryList - For Mac users: The code tries to close each tab after each task and destroys all browser instances after running or interruption, but in case the headless browsers are not released, search for "chrome for testing" in Mac's activity monitor and kill all the browsers. This problem does not seem to exist on Windows, as I tested on my another computer
-
Define task count -> Default is 300 tasks, but you may specify that by
TASKS=1000 -
Default Pool size is 10, but you may set by
POOL=20, normally one browser takes up 1.2G, so if your memory can affort at least 20 * 1.2 * 1.5(buffer for system running)) = 36G of memory, 20 instances should not be a problem -
Verbose output -> set env variable:
VERBOSE=true -
Change headless browser count - > set env variable:
POOL=$some_number -
Show devtool -> set env variables:
DEV_TOOL=true POOL=1
-
Some domains require interacting with a blocking page before the crawler can start crawling, for example a bot detection page or the PIPL consent wall page if you visit www.booking.com from China. In the future, allowing user to write custom logic to manipulate puppeteer should be supported so that these blockers can be worked around via a few mimicked user interactions (For example, clicking some checkbox and then click ok)

-
At the moment finding Cookie is supported, but finding all other tracking technologies such as localStorage/sessionStorage/iframe etc should also be easy
-
User journey: Some cookies are only dropped in a specific condition, which most of the time requires the user to be logged-in or go through some user journey. But still, on each page, the cpature of Cookie works the same via intercepting"document.cookie=" and observeing to "set-cookie" headers.
-
A more friendly UI tool is being planned