Cookie Crawler

This tool is a prototype that tries to scan a set of given pages and output all the Tracking technology instances found, at the moment only Cookies, butmore types will be supported in the future.

This tool is especially useful for any companies who are required by regulators to set up a baseline for their existing Cookies, but do not know how to get started.

Under the hood, this tool sort of reflects how website scanning tool such as OneTrust or ObservePoint works.

By default, The tool launches 10 headless browsers in parallel to do the scanning. Check the last part of the doc for more fine tuning options

The normal usage, 3 steps

# this will scan 300 urls in www.google.com domain
DOMAIN=google yarn start 

# check all cookies found in this job
cat output/cookie-names.txt

# get cookie information, on which page this cookie is dropped and by which initiator will be output
COOKIE=_ga_A12345 yarn get-cookie

An example output

$ COOKIE=BAIDUID yarn get-cookie

Domain:.baidu.com ,print at most 100 items
=================================================================
Found in:
https://www.csdn.net//marketing.csdn.net/questions/Q2202181748074189855
its initiator is https://gsp0.baidu.com/yrwHcjSl0MgCo2Kml5_Y_D3/...
this cookie is set via set-cookie

How it works

TODO: Add diagram to replace textual description

Before reading on, be sure to check the configs under config/domain folder, which are self explanatory.

You need to pass domain in the way of DOMAIN=google, and the tool will search for the config file under the config/domains folder. The tool will add the config.entryList to its crawl queue and set counter = 0
This step is temporarily disabled: Load previous load data so that incremantal scanning can be realized
Enter while loop: Pick up a task if the queue is not empty or counter less than TASKS tasks.
If config.allowUrlAsTask is not defined, the default behavior is that urls outside of this domain are skipped, but this rule does not apply to iframe tasks. But you may always customize this behavior
Check if the task needs to be run, if the task matches a certain regex defined in config.limitsByRegex and that regex can only run up to N times and N times is reached, it is skipped. If the task does not match any regex, it is normalized by dropping all its trailing parameters and the normalized instance can only run up to 10 times.

The above two methods prevent the crawler from endlessly crawling similar or the same content, which is a big waste to machine resources. This is similar to how Observe Point handles deduplicaion, check skipThisUrl in task.ts for more information.

use node-fetch to fetch the original page HTML response
patch the result by inserting document.cookie= hijacking logic defined in document-cookie-interceptor.js
Render the html again using the modified html body
Get console result for that web page and search for content related to document.cookie= and write result to memory.
Search for all set-cookie headers for each resource(.js,.css, xhr calls, etc) it loads and write result to memory
search for set-cookie header for the html page itself and write result to memory
find all a links and iframe srcs
add to task queue.
counter++
If no task left or TASKS tasks have been executed, write scanned data to output folder

Config folder

config/master-sheet.txt contains all 3rd party cookies whose owner need to be detected.

config/document-cookie-interceptor.js The js file used for document.cookie= task, you may ignore it

config/deprecated.txt Not actively used not, you may ignore it.

Output folder

output/cookie-info.json This is the source of truth of all scanning result.

output/cookie-names.txt All cookies+domain pairs found, this is the derivative of the the cookie-info.json for easier reading.

output/crawled-urls.txt All urls that have been crawled.

How to setup

Install nodejs v22

Install node manager NVM

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

Put this into your ~/.zshrc to make nvm command accessible everyone

export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm

Restart your terminal and install node 22 and yarn

nvm install v22
npm install -g yarn

Install dependencies

Note that this step needs to be done everytime you do git pull

yarn 
# compile typescript to javascript, the final product is in the dist/ folder
yarn compile

Functionalities

Check how many cookies we found in mastersheet and how many dintinct cookies have been found

yarn detect

Get Cookie Info

The current repo already contains previous running result(which might not be complete), and you may run COOKIE=$SOME_COOKIE_NAME yarn get-cookie to get its information. Caveat: always write it to an external file for inspection as vscode always truncates console output.

# by default only output at max 100 traces per domain
npx cross-env COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

# but you can customize the maximum
npx cross-env MAX_LINE_COUNT=150 COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

# you can also output all and do not paste to pastboard
npx cross-env ALL=true COOKIE=$SOME_COOKIE_NAME yarn get-cookie > a.txt

Run/Debug cookie scanning

this is for normal scanning, which initializes its starting pages from config.entryList

# search for all set-cookie instances
# TASKS if not defined, is equal to 300
npx cross-env TASKS=100 yarn start

This is to only crawl all urls in config.entryList and exit

npx cross-env STOP_IMMEDIATE_PROPAGATION=true DOMAIN=https://www.example.com yarn start

This is to only crawl all urls in config.entryList plus the iframes and a links on these pages, and then exits

npx cross-env STOP_PROPAGATION=true DOMAIN=google yarn start

this is for debug

# make sure POOL=1 otherwise you will get entangled result
npx cross-env POOL=1 yarn debug

# It would be nice to turn on verbose mode + devtool in the debug mode
npx cross-env POOL=1 VERBOSE=true DEV_TOOL=true yarn debug

and then open chrome://inspect in chrome, and you will something like In case you don't see it, click "Discover network target" and try to update like this

Linting

Prettier is not implemented for the time being. Husky precommit hook is not implemented for the time being

yarn lint
yarn lint --fix

Caveats

TASKS count does not include the tasks in the config.entryList
For Mac users: The code tries to close each tab after each task and destroys all browser instances after running or interruption, but in case the headless browsers are not released, search for "chrome for testing" in Mac's activity monitor and kill all the browsers. This problem does not seem to exist on Windows, as I tested on my another computer

Fine tunings

Define task count -> Default is 300 tasks, but you may specify that by TASKS=1000
Default Pool size is 10, but you may set by POOL=20, normally one browser takes up 1.2G, so if your memory can affort at least 20 * 1.2 * 1.5(buffer for system running)) = 36G of memory, 20 instances should not be a problem
Verbose output -> set env variable: VERBOSE=true
Change headless browser count - > set env variable: POOL=$some_number
Show devtool -> set env variables: DEV_TOOL=true POOL=1

Future Opportunities

Some domains require interacting with a blocking page before the crawler can start crawling, for example a bot detection page or the PIPL consent wall page if you visit www.booking.com from China. In the future, allowing user to write custom logic to manipulate puppeteer should be supported so that these blockers can be worked around via a few mimicked user interactions (For example, clicking some checkbox and then click ok)
At the moment finding Cookie is supported, but finding all other tracking technologies such as localStorage/sessionStorage/iframe etc should also be easy
User journey: Some cookies are only dropped in a specific condition, which most of the time requires the user to be logged-in or go through some user journey. But still, on each page, the cpature of Cookie works the same via intercepting"document.cookie=" and observeing to "set-cookie" headers.
A more friendly UI tool is being planned

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
src		src
.gitignore		.gitignore
README.md		README.md
eslint.config.mjs		eslint.config.mjs
image-1.png		image-1.png
image-2.png		image-2.png
image.png		image.png
package.json		package.json
playground.js		playground.js
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Cookie Crawler

An example output

How it works

Config folder

Output folder

How to setup

Install nodejs v22

Install dependencies

Functionalities

Check how many cookies we found in mastersheet and how many dintinct cookies have been found

Get Cookie Info

Run/Debug cookie scanning

Linting

Caveats

Fine tunings

Future Opportunities

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

Uh oh!

tokyojava/cookie-detector

Folders and files

Latest commit

History

Repository files navigation

Cookie Crawler

An example output

How it works

Config folder

Output folder

How to setup

Install nodejs v22

Install dependencies

Functionalities

Check how many cookies we found in mastersheet and how many dintinct cookies have been found

Get Cookie Info

Run/Debug cookie scanning

Linting

Caveats

Fine tunings

Future Opportunities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages