An automated pipeline for assessing WCAG 2.1/2.2 Level AA colour contrast compliance across the 500 most-crawled registered domains in Common Crawl's February 2026 crawl archive (CC-MAIN-2026-08), using archived copies from Common Crawl's WARC data.
This pipeline:
- Takes the top 500 registered domains from CC-MAIN-2026-08 crawl statistics
- Queries Common Crawl's Columnar Index via Amazon Athena to locate archived homepage captures in a single SQL pass
- Fetches the actual HTML from WARC files via byte-range requests to
data.commoncrawl.org - Parses all CSS colour declarations (inline styles, embedded
<style>blocks) - Evaluates every foreground/background colour pairing against WCAG 2.1/2.2 Level AA thresholds
- Produces a comprehensive JSON results file and summary statistics
No live websites are crawled. All page content comes from Common Crawl's open archive.
The pipeline uses Common Crawl's Columnar Index, a Parquet-based representation of the crawl index stored on S3 at
s3://commoncrawl/cc-index/table/cc-main/warc/. A single Athena SQL query finds all 500 homepage captures in one pass.
The query:
- Filters for
crawl = 'CC-MAIN-2026-08',subset = 'warc',fetch_status = 200,url_path = '/',content_mime_detected = 'text/html' - Uses
ROW_NUMBER() OVER (PARTITION BY url_host_registered_domain ...)to pick one capture per domain - Prefers the
wwwsubdomain or bare domain over deep subdomains, HTTPS over HTTP, and the most recent capture - Scans roughly 100-300 GB of columnar data at a typical cost of $0.50-1.50
- Python 3.9+
- For Athena auto mode:
pip install pyathenaand AWS credentials with Athena access - No other external dependencies (uses only
urllib,json,re,html.parser,csv,gzip,io,concurrent.futures)
See ATHENA_SETUP for instructions for setting up Amazon Athena.
Step 1 (01_fetch_index.py) supports three modes for querying the Columnar Index:
# Mode 1: Print the SQL query, run it yourself in the Athena console
python3 01_fetch_index.py --mode=sql
# Mode 2: Import results from a CSV downloaded from the Athena console
python3 01_fetch_index.py --mode=csv path/to/athena-results.csv
# Mode 3: Run the query directly via pyathena (requires AWS credentials)
export ATHENA_OUTPUT=s3://your-bucket/athena-results/
python3 01_fetch_index.py --mode=autoOptionally use a personal database namespace to avoid touching shared resources:
python3 01_fetch_index.py --mode=auto --database=my_wcag --setupThen run the rest of the pipeline:
python3 02_fetch_warc.py # ~1 min (WARC byte-range fetches with 8 workers)
python3 03_analyse_wcag.py # ~2 min (colour extraction + analysis)
python3 04_generate_report.py # ~2 sec (summary statistics)Or use the wrapper:
./run.sh # Query existing ccindex table
./run.sh --setup # Create personal table first
./run.sh --database=my_wcag --setup # Use personal namespace
./run.sh sql # Print SQL only
./run.sh csv path/to/results.csv # Import CSVdata/domains-top-500.csv-- Input domain list with rankingsdata/index_results.json-- Columnar Index lookup results (WARC filename, offset, length)data/warc_html/-- Extracted HTML files from WARC recordsoutput/wcag_results.json-- Per-domain WCAG analysis resultsoutput/wcag_summary.json-- Aggregate statisticsoutput/wcag_report.csv-- Tabular summary for spreadsheet usewcag-dashboard.html-- Interactive results dashboard
The interactive dashboard (wcag-dashboard.html) visualises the audit results across four tabs: Overview, Distribution, By Category, and Notable Sites. It is a standalone HTML file with no external dependencies beyond Google Fonts. The dashboard itself passes WCAG 2.1 Level AA colour contrast on all text/background pairings.
| Element | Minimum contrast ratio |
|---|---|
| Normal text (< 18pt, or < 14pt bold) | 4.5:1 |
| Large text (>= 18pt, or >= 14pt bold) | 3:1 |
| UI components and graphical objects | 3:1 |
- Colour extraction is static: it parses CSS from the archived HTML without executing JavaScript.
This means dynamically injected styles are not captured, but all inline styles, embedded
<style>blocks, andstyleattributes are analysed. - When only a foreground colour is specified without an explicit background, white (
#FFFFFF) is assumed. - When only a background colour is specified without explicit foreground text, black (
#000000) is assumed. - Named CSS colours (e.g.,
red,navy,cornflowerblue) are fully supported. - Shorthand hex colours (e.g.,
#fff) are expanded to full form. rgb(),rgba(),hsl(), andhsla()functions are parsed.- The crawl used is CC-MAIN-2026-08, Common Crawl's February 2026 crawl.
- Crawl ID: CC-MAIN-2026-08
- Domain ranking source: CC Crawl Statistics
- Columnar Index:
s3://commoncrawl/cc-index/table/cc-main/warc/(queried via Amazon Athena) - WARC data:
https://data.commoncrawl.org/
- This code is released under the MIT Licence.
- Site content is dedicated to the public domain under CC0 1.0.
- Common Crawl data is available under the Common Crawl Terms of Use.