A comprehensive web application and worker system that automatically generates and maintains llms.txt files for static sites based on the standard described at llmstxt.org. The system monitors website changes and outputs a structured text artifact optimized for LLM consumption.
Important caveat: the tool currently only works with static sites. Try some of the sites here. In addition, to save on Google Cloud compute costs, the demo crawls a maximum of 100 URLs.
Visit the live demo at https://llms-txt-generator.vercel.app/
- Sign in with Google to be redirected to the projects page.
- Create projects associated with a website and schedule llms.txt generations with a daily or weekly frequency.
- Creating a project will immediately trigger a run to generate an llms.txt for the site.
- Due to Google Cloud Run cold starts, it may take a while for the worker to ingest the job and update the run state to in-progress. You can press "refresh" to check for newly in-progress or complete jobs. The page will continuously refresh while an in-progress run is detected until the job is complete. (This could be fixed by setting minimum Cloud Run instances to 1, but for demo purposes I'm setting it to zero).
- When the worker detects new pages or page metadata changes compared to previous crawls, a new version of the llms.txt will be generated. Runs will automatically be executed with the frequency specified in the project configuration, but they can also be manually triggered on the project detail page.
- Add outgoing webhooks to receive a link to the updated llms.txt every time a run (manual or automated) generates a new version. Individual webhooks can be disabled or deleted.
The main dashboard showing all your monitoring projects with status and quick actions
Detailed view of a specific project showing run history, configuration, and generated files
The LLMs.txt Crawler consists of two main components:
- Web Application (
apps/web): A Next.js-based dashboard for managing projects, viewing generated llms.txt, and configuring automated monitoring - Worker Service (
apps/worker): A Python-based background service that crawls websites, detects changes, and generatesllms.txtfiles
- Google Authentication: Secure sign-in using Google One Tap
- Project Management: Create, configure, and manage website monitoring projects
- Dashboard: View all your projects with status, last run times, and quick actions
- Real-time Monitoring: Track crawling progress and view detailed run histories
- File Management: Download generated
llms.txtfiles and view change history - Webhook Configuration: Set up webhooks to automatically update your site when changes are detected
The worker is a Python-based HTTP server deployed on Google Cloud Run that handles the heavy lifting of website crawling and llms.txt generation. It receives job requests via Google Cloud Tasks and processes them asynchronously.
- Intelligent Web Crawling: Respects
robots.txtand implements rate limiting - Change Detection: Two-phase detection using HTTP headers and content hashing
- Content Processing: Extracts and normalizes web content for LLM consumption
- LLMS.txt Generation: Creates structured files following the llms.txt specification
- Automated Scheduling: Receives and enqueues scheduled crawl jobs via Google Cloud Tasks queue
- Webhook Integration: Notifies external systems when content changes
- Header-based Detection: Uses ETag and Last-Modified headers for quick change identification
- Content-based Detection: Falls back to SHA256 hashing of normalized content
- Smart Crawling: Only processes pages that have actually changed
- Removes timestamps and dynamic content that shouldn't affect change detection
- Strips script tags, style elements, and other non-content elements
- Normalizes whitespace and extracts meaningful text content
crawler.py: Web crawling with change detection integrationchange_detection.py: Content change detection using headers and SHA256 hashingllms_generator.py: Generate LLMS.txt formatted content from crawl resultsstorage.py: Database operations and run status managements3_storage.py: S3 upload operations and artifact managementscheduling.py: Cron scheduling and task managementwebhooks.py: Webhook management and executioncloud_tasks_client.py: Google Cloud Tasks integration
- Node.js 22.x
- Python 3.11+ with
uvfor package management - Github account
- Vercel account (for Next deployment) with this repo linked and the output dir set to apps/web/.next
- Supabase account (for DB+Object Storage)
- Google Cloud account (for Cloud Tasks/Cloud Run) with a project that has an Artifact Registry repo and a Cloud Tasks queue, and a service account with access to deploy to the repo and write to the queue
Create a .env file in the project root with the following variables.
In addition, they must all be configured as repo secrets on Github. You must also upload the .env file to the Vercel environment variables page.
# Supabase Configuration
NEXT_PUBLIC_GOOGLE_CLIENT_ID=your_google_client_id_for_oauth
NEXT_PUBLIC_SUPABASE_URL=your_supabase_url
NEXT_PUBLIC_SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_supabase_service_role_key
SUPABASE_PROJECT_ID=your_supabase_project_id
# Supabase object storage configuration
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
S3_BUCKET_NAME=your_s3_bucket_name
# Google Cloud Tasks / Cloud Run
GOOGLE_CLOUD_PROJECT_ID=your_gcp_project_id
GCP_SA_KEY=your_gcp_service_account_key # GCP service account key
PROJECT_ID=your_gcp_project_id
REGION=your_gcp_region # e.g. us-central1
REPOSITORY=your_gcp_container_repo # e.g. worker-repo
SERVICE=your_cloud_run_service_name # e.g. llms-worker
WORKER_URL=https://your-worker-endpoint
PORT=8080
# Crawling Configuration (optional)
CRAWL_MAX_PAGES=100
CRAWL_MAX_DEPTH=2
CRAWL_DELAY=0.5- Create a new Supabase project
- Run the database schema from
infra/schemas.sql:
# Connect to your Supabase database and run:
psql -h your-db-host -U postgres -d postgres -f infra/schemas.sql- Install dependencies:
npm install- Start the development server:
npm run devThe web application will be available at http://localhost:3001
- Navigate to the worker directory:
cd apps/worker- Install Python dependencies using uv:
uv sync- Run the worker:
npm run dev
# or directly with Python:
uv run worker.pyThe worker will start an HTTP server on port 8080 (configurable via the PORT environment variable). Because our frontend will send tasks to Cloud Tasks which cannot reach our local worker, for local development we must curl POST requests to our locally running worker instead of using the frontend. (A possible future improvement to reduce development friction is to expose the local worker with a tunnel using a tool like ngrok.)
Core Framework:
- Next.js 15.4.2
- React 19.1.0
- TypeScript 5.9.2
Authentication & Database:
- @supabase/supabase-js 2.57.4
- @supabase/ssr 0.7.0
- google-one-tap 1.0.6
Cloud Services:
- @google-cloud/tasks 4.0.1
UI & Styling:
- Tailwind CSS 4.1.5
Core Libraries:
- python-dotenv 1.0.0+
- requests 2.31.0+
- beautifulsoup4 4.12.2+
- lxml 4.9.3+
Cloud Services:
- boto3 (AWS SDK)
- supabase 2.0.0+
- google-cloud-tasks 2.16.0+
- Sign in to the web application using Google
- Click "Create New Project"
- Enter your website URL and configuration:
- Name: Friendly name for your project
- Domain: Website URL to monitor
- Description: Optional description
- Crawl Depth: How many levels deep to crawl (default: 2)
- Schedule: How often to check for changes (daily, weekly, etc.)
The system automatically:
- Crawls your website at scheduled intervals
- Detects content changes using intelligent algorithms
- Generates updated
llms.txtfiles when changes are found - Stores artifacts in S3 for download
- Calls configured webhooks to notify external systems
- Access your project dashboard
- View the latest run status
- Click "Download llms.txt" to get the generated file
- View change history to see what was updated
Configure webhooks to automatically publish generated files to your website:
- Go to project settings
- Add webhook URL
- Configure an optional secret
- The system will POST a link to the generated
llms.txtto your webhook
POST /
Content-Type: application/json
{
"id": "job_123",
"url": "https://example.com",
"projectId": "project_456",
"runId": "run_789",
"isScheduled": false
}GET /healthThe system provides comprehensive logging for:
- Web Application: User actions, API calls, authentication events
- Worker: Crawling progress, change detection results, S3 uploads, webhook calls
- Database: All operations are logged with timestamps and user context
A comprehensive test suite of Jest tests for the frontend API and Pytest tests for the worker can be invoked with npm run test from the repo root.
