Skip to content

Disane87/scrape-dojo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

216 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Scrape Dojo Logo

Scrape Dojo

Declarative web scraping & browser automation with JSON workflows

Version GHCR License: MIT Docs

NestJS Angular Astro Puppeteer TypeScript Nx pnpm

GitHub Stars GitHub Issues CI


Note

🤖 AI-Aided Development (AIAD)

This project openly uses AI-assisted development (e.g. Claude Code) to accelerate workflows, improve code quality, and gain more development momentum. All AI-generated code is reviewed and approved by humans — this is not a vibe-coding project, but a deliberate effort to build a useful product while exploring the boundaries, benefits, and trade-offs of AI-aided development.


🥷 What is Scrape Dojo?

Scrape Dojo is a self-hosted web scraping & browser automation platform. Instead of writing Puppeteer code for every site, you define workflows declaratively in JSON/JSONC — like Infrastructure-as-Code, but for scraping.

Key capabilities:

  • 25+ built-in actions — navigate, click, type, extract, loop, download, screenshot, and more
  • 🧩 Handlebars + JSONata — dynamic templates and powerful data transformations
  • Cron scheduling — automate scrapes with cron, webhooks, or startup triggers
  • 🔐 Encrypted secrets — AES-256-CBC at-rest encryption for credentials
  • 📡 Real-time monitoring — SSE-powered live execution tracking in Angular UI
  • 🛡️ Auth (optional) — JWT, OIDC/SSO, MFA/TOTP, API keys
  • 🗄️ Multi-DB — SQLite (default), MySQL, PostgreSQL

Important

Scrape Dojo automates real browser interactions. Please respect website terms of service and applicable legal frameworks.

Full documentation: scrape-dojo.com


🐳 Quick Start (Docker)

# 1. Generate encryption key
node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

# 2. Create docker-compose.yml
cat <<'EOF' > docker-compose.yml
services:
  scrape-dojo:
    image: ghcr.io/disane87/scrape-dojo:latest
    ports:
      - '8080:80'
    environment:
      - SCRAPE_DOJO_ENCRYPTION_KEY=your_generated_key_here
      - SCRAPE_DOJO_AUTH_JWT_SECRET=your_random_jwt_secret_here
      - SCRAPE_DOJO_AUTH_REFRESH_TOKEN_SECRET=your_random_refresh_secret_here
      - DB_TYPE=sqlite
      # - SCRAPE_DOJO_PROXY_URL=http://proxy:8080  # Optional: route scrapes through a proxy
    volumes:
      - ./data:/home/pptruser/app/data
      - ./downloads:/home/pptruser/app/downloads
      - ./logs:/home/pptruser/app/logs
      - ./config:/home/pptruser/app/config
      - ./browser-data:/home/pptruser/app/browser-data
    restart: unless-stopped
EOF

# 3. Start
docker compose up -d

Open http://localhost:8080 — UI and API on the same port.

Warning

The SCRAPE_DOJO_ENCRYPTION_KEY encrypts all secrets. Store it safely — if lost, existing secrets are unrecoverable.

For local development, environment variables, auth setup, and more: see the Quickstart Guide.


⚡ Your First Scrape

Create config/sites/my-first-scrape.jsonc:

{
  "$schema": "../scrapes.schema.json",
  "scrapes": [
    {
      "id": "my-first-scrape",
      "metadata": {
        "description": "Read a page title",
        "triggers": [{ "type": "manual" }],
      },
      "steps": [
        {
          "name": "Main",
          "actions": [
            {
              "name": "open",
              "action": "navigate",
              "params": { "url": "https://example.com" },
            },
            {
              "name": "title",
              "action": "extract",
              "params": { "selector": "h1" },
            },
            {
              "name": "log",
              "action": "logger",
              "params": { "message": "Title: {{previousData.title}}" },
            },
          ],
        },
      ],
    },
  ],
}

The scrape auto-appears in the UI (hot reload). Click Run or use the API:

curl http://localhost:8080/api/scrape/my-first-scrape

📖 Documentation

Everything else lives in the docs:

Topic Link
🚀 Quickstart (Docker & Source) Getting Started
📐 Config format & metadata Configuration
⚡ All 22 actions with examples Actions Reference
🧩 Templates & JSONata Templates
⏰ Scheduling & triggers Scheduling
🔐 Secrets & variables Secrets & Variables
⚙️ Environment variables Env Reference
🏗️ Architecture & API Developer Guide
🛡️ Auth (JWT/OIDC/MFA) Authentication
💡 Full examples Examples

🛠️ Development

git clone https://github.com/disane87/scrape-dojo.git && cd scrape-dojo
pnpm install
cp .env.example .env  # Set SCRAPE_DOJO_ENCRYPTION_KEY
pnpm start            # API (3000) + UI (4200)
pnpm test             # All tests
Command What it does
pnpm start API + UI dev servers
pnpm test All tests
pnpm test:api API tests only
pnpm test:ui UI tests only
pnpm lint Lint all projects
pnpm build Build all apps

Commits follow Conventional Commits (feat:, fix:, docs:, etc.).


🤝 Contributing

  • 🐛 Issues & bugs: GitHub Issues
  • 💡 Feature requests: New Issue
  • 🔀 Pull requests: Fork → branch → commit → PR

📄 License

MIT — use it however you like.


🌟 Contributors

Contributors

Made with ❤️ by Marco Franke

Documentation · Issues · Discussions

About

🥷 Master the art of web scraping with JSON-powered workflows Define scrapes declaratively · Template everything · Run and monitor in style

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

 
 
 

Contributors