Skip to content

Add Architecture Decision Records for Cloudflare Crawler Replacement#3

Merged
brylie merged 8 commits intomainfrom
docs/cloudflare-crawler-adr
Mar 12, 2026
Merged

Add Architecture Decision Records for Cloudflare Crawler Replacement#3
brylie merged 8 commits intomainfrom
docs/cloudflare-crawler-adr

Conversation

@brylie
Copy link
Copy Markdown
Owner

@brylie brylie commented Mar 12, 2026

Introduce Architecture Decision Records (ADRs) to document the decision to replace the bespoke crawler with the Cloudflare Browser Rendering API. This change aims to streamline the crawling process and reduce maintenance overhead while enhancing capabilities. The initial ADR outlines the context, decision, and consequences of this architectural shift. Additionally, a template for future ADRs is included to facilitate consistent documentation.

Summary by CodeRabbit

  • Documentation
    • Added ADR framework and template for documenting major architectural decisions.
    • Introduced ADR proposing migration of crawling to Cloudflare Browser Rendering API: documents updated crawling behavior, simplified configuration options, CLI adjusted to a polling workflow, and authentication via environment variables.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add Architecture Decision Records framework and Cloudflare crawler replacement ADR

📝 Documentation

Grey Divider

Walkthroughs

Description
• Introduces Architecture Decision Records (ADRs) framework for documenting architectural decisions
• Documents decision to replace bespoke crawler with Cloudflare Browser Rendering API
• Provides ADR template and guidelines for future architectural decisions
• Details context, consequences, and alternatives for crawler replacement strategy
Diagram
flowchart LR
  A["ADR Framework"] --> B["README with guidelines"]
  A --> C["Template for future ADRs"]
  D["Crawler Replacement Decision"] --> E["ADR 0001: Cloudflare API adoption"]
  E --> F["Replaces BaseCrawler and CrawlerRunner"]
  E --> G["Updates config models and CLI"]
Loading

Grey Divider

File Changes

1. docs/ADRs/0001-cloudflare-crawler.md 📝 Documentation +100/-0

Cloudflare crawler replacement architectural decision record

• Comprehensive ADR documenting the decision to replace custom crawler with Cloudflare Browser
 Rendering /crawl API
• Details context of existing BaseCrawler implementation and its maintenance burden
• Outlines specific changes: removing BaseCrawler, rewriting CrawlerRunner, updating
 CrawlerConfig model
• Lists positive consequences (reduced code, better capabilities, robots.txt compliance, Markdown
 output)
• Documents negative consequences (external dependency, API token management, costs, less control,
 async polling)
• Identifies risks including beta status, bot protection, and content quality differences
• Evaluates three alternatives considered and explains rejection rationale

docs/ADRs/0001-cloudflare-crawler.md


2. docs/ADRs/README.md 📝 Documentation +36/-0

ADR framework guidelines and directory documentation

• Establishes ADR directory purpose and explains benefits for onboarding and decision documentation
• Defines file naming convention using sequential numbering with descriptive slugs
• Documents four ADR statuses: Proposed, Accepted, Deprecated, and Superseded
• Provides instructions for creating new ADRs by copying and modifying the template

docs/ADRs/README.md


3. docs/ADRs/_template.md 📝 Documentation +62/-0

Template for creating new Architecture Decision Records

• Provides reusable template for future Architecture Decision Records
• Includes sections for Status, Date, Context, Decision, Consequences (Positive/Negative/Risks),
 Alternatives considered, and References
• Contains detailed comments explaining what content belongs in each section
• Serves as standardized format for consistent ADR documentation across the project

docs/ADRs/_template.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 12, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Broken issue reference link🐞 Bug ✓ Correctness
Description
docs/ADRs/0001-cloudflare-crawler.md links to ../../issues/replace-crawler-with-cloudflare.md,
but there is no issues/ directory in the repository, so the reference will be dead. This breaks
navigation for readers trying to follow the decision’s tracking context.
Code

docs/ADRs/0001-cloudflare-crawler.md[100]

+- Related issue: [issues/replace-crawler-with-cloudflare.md](../../issues/replace-crawler-with-cloudflare.md)
Evidence
The ADR’s References section includes a relative link to an issues/ path; the repository structure
(top-level tree) contains no issues/ directory, so the link cannot resolve to a file in-repo.

docs/ADRs/0001-cloudflare-crawler.md[95-100]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The ADR references a non-existent in-repo tracking document (`../../issues/replace-crawler-with-cloudflare.md`), which makes the link dead.
## Issue Context
The repository does not currently contain an `issues/` directory, so the relative link in the ADR cannot resolve.
## Fix Focus Areas
- docs/ADRs/0001-cloudflare-crawler.md[95-100]
## Suggested changes
- Option A: Create `issues/replace-crawler-with-cloudflare.md` (and `issues/` directory) so the link works.
- Option B: Replace the link with a valid URL (e.g., a GitHub issue URL) or remove the “Related issue” line until a tracker exists.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

Warning

Rate limit exceeded

@brylie has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57c680e5-be24-43d6-b0b1-f2d3d9dd95b1

📥 Commits

Reviewing files that changed from the base of the PR and between c4b15f8 and 0df04a4.

📒 Files selected for processing (1)
  • docs/ADRs/0001-cloudflare-crawler.md
📝 Walkthrough

Walkthrough

Adds ADR infrastructure (README and template) and introduces ADR 0001 proposing migration from the project's bespoke crawling/parsing stack to Cloudflare Browser Rendering /crawl API, including config and CLI adjustments and removal of internal crawler/parser dependencies.

Changes

Cohort / File(s) Summary
ADR Infrastructure
docs/ADRs/README.md, docs/ADRs/_template.md
Adds ADR README describing purpose, conventions, statuses, and creation steps; adds a reusable ADR template with sections for status, context, decision, consequences, alternatives, and references.
Cloudflare Crawler Migration Proposal
docs/ADRs/0001-cloudflare-crawler.md
New ADR proposing replacement of internal crawling/parsing (BaseCrawler, CrawlerRunner, Parser, ParserConfig, HtmlToMarkdownConfig) with a Cloudflare /crawl API client and polling workflow; updates config/CLI/auth patterns and documents trade-offs, risks, and alternatives.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related issues

Poem

🐰 I nibbled through the docs today,

ADRs in order, clear as hay,
Cloudflare winds will carry our crawl,
Less bespoke, more bouncing for all. 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: introducing Architecture Decision Records (ADRs) that document the decision to replace the bespoke crawler with Cloudflare's solution.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docs/cloudflare-crawler-adr
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Line 99: The Cloudflare blog reference line "[Cloudflare Blog: Crawl entire
websites with a single API call (March 10, 2026)](https://blog.cloudflare.com/)"
is missing the specific article path; update that markdown link to point to the
full article URL for the March 10, 2026 announcement (replace the href with the
complete Cloudflare blog post URL) or remove the entire reference line if the
post does not exist yet, ensuring the markdown label "[Cloudflare Blog: Crawl
entire websites with a single API call (March 10, 2026)]" in the ADR is kept
consistent or deleted accordingly.
- Line 100: The "Related issue" link in docs/ADRs/0001-cloudflare-crawler.md
points to a non-existent file ../../issues/replace-crawler-with-cloudflare.md;
fix it by either adding that missing markdown file under the issues/ directory
with the intended content, replacing the relative link with the proper GitHub
issue reference (e.g., use "#123" or the full GitHub issue URL) or removing the
"Related issue" line entirely; update the single line that starts with "Related
issue:" in docs/ADRs/0001-cloudflare-crawler.md accordingly so the ADR no longer
links to a missing resource.

In `@docs/ADRs/README.md`:
- Around line 17-19: The fenced code block containing the filename pattern
"NNNN-short-description.md" lacks a language identifier; update that markdown
block (the triple-backtick block showing NNNN-short-description.md) to include a
language token such as text or plaintext so the renderer and accessibility tools
recognize it (e.g., change ``` to ```text).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 950760ff-b061-4538-b32d-e506976ccf85

📥 Commits

Reviewing files that changed from the base of the PR and between aebc9cc and 54f9f42.

📒 Files selected for processing (3)
  • docs/ADRs/0001-cloudflare-crawler.md
  • docs/ADRs/README.md
  • docs/ADRs/_template.md

Comment thread docs/ADRs/0001-cloudflare-crawler.md Outdated
Comment thread docs/ADRs/0001-cloudflare-crawler.md Outdated
Comment thread docs/ADRs/README.md Outdated
brylie added 4 commits March 12, 2026 20:48
…arser with Cloudflare Browser Rendering API, highlighting the removal of redundant components and the benefits of reduced maintenance and improved crawling capabilities.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
docs/ADRs/0001-cloudflare-crawler.md (2)

35-61: Consider adding migration and testing strategy.

The ADR comprehensively describes what will change but is missing two important sections:

  1. Migration strategy: What happens to existing crawled content in content/{site}/parsed/? Will it be re-crawled using Cloudflare, or will the old content remain? If re-crawling, should it be done incrementally or all at once?

  2. Testing plan: How will you verify that:

    • Cloudflare's frontmatter structure is compatible with the vectorizer's requirements (especially source_url)
    • The downstream RAG pipeline receives proper citation URLs
    • All 5 sites can be successfully crawled within the free plan limits (if applicable)

These additions would strengthen the ADR by addressing operational aspects of the transition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ADRs/0001-cloudflare-crawler.md` around lines 35 - 61, Add a short
"Migration strategy" and "Testing plan" sections to the ADR: under Migration
strategy describe exactly what will happen to existing content in
content/{site}/parsed/ (keep as-is vs re-crawl), whether re-crawls will be
incremental or full, and provide a safe roll-back plan and filename/versioning
strategy for migrated Markdown files (reference current artifacts produced by
Parser/BaseCrawler and new outputs from CrawlerRunner). In Testing plan list
concrete checks and steps to validate Cloudflare output: verify YAML frontmatter
contains source_url and other fields required by the vectorizer and downstream
RAG (include exact keys to assert), test that citation URLs flow through the
vectorizer->RAG pipeline, and include load tests to confirm all 5 sites crawl
within the free plan limits; also mention automated unit/integration tests for
CrawlerRunner and CLI (crawl) and env var authentication checks for
CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN.

97-101: Add cost and capacity estimates for operational planning.

The negative consequences mention costs and free plan limits but don't provide enough detail for decision-making:

  1. Cost estimation: Provide an estimate of monthly crawling costs based on the 5 sites, their typical page counts, and whether render: true is needed. This helps stakeholders understand the operational expense.

  2. Free plan viability: The 10-minute daily limit on the free plan needs evaluation. Will it suffice for crawling 5 Finnish government sites (migri, te_palvelut, kela, vero, dvv) at the desired frequency? If not, what paid plan is required?

  3. Migration cost: Consider adding the one-time effort to rewrite the crawler, update configs, test, and validate frontmatter compatibility as a negative consequence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ADRs/0001-cloudflare-crawler.md` around lines 97 - 101, Expand the
negative consequences under the "**Cost**" and "**Free plan limits**" bullets:
add a monthly cost estimate for crawling the five target sites (migri,
te_palvelut, kela, vero, dvv) using both render: true and render: false based on
typical page counts and an assumed frequency, state whether the 10-minute/day
Free Workers limit can cover the desired crawl schedule (include required paid
plan tier and minutes if not), and add a short "Migration cost" bullet
quantifying one-time engineering effort to rewrite the crawler, update configs,
run tests, and validate frontmatter compatibility (estimate hours or
person-days).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Line 64: Update the ADR text that currently says "CLI interface for the
`crawl` command (same arguments and flags)" to clarify that the command name and
basic structure remain, but the available flags will change to match the new
CrawlerConfig model; remove any mention of flags for removed fields
(delay_between_requests, max_concurrent) and explicitly state that new
Cloudflare-based parameters replace them so the `crawl` command flags reflect
the updated CrawlerConfig.
- Line 44: The ADR currently proposes removing the parse CLI command
(tapio/cli.py::parse) but doesn't provide migration steps; update the ADR to
recommend a gradual deprecation plan: mark tapio.cli.parse as deprecated (emit
warnings), update public docs (README.md and CONTRIBUTING.md) with clear
migration guidance and new workflows, modify or add tests (tests/test_cli.py and
tests/integration/test_parser_pipeline.py) to cover the deprecation path and the
recommended alternative, and document how to migrate site-specific parser
settings from tapio/config/site_configs.yaml (parser_config entries for
migri/kela/vero) into the new workflow or configuration API so existing
separation of crawl vs parse is preserved or replaced.
- Line 31: Update the Cloudflare /crawl parameter names to match the actual API:
nest include/exclude pattern settings under an options object and use camelCase
names (use options.includePatterns and options.excludePatterns instead of
include_patterns/exclude_patterns) and rename max_age to maxAge; keep
metadata.url as the canonical source. Locate the parameters in the ADR paragraph
that currently lists include_patterns, exclude_patterns, and max_age and replace
them with options.includePatterns, options.excludePatterns, and maxAge
respectively so the documented request shape matches the API.

---

Nitpick comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Around line 35-61: Add a short "Migration strategy" and "Testing plan"
sections to the ADR: under Migration strategy describe exactly what will happen
to existing content in content/{site}/parsed/ (keep as-is vs re-crawl), whether
re-crawls will be incremental or full, and provide a safe roll-back plan and
filename/versioning strategy for migrated Markdown files (reference current
artifacts produced by Parser/BaseCrawler and new outputs from CrawlerRunner). In
Testing plan list concrete checks and steps to validate Cloudflare output:
verify YAML frontmatter contains source_url and other fields required by the
vectorizer and downstream RAG (include exact keys to assert), test that citation
URLs flow through the vectorizer->RAG pipeline, and include load tests to
confirm all 5 sites crawl within the free plan limits; also mention automated
unit/integration tests for CrawlerRunner and CLI (crawl) and env var
authentication checks for CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN.
- Around line 97-101: Expand the negative consequences under the "**Cost**" and
"**Free plan limits**" bullets: add a monthly cost estimate for crawling the
five target sites (migri, te_palvelut, kela, vero, dvv) using both render: true
and render: false based on typical page counts and an assumed frequency, state
whether the 10-minute/day Free Workers limit can cover the desired crawl
schedule (include required paid plan tier and minutes if not), and add a short
"Migration cost" bullet quantifying one-time engineering effort to rewrite the
crawler, update configs, run tests, and validate frontmatter compatibility
(estimate hours or person-days).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 730cf62f-0baf-4bf8-a86c-6e34b68a1e78

📥 Commits

Reviewing files that changed from the base of the PR and between 54f9f42 and c4b15f8.

📒 Files selected for processing (2)
  • docs/ADRs/0001-cloudflare-crawler.md
  • docs/ADRs/README.md

Comment thread docs/ADRs/0001-cloudflare-crawler.md
1. **Remove** `BaseCrawler` class (`tapio/crawler/crawler.py`) — the entire custom crawling logic
2. **Remove** `Parser` class (`tapio/parser/parser.py`) — XPath extraction, HTML-to-Markdown conversion, frontmatter generation
3. **Remove** `ParserConfig` and `HtmlToMarkdownConfig` from `tapio/config/config_models.py`
4. **Remove** the `parse` CLI command from `tapio/cli.py`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for references to the parse command in docs, scripts, and workflows
rg -n 'parse' --type=md --type=yaml --type=sh -g '!ADRs/*'

Repository: brylie/tapio

Length of output: 2617


🏁 Script executed:

#!/bin/bash
# Check if there are any GitHub Actions or CI/CD configs that invoke parse
fd -t f -e yml -e yaml .github | xargs rg -n 'parse' 2>/dev/null || echo "No GitHub workflows found"

Repository: brylie/tapio

Length of output: 30472


Parse command removal is a breaking change that impacts documented workflows and requires migration guidance.

The parse CLI command is not isolated—it has significant dependencies that removal would break:

  1. Public documentation — README.md (lines 57, 72) and CONTRIBUTING.md document the parse command with usage examples like uv run -m tapio.cli parse migri

  2. Extensive test coverage — tests/test_cli.py includes 10+ test cases specifically for parse command scenarios (single site, no site specified, custom config, error handling, etc.)

  3. Integration tests — tests/integration/test_parser_pipeline.py validates the parser pipeline as a standalone workflow

  4. Site configuration — tapio/config/site_configs.yaml defines parser_config for multiple sites (migri, kela, vero), indicating parser-specific tuning is expected

  5. Separation of concerns — The architecture supports independent parse steps (parsing previously-crawled HTML separately from the crawl step), which users may depend on

The ADR should address migration strategy: Before removing the parse command, update documentation, deprecate gradually, and provide migration guidance for workflows that rely on separated crawl and parse steps or site-specific parser tuning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ADRs/0001-cloudflare-crawler.md` at line 44, The ADR currently proposes
removing the parse CLI command (tapio/cli.py::parse) but doesn't provide
migration steps; update the ADR to recommend a gradual deprecation plan: mark
tapio.cli.parse as deprecated (emit warnings), update public docs (README.md and
CONTRIBUTING.md) with clear migration guidance and new workflows, modify or add
tests (tests/test_cli.py and tests/integration/test_parser_pipeline.py) to cover
the deprecation path and the recommended alternative, and document how to
migrate site-specific parser settings from tapio/config/site_configs.yaml
(parser_config entries for migri/kela/vero) into the new workflow or
configuration API so existing separation of crawl vs parse is preserved or
replaced.

Comment thread docs/ADRs/0001-cloudflare-crawler.md Outdated
@brylie brylie merged commit a7e2869 into main Mar 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant