Add Architecture Decision Records for Cloudflare Crawler Replacement#3
Add Architecture Decision Records for Cloudflare Crawler Replacement#3
Conversation
Review Summary by QodoAdd Architecture Decision Records framework and Cloudflare crawler replacement ADR
WalkthroughsDescription• Introduces Architecture Decision Records (ADRs) framework for documenting architectural decisions • Documents decision to replace bespoke crawler with Cloudflare Browser Rendering API • Provides ADR template and guidelines for future architectural decisions • Details context, consequences, and alternatives for crawler replacement strategy Diagramflowchart LR
A["ADR Framework"] --> B["README with guidelines"]
A --> C["Template for future ADRs"]
D["Crawler Replacement Decision"] --> E["ADR 0001: Cloudflare API adoption"]
E --> F["Replaces BaseCrawler and CrawlerRunner"]
E --> G["Updates config models and CLI"]
File Changes1. docs/ADRs/0001-cloudflare-crawler.md
|
Code Review by Qodo
1.
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds ADR infrastructure (README and template) and introduces ADR 0001 proposing migration from the project's bespoke crawling/parsing stack to Cloudflare Browser Rendering /crawl API, including config and CLI adjustments and removal of internal crawler/parser dependencies. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Line 99: The Cloudflare blog reference line "[Cloudflare Blog: Crawl entire
websites with a single API call (March 10, 2026)](https://blog.cloudflare.com/)"
is missing the specific article path; update that markdown link to point to the
full article URL for the March 10, 2026 announcement (replace the href with the
complete Cloudflare blog post URL) or remove the entire reference line if the
post does not exist yet, ensuring the markdown label "[Cloudflare Blog: Crawl
entire websites with a single API call (March 10, 2026)]" in the ADR is kept
consistent or deleted accordingly.
- Line 100: The "Related issue" link in docs/ADRs/0001-cloudflare-crawler.md
points to a non-existent file ../../issues/replace-crawler-with-cloudflare.md;
fix it by either adding that missing markdown file under the issues/ directory
with the intended content, replacing the relative link with the proper GitHub
issue reference (e.g., use "#123" or the full GitHub issue URL) or removing the
"Related issue" line entirely; update the single line that starts with "Related
issue:" in docs/ADRs/0001-cloudflare-crawler.md accordingly so the ADR no longer
links to a missing resource.
In `@docs/ADRs/README.md`:
- Around line 17-19: The fenced code block containing the filename pattern
"NNNN-short-description.md" lacks a language identifier; update that markdown
block (the triple-backtick block showing NNNN-short-description.md) to include a
language token such as text or plaintext so the renderer and accessibility tools
recognize it (e.g., change ``` to ```text).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 950760ff-b061-4538-b32d-e506976ccf85
📒 Files selected for processing (3)
docs/ADRs/0001-cloudflare-crawler.mddocs/ADRs/README.mddocs/ADRs/_template.md
…arser with Cloudflare Browser Rendering API, highlighting the removal of redundant components and the benefits of reduced maintenance and improved crawling capabilities.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
docs/ADRs/0001-cloudflare-crawler.md (2)
35-61: Consider adding migration and testing strategy.The ADR comprehensively describes what will change but is missing two important sections:
Migration strategy: What happens to existing crawled content in
content/{site}/parsed/? Will it be re-crawled using Cloudflare, or will the old content remain? If re-crawling, should it be done incrementally or all at once?Testing plan: How will you verify that:
- Cloudflare's frontmatter structure is compatible with the vectorizer's requirements (especially
source_url)- The downstream RAG pipeline receives proper citation URLs
- All 5 sites can be successfully crawled within the free plan limits (if applicable)
These additions would strengthen the ADR by addressing operational aspects of the transition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/ADRs/0001-cloudflare-crawler.md` around lines 35 - 61, Add a short "Migration strategy" and "Testing plan" sections to the ADR: under Migration strategy describe exactly what will happen to existing content in content/{site}/parsed/ (keep as-is vs re-crawl), whether re-crawls will be incremental or full, and provide a safe roll-back plan and filename/versioning strategy for migrated Markdown files (reference current artifacts produced by Parser/BaseCrawler and new outputs from CrawlerRunner). In Testing plan list concrete checks and steps to validate Cloudflare output: verify YAML frontmatter contains source_url and other fields required by the vectorizer and downstream RAG (include exact keys to assert), test that citation URLs flow through the vectorizer->RAG pipeline, and include load tests to confirm all 5 sites crawl within the free plan limits; also mention automated unit/integration tests for CrawlerRunner and CLI (crawl) and env var authentication checks for CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN.
97-101: Add cost and capacity estimates for operational planning.The negative consequences mention costs and free plan limits but don't provide enough detail for decision-making:
Cost estimation: Provide an estimate of monthly crawling costs based on the 5 sites, their typical page counts, and whether
render: trueis needed. This helps stakeholders understand the operational expense.Free plan viability: The 10-minute daily limit on the free plan needs evaluation. Will it suffice for crawling 5 Finnish government sites (migri, te_palvelut, kela, vero, dvv) at the desired frequency? If not, what paid plan is required?
Migration cost: Consider adding the one-time effort to rewrite the crawler, update configs, test, and validate frontmatter compatibility as a negative consequence.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/ADRs/0001-cloudflare-crawler.md` around lines 97 - 101, Expand the negative consequences under the "**Cost**" and "**Free plan limits**" bullets: add a monthly cost estimate for crawling the five target sites (migri, te_palvelut, kela, vero, dvv) using both render: true and render: false based on typical page counts and an assumed frequency, state whether the 10-minute/day Free Workers limit can cover the desired crawl schedule (include required paid plan tier and minutes if not), and add a short "Migration cost" bullet quantifying one-time engineering effort to rewrite the crawler, update configs, run tests, and validate frontmatter compatibility (estimate hours or person-days).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Line 64: Update the ADR text that currently says "CLI interface for the
`crawl` command (same arguments and flags)" to clarify that the command name and
basic structure remain, but the available flags will change to match the new
CrawlerConfig model; remove any mention of flags for removed fields
(delay_between_requests, max_concurrent) and explicitly state that new
Cloudflare-based parameters replace them so the `crawl` command flags reflect
the updated CrawlerConfig.
- Line 44: The ADR currently proposes removing the parse CLI command
(tapio/cli.py::parse) but doesn't provide migration steps; update the ADR to
recommend a gradual deprecation plan: mark tapio.cli.parse as deprecated (emit
warnings), update public docs (README.md and CONTRIBUTING.md) with clear
migration guidance and new workflows, modify or add tests (tests/test_cli.py and
tests/integration/test_parser_pipeline.py) to cover the deprecation path and the
recommended alternative, and document how to migrate site-specific parser
settings from tapio/config/site_configs.yaml (parser_config entries for
migri/kela/vero) into the new workflow or configuration API so existing
separation of crawl vs parse is preserved or replaced.
- Line 31: Update the Cloudflare /crawl parameter names to match the actual API:
nest include/exclude pattern settings under an options object and use camelCase
names (use options.includePatterns and options.excludePatterns instead of
include_patterns/exclude_patterns) and rename max_age to maxAge; keep
metadata.url as the canonical source. Locate the parameters in the ADR paragraph
that currently lists include_patterns, exclude_patterns, and max_age and replace
them with options.includePatterns, options.excludePatterns, and maxAge
respectively so the documented request shape matches the API.
---
Nitpick comments:
In `@docs/ADRs/0001-cloudflare-crawler.md`:
- Around line 35-61: Add a short "Migration strategy" and "Testing plan"
sections to the ADR: under Migration strategy describe exactly what will happen
to existing content in content/{site}/parsed/ (keep as-is vs re-crawl), whether
re-crawls will be incremental or full, and provide a safe roll-back plan and
filename/versioning strategy for migrated Markdown files (reference current
artifacts produced by Parser/BaseCrawler and new outputs from CrawlerRunner). In
Testing plan list concrete checks and steps to validate Cloudflare output:
verify YAML frontmatter contains source_url and other fields required by the
vectorizer and downstream RAG (include exact keys to assert), test that citation
URLs flow through the vectorizer->RAG pipeline, and include load tests to
confirm all 5 sites crawl within the free plan limits; also mention automated
unit/integration tests for CrawlerRunner and CLI (crawl) and env var
authentication checks for CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN.
- Around line 97-101: Expand the negative consequences under the "**Cost**" and
"**Free plan limits**" bullets: add a monthly cost estimate for crawling the
five target sites (migri, te_palvelut, kela, vero, dvv) using both render: true
and render: false based on typical page counts and an assumed frequency, state
whether the 10-minute/day Free Workers limit can cover the desired crawl
schedule (include required paid plan tier and minutes if not), and add a short
"Migration cost" bullet quantifying one-time engineering effort to rewrite the
crawler, update configs, run tests, and validate frontmatter compatibility
(estimate hours or person-days).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 730cf62f-0baf-4bf8-a86c-6e34b68a1e78
📒 Files selected for processing (2)
docs/ADRs/0001-cloudflare-crawler.mddocs/ADRs/README.md
| 1. **Remove** `BaseCrawler` class (`tapio/crawler/crawler.py`) — the entire custom crawling logic | ||
| 2. **Remove** `Parser` class (`tapio/parser/parser.py`) — XPath extraction, HTML-to-Markdown conversion, frontmatter generation | ||
| 3. **Remove** `ParserConfig` and `HtmlToMarkdownConfig` from `tapio/config/config_models.py` | ||
| 4. **Remove** the `parse` CLI command from `tapio/cli.py` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for references to the parse command in docs, scripts, and workflows
rg -n 'parse' --type=md --type=yaml --type=sh -g '!ADRs/*'Repository: brylie/tapio
Length of output: 2617
🏁 Script executed:
#!/bin/bash
# Check if there are any GitHub Actions or CI/CD configs that invoke parse
fd -t f -e yml -e yaml .github | xargs rg -n 'parse' 2>/dev/null || echo "No GitHub workflows found"Repository: brylie/tapio
Length of output: 30472
Parse command removal is a breaking change that impacts documented workflows and requires migration guidance.
The parse CLI command is not isolated—it has significant dependencies that removal would break:
-
Public documentation — README.md (lines 57, 72) and CONTRIBUTING.md document the parse command with usage examples like
uv run -m tapio.cli parse migri -
Extensive test coverage — tests/test_cli.py includes 10+ test cases specifically for parse command scenarios (single site, no site specified, custom config, error handling, etc.)
-
Integration tests — tests/integration/test_parser_pipeline.py validates the parser pipeline as a standalone workflow
-
Site configuration — tapio/config/site_configs.yaml defines
parser_configfor multiple sites (migri, kela, vero), indicating parser-specific tuning is expected -
Separation of concerns — The architecture supports independent parse steps (parsing previously-crawled HTML separately from the crawl step), which users may depend on
The ADR should address migration strategy: Before removing the parse command, update documentation, deprecate gradually, and provide migration guidance for workflows that rely on separated crawl and parse steps or site-specific parser tuning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/ADRs/0001-cloudflare-crawler.md` at line 44, The ADR currently proposes
removing the parse CLI command (tapio/cli.py::parse) but doesn't provide
migration steps; update the ADR to recommend a gradual deprecation plan: mark
tapio.cli.parse as deprecated (emit warnings), update public docs (README.md and
CONTRIBUTING.md) with clear migration guidance and new workflows, modify or add
tests (tests/test_cli.py and tests/integration/test_parser_pipeline.py) to cover
the deprecation path and the recommended alternative, and document how to
migrate site-specific parser settings from tapio/config/site_configs.yaml
(parser_config entries for migri/kela/vero) into the new workflow or
configuration API so existing separation of crawl vs parse is preserved or
replaced.
…ations for Cloudflare integration
…ntegration, detailing updated flags and parameters.
Introduce Architecture Decision Records (ADRs) to document the decision to replace the bespoke crawler with the Cloudflare Browser Rendering API. This change aims to streamline the crawling process and reduce maintenance overhead while enhancing capabilities. The initial ADR outlines the context, decision, and consequences of this architectural shift. Additionally, a template for future ADRs is included to facilitate consistent documentation.
Summary by CodeRabbit