| layout | title | parent | nav_order |
|---|---|---|---|
default |
Chapter 3: Advanced Data Extraction |
Firecrawl Tutorial |
3 |
Welcome to Chapter 3: Advanced Data Extraction. In this part of Firecrawl Tutorial: Building LLM-Ready Web Scraping and Data Extraction Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
In Chapter 2, you learned how to scrape pages and get back raw content in Markdown, HTML, or JSON. But raw content is only the starting point. Most AI applications need structured data -- typed fields like titles, prices, dates, and authors extracted from messy web pages into clean, predictable objects.
Firecrawl's extraction capabilities let you define schemas that describe what you want, and the engine returns exactly that. This chapter covers schema-driven extraction, CSS selector rules, LLM-powered extraction, validation, and building reusable extraction pipelines.
| Skill | Description |
|---|---|
| Schema-driven extraction | Define typed schemas and let Firecrawl extract matching data |
| CSS selector rules | Target specific DOM elements with precision |
| LLM-powered extraction | Use natural language prompts to extract complex data |
| Data validation | Ensure extracted fields meet type and completeness constraints |
| Metadata enrichment | Augment results with source URLs, timestamps, and language |
| Reusable pipelines | Build extraction templates for different page types |
When you use Firecrawl's extract endpoint, data flows through a multi-stage pipeline.
flowchart TD
A[Target URL] --> B[Fetch & Render Page]
B --> C[Parse DOM]
C --> D{Extraction Method}
D -- "Schema + Selectors" --> E[CSS Selector Matching]
D -- "Schema + LLM" --> F[LLM-Powered Extraction]
E --> G[Field Extraction]
F --> G
G --> H[Type Validation]
H --> I{Valid?}
I -- Yes --> J[Enriched Output]
I -- No --> K[Error / Fallback]
J --> L[Structured JSON]
classDef input fill:#e1f5fe,stroke:#01579b
classDef process fill:#f3e5f5,stroke:#4a148c
classDef output fill:#e8f5e8,stroke:#1b5e20
classDef error fill:#ffebee,stroke:#b71c1c
class A input
class B,C,D,E,F,G,H process
class J,L output
class K error
The most powerful extraction approach is defining a schema that describes the data you expect. Firecrawl uses this schema to locate and extract matching content from the page.
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_KEY")
# Define a schema for blog articles
article_schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"published_date": {"type": "string"},
"summary": {"type": "string"},
"body": {"type": "string"},
"tags": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["title", "body"]
}
# Extract structured data from a blog post
result = app.scrape_url(
"https://example.com/blog/ai-trends-2025",
params={
"formats": ["extract"],
"extract": {
"schema": article_schema
}
}
)
article = result["extract"]
print(f"Title: {article['title']}")
print(f"Author: {article.get('author', 'Unknown')}")
print(f"Tags: {', '.join(article.get('tags', []))}")
print(f"Body preview: {article['body'][:300]}...")import FirecrawlApp from "@mendable/firecrawl-js";
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const articleSchema = {
type: "object",
properties: {
title: { type: "string" },
author: { type: "string" },
published_date: { type: "string" },
summary: { type: "string" },
body: { type: "string" },
tags: { type: "array", items: { type: "string" } },
},
required: ["title", "body"],
};
const result = await app.scrapeUrl("https://example.com/blog/ai-trends-2025", {
formats: ["extract"],
extract: { schema: articleSchema },
});
console.log("Title:", result.extract?.title);
console.log("Author:", result.extract?.author);curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer $FIRECRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/ai-trends-2025",
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"body": {"type": "string"},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "body"]
}
}
}'Firecrawl schemas follow JSON Schema conventions. Here are the types you can use.
| Type | Example Value | Use Case |
|---|---|---|
string |
"Hello World" |
Titles, names, descriptions, body text |
number |
42.99 |
Prices, ratings, counts |
integer |
2025 |
Years, quantities, IDs |
boolean |
true |
Availability flags, feature toggles |
array |
["tag1", "tag2"] |
Lists of tags, categories, authors |
object |
{"name": "...", "url": "..."} |
Nested structures (author details, etc.) |
For complex pages, you can nest objects to capture hierarchical data.
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
"rating": {
"type": "object",
"properties": {
"score": {"type": "number"},
"count": {"type": "integer"}
}
},
"specifications": {
"type": "array",
"items": {
"type": "object",
"properties": {
"label": {"type": "string"},
"value": {"type": "string"}
}
}
}
}
}
result = app.scrape_url(
"https://example.com/products/widget-pro",
params={
"formats": ["extract"],
"extract": {"schema": product_schema}
}
)
product = result["extract"]
print(f"{product['name']}: ${product['price']} {product['currency']}")
print(f"Rating: {product['rating']['score']}/5 ({product['rating']['count']} reviews)")
for spec in product.get("specifications", []):
print(f" {spec['label']}: {spec['value']}")Sometimes CSS selectors are too rigid and schemas alone are not enough context. Firecrawl supports prompt-based extraction, where you describe what you want in natural language and the LLM interprets the page content to fill your schema.
# Use a natural language prompt to guide extraction
result = app.scrape_url(
"https://example.com/company/about",
params={
"formats": ["extract"],
"extract": {
"prompt": "Extract the company name, founding year, number of employees, "
"headquarters location, and a one-sentence mission statement.",
"schema": {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"founded_year": {"type": "integer"},
"employee_count": {"type": "integer"},
"headquarters": {"type": "string"},
"mission": {"type": "string"}
}
}
}
}
)
company = result["extract"]
print(f"{company['company_name']} (est. {company['founded_year']})")
print(f"HQ: {company['headquarters']}")
print(f"Employees: {company['employee_count']}")
print(f"Mission: {company['mission']}")flowchart TD
A[Extraction Task] --> B{Page Structure}
B -- "Consistent HTML structure" --> C[Schema Only]
B -- "Variable layout" --> D{Data Complexity}
D -- "Simple fields" --> E[Schema + Selectors]
D -- "Complex / contextual" --> F[Schema + Prompt]
C --> G[Fast & Cheap]
E --> H[Fast & Reliable]
F --> I[Flexible but Slower]
classDef fast fill:#e8f5e8,stroke:#1b5e20
classDef mid fill:#fff3e0,stroke:#e65100
classDef flex fill:#e1f5fe,stroke:#01579b
class G fast
class H mid
class I flex
| Approach | Speed | Cost | Flexibility | Best For |
|---|---|---|---|---|
| Schema only | Fast | Low | Moderate | Uniform page structures (e-commerce, docs) |
| Schema + selectors | Fast | Low | Low | Known, stable DOM layouts |
| Schema + prompt | Slower | Higher | High | Variable layouts, contextual extraction |
Real-world scraping often involves different page types on the same site. Create a registry of schemas and select the right one based on URL patterns.
# Schema registry for different page types
SCHEMAS = {
"article": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"date": {"type": "string"},
"body": {"type": "string"},
"tags": {"type": "array", "items": {"type": "string"}}
}
},
"product": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"in_stock": {"type": "boolean"}
}
},
"profile": {
"type": "object",
"properties": {
"name": {"type": "string"},
"title": {"type": "string"},
"bio": {"type": "string"},
"social_links": {"type": "array", "items": {"type": "string"}}
}
}
}
def detect_page_type(url: str) -> str:
"""Select schema based on URL pattern."""
if "/blog/" in url or "/posts/" in url:
return "article"
elif "/products/" in url or "/shop/" in url:
return "product"
elif "/team/" in url or "/people/" in url:
return "profile"
return "article" # default
def extract_structured(app, url: str):
"""Extract data using the appropriate schema for the URL."""
page_type = detect_page_type(url)
schema = SCHEMAS[page_type]
result = app.scrape_url(
url,
params={
"formats": ["extract", "markdown"],
"extract": {"schema": schema}
}
)
return {
"type": page_type,
"url": url,
"data": result["extract"],
"markdown": result.get("markdown", ""),
}
# Extract from different page types
urls = [
"https://example.com/blog/ai-news",
"https://example.com/products/widget",
"https://example.com/team/jane-doe",
]
for url in urls:
extracted = extract_structured(app, url)
print(f"[{extracted['type']}] {url}")
print(f" Data: {extracted['data']}")
print()Extracted data is not always perfect. Build validation logic to catch missing fields, wrong types, and low-quality results.
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class ValidationResult:
is_valid: bool
errors: List[str] = field(default_factory=list)
warnings: List[str] = field(default_factory=list)
def validate_article(data: dict) -> ValidationResult:
"""Validate extracted article data."""
errors = []
warnings = []
# Required fields
if not data.get("title"):
errors.append("Missing required field: title")
if not data.get("body"):
errors.append("Missing required field: body")
# Quality checks
if data.get("body") and len(data["body"]) < 100:
warnings.append(f"Body is very short ({len(data['body'])} chars)")
if data.get("title") and len(data["title"]) > 200:
warnings.append("Title is unusually long")
# Type checks
if data.get("tags") and not isinstance(data["tags"], list):
errors.append("Tags must be a list")
return ValidationResult(
is_valid=len(errors) == 0,
errors=errors,
warnings=warnings,
)
# Validate extracted data
article_data = result["extract"]
validation = validate_article(article_data)
if validation.is_valid:
print("Extraction valid")
for w in validation.warnings:
print(f" Warning: {w}")
else:
print("Extraction failed validation:")
for e in validation.errors:
print(f" Error: {e}")Add context to every extraction result so you can trace data back to its source and understand when it was collected.
from datetime import datetime, timezone
def enrich_with_metadata(extracted_data: dict, url: str) -> dict:
"""Add metadata to extraction results."""
return {
**extracted_data,
"_metadata": {
"source_url": url,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"extractor_version": "1.0.0",
"schema_type": detect_page_type(url),
}
}
enriched = enrich_with_metadata(result["extract"], "https://example.com/blog/post")
print(f"Source: {enriched['_metadata']['source_url']}")
print(f"Fetched: {enriched['_metadata']['fetched_at']}")Combine schema selection, extraction, validation, and enrichment into a single pipeline class.
import json
from pathlib import Path
from firecrawl import FirecrawlApp
class ExtractionPipeline:
"""Reusable pipeline for structured data extraction."""
def __init__(self, app: FirecrawlApp, schemas: dict):
self.app = app
self.schemas = schemas
def extract(self, url: str, page_type: str = None) -> dict:
"""Extract, validate, and enrich data from a URL."""
if page_type is None:
page_type = detect_page_type(url)
schema = self.schemas.get(page_type)
if not schema:
raise ValueError(f"No schema found for page type: {page_type}")
# Extract
result = self.app.scrape_url(
url,
params={
"formats": ["extract", "markdown"],
"extract": {"schema": schema},
"onlyMainContent": True,
}
)
# Enrich
data = enrich_with_metadata(result["extract"], url)
return data
def extract_batch(self, urls: list, output_dir: str = "./extracted"):
"""Extract data from multiple URLs and save results."""
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
results = []
for url in urls:
try:
data = self.extract(url)
results.append({"url": url, "status": "success", "data": data})
except Exception as exc:
results.append({"url": url, "status": "error", "error": str(exc)})
# Save results
with open(output_path / "extractions.json", "w") as f:
json.dump(results, f, indent=2, default=str)
success = sum(1 for r in results if r["status"] == "success")
print(f"Extracted {success}/{len(urls)} URLs successfully")
return results
# Usage
pipeline = ExtractionPipeline(app, SCHEMAS)
results = pipeline.extract_batch([
"https://example.com/blog/post-1",
"https://example.com/products/widget",
"https://example.com/team/ceo",
])| Problem | Possible Cause | Solution |
|---|---|---|
| Missing fields in output | Page layout does not match schema | Inspect the page HTML and adjust selectors or add a prompt |
| Wrong data types | Schema mismatch | Ensure schema types match the actual content |
| Empty extraction result | Page requires JavaScript rendering | Add waitFor parameter (see Chapter 4) |
| Garbled or encoded text | Character encoding issue | Force UTF-8 decoding in params |
| Repeated content in arrays | Selector too broad | Narrow the CSS selector or add deduplication logic |
| Slow extraction | LLM prompt extraction on large pages | Reduce page size with onlyMainContent: True |
- Reuse schemas -- Define schemas once and reference them across extractions.
- Combine formats -- Request
extractandmarkdownin one call to avoid double-fetching. - Batch similar pages -- Group URLs by page type so you can use the same schema and parameters.
- Cache results -- Store extracted JSON so re-runs can skip already-processed URLs.
- Limit page content -- Use
onlyMainContent: Trueto reduce the input size for LLM-based extraction.
- Sanitize extracted HTML -- Never render user-sourced HTML without sanitization.
- Redact PII -- If scraping user-generated content, filter out personal data before storing.
- Log extraction failures -- Track which pages fail so you can audit data quality.
- Avoid executing scripts -- Always rely on Firecrawl's sandboxed rendering rather than running page scripts locally.
Structured extraction turns raw web pages into clean, typed data objects that your applications can work with directly. By defining schemas, using LLM-powered prompts for flexible extraction, validating results, and enriching outputs with metadata, you build reliable data pipelines that feed AI applications with high-quality inputs.
- Schema-driven extraction uses JSON Schema to define expected fields and types, giving you predictable structured output.
- LLM-powered prompts add flexibility for pages with variable layouts -- use them when pure schemas are insufficient.
- Validation is essential -- always check that required fields are present and values are reasonable before downstream use.
- Metadata enrichment (source URL, timestamp, schema type) makes data traceable and auditable.
- Reusable pipelines with schema registries let you scale extraction across many page types with minimal code changes.
Now that you can extract structured data, many websites hide their content behind JavaScript rendering. In Chapter 4: JavaScript & Dynamic Content, you will learn how to handle SPAs, infinite scroll, and Ajax-loaded data to ensure Firecrawl captures every piece of content.
Built with insights from the Firecrawl project.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for extract, print, schema so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 3: Advanced Data Extraction as an operating subsystem inside Firecrawl Tutorial: Building LLM-Ready Web Scraping and Data Extraction Systems, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around result, title, body as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 3: Advanced Data Extraction usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
extract. - Input normalization: shape incoming data so
printreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
schema. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com).
Suggested trace strategy:
- search upstream code for
extractandprintto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production