A minimal, scalable, and intelligent web scraping framework powered by AI agents
Miss Scraper is a minimal yet scalable implementation of intelligent web scraping, designed as a lightweight alternative to browser-use while maintaining enterprise-grade capabilities. Our key features include:
- ๐ฏ Minimal & Scalable: Unlike browser-use's complex architecture, we provide a streamlined ~237 lines of core browser automation code
- ๐ฅท Maximum Stealth: Powered by ZenBrowser with advanced anti-detection capabilities that bypass sophisticated bot detection systems
- ๐ง Pythonic Plug-and-Play Agentic Architecture: Built with Agno, featuring an easy-to-customize, plug-and-play patternโrapidly integrate RAG, chat storage, and new tools with minimal code.
- ๐ Clean Markdown Extraction: Powered by crawl4ai for LLM-optimized web content processing
- โก Efficient Context Usage: Smart separation of browser state and page content for minimal token consumption
- ๐ง Easy MCP Customization: Seamlessly add new tools, modify existing ones, and compose complex automation workflows
- ๐ Dynamic Data Extraction: LLM-generated Pydantic schemas ensure uniform, validated JSONL output
Miss Scraper leverages ZenBrowser (zendriver) for unparalleled stealth in web automation:
- ๐ญ Browser Fingerprint Spoofing: Randomized user agents, screen resolutions, and device characteristics
- ๐ Network Behavior Mimicking: Human-like request timing and pattern simulation
- ๐ง Advanced Evasion: Bypass Cloudflare, DataDome, and other sophisticated bot detection systems
- ๐ฑ Device Emulation: Mobile and desktop device simulation with accurate viewport and touch events
- ๐ฐ๏ธ Timing Randomization: Human-like delays and interaction patterns
- โฑ๏ธ Smart Network Waiting: Waits for all network requests to complete ensuring DOM is fully loaded
Stealth Test
async def test_webdriver():
import zendriver as zd
browser = await zd.start(no_sandbox=True)
tab: zd.Tab = await browser.get("https://www.browserscan.net/bot-detection")
await tab.save_screenshot("browserscan.png")
await tab.close()
if __name__ == "__main__":
asyncio.run(test_webdriver())Adding new browser tools is straightforward:
@mcp.tool
async def browser_custom_action(param: str, ctx: Context) -> dict:
"""Your custom browser automation"""
tab = await browser_pool.get_tab(ctx.session_id)
# Your custom logic here
return await get_llm_browser_state(tab, interactive_dom_map)RAG Integration Example:
from agno.knowledge import Embeddings
from agno.vectordb import PgVector
browser_agent = Agent(
name="RAG-Enhanced Browser Agent",
knowledge_base=Embeddings(
vector_db=PgVector(table_name="web_knowledge"),
embedder=OpenAIEmbedder()
),
# ... other configurations
)Storage Backend Options:
- SQLite: Default, file-based storage
- PostgreSQL: Production-grade relational storage
- Redis: High-performance in-memory storage
- Custom: Implement your own storage backend
AI Model Flexibility:
# Swap AI models easily
from agno.models.openai import OpenAI
from agno.models.anthropic import Claude
from agno.models.google.gemini import Gemini
agent = Agent(
model=OpenAI(id="gpt-4"), # or Claude() or Gemini()
# ... other configurations
)# Clone the repository
git clone https://github.com/eryawww/miss-scraper.git
cd miss-scraper
# Install dependencies
source install.shRequired API Keys:
Rename .env.example to .env and add your API keys:
# Google AI Studio API Key (Required)
# Get from: https://aistudio.google.com
GOOGLE_API_KEY=your_google_api_key_here
# Agno API Key (Optional - for Playground/Dashboard)
# Get from: https://docs.agno.com/introduction/playground
AGNO_API_KEY=your_agno_api_key_here
# Browser Configuration
BROWSER_PAGE_LOAD_WAIT=2
MCP_ENDPOINT=http://localhost:8000
TOOLCALL_TIMEOUT_SECONDS=30Miss Scraper offers two ways to interact with the browser automation system:
source ./scripts/launch_mcp.shsource ./scripts/launch_agent.shEndpoint: POST /api/v1/chat
Request:
{
"text": "Extract product details from amazon.com/product/xyz",
"session_id": "optional-session-id"
}Response:
{
"text": "I've extracted the product details for you...",
"results": {
"0": {
"name": "Product Name",
"price": 29.99,
"availability": "In Stock"
}
},
"session_id": "uuid-session-id"
}source ./scripts/launch_playground.shThe Agno Playground provides a user-friendly chatbot dashboard where you can directly interact with the browser agent, test extraction schemas, and debug automation workflows.
Our browser automation toolkit provides 7 essential tools through the MCP interface:
browser_navigate- Navigate to any URL with intelligent page loadingbrowser_go_back- Browser history navigationbrowser_scroll- Smooth scrolling in both directions
browser_click- Click elements by interactive indexbrowser_type_keyboard- Type text with automatic form submission
browser_get_page_source- Extract clean markdown using crawl4ai for optimal LLM processingbrowser_extract_content- AI-powered schema-based data extraction with dynamic Pydantic validation
LLM-Generated Pydantic Schemas:
# Define extraction schema
schema = {
'product_name': FieldDef(type='string', required=True),
'price': FieldDef(type='number', required=True),
'rating': FieldDef(type='number', required=False)
}
# AI automatically creates Pydantic models and validates output
result = await browser_extract_content(schema)Guaranteed JSON Structure:
- โ Type Validation: Automatic string/number/boolean conversion
- โ Required Fields: Ensures id critical data is present
- โ Uniform Format: Consistent structure across all extractions
crawl4ai-Powered Content Processing:
# Get clean, LLM-optimized markdown from any page
clean_content = await browser_get_page_source()
# Returns: Clean markdown without ads, navigation, or clutterLLM-Optimized Features:
- ๐ LLM Native Format: Easy to understand hierarchy of content
- โก Token Efficiency: Reduced token usage for LLM processing
- ๐ Link Preservation: Maintains important links and references
- 99.9% Completeness: Ensures that the content is extracted without leftover information
Smart Browser State Management:
# Optimized browser state - only essential information
browser_state = {
"url": "https://example.com",
"interactive_elements": [
{"index": 0, "tag": "button", "content": "Login"},
{"index": 1, "tag": "input", "content": "Search..."}
],
"total_interactive": 15
}
# Separate page content call when needed
page_content = await browser_get_page_source() # Only when extractingMiss Scraper follows a clean, modular architecture designed for scalability and maintainability:
miss_scraper/
โโโ agents/ # ๐ค AI Agent Modules
โ โโโ repository.py # Agent definitions and factories
โ โโโ serve.py # FastAPI agent server
โ โโโ playground.py # Agno playground integration
โ โโโ static/ # System prompts and configurations
โ โโโ browser_system_prompt.md
โ โโโ extractor_system_prompt.md
โ
โโโ mcp/ # ๐ง Model Context Protocol
โโโ serve.py # MCP server implementation
โโโ tools/ # Tool implementations
โโโ browser/ # Browser automation tools
โโโ mcp.py # Core MCP tool definitions
โโโ utils.py # Browser utilities and helpers
โโโ pool.py # Browser pool management
โโโ schema.py # Data schemas and validation
- Agno-Powered: Leverages Agno's pythonic agent framework
- Dual Agents: Browser navigation agent + Content extraction agent to ensure effective context utilization
- Persistent Storage: SQLite-based conversation memory
- Default Model: Google's Gemini 2.5 Flash is the most optimal performance to cost ratio
- Tool Composition: Modular browser automation tools
- Session Management: Isolated browser contexts per user session
- Efficient Context Usage: Smart separation of browser state and page content for minimal token consumption
- Clean Content Extraction: crawl4ai-powered markdown conversion for optimal LLM processing
- Dynamic Schema Generation: LLM-powered Pydantic model creation for structured data extraction
- Uniform JSON Output: Guarantees consistent, validated data format across all extractions
- Network Optimization: Intelligent page load detection and stability assurance (Waits for all network requests to complete before proceeding)
- ZenDriver Integration: Maximum stealth browser automation with advanced anti-detection
- Fingerprint Spoofing: Automatic browser fingerprint randomization and device emulation
- Pool Management: Scalable browser instance management with session isolation
graph TD
A[User Query] --> B[Browser Agent]
B --> C[MCP Tools]
C --> D[Browser Pool]
D --> E[Page Interaction]
E --> F[Content Extraction Agent]
F --> G[Structured Data]
Miss Scraper uses some code, especially JavaScript code, from the browser-use project, including interactive element detection, DOM manipulation scripts, and page state extraction utilities. We've built upon this foundation with a minimalist architecture (~237 lines vs 1000+), Agno-powered agentic design, ZenBrowser stealth capabilities, and optimized LLM context usage.
