Breaking Changes
- New Python API:
Fetcherclass with async context manager and streaming events - src/ layout: PEP 517/518 compliant package structure
- Pydantic models: Configuration via
DocpullConfiginstead of dictionaries - Removed v1.x modules: All deprecated code removed
New Features
- Streaming Event API:
AsyncIterator[FetchEvent]for real-time progress - Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
- CacheManager: O(1) lookups with batched writes and TTL eviction
- StreamingDeduplicator: Real-time content deduplication via SHA-256
- JavaScript Rendering: Browser-based fetching via Playwright
- Profile Presets: RAG, MIRROR, QUICK for common use cases
- Rate Limiting: Per-host concurrent request limits
- Security: robots.txt respect and URL validation
Quick Start
```bash
CLI
docpull https://docs.example.com --profile rag
Python API
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```
Full Changelog
See CHANGELOG.md