Skip to content

v2.0.0 - Complete Architecture Rewrite

Latest

Choose a tag to compare

@zacharyr0th zacharyr0th released this 29 Nov 23:26
· 5 commits to main since this release

Breaking Changes

  • New Python API: Fetcher class with async context manager and streaming events
  • src/ layout: PEP 517/518 compliant package structure
  • Pydantic models: Configuration via DocpullConfig instead of dictionaries
  • Removed v1.x modules: All deprecated code removed

New Features

  • Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
  • Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
  • CacheManager: O(1) lookups with batched writes and TTL eviction
  • StreamingDeduplicator: Real-time content deduplication via SHA-256
  • JavaScript Rendering: Browser-based fetching via Playwright
  • Profile Presets: RAG, MIRROR, QUICK for common use cases
  • Rate Limiting: Per-host concurrent request limits
  • Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md