Skip to content

afadesigns/pyru

Repository files navigation

pyweb: The Fastest Python Web Scraper

pyweb is a command-line web scraper engineered for one purpose: to be the fastest Python web scraper in existence, achieving sub-millisecond latency. It leverages a hyper-optimized, asynchronous Rust core built on tokio and a fine-tuned reqwest client using rustls to eliminate C FFI overhead and explicitly forcing HTTP/1.1 for minimal connection latency. The code is further micro-optimized to eliminate all unnecessary allocations in the hot path. Performance is further enhanced with the mimalloc high-performance memory allocator, native CPU-specific compiler optimizations, Profile-Guided Optimization (PGO), and the io_uring asynchronous I/O interface on Linux.

Performance

pyweb is definitively the fastest Python web scraper. The final benchmark, scraping 100 pages from a local aiohttp server, was conducted after applying advanced OS-level network tuning (tcp_tw_reuse, tcp_fin_timeout) to minimize TCP connection overhead. The results below compare pyweb against the best-in-class pure-Python async solution (httpx + selectolax).

Metric pyweb (hyper-tuned async Rust) httpx+selectolax
Total Time 0.0659 seconds 0.1846 seconds
Average Latency 13.46 ms 92.36 ms
Jitter (Std Dev) 2.23 ms 2.48 ms
Requests > 50ms Threshold 0 (0.00%) 100 (100.00%)

pyweb is ~2.8x faster in total execution time and achieves ~6.86x lower average latency compared to its closest competitor. This is a direct result of a holistic optimization strategy, spanning the application code, compiler, memory allocator, I/O subsystem, TLS implementation, HTTP protocol, and the underlying operating system.

Installation

pip install pyweb-scraper

Usage

pyweb scrape [OPTIONS] [URLS]...

Options:

  • -s, --selector TEXT: CSS selector to extract specific elements.
  • -o, --output [json|text]: Output format.
  • -c, --concurrency INTEGER: Number of concurrent requests.
  • --help: Show this message and exit.

Example:

pyweb scrape "http://books.toscrape.com" -s "h3 > a" -c 200

About

Pyru is a command-line web scraper engineered for one purpose - to be the fastest Python web scraper in existence.

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors