Skip to content

MajdKZ1/StockDataDump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StockDataDump

StockDataDump is a high-throughput historical stock data dumper designed for large-scale quantitative research pipelines.
It combines a Rust-powered concurrent fetcher with a Python orchestration and conversion layer, enabling rapid data acquisition, compact storage, and seamless transformation into analytics-ready formats.

Overview

StockDataDump provides:

  • Rust hot-path fetcher
    Concurrent HTTP downloads using Tokio + Reqwest, streaming responses directly through zstd compression into .zst dump files.

  • Python orchestration layer
    Manifest generation, job scheduling, error handling, and conversion to columnar formats such as Parquet and Feather using pandas/pyarrow/numpy.

  • Optimized storage formats
    Raw dumps use zstd compression; converted Parquet outputs support zstd or snappy for fast reads and reduced disk usage.

This architecture allows capturing large universes of symbols quickly, while producing small, efficient datasets ideal for backtesting, machine learning, and long-horizon research.


Repository Layout

rust-core/     # Rust fetcher (`dump-core`), built with Tokio + Reqwest + zstd
python/        # Python CLI (`stockdatadump`), manifest builder + converter
scripts/       # Helper scripts for install/clean/update workflows
dumps/         # Default output location for manifests, raw `.zst`, and Parquet files

Quick Start

1. Build the Rust fetcher

cd rust-core
cargo build --release

2. Install the Python orchestrator

From the repository root:

pip install -e python

3. Optional helper scripts (Linux/macOS)

./scripts/interface.sh   # interactive menu: build, install, clean, update
./scripts/install.sh     # installs Rust core + Python tools
./scripts/clean.sh       # removes generated artifacts
./scripts/update.sh      # rebuilds Rust and reinstalls Python package

Fetching and Converting Data

Yahoo Finance requires both a crumb and a cookie for authentication.
These may be passed directly or exported as environment variables (YAHOO_CRUMB, YAHOO_COOKIE).

Getting Yahoo Finance Credentials

To obtain valid credentials:

  1. Open your browser and navigate to https://finance.yahoo.com/quote/AAPL/history
  2. Open Developer Tools (F12) and go to the Network tab
  3. Download historical data (click Download button or adjust date range)
  4. Find the download request in the Network tab and click on it
  5. Extract credentials:
    • Crumb: Look in the request URL for the crumb= parameter (e.g., crumb=abc123xyz)
    • Cookie: Copy the entire Cookie header value from the request headers

Note: These credentials may expire after some time. If you get 401 errors, regenerate them.

1. Generate a manifest

stockdatadump manifest AAPL MSFT SPY \
  -o dumps/manifests/yahoo.jsonl \
  --start 2023-01-01 \
  --crumb "your-actual-crumb-value" \
  --cookie "B=actual-cookie-value; other-cookies=values"

Or use environment variables:

export YAHOO_CRUMB="your-actual-crumb-value"
export YAHOO_COOKIE="B=actual-cookie-value; other-cookies=values"
stockdatadump manifest AAPL MSFT SPY \
  -o dumps/manifests/yahoo.jsonl \
  --start 2023-01-01

2. Fetch data using the Rust core

stockdatadump fetch \
  --manifest dumps/manifests/yahoo.jsonl \
  --output-dir dumps/raw \
  --concurrency 12

This writes compressed .zst files into dumps/raw/.

3. Convert raw dumps to Parquet

stockdatadump convert \
  --dumps-dir dumps/raw \
  --output dumps/arrow/dump.parquet \
  --format parquet \
  --compression zstd

Inspecting Dumps

To quickly preview a compressed dump:

stockdatadump head dumps/raw/AAPL.zst

This decompresses the stream and prints the first few records.


Manifest Format

The Rust dump-core fetcher expects NDJSON, where each line contains a symbol and a url:

{"symbol": "AAPL", "url": "https://query1.finance.yahoo.com/v7/finance/download/AAPL?..."}
{"symbol": "MSFT", "url": "https://query1.finance.yahoo.com/v7/finance/download/MSFT?..."}

Manifests are fully generated by the CLI but can be manually constructed for custom data sources.


License

This project is licensed under OpenNET LLC.

About

Fast historical stock data dumping for quant workflows.

Topics

Resources

License

Contributing

Stars

Watchers

Forks