Distributed Digital Asset Market Data Pipeline
QuantCore Engine is a high-frequency trading (HFT) data pipeline designed to ingest, process, and serve real-time market microstructure signals. It calculates Order Book Imbalance (OBI) for the Top 30 crypto assets with sub-second end-to-end latency.
The system implements a CQRS Pattern to decouple high-throughput computation from low-latency serving, using a Kappa-style streaming architecture where the data stream is the single source of truth.
-
Real-Time Market Microstructure: Calculates Order Book Imbalance (OBI) (
$\frac{V_b - V_a}{V_b + V_a}$ ) to predict short-term price pressure using L2 Depth data. - Distributed Stream Processing: Utilizes Apache Spark Structured Streaming to process nested JSON arrays of Order Books using vectorized higher-order functions.
- High-Performance Ingestion: Python producer multiplexes 30+ WebSocket streams into a single connection, sharding data into Kafka Partitions to guarantee strict ordering per symbol.
- gRPC Streaming API: Go server pushes updates to clients via HTTP/2 server-side streaming, reducing network overhead compared to REST polling.
- Fault Tolerance: Fully containerized environment with Zookeeper-managed Kafka brokers and auto-healing Spark workers.
- Cloud-Native Deployment: Fully automated deployment to AWS EC2 using Terraform, with self-healing Docker container orchestration.
DATA SOURCE INGESTION LAYER BUFFER LAYER
+-------------+ +------------------+ +----------------------+
| Binance WS | --> | Python Producer | --> | Apache Kafka |
| (L2 Depth) | | (Multiplexer x10)| | (30 Partitions) |
+-------------+ +------------------+ +----------------------+
|
v
COMPUTE LAYER (WRITE)
+----------------------+
| Apache Spark Cluster |
| (Structured Stream) |
| - Calculate OBI |
| - Measure Latency |
+----------------------+
|
v
SERVING LAYER (READ) STORAGE LAYER
+----------------------+ gRPC / HTTP2 +----------------------+
| Go gRPC API Server | <------------------ | Redis (In-Memory) |
| (Streaming Response) | HGETALL | (Live Scoreboard) |
+----------------------+ +----------------------+
|
v
+-------------+
| Trading Bot |
| (Client) |
+-------------+
The system strictly separates the Write Model (Ingestion/Compute) from the Read Model (Serving) to optimize for conflicting requirements.
- Command Side (Write): Handles high-throughput math (18,000+ events/min) using Spark. Optimized for Throughput.
- Query Side (Read): Handles client requests using Go/Redis. Optimized for Latency.
- The Bridge: Redis acts as the materialized view, allowing the Go API to serve data in microseconds without being blocked by the heavy computational load of the Spark engine.
Unlike batch-based architectures, QuantCore treats the live data stream as the primary system of record.
- Continuous Intelligence: Metrics are calculated incrementally on the fly using Spark Structured Streaming, eliminating the need for nightly batch jobs.
- State Management: The system maintains the "Current State of the Market" in memory, rather than storing a historical archive on disk.
The entire production environment is provisioned automatically using Terraform.
- Dynamic Provisioning: Automates the creation of AWS EC2 instances (
m5.xlarge) and security groups. - Bootstrap Strategy: Uses
user_datascripts to install Docker, clone the repository, and launch the distributed cluster on boot, ensuring reproducible deployments. -
One of the core engineering challenges in HFT is processing massive data volumes without losing the strict chronological order of trades. QuantCore solves this using a Partition-Aware Streaming Strategy scaled for the Top 30 market assets.
To overcome the Global Interpreter Lock (GIL) and WebSocket limits of a single Python process, the ingestion layer is horizontally scaled.
- Sharding Strategy: The system launches 10 parallel producer instances, each responsible for a distinct slice of the symbol universe (e.g., Shard 0 handles BTC/ETH, Shard 1 handles SOL/ADA).
- Concurrency: This enables parallel network I/O and JSON parsing across multiple CPU cores before data even reaches Kafka.
The Kafka Broker routes messages using a consistent hashing algorithm on the Symbol key.
- Strict Ordering: All updates for a specific symbol (e.g.,
BTCUSDT) are guaranteed to land in the same partition. - Load Balancing: With 30 partitions enabled, the system provides a dedicated logical lane for each of the Top 30 assets, preventing "noisy neighbor" latency spikes.
The system simulates a high-performance cluster by vertically scaling Spark Executors to 30 Logical Cores (15 per Worker Node).
- 1:1 Concurrency: By matching 30 Kafka Partitions with 30 Spark Cores, the system achieves perfect parallelism.
- Result: No task serialization. BTC processing never queues behind ETH processing, maintaining sub-300ms latency even under high load.
For a real-time ticker, Latency is the primary constraint.
- Redis (RAM): Provides ~200Β΅s read latency via persistent TCP sockets. Ideal for the "Current State" scoreboard pattern.
- Rejection of Disk DBs: Traditional databases (Postgres/DynamoDB) were rejected for the hot path because the 5-10ms latency introduced by disk I/O and HTTP overhead is unacceptable for high-frequency signal distribution.
The consumption pattern for market data is Streaming, not Request-Response.
- REST: Clients must poll (
GET /price) repeatedly. This creates "Thundering Herd" problems and wastes bandwidth on HTTP headers. - gRPC: Allows for Bi-Directional Streaming. The client connects once, and the server pushes binary Protobuf updates continuously. This reduces payload size by ~60% and CPU usage for parsing.
Acts as the Shock Absorber between the volatile data source (Binance) and the processing engine (Spark).
- Backpressure: Prevents the ingestion layer from crashing if the compute layer slows down during market spikes.
- Parallelism: Hashes symbols to specific partitions, allowing Spark workers to process BTC and ETH in parallel without race conditions.
- Docker & Docker Compose
- Go 1.21+
- Python 3.11+
- Terraform
- AWS CLI (configured)
Boot up the "Virtual Data Center" (Zookeeper, Kafka, Spark, Redis).
docker-compose up -dForce creation of the topic with 30 partitions to enable parallel processing for the Top 30 symbols.
docker exec -it kafka kafka-topics --create \
--topic order_book \
--bootstrap-server localhost:9092 \
--partitions 30 \
--replication-factor 1Connects to Binance and feeds Kafka.
# Create venv and install dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run Producer
./venv/bin/python ingestion/producer.pySubmits the job to the Spark Cluster. Note that we execute this inside the container as the root user to handle JAR permissions.
docker exec -u 0 -it spark-master /opt/spark/bin/spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
--master spark://spark-master:7077 \
/app/stream/stream_processor.pyLaunch the API server.
cd api
go run main.goConnect a dummy client to verify the stream.
cd api
go run client/main.goDeploy the entire stack to a dedicated AWS EC2 instance automatically.
-
Initialize Terraform:
cd infra terraform init -
Deploy Infrastructure:
terraform apply
(This provisions an
m5.xlargeinstance, installs Docker, clones the repo, and starts the cluster via User Data scripts.) -
Start Ingestion (Sharded): Use the helper script to launch 10 parallel producers.
# Inside the server or local machine cd ingestion ./run.sh
-
Teardown:
terraform destroy
.
βββ api/ # Serving Layer (Go gRPC)
β βββ client/ # Test gRPC Client
β βββ proto/ # Protobuf Contracts
β βββ main.go # Server Entrypoint
βββ infra/ # Infrastructure as Code
β βββ main.tf # Terraform AWS Definition
βββ ingestion/ # Ingestion Layer (Python)
β βββ producer.py # Binance WebSocket -> Kafka
β βββ run.sh # Helper script to launch sharded producers
βββ stream/ # Compute Layer (PySpark)
β βββ stream_processor.py # Kafka -> OBI Math -> Redis
βββ docker-compose.yml # Local Orchestration
βββ Dockerfile # Custom Spark Image with Dependencies
βββ requirements.txt # Python Dependencies
βββ README.md # System Documentation
MIT License.
