Rust - Core Inference Infrastructure

High-performance AI orchestration with gRPC microservices

Rust workspace providing WebRTC signaling, HTTP API, database layer, and ML orchestration. Spawns and manages Python ML subprocess for vision/language models.

Architecture

Rust Server (Orchestrator)
├── HTTP API (REST + OpenAPI)
├── WebRTC Signaling + Data Channels
├── Native Messaging (Chrome Extension)
├── Database Layer (7 MIA databases)
├── Model Management (Download + Cache)
└── Python ML gRPC Client
     ↓ (spawns subprocess)
    Python ML Service (MediaPipe + Transformers + LiteRT)

Design:

Rust = Brain (orchestration, state, networking)
Python = Slave (ML inference only)
gRPC = Communication (language-agnostic, location-transparent)

Workspace Structure

Core Server (`server/`)

Main binary - Orchestrates all components

Modes:

native - Native messaging only
http - HTTP API only
webrtc - WebRTC signaling only
web - HTTP + WebRTC (no native)
all - Everything (default)

Auto-starts: Python ML subprocess on startup

See: server/README.md

API Layer (`api/`)

HTTP REST API with OpenAPI documentation

Features:

Auto-generated routes from traits
Swagger/ReDoc/RapiDoc UI
Static file serving
CORS support

Endpoints:

/v1/health - Health check
/v1/models/* - Model management
/v1/chat/* - Chat completion
/v1/webrtc/* - WebRTC signaling
/api-doc/* - API documentation

See: api/README.md

Application State (`appstate/`)

Centralized state for all components

Components:

ModelOrchestrator - Model lifecycle (download → load → inference)
Database clients
ML clients (gRPC to Python)
Hardware info
HuggingFace auth

Pattern: Provides AppStateProvider trait for dependency injection

See: appstate/README.md

Database Layer (`storage/`)

MIA cognitive architecture - 7-database memory system

Databases:

conversations/ - Chat history
knowledge/ - Facts, entities
embeddings/ - Vector search
tool-results/ - Function call results
experience/ - Episodic memory
summaries/ - Compressed history
meta/ - System metadata

Features:

gRPC service for remote access
Direct in-process access
Atomic transactions
Backup/restore

See: storage/README.md

Common Types (`common/`)

Shared types, clients, utilities

Key Components:

MlClient - gRPC client for Python ML
PythonProcessManager - Subprocess lifecycle
AppStateProvider - State trait
Platform-specific paths
Error types
gRPC generated code

See: common/README.md

Model Management (`model-cache/`)

Model download, storage, serving

Features:

HuggingFace Hub integration
Parallel chunk downloads
Resume interrupted downloads
Local file serving (for Python gRPC)
Default model library

File Structure:

models/
├── microsoft--Florence-2-base/
│   ├── config.json
│   ├── model.safetensors
│   └── ...
└── meta-llama--Llama-2-7b-hf/
    └── ...

See: model-cache/README.md

WebRTC (`webrtc/`)

WebRTC signaling + data channels

Features:

SDP offer/answer exchange
ICE candidate gathering
Data channel creation
Video stream handling
Session management

Use Cases:

Browser → Rust real-time video
Rust → Python streaming inference
Peer-to-peer connections

See: webrtc/README.md

Native Messaging (`native-messaging/`)

Chrome extension protocol

Features:

stdin/stdout message framing
JSON message parsing
Bi-directional communication
Extension manifest generation

Protocol:

[4-byte length][JSON message]

See: native-messaging/README.md

Hardware Detection (`hardware/`)

Auto-detect CPU, GPU, NPU

Detects:

CPUs (cores, threads, model)
NVIDIA GPUs (VRAM via nvidia-smi)
AMD GPUs (discrete/integrated)
Intel GPUs (Arc, integrated)
Acceleration (CUDA, ROCm, Vulkan, DirectML)
NPUs (AMD Ryzen AI, Intel Core Ultra)

Platforms: Windows (complete), Linux/macOS (planned)

See: hardware/README.md

Model Loaders

ONNX (`onnx-loader/`)

ONNX Runtime integration

Features:

Load .onnx models
Execution providers (CPU, CUDA, DirectML, etc.)
Session management
Tensor handling

See: onnx-loader/README.md

GGUF (`gguf-loader/`)

llama.cpp integration

Features:

Load .gguf models
Context management
Sampling parameters
Streaming generation

See: gguf-loader/README.md

Execution Providers (`execution-providers/`)

ONNX Runtime provider definitions

Providers: 25+ execution providers

CUDA, TensorRT, ROCm (GPU)
DirectML, CoreML, Metal (Platform-specific)
XNNPACK, QNN, NNAPI (Edge)
OpenVINO, Vitis, MIGraphX (Specialized)

See: execution-providers/README.md

Pipeline (`pipeline/`)

Inference pipeline orchestration

Features:

Multi-stage pipelines
Preprocessing/postprocessing
Model chaining
Error handling

See: pipeline/README.md

Database Components

Indexing (`indexing/`)

Vector indexing and search

Features:

HNSW implementation
Cosine/Euclidean distance
Lock-free concurrent access
Incremental updates

See: indexing/README.md

Query Engine (`query/`)

Semantic search and retrieval

Features:

Hybrid search (vector + keyword)
Filtering
Ranking
Result fusion

See: query/README.md

Knowledge Weaver (`weaver/`)

Graph-based knowledge management

Features:

Entity extraction
Relationship mapping
Graph traversal
Knowledge consolidation

See: weaver/README.md

Utilities

Task Scheduler (`task-scheduler/`)

Async task management

Features:

Priority queues
Deadline scheduling
Resource limits
Cancellation

See: task-scheduler/README.md

Tokenization (`tokenization/`)

Text tokenization utilities

Features:

HuggingFace tokenizers
Byte-pair encoding
Token counting

See: tokenization/README.md

Values (`values/`)

Typed value system

Features:

Dynamic typing
Serialization
Type conversions
Validation

See: values/README.md

Native Handler (`native-handler/`)

Native messaging request handler

Features:

Route requests to appstate
Error handling
Response formatting

See: native-handler/README.md

Build & Run

Prerequisites

# Rust 1.75+
rustup update

# Protocol Buffers compiler
# Windows: choco install protoc
# macOS: brew install protobuf
# Linux: apt install protobuf-compiler

Build

cd Rust

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Build specific crate
cargo build -p tabagent-server

Run

# Default (all modes)
cargo run --bin tabagent-server

# Specific mode
cargo run --bin tabagent-server -- --mode web --port 3000

# Release mode
cargo run --release --bin tabagent-server

# With script (auto-kills old processes)
./run-server.ps1  # Windows PowerShell
./run-server.bat  # Windows CMD

Test

# All tests
cargo test --workspace

# Specific crate
cargo test -p storage

# Integration tests
cargo test --test '*'

# With output
cargo test -- --nocapture

gRPC Communication

Proto Definitions

Located in protos/:

database.proto - Database service
ml_inference.proto - ML services (ModelManagement, Transformers, MediaPipe)

Generate Code

# Rust (automatic via build.rs)
cargo build -p common

# Python
cd ../PythonML
./generate_protos.bat

Architecture

Rust → Python:

MlClient in common/src/ml_client.rs
Connects to localhost:50051
Calls: LoadModel, GenerateText, StreamFaceDetection, etc.

Database:

DatabaseClient in storage/src/database_client.rs
Can run in-process or remote (gRPC)
Transparent switching

See: GRPC_ARCHITECTURE.md

Configuration

Database Path

# Default (platform-specific AppData)
# Windows: %APPDATA%/TabAgent/tabagent_db/
# Linux: ~/.local/share/TabAgent/tabagent_db/
# macOS: ~/Library/Application Support/TabAgent/tabagent_db/

# Custom
cargo run --bin tabagent-server -- --db-path ./my_db

Model Cache Path

# Default
# Windows: %APPDATA%/TabAgent/models/
# Linux: ~/.local/share/TabAgent/models/
# macOS: ~/Library/Application Support/TabAgent/models/

# Custom
cargo run --bin tabagent-server -- --model-cache-path ./my_models

Crate Dependencies

server
├── api (HTTP routes)
├── appstate (shared state)
├── webrtc (signaling)
├── native-messaging (extension)
└── common (types, clients)

appstate
├── storage (database)
├── model-cache (downloads)
├── hardware (detection)
├── common (gRPC clients)
└── ... (all other crates)

storage
├── indexing (vector search)
├── query (semantic search)
├── weaver (knowledge graph)
└── common (types)

common
├── tonic, prost (gRPC)
└── (no other workspace dependencies)

Performance

Component	Metric	Value
HTTP API	Latency	<5ms (p99)
WebRTC Signaling	Setup Time	<100ms
Database Writes	Throughput	10k/s
Database Reads	Throughput	50k/s
Model Download	Speed	100MB/s (parallel chunks)
gRPC (Rust↔Python)	Latency	<1ms (localhost)

i9-12900K + NVMe SSD

Roadmap

High Priority

Linux/macOS hardware detection
GPU memory management (VRAM tracking)
Model quantization (GPTQ, AWQ)
WebRTC video encoding/decoding

Medium Priority

Distributed database (multi-node)
Model fine-tuning API
Ray tracing for attention visualization
Plugin system

Low Priority

Web UI (dashboard)
Monitoring/metrics (Prometheus)
Clustering support
Docker deployment

See: Individual crate TODO.md files for details

Documentation

GRPC_ARCHITECTURE.md - gRPC design patterns
ARCHITECTURE.md - System architecture
Rust-Architecture-Guidelines.md - Code standards
docs/ - Database architecture, MIA memory

Contributing

Read architecture docs
Check crate-specific README.md + TODO.md
Write tests
Run cargo fmt and cargo clippy
No compilation warnings

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Rust - Core Inference Infrastructure

Architecture

Workspace Structure

Core Server (server/)

API Layer (api/)

Application State (appstate/)

Database Layer (storage/)

Common Types (common/)

Model Management (model-cache/)

WebRTC (webrtc/)

Native Messaging (native-messaging/)

Hardware Detection (hardware/)

Model Loaders

ONNX (onnx-loader/)

GGUF (gguf-loader/)

Execution Providers (execution-providers/)

Pipeline (pipeline/)

Database Components

Indexing (indexing/)

Query Engine (query/)

Knowledge Weaver (weaver/)

Utilities

Task Scheduler (task-scheduler/)

Tokenization (tokenization/)

Values (values/)

Native Handler (native-handler/)

Build & Run

Prerequisites

Build

Run

Test

gRPC Communication

Proto Definitions

Generate Code

Architecture

Configuration

Database Path

Model Cache Path

Crate Dependencies

Performance

Roadmap

High Priority

Medium Priority

Low Priority

Documentation

Contributing

See Also

Core Server (`server/`)

API Layer (`api/`)

Application State (`appstate/`)

Database Layer (`storage/`)

Common Types (`common/`)

Model Management (`model-cache/`)

WebRTC (`webrtc/`)

Native Messaging (`native-messaging/`)

Hardware Detection (`hardware/`)

ONNX (`onnx-loader/`)

GGUF (`gguf-loader/`)

Execution Providers (`execution-providers/`)

Pipeline (`pipeline/`)

Indexing (`indexing/`)

Query Engine (`query/`)

Knowledge Weaver (`weaver/`)

Task Scheduler (`task-scheduler/`)

Tokenization (`tokenization/`)

Values (`values/`)

Native Handler (`native-handler/`)