GitHub - ShipItAndPray/turboquant: Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.

 ████████╗██╗   ██╗██████╗ ██████╗  ██████╗  ██████╗ ██╗   ██╗ █████╗ ███╗   ██╗████████╗
 ╚══██╔══╝██║   ██║██╔══██╗██╔══██╗██╔═══██╗██╔═══██╗██║   ██║██╔══██╗████╗  ██║╚══██╔══╝
    ██║   ██║   ██║██████╔╝██████╔╝██║   ██║██║   ██║██║   ██║███████║██╔██╗ ██║   ██║
    ██║   ██║   ██║██╔══██╗██╔══██╗██║   ██║██║▄▄ ██║██║   ██║██╔══██║██║╚██╗██║   ██║
    ██║   ╚██████╔╝██║  ██║██████╔╝╚██████╔╝╚██████╔╝╚██████╔╝██║  ██║██║ ╚████║   ██║
    ╚═╝    ╚═════╝ ╚═╝  ╚═╝╚═════╝  ╚═════╝  ╚══▀▀═╝  ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═╝

6x Compression for Vectors, Embeddings, and LLMs

Based on Google's TurboQuant — PolarQuant + QJL. No training required. Near-zero accuracy loss.

24 Adapters • Quick Start • Benchmarks • How It Works • LLM CLI • Platforms • CI/CD

Adapters — Plug and Play for 24 Systems

3 lines of code. No forks, no patches, no recompilation. Wrap your existing client, get 6x compression.

from turboquant.core import TurboQuantEncoder
from turboquant.adapters.redis import RedisTurboCache

encoder = TurboQuantEncoder(dim=768)
cache = RedisTurboCache(encoder, your_existing_redis_client)
cache.put("doc:1", embedding)  # 3KB → 500 bytes

Every adapter has the same API: put · get · search · put_batch · get_batch · delete · stats

Caches

Adapter	Install	Key Feature
Redis	`pip install redis`	Pipeline batching, SCAN search, TTL, key prefixing
Memcached	`pip install pymemcache`	get_multi/set_multi, CAS atomic updates
Ehcache	`pip install py4j`	Java JVM bridge (Py4J) or REST API, Ehcache 2 & 3
Hazelcast	`pip install hazelcast-python-client`	Distributed cluster, put_all/get_all

Databases

Adapter	Install	Key Feature
PostgreSQL	`pip install psycopg2-binary`	BYTEA + optional pgvector hybrid search, JSONB metadata
MySQL	`pip install mysql-connector-python`	MEDIUMBLOB storage, executemany bulk insert
SQLite	(built-in — zero deps)	WAL mode, JSON1 metadata, great for local dev
MongoDB	`pip install pymongo`	BSON Binary, Atlas Vector Search aggregation pipeline
DynamoDB	`pip install boto3`	Binary attribute, batch_write_item (25/batch), TTL
Cassandra	`pip install cassandra-driver`	Prepared statements, UNLOGGED BATCH, native TTL

Vector Databases

Adapter	Install	Key Feature
Pinecone	`pip install pinecone-client`	Native ANN + TurboQuant reranking for higher recall
Qdrant	`pip install qdrant-client`	HNSW search + rerank, payload filtering
ChromaDB	`pip install chromadb`	Local/server mode, metadata where-filtering
Milvus	`pip install pymilvus`	IVF/HNSW index + TurboQuant rerank
Weaviate	`pip install weaviate-client`	Schema-based, near_vector + rerank
FAISS	`pip install faiss-cpu`	Local ANN index, save/load to disk, rerank mode

Search Engines

Adapter	Install	Key Feature
Elasticsearch	`pip install elasticsearch`	Binary field + dense_vector kNN, bulk API
OpenSearch	`pip install opensearch-py`	k-NN plugin (nmslib/faiss engine), compressed-only mode

Object Storage

Adapter	Install	Key Feature
AWS S3	`pip install boto3`	~500B objects, concurrent ThreadPool upload
Google Cloud Storage	`pip install google-cloud-storage`	Blob metadata, concurrent upload
Azure Blob	`pip install azure-storage-blob`	Container-based, blob metadata

Embedded Key-Value Stores

Adapter	Install	Key Feature
LMDB	`pip install lmdb`	Memory-mapped B+ tree, zero-copy reads, ACID
RocksDB	`pip install python-rocksdb`	WriteBatch, LSM-tree (less write amplification with smaller values)

Streaming

Adapter	Install	Key Feature
Apache Kafka	`pip install confluent-kafka`	Producer + Consumer, 6x smaller messages, metadata support

Full adapter docs with examples: adapters/README.md

Quick Start

Install

pip install numpy   # Only dependency for core engine

# Then install your backend's client:
pip install redis               # for Redis adapter
pip install psycopg2-binary     # for PostgreSQL adapter
pip install pymongo             # for MongoDB adapter
# ... etc

Compress and Store Vectors

from turboquant.core import TurboQuantEncoder, TurboQuantConfig

# Create encoder (reuse across your app)
config = TurboQuantConfig(bits=4, block_size=32, qjl_proj_dim=64)
encoder = TurboQuantEncoder(dim=768, config=config)

# Compress a single vector
import numpy as np
vector = np.random.randn(768).astype(np.float32)
compressed = encoder.encode(vector)

print(f"Original:   {768 * 4} bytes")
print(f"Compressed: {compressed.nbytes()} bytes")
print(f"Ratio:      {compressed.compression_ratio():.1f}x")

# Decompress
reconstructed = encoder.decode(compressed)

# Serialize for any storage
raw_bytes = compressed.to_bytes()
restored = type(compressed).from_bytes(raw_bytes)

Use with Any Backend

# Redis
import redis
from turboquant.adapters.redis import RedisTurboCache
cache = RedisTurboCache(encoder, redis.Redis(), prefix="emb:", ttl=3600)

# PostgreSQL with pgvector
from turboquant.adapters.postgresql import PostgresTurboCache
cache = PostgresTurboCache(encoder, dsn="postgresql://localhost/mydb", use_pgvector=True)

# MongoDB with Atlas Vector Search
from pymongo import MongoClient
from turboquant.adapters.mongodb import MongoTurboCache
cache = MongoTurboCache(encoder, MongoClient(), db="myapp", collection="embeddings")

# S3
from turboquant.adapters.s3 import S3TurboCache
cache = S3TurboCache(encoder, bucket="my-vectors", prefix="embeddings/")

# SQLite (zero deps)
from turboquant.adapters.sqlite import SQLiteTurboCache
cache = SQLiteTurboCache(encoder, db_path="vectors.db")

# All adapters — same API:
cache.put("doc:1", vector)
cache.put_batch({"doc:2": v2, "doc:3": v3})
vec = cache.get("doc:1")
results = cache.search(query_vector, k=10)
print(cache.stats())

Vector DB Reranking

For Pinecone, Qdrant, Milvus, etc. — use native ANN for candidates, TurboQuant for precision:

from turboquant.adapters.qdrant import QdrantTurboCache
cache = QdrantTurboCache(encoder, qdrant_client, collection="docs")

results = cache.search(query, k=10, mode="rerank")      # ANN + TQ rerank (best quality)
results = cache.search(query, k=10, mode="native")      # ANN only (fastest)
results = cache.search(query, k=10, mode="compressed")   # TQ only (no ANN index needed)

Build Your Own Adapter

Subclass BaseTurboAdapter — implement 4 methods, get the full API for free:

from turboquant.adapters._base import BaseTurboAdapter

class MyCache(BaseTurboAdapter):
    def _raw_get(self, key): ...       # return bytes or None
    def _raw_set(self, key, value): ...  # store bytes
    def _raw_delete(self, key): ...    # return bool
    def _raw_keys(self, pattern): ...  # return list of keys

# You now have: put, get, search, put_batch, get_batch, delete, stats

Compression Benchmarks

4-bit quantization, block_size=32, QJL proj_dim=64:

Dimension	Compression	Cosine Similarity	Bytes per Vector
128	5.5x	0.990	94
384	6.1x	0.973	254
768	6.2x	0.949	494
1536	6.3x	0.907	974

Memory Savings at Scale

Scenario	Raw float32	TurboQuant	Saved
10K vectors, dim=128	5 MB	940 KB	82%
100K vectors, dim=384	154 MB	25 MB	83%
1M vectors, dim=768	3.1 GB	494 MB	84%
10M vectors, dim=1536	61.4 GB	9.7 GB	84%

Throughput

Operation	Speed (dim=768)
Encode	~1,000 vec/s
Decode	~1,800 vec/s
Similarity	~500 pairs/s

How It Works

Based on Google's TurboQuant research — two-stage compression, no training required:

Stage 1: PolarQuant

Random orthogonal rotation — spreads information uniformly across all vector components
Block-wise quantization — each block of 32 values gets its own scale factor, quantized to N bits
Norm preservation — vector magnitude stored separately at float16 precision

Stage 2: QJL (Quantized Johnson-Lindenstrauss)

Random projection of the quantization residual into a lower-dimensional space
1-bit sign quantization — each projected value becomes just +1 or -1
Unbiased error correction — mathematically proven to eliminate quantization bias

Input Vector (float32)          Compressed Output (~6x smaller)
   ┌─────────────┐              ┌──────────────────────────┐
   │ [0.23, -0.1,│              │ norm (2B) + block scales  │
   │  0.45, 0.67,│   encode()   │ (N*4B) + packed N-bit    │
   │  ...768 dim ]│ ──────────→  │ values + QJL sign bits   │
   │ 3,072 bytes  │              │ ~494 bytes               │
   └─────────────┘              └──────────────────────────┘

LLM Quantization CLI

TurboQuant also includes a CLI for compressing HuggingFace LLMs to GGUF/GPTQ/AWQ:

turboquant meta-llama/Llama-3.1-8B-Instruct --format gguf --bits 4

That's it. Your 16GB model is now 4GB. Ship it to Ollama, vLLM, or llama.cpp.

pip install turboquant[all]  # Install all LLM backends

Target Platforms

Don't know which format to use? Just tell TurboQuant where you want to run it.

Ollama (one command, ready to run)

turboquant meta-llama/Llama-3.1-8B-Instruct --target ollama --bits 4

This quantizes to GGUF, auto-generates a Modelfile with the correct chat template, and tells you the exact ollama create command to run.

vLLM

turboquant meta-llama/Llama-3.1-8B-Instruct --target vllm --bits 4

Auto-selects AWQ (best GPU throughput for vLLM).

LM Studio / llama.cpp

turboquant meta-llama/Llama-3.1-8B-Instruct --target lmstudio --bits 4
turboquant meta-llama/Llama-3.1-8B-Instruct --target llamacpp --bits 4

Publish to HuggingFace

Quantize any model and publish to HuggingFace Hub in one command:

turboquant meta-llama/Llama-3.1-8B-Instruct \
  --format gguf --bits 4 \
  --push-to-hub yourname/Llama-3.1-8B-Instruct-GGUF

Requires: huggingface-cli login or HF_TOKEN environment variable.

Quality Evaluation

turboquant meta-llama/Llama-3.1-8B-Instruct --format gguf --bits 4 --eval

Perplexity	Grade	Meaning
< 10	EXCELLENT	Minimal quality loss
10-20	GOOD	Acceptable for most use cases
20-50	FAIR	Some degradation, consider higher bits
> 100	POOR	Model may be broken

Smart Recommendations

turboquant meta-llama/Llama-3.1-8B-Instruct --recommend

Detects your hardware (Apple Silicon, NVIDIA GPU, CPU-only) and recommends the best format + bits.

GitHub Action

CI/CD pipeline for LLM quantization. Auto-quantize after fine-tuning.

# .github/workflows/quantize.yml
name: Quantize Model
on:
  workflow_dispatch:
    inputs:
      model:
        description: 'Model to quantize'
        required: true

jobs:
  quantize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ShipItAndPray/turboquant@master
        with:
          model: ${{ inputs.model }}
          format: gguf
          bits: 4
          eval: true
          push-to-hub: yourname/model-GGUF
          hf-token: ${{ secrets.HF_TOKEN }}

Action Inputs

Input	Required	Default	Description
`model`	Yes	—	HuggingFace model ID or local path
`format`	No	`gguf`	`gguf`, `gptq`, `awq`, or `all`
`bits`	No	`4`	`2`, `3`, `4`, `5`, or `8`
`target`	No	—	`ollama`, `vllm`, `llamacpp`, `lmstudio`
`push-to-hub`	No	—	HuggingFace repo to upload to
`eval`	No	`false`	Run quality evaluation
`hf-token`	No	—	HuggingFace API token

LLM Formats

Format	Best For	Engine	GPU?
GGUF	Local/CPU, Ollama, LM Studio	llama.cpp	No
GPTQ	GPU serving, high throughput	vLLM, TGI	Yes
AWQ	Fast GPU inference	vLLM, TGI	Yes

Don't know? Run turboquant your-model --recommend.

Supported Architectures

LLaMA (1-3.3), Mistral/Mixtral, Qwen (1.5-2.5), Phi (1-4), GPT-2/J/NeoX, Gemma, DeepSeek, and any HuggingFace model with .safetensors or .bin weights.

All CLI Options

turboquant MODEL [OPTIONS]

Positional:
  MODEL                     HuggingFace model ID or local path

Formats:
  --format, -f FORMAT       gguf, gptq, awq, or all (default: gguf)
  --bits, -b BITS           2, 3, 4, 5, or 8 (default: 4)
  --output, -o DIR          Output directory (default: ./turboquant-output)

Target Platforms:
  --target, -t TARGET       ollama, vllm, llamacpp, lmstudio

Publishing:
  --push-to-hub REPO        Upload to HuggingFace Hub (e.g. user/model-GGUF)

Quality:
  --eval                    Run perplexity evaluation after quantization
  --recommend               Show hardware-aware format recommendation

Info:
  --info                    Show model details without quantizing
  --check                   Show available backends and hardware

Requirements

Python 3.9+
NumPy (only dependency for core vector engine + adapters)
Backend client library for your chosen adapter (see tables above)
For LLM CLI: PyTorch 2.0+ and backend-specific packages

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
adapters		adapters
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
cache_optimizer.py		cache_optimizer.py
core.py		core.py
demo.py		demo.py
pyproject.toml		pyproject.toml
setup.py		setup.py
turboquant.py		turboquant.py

Folders and files

Latest commit

History

Repository files navigation

6x Compression for Vectors, Embeddings, and LLMs

Adapters — Plug and Play for 24 Systems

Caches

Databases

Vector Databases

Search Engines

Object Storage

Embedded Key-Value Stores

Streaming

Quick Start

Install

Compress and Store Vectors

Use with Any Backend

Vector DB Reranking

Build Your Own Adapter

Compression Benchmarks

Memory Savings at Scale

Throughput

How It Works

Stage 1: PolarQuant

Stage 2: QJL (Quantized Johnson-Lindenstrauss)

LLM Quantization CLI

Target Platforms

Ollama (one command, ready to run)

vLLM

LM Studio / llama.cpp

Publish to HuggingFace

Quality Evaluation

Smart Recommendations

GitHub Action

Action Inputs

LLM Formats

Supported Architectures

All CLI Options

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages