TurboTensors is an experimental, low-level CPU inference engine focused on minimizing framework overhead and maximizing real-time token generation on low-end and mid-range CPUs.
It is designed primarily for small to mid-sized language models (approximately 50M–300M parameters) and has been tested mainly with Turkish LLMs such as Kayra-1-exp.
IMPORTANT: TurboTensors is NOT a general-purpose replacement for optimized BLAS-based frameworks on high-end CPUs. It intentionally prioritizes low-overhead execution over heavy vectorized libraries.
Modern CPU inference stacks (PyTorch, oneDNN, MKL) perform exceptionally well on high-end CPUs, but introduce significant overhead on resource-constrained systems.
TurboTensors targets:
- Older CPUs
- Limited cache sizes
- No AVX-512
- GPU-less environments
- Edge / experimental setups
The goal is predictable, low-latency inference, not peak FLOPS.
-
Numba-JIT Kernels (LLVM) Critical execution paths are JIT-compiled to native machine code, avoiding Python interpreter overhead.
-
Fused Operator Execution Operations such as RMSNorm and activation functions are fused to reduce memory traffic and cache misses.
-
Prefill vs Decode Separation Distinct computational paths for context understanding and token generation, reducing redundant work.
-
KV Cache Awareness Aggressive reuse of key/value tensors during autoregressive decoding.
-
Zero-Copy Safetensors Loading Weights are accessed with minimal memory duplication (BF16 / F16 / F32 supported).
-
Thread-Level Parallelism Uses nogil-style execution and explicit thread control to avoid Python GIL bottlenecks.
Latest optimization pass focused on both high-core CPUs and low-resource profiles.
Key changes included in v4.1:
- Faster decode attention path for single-token generation (vectorized implementation)
- Lower decode loop allocation overhead (buffer reuse)
- Faster Top-K sampling path using partial partitioning
- Reduced cache-reset overhead between generations
- Better small-batch behavior with serial kernel fallback where parallel overhead dominates
Test model: Kayra-1-exp (~85M parameters) Hardware: consumer-grade laptop CPU (non-server, non-AVX512)
TurboTensors v4.1: ~55-65 tokens/s (observed), very low first-token latency
HuggingFace (CPU): ~40-45 tokens/s (same machine, same model, comparison path)
NOTE: Results are hardware-dependent and primarily reflect performance on low to mid-tier CPUs. On high-end CPUs, optimized BLAS-based engines may outperform TurboTensors.
Measured with max_new_tokens=50 and the built-in benchmark/comparison flow on the current optimized code path:
| Engine | Speed (tokens/s) | Notes |
|---|---|---|
| TurboTensors v4.1 | 60.9 | Same model/tokenizer, CPU-only |
| HuggingFace CPU | 41.9 | Same machine/model, torch float32 |
Relative result on this run: TurboTensors ~1.5x faster.
To emulate a low-resource environment, CPU threads were hard-limited (OMP/MKL/OPENBLAS/NUMBA, plus torch thread limits). This is a thread-constrained simulation, not a true ARM hardware benchmark.
Test setup:
- Model:
sixfingerdev/kayra-1-exp - Prompt:
Türkiye max_new_tokens=50,runs=3
| Threads | TurboTensors AVG (tok/s) | TurboTensors BEST (tok/s) | HuggingFace (tok/s) | TurboTensors / HF |
|---|---|---|---|---|
| 1 | 74.55 | 81.84 | 15.75 | 4.74x |
| 2 | 80.67 | 87.43 | 27.96 | 2.89x |
| 4 | 84.30 | 91.31 | 43.93 | 1.92x |
Interpretation:
- TurboTensors remains significantly ahead under strict CPU limits.
- As thread count increases, HuggingFace scales up, but TurboTensors still leads in these tests.
Benchmarks of core TurboTensors operations running on a GitHub Actions runner (Ubuntu, 4-core CPU).
Note: These benchmarks measure individual JIT-compiled kernel performance using a standalone test script. They demonstrate the efficiency of the low-level operations but do not represent end-to-end model generation performance.
| Operation | Time (ms) | Description |
|---|---|---|
| RMS Norm | 0.024 | Layer normalization |
| SiLU × Gate (Fused) | 0.060 | Fused activation function |
| Attention (Prefill, 32 tokens) | 0.436 | Multi-token attention |
| Attention (Decode, 1 token) | 0.093 | Single-token attention |
| Top-K Sampling | 2.410 | Token selection |
Theoretical Throughput (based on individual operations): ~386 tokens/second
This is a theoretical upper bound calculated from individual operation timings. Actual end-to-end generation performance will be significantly lower due to:
- Sequential dependencies between operations
- Memory transfer overhead
- Multiple layer traversals
- Additional operations (embedding lookups, projections, etc.)
For real-world performance, refer to the "Performance Snapshot" section above showing ~55-65 tokens/s on consumer hardware.
- JIT Compilation: ~5.2 seconds warmup time (one-time cost)
- Memory Efficiency: Zero-copy safetensors loading
- Parallel Processing: Numba parallel loops for multi-core utilization
- Cache Optimization: KV cache reuse during autoregressive decoding
The following output is from a standalone benchmark script that measures individual kernel performance:
======================================================================
TURBOTENSORS v4.0 - CORE OPERATIONS BENCHMARK
======================================================================
Configuration:
Batch size: 1
Sequence length: 128
Hidden size: 640
Attention heads: 10
Head dimension: 64
Vocabulary size: 32000
Warming up JIT kernels...
🔥 Warming up JIT kernels... ✓ 5.25s
✓ Warmup completed in 5.25s
[1/5] Benchmarking RMS Norm...
Average time: 0.024 ms
[2/5] Benchmarking SiLU * Gate (Fused)...
Average time: 0.060 ms
[3/5] Benchmarking Attention (Prefill)...
Average time: 0.436 ms
[4/5] Benchmarking Attention (Decode)...
Average time: 0.093 ms
[5/5] Benchmarking Top-K Sampling...
Average time: 2.410 ms
======================================================================
BENCHMARK SUMMARY
======================================================================
Operation Time (ms)
----------------------------------------------------------------------
RMS Norm 0.024
SiLU * Gate (Fused) 0.060
Attention (Prefill, 32 tok) 0.436
Attention (Decode, 1 tok) 0.093
Top-K Sampling 2.410
----------------------------------------------------------------------
Theoretical decode throughput: ~386.6 tokens/second
(Based on individual operation timings)
======================================================================
✓ BENCHMARK COMPLETE!
======================================================================
Note: This benchmark output demonstrates kernel-level performance. Actual model generation performance depends on the complete architecture and is typically lower. See the "Performance Snapshot" section for real-world generation speeds.
- Python 3.8 or higher
- pip package manager
git clone https://github.com/sixfingerdev/TurboTensors.git
cd TurboTensorspip install -r requirements.txtThe following packages will be installed:
- numpy (>=1.21.0) - Numerical computing
- numba (>=0.56.0) - JIT compilation for performance
- transformers (>=4.30.0) - Model tokenizer support
- huggingface_hub (>=0.16.0) - Model downloading
Optional dependency:
- torch (>=2.0.0) - Only needed for Hugging Face comparison benchmarks (commented out in requirements.txt)
python main.pyfrom main import TurboLLM, download_model
from transformers import AutoTokenizer
# Define model to use
model_id = "sixfingerdev/kayra-1-exp"
# Download and load model
model_path = download_model(model_id)
model = TurboLLM(model_path)
# Load tokenizer (must match the model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Generate text
output = model.generate(
"Türkiye",
max_new_tokens=50,
temperature=0.8,
top_k=50,
repetition_penalty=1.2,
tokenizer=tokenizer,
stream=True
)
print(output)# Custom generation parameters
output = model.generate(
prompt="Your prompt here",
max_new_tokens=100, # Number of tokens to generate
temperature=0.8, # Sampling temperature (0.0-2.0)
top_k=50, # Top-K sampling
repetition_penalty=1.2, # Penalize repeated tokens
tokenizer=tokenizer,
stream=True # Stream output token by token
)Note: KV caching is automatically enabled during generation to maximize performance. The engine reuses key/value tensors during autoregressive decoding, significantly reducing redundant computations.
Key ideas explored in this project:
- Cache-friendly memory layouts (L1/L2 aware)
- Partial Top-K sampling to avoid full logit sorting
- Manual control over compute granularity
- Avoidance of heavy framework abstractions
This project intentionally trades generality for clarity and control.
- Not optimized for AVX-512 or large-core-count servers
- Not intended for very large models (1B+ parameters)
- Limited numerical precision experiments so far
TurboTensors is best viewed as a research and engineering exploration, not a production-ready engine.
Enes Altıparmak (sixfingerdev)
Student — Kayseri Science High School
Interests:
- CPU inference optimization
- Turkish language models
- Tokenization and memory efficiency
- Low-level performance engineering
Understanding why systems are slow matters more than blindly using faster ones.