⚡ TurboTensors v4.1 — JET MODE (Experimental)

TurboTensors is an experimental, low-level CPU inference engine focused on minimizing framework overhead and maximizing real-time token generation on low-end and mid-range CPUs.

It is designed primarily for small to mid-sized language models (approximately 50M–300M parameters) and has been tested mainly with Turkish LLMs such as Kayra-1-exp.

IMPORTANT: TurboTensors is NOT a general-purpose replacement for optimized BLAS-based frameworks on high-end CPUs. It intentionally prioritizes low-overhead execution over heavy vectorized libraries.

🎯 Design Philosophy

Modern CPU inference stacks (PyTorch, oneDNN, MKL) perform exceptionally well on high-end CPUs, but introduce significant overhead on resource-constrained systems.

TurboTensors targets:

Older CPUs
Limited cache sizes
No AVX-512
GPU-less environments
Edge / experimental setups

The goal is predictable, low-latency inference, not peak FLOPS.

✨ Core Features

Numba-JIT Kernels (LLVM) Critical execution paths are JIT-compiled to native machine code, avoiding Python interpreter overhead.
Fused Operator Execution Operations such as RMSNorm and activation functions are fused to reduce memory traffic and cache misses.
Prefill vs Decode Separation Distinct computational paths for context understanding and token generation, reducing redundant work.
KV Cache Awareness Aggressive reuse of key/value tensors during autoregressive decoding.
Zero-Copy Safetensors Loading Weights are accessed with minimal memory duplication (BF16 / F16 / F32 supported).
Thread-Level Parallelism Uses nogil-style execution and explicit thread control to avoid Python GIL bottlenecks.

🆕 v4.1 Update (High-CPU + Constrained-CPU Tuning)

Latest optimization pass focused on both high-core CPUs and low-resource profiles.

Key changes included in v4.1:

Faster decode attention path for single-token generation (vectorized implementation)
Lower decode loop allocation overhead (buffer reuse)
Faster Top-K sampling path using partial partitioning
Reduced cache-reset overhead between generations
Better small-batch behavior with serial kernel fallback where parallel overhead dominates

📊 Performance Snapshot

Test model: Kayra-1-exp (~85M parameters) Hardware: consumer-grade laptop CPU (non-server, non-AVX512)

TurboTensors v4.1: ~55-65 tokens/s (observed), very low first-token latency
HuggingFace (CPU): ~40-45 tokens/s (same machine, same model, comparison path)

NOTE: Results are hardware-dependent and primarily reflect performance on low to mid-tier CPUs. On high-end CPUs, optimized BLAS-based engines may outperform TurboTensors.

🔬 Benchmark Results

End-to-End Generation (Latest)

Measured with max_new_tokens=50 and the built-in benchmark/comparison flow on the current optimized code path:

Engine	Speed (tokens/s)	Notes
TurboTensors v4.1	60.9	Same model/tokenizer, CPU-only
HuggingFace CPU	41.9	Same machine/model, torch float32

Relative result on this run: TurboTensors ~1.5x faster.

Constrained CPU Benchmark (Raspberry Pi-like Simulation)

To emulate a low-resource environment, CPU threads were hard-limited (OMP/MKL/OPENBLAS/NUMBA, plus torch thread limits). This is a thread-constrained simulation, not a true ARM hardware benchmark.

Test setup:

Model: sixfingerdev/kayra-1-exp
Prompt: Türkiye
max_new_tokens=50, runs=3

Threads	TurboTensors AVG (tok/s)	TurboTensors BEST (tok/s)	HuggingFace (tok/s)	TurboTensors / HF
1	74.55	81.84	15.75	4.74x
2	80.67	87.43	27.96	2.89x
4	84.30	91.31	43.93	1.92x

Interpretation:

TurboTensors remains significantly ahead under strict CPU limits.
As thread count increases, HuggingFace scales up, but TurboTensors still leads in these tests.

Core Operation Micro-Benchmarks (Reference)

Benchmarks of core TurboTensors operations running on a GitHub Actions runner (Ubuntu, 4-core CPU).

Note: These benchmarks measure individual JIT-compiled kernel performance using a standalone test script. They demonstrate the efficiency of the low-level operations but do not represent end-to-end model generation performance.

Core Operation Performance

Operation	Time (ms)	Description
RMS Norm	0.024	Layer normalization
SiLU × Gate (Fused)	0.060	Fused activation function
Attention (Prefill, 32 tokens)	0.436	Multi-token attention
Attention (Decode, 1 token)	0.093	Single-token attention
Top-K Sampling	2.410	Token selection

Theoretical Throughput (based on individual operations): ~386 tokens/second

This is a theoretical upper bound calculated from individual operation timings. Actual end-to-end generation performance will be significantly lower due to:

Sequential dependencies between operations
Memory transfer overhead
Multiple layer traversals
Additional operations (embedding lookups, projections, etc.)

For real-world performance, refer to the "Performance Snapshot" section above showing ~55-65 tokens/s on consumer hardware.

Key Performance Characteristics

JIT Compilation: ~5.2 seconds warmup time (one-time cost)
Memory Efficiency: Zero-copy safetensors loading
Parallel Processing: Numba parallel loops for multi-core utilization
Cache Optimization: KV cache reuse during autoregressive decoding

Example Benchmark Output

The following output is from a standalone benchmark script that measures individual kernel performance:

======================================================================
TURBOTENSORS v4.0 - CORE OPERATIONS BENCHMARK
======================================================================

Configuration:
  Batch size: 1
  Sequence length: 128
  Hidden size: 640
  Attention heads: 10
  Head dimension: 64
  Vocabulary size: 32000

 Warming up JIT kernels...
🔥 Warming up JIT kernels... ✓ 5.25s
✓ Warmup completed in 5.25s

 [1/5] Benchmarking RMS Norm...
     Average time: 0.024 ms

 [2/5] Benchmarking SiLU * Gate (Fused)...
     Average time: 0.060 ms

 [3/5] Benchmarking Attention (Prefill)...
     Average time: 0.436 ms

 [4/5] Benchmarking Attention (Decode)...
     Average time: 0.093 ms

 [5/5] Benchmarking Top-K Sampling...
     Average time: 2.410 ms

======================================================================
BENCHMARK SUMMARY
======================================================================

Operation                    Time (ms)
----------------------------------------------------------------------
RMS Norm                        0.024
SiLU * Gate (Fused)             0.060
Attention (Prefill, 32 tok)     0.436
Attention (Decode, 1 tok)       0.093
Top-K Sampling                  2.410
----------------------------------------------------------------------

Theoretical decode throughput: ~386.6 tokens/second
(Based on individual operation timings)

======================================================================
✓ BENCHMARK COMPLETE!
======================================================================

Note: This benchmark output demonstrates kernel-level performance. Actual model generation performance depends on the complete architecture and is typically lower. See the "Performance Snapshot" section for real-world generation speeds.

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip package manager

Step 1: Clone the Repository

git clone https://github.com/sixfingerdev/TurboTensors.git
cd TurboTensors

Step 2: Install Dependencies

pip install -r requirements.txt

The following packages will be installed:

numpy (>=1.21.0) - Numerical computing
numba (>=0.56.0) - JIT compilation for performance
transformers (>=4.30.0) - Model tokenizer support
huggingface_hub (>=0.16.0) - Model downloading

Optional dependency:

torch (>=2.0.0) - Only needed for Hugging Face comparison benchmarks (commented out in requirements.txt)

Step 3: Run the Code

python main.py

🚀 Usage

Basic Usage

from main import TurboLLM, download_model
from transformers import AutoTokenizer

# Define model to use
model_id = "sixfingerdev/kayra-1-exp"

# Download and load model
model_path = download_model(model_id)
model = TurboLLM(model_path)

# Load tokenizer (must match the model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate text
output = model.generate(
    "Türkiye",
    max_new_tokens=50,
    temperature=0.8,
    top_k=50,
    repetition_penalty=1.2,
    tokenizer=tokenizer,
    stream=True
)

print(output)

Advanced Options

# Custom generation parameters
output = model.generate(
    prompt="Your prompt here",
    max_new_tokens=100,        # Number of tokens to generate
    temperature=0.8,            # Sampling temperature (0.0-2.0)
    top_k=50,                   # Top-K sampling
    repetition_penalty=1.2,     # Penalize repeated tokens
    tokenizer=tokenizer,
    stream=True                 # Stream output token by token
)

Note: KV caching is automatically enabled during generation to maximize performance. The engine reuses key/value tensors during autoregressive decoding, significantly reducing redundant computations.

🧠 Engineering Notes

Key ideas explored in this project:

Cache-friendly memory layouts (L1/L2 aware)
Partial Top-K sampling to avoid full logit sorting
Manual control over compute granularity
Avoidance of heavy framework abstractions

This project intentionally trades generality for clarity and control.

🚧 Limitations

Not optimized for AVX-512 or large-core-count servers
Not intended for very large models (1B+ parameters)
Limited numerical precision experiments so far

TurboTensors is best viewed as a research and engineering exploration, not a production-ready engine.

👨‍💻 Author

Enes Altıparmak (sixfingerdev)
Student — Kayseri Science High School

Interests:

CPU inference optimization
Turkish language models
Tokenization and memory efficiency
Low-level performance engineering

📌 Motivation

Understanding why systems are slow matters more than blindly using faster ones.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ TurboTensors v4.1 — JET MODE (Experimental)

🎯 Design Philosophy

✨ Core Features

🆕 v4.1 Update (High-CPU + Constrained-CPU Tuning)

📊 Performance Snapshot

🔬 Benchmark Results

End-to-End Generation (Latest)

Constrained CPU Benchmark (Raspberry Pi-like Simulation)

Core Operation Micro-Benchmarks (Reference)

Core Operation Performance

Key Performance Characteristics

Example Benchmark Output

🛠️ Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Run the Code

🚀 Usage

Basic Usage

Advanced Options

🧠 Engineering Notes

🚧 Limitations

👨‍💻 Author

📌 Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ TurboTensors v4.1 — JET MODE (Experimental)

🎯 Design Philosophy

✨ Core Features

🆕 v4.1 Update (High-CPU + Constrained-CPU Tuning)

📊 Performance Snapshot

🔬 Benchmark Results

End-to-End Generation (Latest)

Constrained CPU Benchmark (Raspberry Pi-like Simulation)

Core Operation Micro-Benchmarks (Reference)

Core Operation Performance

Key Performance Characteristics

Example Benchmark Output

🛠️ Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Run the Code

🚀 Usage

Basic Usage

Advanced Options

🧠 Engineering Notes

🚧 Limitations

👨‍💻 Author

📌 Motivation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages