Skip to content

sixfingerdev/TurboTensors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ TurboTensors v4.1 — JET MODE (Experimental)

TurboTensors is an experimental, low-level CPU inference engine focused on minimizing framework overhead and maximizing real-time token generation on low-end and mid-range CPUs.

It is designed primarily for small to mid-sized language models (approximately 50M–300M parameters) and has been tested mainly with Turkish LLMs such as Kayra-1-exp.

IMPORTANT: TurboTensors is NOT a general-purpose replacement for optimized BLAS-based frameworks on high-end CPUs. It intentionally prioritizes low-overhead execution over heavy vectorized libraries.


🎯 Design Philosophy

Modern CPU inference stacks (PyTorch, oneDNN, MKL) perform exceptionally well on high-end CPUs, but introduce significant overhead on resource-constrained systems.

TurboTensors targets:

  • Older CPUs
  • Limited cache sizes
  • No AVX-512
  • GPU-less environments
  • Edge / experimental setups

The goal is predictable, low-latency inference, not peak FLOPS.


✨ Core Features

  • Numba-JIT Kernels (LLVM) Critical execution paths are JIT-compiled to native machine code, avoiding Python interpreter overhead.

  • Fused Operator Execution Operations such as RMSNorm and activation functions are fused to reduce memory traffic and cache misses.

  • Prefill vs Decode Separation Distinct computational paths for context understanding and token generation, reducing redundant work.

  • KV Cache Awareness Aggressive reuse of key/value tensors during autoregressive decoding.

  • Zero-Copy Safetensors Loading Weights are accessed with minimal memory duplication (BF16 / F16 / F32 supported).

  • Thread-Level Parallelism Uses nogil-style execution and explicit thread control to avoid Python GIL bottlenecks.


🆕 v4.1 Update (High-CPU + Constrained-CPU Tuning)

Latest optimization pass focused on both high-core CPUs and low-resource profiles.

Key changes included in v4.1:

  • Faster decode attention path for single-token generation (vectorized implementation)
  • Lower decode loop allocation overhead (buffer reuse)
  • Faster Top-K sampling path using partial partitioning
  • Reduced cache-reset overhead between generations
  • Better small-batch behavior with serial kernel fallback where parallel overhead dominates

📊 Performance Snapshot

Test model: Kayra-1-exp (~85M parameters) Hardware: consumer-grade laptop CPU (non-server, non-AVX512)

TurboTensors v4.1: ~55-65 tokens/s (observed), very low first-token latency
HuggingFace (CPU): ~40-45 tokens/s (same machine, same model, comparison path)

NOTE: Results are hardware-dependent and primarily reflect performance on low to mid-tier CPUs. On high-end CPUs, optimized BLAS-based engines may outperform TurboTensors.


🔬 Benchmark Results

End-to-End Generation (Latest)

Measured with max_new_tokens=50 and the built-in benchmark/comparison flow on the current optimized code path:

Engine Speed (tokens/s) Notes
TurboTensors v4.1 60.9 Same model/tokenizer, CPU-only
HuggingFace CPU 41.9 Same machine/model, torch float32

Relative result on this run: TurboTensors ~1.5x faster.

Constrained CPU Benchmark (Raspberry Pi-like Simulation)

To emulate a low-resource environment, CPU threads were hard-limited (OMP/MKL/OPENBLAS/NUMBA, plus torch thread limits). This is a thread-constrained simulation, not a true ARM hardware benchmark.

Test setup:

  • Model: sixfingerdev/kayra-1-exp
  • Prompt: Türkiye
  • max_new_tokens=50, runs=3
Threads TurboTensors AVG (tok/s) TurboTensors BEST (tok/s) HuggingFace (tok/s) TurboTensors / HF
1 74.55 81.84 15.75 4.74x
2 80.67 87.43 27.96 2.89x
4 84.30 91.31 43.93 1.92x

Interpretation:

  • TurboTensors remains significantly ahead under strict CPU limits.
  • As thread count increases, HuggingFace scales up, but TurboTensors still leads in these tests.

Core Operation Micro-Benchmarks (Reference)

Benchmarks of core TurboTensors operations running on a GitHub Actions runner (Ubuntu, 4-core CPU).

Note: These benchmarks measure individual JIT-compiled kernel performance using a standalone test script. They demonstrate the efficiency of the low-level operations but do not represent end-to-end model generation performance.

Core Operation Performance

Operation Time (ms) Description
RMS Norm 0.024 Layer normalization
SiLU × Gate (Fused) 0.060 Fused activation function
Attention (Prefill, 32 tokens) 0.436 Multi-token attention
Attention (Decode, 1 token) 0.093 Single-token attention
Top-K Sampling 2.410 Token selection

Theoretical Throughput (based on individual operations): ~386 tokens/second

This is a theoretical upper bound calculated from individual operation timings. Actual end-to-end generation performance will be significantly lower due to:

  • Sequential dependencies between operations
  • Memory transfer overhead
  • Multiple layer traversals
  • Additional operations (embedding lookups, projections, etc.)

For real-world performance, refer to the "Performance Snapshot" section above showing ~55-65 tokens/s on consumer hardware.

Key Performance Characteristics

  • JIT Compilation: ~5.2 seconds warmup time (one-time cost)
  • Memory Efficiency: Zero-copy safetensors loading
  • Parallel Processing: Numba parallel loops for multi-core utilization
  • Cache Optimization: KV cache reuse during autoregressive decoding

Example Benchmark Output

The following output is from a standalone benchmark script that measures individual kernel performance:

======================================================================
TURBOTENSORS v4.0 - CORE OPERATIONS BENCHMARK
======================================================================

Configuration:
  Batch size: 1
  Sequence length: 128
  Hidden size: 640
  Attention heads: 10
  Head dimension: 64
  Vocabulary size: 32000

 Warming up JIT kernels...
🔥 Warming up JIT kernels... ✓ 5.25s
✓ Warmup completed in 5.25s

 [1/5] Benchmarking RMS Norm...
     Average time: 0.024 ms

 [2/5] Benchmarking SiLU * Gate (Fused)...
     Average time: 0.060 ms

 [3/5] Benchmarking Attention (Prefill)...
     Average time: 0.436 ms

 [4/5] Benchmarking Attention (Decode)...
     Average time: 0.093 ms

 [5/5] Benchmarking Top-K Sampling...
     Average time: 2.410 ms

======================================================================
BENCHMARK SUMMARY
======================================================================

Operation                    Time (ms)
----------------------------------------------------------------------
RMS Norm                        0.024
SiLU * Gate (Fused)             0.060
Attention (Prefill, 32 tok)     0.436
Attention (Decode, 1 tok)       0.093
Top-K Sampling                  2.410
----------------------------------------------------------------------

Theoretical decode throughput: ~386.6 tokens/second
(Based on individual operation timings)

======================================================================
✓ BENCHMARK COMPLETE!
======================================================================

Note: This benchmark output demonstrates kernel-level performance. Actual model generation performance depends on the complete architecture and is typically lower. See the "Performance Snapshot" section for real-world generation speeds.


🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Step 1: Clone the Repository

git clone https://github.com/sixfingerdev/TurboTensors.git
cd TurboTensors

Step 2: Install Dependencies

pip install -r requirements.txt

The following packages will be installed:

  • numpy (>=1.21.0) - Numerical computing
  • numba (>=0.56.0) - JIT compilation for performance
  • transformers (>=4.30.0) - Model tokenizer support
  • huggingface_hub (>=0.16.0) - Model downloading

Optional dependency:

  • torch (>=2.0.0) - Only needed for Hugging Face comparison benchmarks (commented out in requirements.txt)

Step 3: Run the Code

python main.py

🚀 Usage

Basic Usage

from main import TurboLLM, download_model
from transformers import AutoTokenizer

# Define model to use
model_id = "sixfingerdev/kayra-1-exp"

# Download and load model
model_path = download_model(model_id)
model = TurboLLM(model_path)

# Load tokenizer (must match the model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate text
output = model.generate(
    "Türkiye",
    max_new_tokens=50,
    temperature=0.8,
    top_k=50,
    repetition_penalty=1.2,
    tokenizer=tokenizer,
    stream=True
)

print(output)

Advanced Options

# Custom generation parameters
output = model.generate(
    prompt="Your prompt here",
    max_new_tokens=100,        # Number of tokens to generate
    temperature=0.8,            # Sampling temperature (0.0-2.0)
    top_k=50,                   # Top-K sampling
    repetition_penalty=1.2,     # Penalize repeated tokens
    tokenizer=tokenizer,
    stream=True                 # Stream output token by token
)

Note: KV caching is automatically enabled during generation to maximize performance. The engine reuses key/value tensors during autoregressive decoding, significantly reducing redundant computations.


🧠 Engineering Notes

Key ideas explored in this project:

  • Cache-friendly memory layouts (L1/L2 aware)
  • Partial Top-K sampling to avoid full logit sorting
  • Manual control over compute granularity
  • Avoidance of heavy framework abstractions

This project intentionally trades generality for clarity and control.


🚧 Limitations

  • Not optimized for AVX-512 or large-core-count servers
  • Not intended for very large models (1B+ parameters)
  • Limited numerical precision experiments so far

TurboTensors is best viewed as a research and engineering exploration, not a production-ready engine.


👨‍💻 Author

Enes Altıparmak (sixfingerdev)
Student — Kayseri Science High School

Interests:

  • CPU inference optimization
  • Turkish language models
  • Tokenization and memory efficiency
  • Low-level performance engineering

📌 Motivation

Understanding why systems are slow matters more than blindly using faster ones.

About

Maximum performance CPU Inference Engine for LLMs. Built with Numba-JIT kernels and custom memory management to outperform standard implementations on edge devices. Specifically optimized for Turkish LLMs (Kayra).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages