LiteRT - Quantized Edge Model Inference

Ultra-low latency inference with TensorFlow Lite

Quantized model support for edge deployment. Runs 4-bit/8-bit models with minimal memory footprint, ideal for mobile and embedded systems.

Formerly known as TensorFlow Lite, now LiteRT (Lite Runtime).

Reference: https://ai.google.dev/edge/litert

Overview

Purpose: Run quantized .tflite models with maximum efficiency

Use Cases:

Edge devices (phones, IoT)
Low-latency requirements
Limited RAM/VRAM
Battery-powered devices

Models:

Gemma LiteRT (e.g., google/gemma-3n-E4B-it-litert-lm)
Custom quantized models
TensorFlow Lite model zoo

Architecture

.tflite Model File
    ↓
LiteRT Interpreter
    ↓
Delegates (XNNPACK/GPU/NNAPI)
    ↓
Hardware (CPU/GPU/NPU)

Delegates:

XNNPACK: Optimized CPU inference
GPU: OpenGL/Metal/Vulkan acceleration
NNAPI: Android Neural Networks API
Core ML: iOS acceleration
QNN: Qualcomm Hexagon DSP

LiteRTManager

manager.py - Core inference manager

Initialization

from litert import LiteRTManager

manager = LiteRTManager()

Load Model

success = manager.load_model(
    model_path="models/gemma-3b-litert.tflite",
    use_xnnpack=True,  # CPU acceleration
    use_gpu=False,      # GPU acceleration
    num_threads=4       # CPU threads
)

Run Inference

result = manager.generate({
    "input": input_tensor  # numpy array
})

output = result['output']

Unload

manager.unload()

Gemma LiteRT Models

Example: google/gemma-3n-E4B-it-litert-lm

Quantization: E4B (4-bit quantization with 8-bit activations)

Download:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="google/gemma-3n-E4B-it-litert-lm",
    filename="gemma-3n-E4B-it-litert-lm.tflite"
)

Usage:

manager = LiteRTManager()
manager.load_model(model_path, use_xnnpack=True)

# Text generation (simplified)
result = manager.generate({
    "input": tokenized_input
})

Delegates

XNNPACK (CPU Acceleration)

Best for: ARM/x86 CPUs
Performance: 2-4x faster than default
Setup: Automatic (built into TensorFlow Lite)

manager.load_model(model_path, use_xnnpack=True)

GPU Delegate

Best for: Mobile GPUs, discrete GPUs
Performance: 5-10x faster than CPU
Setup: Requires GPU support in TensorFlow Lite

manager.load_model(model_path, use_gpu=True)

Platforms:

Android: OpenGL ES
iOS: Metal
Desktop: OpenCL/Vulkan

NNAPI (Android Neural Networks)

Best for: Android devices with NPU
Performance: Hardware-dependent
Setup: Android 8.1+

# TODO: Implement NNAPI delegate

Core ML (iOS)

Best for: iOS devices (iPhone, iPad)
Performance: Hardware-dependent (Neural Engine)
Setup: iOS 11+

# TODO: Implement Core ML delegate

Quantization

Supported Formats:

INT8: 8-bit integer quantization
FLOAT16: Half-precision floating point
E4B: 4-bit weights + 8-bit activations (Gemma LiteRT)

Memory Savings:

FP32 → INT8: 4x smaller
FP32 → FLOAT16: 2x smaller
FP32 → E4B: ~8x smaller

Accuracy Trade-off:

INT8: Minimal accuracy loss (<1%)
FLOAT16: Negligible loss
E4B: Slightly higher loss, acceptable for most tasks

Performance

Model	Quantization	Size	Latency	Memory
Gemma 2B FP32	None	8GB	150ms	8GB RAM
Gemma 2B INT8	8-bit	2GB	80ms	2GB RAM
Gemma 3B E4B	4-bit	1GB	50ms	1.5GB RAM

Pixel 7 Pro (Tensor G2)

Desktop Performance:

Gemma 3B LiteRT: 45 tok/s @ 4 threads (i9-12900K)
XNNPACK enabled: 2.5x faster vs. default

Integration with Services

Future: LiteRT Service (gRPC)

# services/litert_service.py
class LiteRTServiceImpl(ml_inference_pb2_grpc.LiteRTServiceServicer):
    def LoadModel(self, request, context):
        manager = LiteRTManager()
        manager.load_model(request.model_path)
        return LoadModelResponse(success=True)
    
    def Generate(self, request, context):
        result = manager.generate({"input": request.input})
        return GenerateResponse(output=result['output'])

Examples

Basic Text Generation

from litert import LiteRTManager
import numpy as np

manager = LiteRTManager()

# Load model
manager.load_model(
    "gemma-3b-litert.tflite",
    use_xnnpack=True,
    num_threads=4
)

# Prepare input (tokenize first)
input_ids = np.array([[1, 2, 3, 4]], dtype=np.int32)

# Generate
result = manager.generate({"input": input_ids})

# Process output (detokenize)
output_ids = result['output']

Benchmark Inference

import time

manager = LiteRTManager()
manager.load_model("model.tflite", use_xnnpack=True)

# Warmup
for _ in range(10):
    manager.generate({"input": dummy_input})

# Benchmark
times = []
for _ in range(100):
    start = time.time()
    manager.generate({"input": input_data})
    times.append(time.time() - start)

avg_latency = np.mean(times) * 1000  # ms
print(f"Average latency: {avg_latency:.2f}ms")

Testing

# Unit tests
pytest tests/test_litert.py -v

# Benchmark
python -m litert.benchmark --model gemma-3b-litert.tflite

Limitations

Current:

No streaming support (batch only)
Limited model coverage (not all HF models have .tflite)
Manual tokenization required
Delegates not fully implemented

Not Supported:

GGUF models (use gguf-loader instead)
ONNX models (use onnx-loader instead)
FP32 models (convert to quantized first)

Converting Models

From PyTorch to LiteRT

import tensorflow as tf
from transformers import AutoModel

# 1. Export to TensorFlow
model = AutoModel.from_pretrained("model-name")
# ... conversion code ...

# 2. Quantize
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# 3. Save
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Note: Many models need custom conversion scripts. Check model documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiteRT - Quantized Edge Model Inference

Overview

Architecture

LiteRTManager

Initialization

Load Model

Run Inference

Unload

Gemma LiteRT Models

Delegates

XNNPACK (CPU Acceleration)

GPU Delegate

NNAPI (Android Neural Networks)

Core ML (iOS)

Quantization

Performance

Integration with Services

Examples

Basic Text Generation

Benchmark Inference

Testing

Limitations

Converting Models

From PyTorch to LiteRT

See Also

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LiteRT - Quantized Edge Model Inference

Overview

Architecture

LiteRTManager

Initialization

Load Model

Run Inference

Unload

Gemma LiteRT Models

Delegates

XNNPACK (CPU Acceleration)

GPU Delegate

NNAPI (Android Neural Networks)

Core ML (iOS)

Quantization

Performance

Integration with Services

Examples

Basic Text Generation

Benchmark Inference

Testing

Limitations

Converting Models

From PyTorch to LiteRT

See Also