CPU-only LLM/VLM inference engine for Android, built on llama.cpp.
A production fork of llama.cpp stripped to the CPU backend and optimized for ARM Android devices. All GPU backends (CUDA, Metal, Vulkan, OpenCL) have been removed. Three engine components are built on top for the Tool-Neuron Android app.
Kotlin SDK (gguf_lib)
|
JNI bridge
|
Engine layer (engine/)
- GGMLEngine model load/unload, generation, KV cache, context tracking
- VLM Engine vision and audio understanding (20+ architectures)
- ToolManager model-agnostic tool calling (JSON, XML, function-call)
- RAG Engine late chunking, binary quantized retrieval
- Logging callback-based, routes to Android logcat or custom handler
|
llama.cpp core (src/ + common/)
|
GGML CPU backend (ggml/)
- NEON, i8mm, dotprod, fp16, bf16
- KleidiAI ARM micro-kernels
src/ llama.cpp model loading, tokenization, inference, sampling
include/ public C/C++ headers (llama.h, llama-cpp.h)
ggml/ tensor library, CPU backend only, ARM optimized
common/ chat templates, JSON schema grammar, sampling, jinja
engine/ engine layer (ggml-engine, vlm, tool-manager, rag-engine, tn-log)
vlm/ vision/audio encoders (CLIP, SigLIP, Whisper, 20+ architectures)
vendor/ nlohmann/json, stb_image, miniaudio
cmake/ build-info, license, compiler flags
docs/ API reference, architecture, build guide, benchmarks
Any GGUF model works. All compute graphs from upstream llama.cpp are preserved.
- Text: LLaMA, Mistral, Phi, Qwen, Gemma, DeepSeek, Command-R, and 100+ architectures
- Vision: SmolVLM, LLaVA, Qwen2-VL, Qwen3-VL, InternVL, Pixtral, Gemma3-Vision, and 20+ VLM architectures
- Audio: Whisper, Conformer encoders
- Quantization: Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, F32, IQ variants
This repo is consumed as a CMake subdirectory by an Android library module:
set(LLAMA_DIR "/path/to/this/repo")
add_subdirectory(${LLAMA_DIR} ${CMAKE_CURRENT_BINARY_DIR}/llama)
target_link_libraries(my_jni_lib tn-engine llama common ggml)All public engine headers are pure C (extern "C") and safe for JNI binding.
See docs/BUILD.md for full details. Key CMake variables:
| Variable | Value | Purpose |
|---|---|---|
GGML_CPU |
ON | CPU backend |
GGML_CPU_ARM_ARCH |
armv8.6-a+i8mm+dotprod+fp16 |
ARM feature flags |
GGML_CPU_KLEIDIAI |
ON | ARM KleidiAI micro-kernels |
GGML_LTO |
ON | Link-time optimization |
BUILD_SHARED_LIBS |
OFF | Static link into single .so |
Tested on Cortex-X3 (armv9, i8mm, bf16, NEON, dotprod):
| Model | Quant | Generation |
|---|---|---|
| LFM2-350M | Q8_0 | 29-30 t/s |
| SmolVLM-500M | Q8_0 | 28 t/s text, 22 t/s with vision |
| Qwen3-0.6B | Q8_0 | 17-19 t/s |
| Gemma3-1B | Q4_K_M | 14 t/s |
| Document | Description |
|---|---|
| API Reference | C API for GGMLEngine, VLM, ToolManager, RAG, Logging |
| Architecture | Stack diagram, directory map, data flows |
| Build Guide | CMake variables, NDK cross-compilation |
| Performance | Benchmarks, ARM optimizations, threading |
| Models | Supported architectures, quantization, sizing |
MIT License -- see LICENSE.
Based on llama.cpp by Georgi Gerganov and contributors.