Skip to content

Siddhesh2377/llama.cpp-android

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tool-Neuron GGML Backend

CPU-only LLM/VLM inference engine for Android, built on llama.cpp.

Overview

A production fork of llama.cpp stripped to the CPU backend and optimized for ARM Android devices. All GPU backends (CUDA, Metal, Vulkan, OpenCL) have been removed. Three engine components are built on top for the Tool-Neuron Android app.

Kotlin SDK (gguf_lib)
    |
JNI bridge
    |
Engine layer (engine/)
  - GGMLEngine    model load/unload, generation, KV cache, context tracking
  - VLM Engine    vision and audio understanding (20+ architectures)
  - ToolManager   model-agnostic tool calling (JSON, XML, function-call)
  - RAG Engine    late chunking, binary quantized retrieval
  - Logging       callback-based, routes to Android logcat or custom handler
    |
llama.cpp core (src/ + common/)
    |
GGML CPU backend (ggml/)
  - NEON, i8mm, dotprod, fp16, bf16
  - KleidiAI ARM micro-kernels

Directory Structure

src/             llama.cpp model loading, tokenization, inference, sampling
include/         public C/C++ headers (llama.h, llama-cpp.h)
ggml/            tensor library, CPU backend only, ARM optimized
common/          chat templates, JSON schema grammar, sampling, jinja
engine/          engine layer (ggml-engine, vlm, tool-manager, rag-engine, tn-log)
  vlm/           vision/audio encoders (CLIP, SigLIP, Whisper, 20+ architectures)
vendor/          nlohmann/json, stb_image, miniaudio
cmake/           build-info, license, compiler flags
docs/            API reference, architecture, build guide, benchmarks

Supported Models

Any GGUF model works. All compute graphs from upstream llama.cpp are preserved.

  • Text: LLaMA, Mistral, Phi, Qwen, Gemma, DeepSeek, Command-R, and 100+ architectures
  • Vision: SmolVLM, LLaVA, Qwen2-VL, Qwen3-VL, InternVL, Pixtral, Gemma3-Vision, and 20+ VLM architectures
  • Audio: Whisper, Conformer encoders
  • Quantization: Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, F32, IQ variants

Usage

This repo is consumed as a CMake subdirectory by an Android library module:

set(LLAMA_DIR "/path/to/this/repo")
add_subdirectory(${LLAMA_DIR} ${CMAKE_CURRENT_BINARY_DIR}/llama)
target_link_libraries(my_jni_lib tn-engine llama common ggml)

All public engine headers are pure C (extern "C") and safe for JNI binding.

Build

See docs/BUILD.md for full details. Key CMake variables:

Variable Value Purpose
GGML_CPU ON CPU backend
GGML_CPU_ARM_ARCH armv8.6-a+i8mm+dotprod+fp16 ARM feature flags
GGML_CPU_KLEIDIAI ON ARM KleidiAI micro-kernels
GGML_LTO ON Link-time optimization
BUILD_SHARED_LIBS OFF Static link into single .so

Performance

Tested on Cortex-X3 (armv9, i8mm, bf16, NEON, dotprod):

Model Quant Generation
LFM2-350M Q8_0 29-30 t/s
SmolVLM-500M Q8_0 28 t/s text, 22 t/s with vision
Qwen3-0.6B Q8_0 17-19 t/s
Gemma3-1B Q4_K_M 14 t/s

Documentation

Document Description
API Reference C API for GGMLEngine, VLM, ToolManager, RAG, Logging
Architecture Stack diagram, directory map, data flows
Build Guide CMake variables, NDK cross-compilation
Performance Benchmarks, ARM optimizations, threading
Models Supported architectures, quantization, sizing

License

MIT License -- see LICENSE.

Based on llama.cpp by Georgi Gerganov and contributors.

About

Custom llama.cpp fork with character intelligence engine: control vectors, attention bias, head rescaling, attention temperature, fast weight memory

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors