- Plain C/C++ implementation without dependencies
- Optimized for multimodal LLMs like Qwen2-VL and LLaVA
- Supported: ARM NEON, x86 AVX2, Qualcomm NPU (QNN), etc
- Various quantization schemes
- End-to-end Android app demo
- Advanced support: MoE, Prompt Cache, etc..
mllm is a lightweight, fast, and easy-to-use (multimodal) on-device LLM inference engine for mobile devices (mainly supporting CPU/NPU), initiated by the research groups led by Mengwei Xu (BUPT) and Xuanzhe Liu (PKU).
- [2025 July 30] Add Rotation Quantization method for QNN backend models and support Qwen-2-VL 2B
- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support and MLLM IR for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
- Android Demo
- Support models
- Quick Start
- Customization
- Roadmap
- Documentation
- Contribution
- Acknowledgments
- License
| Android Intent Invocation | Image Understanding |
PhoneLM_Call.mp4 |
Fuyu.mp4 |
| Chat CPU | Chat NPU |
QWen1.5_Chat_CPU.mp4 |
QWen1.5_Chat_NPU.mp4 |
| Model | CPU FP32 |
CPU INT4 |
Hexagon NPU INT8 |
|---|---|---|---|
| LLaMA 2 7B | ✔️ | ✔️ | |
| LLaMA 3 1B | ✔️ | ✔️ | |
| LLaMA 3 3B | ✔️ | ✔️ | |
| Alpaca 7B | ✔️ | ✔️ | |
| TinyLLaMA 1.1B | ✔️ | ✔️ | |
| LLaVA 7B | ✔️ | ✔️ | |
| Gemma 2B | ✔️ | ✔️ | |
| Gemma 2 2B | ✔️ | ✔️ | |
| Qwen 1.5 0.5B | ✔️ | ✔️ | ✔️ |
| Qwen 1.5 1.8B | ✔️ | ✔️ | ✔️ |
| Qwen 2.5 1.5B | ✔️ | ✔️ | ✔️ |
| Qwen 3 0.6B | ✔️ | ✔️ | |
| Mistral 7B | ✔️ | ✔️ | |
| Yi 6B | ✔️ | ✔️ | |
| StableLM 2 1.6B | ✔️ | ✔️ | |
| OPT 1.3B | ✔️ | ✔️ | |
| Phi 3 mini 3.8B | ✔️ | ✔️ | |
| MiniCPM 2B | ✔️ | ✔️ | |
| MiniCPM 3 4B | ✔️ | ✔️ | |
| MiniCPM MoE 8x2B | ✔️ | ✔️ | |
| SmolLM 1.7B | ✔️ | ✔️ | |
| DCLM 1B | ✔️ | ✔️ | |
| OpenELM 1.1B | ✔️ | ✔️ | |
| PhoneLM 1.5B | ✔️ | ✔️ | ✔️ |
| Model | CPU FP32 |
CPU INT4 |
Hexagon NPU INT8 |
|---|---|---|---|
| Fuyu 8B | ✔️ | ✔️ | |
| Vision Transformer | ✔️ | ✔️ | |
| CLIP | ✔️ | ✔️ | |
| ImageBind (3 modalities) | ✔️ | ✔️ | |
| LLaVA 7B | ✔️ | ✔️ | |
| Phi-3-Vision | ✔️ | ✔️ | |
| Qwen2-VL 2B | ✔️ | ✔️ | ✔️ |
git clone https://github.com/UbiquitousLearning/mllm
cd mllm
git submodule update --init --recursive \
third_party/googletest \
mllm/backends/cpu/third_party/kleidiaiBuilding mllm requires following tools:
- gcc(11.4+) / clang (11.0+)
- CMake >= 3.18
- Android NDK Toolchains >= 26
Note that building OpenMP libs on macOS may fail due to Apple LLVM compiler, so we disable OpenMP on macOS by default, you may experience slower performance on macOS. Build mllm is more recommended on Linux.
NOTE: The QNN backend is preliminary version which can do end-to-end inference. It is still under active development for better performance and more supported models.
We support running several Qwen family models including Qwen-2-vl using Qualcomm QNN to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is here. The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.
Specifically, we support the following models (similar architecture models are also supported):
- Qwen 1.5 1.8B (demo_qwen_npu, demo_qwen_pipeline)
- Qwen 2.5 1.5B (demo_qwen_npu, demo_qwen_pipeline)
- Qwen 2 VL (demo_qwen2_vl_npu and demo_qwen2_vl_npuvit)
Build the target with QNN backend.
cd ../script
./build_qnn_android.shDownload the model from here, or using the following instructions to download the model. You can also export Pytorch models for QNN backend with int8 weight quantization and apply rotation quantization. Details can be found in backend specific README.
mkdir ../models && cd ../models
# Download int8 model used by npu & q4k model used by cpu
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-int8.mllm?download=true -O qwen-1.5-1.8b-chat-int8.mllm
wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-q4k.mllm?download=true -O qwen-1.5-1.8b-chat-q4k.mllmCurrently, QNN backend uses models with W8A8 or W8A16 quantization. (It is determined by Quantize & Dequantize ops in modeling class, you can refer to mllm/models/qwen/modeling_qwen_npu_v2.hpp for more details.)
Run on an android phone with at least 16GB of memory as building the QNN graphs on device will consume a lot of memory. After building and saving QNN graphs to qnn_context.bin, the runtime memory usage will meet the expectation. The demo_qwen_pipeline.cpp will show the pipeline parallel execution for QNN models, which will nearly has 1.5x speedup compared with the original execution.
cd ../script
./run_qwen_qnn.shResult are as followed:
> ./demo_qwen_npu
[Q] <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
[A] A short introduction to a large language model is a type of artificial intelligence language model that is designed to understand and generate human language text. These models are typically trained on large amounts of text data, such as books, articles, and other written materials, to learn the patterns and structures of human language. They use a combination of natural language processing (NLP)
export ANDROID_NDK=/path/to/your/ndk
cd scripts
./build_android.shDownload the model from here, or using the following instructions
mkdir ../models && cd ../models
# Download fuyu-8b-q4_k.mllm
wget https://huggingface.co/mllmTeam/fuyu-8b-mllm/resolve/main/fuyu-8b-q4_k.mllm?download=true -O fuyu-8b-q4_k.mllmRun on an android phone with at least 12GB of memory.
cd ../script
./run_fuyu.shResult are as followed:
> ./demo_fuyu
[Q] [../assets/bus.png]Generate a coco-style caption.
[A] A blue bus driving down▁the▁road next▁to a sidewalk.
[Q] [../assets/two_cats.jpg]What's this?
[A] Two cats laying on▁top▁of a pink blanket.
Download model from here, or using the following instructions
mkdir ../models && cd ../models
# Dowload llama-2-7b-chat-q4_k.mllm
wget https://huggingface.co/mllmTeam/llama-2-7b-mllm/resolve/main/llama-2-7b-chat-q4_k.mllm?download=true -O llama-2-7b-chat-q4_k.mllmRun on an android phone with at least 12GB of memory.
cd ../script
./run_llama.shResult are as followed:
> ./demo_llama
[Q] Hello, who are you?
[A]
I am a bot, I'm here to help answer your questions and provide information. I'm just an AI, I don't have personal experiences or feelings like humans do, but I'm here to help you with any questions you may have. Is there something specific you would like to know or talk about?
[Q] What can you do?
[A]
I can help with a wide range of tasks, including:
* Answering questions on various topics, such as science, history, technology, and more.
* Providing information on different subjects, such as culture, entertainment, and lifestyle.
* Helping with language-related tasks, such as translating text or improving grammar and syntax.
* Generating text, such as articles, stories, or even entire books.
* Assisting with coding
[Q] Please introduce Beijing University of Posts and Telecommunications.
[A]
Beijing University of Posts and Telecommunications (BUPT) is a public research university located in Beijing, China. It was established in 1955 and is one of the top universities in China, with a strong focus on engineering, technology, and telecommunications.
BUPT offers a wide range of undergraduate and graduate programs in fields such as computer science, electrical engineering, software engineering, and communication engineering. The university
Download model from here, or using the following instructions
mkdir ../models && cd ../models
# Download imagebind_huge-q4_k.mllm
wget https://huggingface.co/mllmTeam/imagebind_huge-mllm/resolve/main/imagebind_huge-q4_k.mllm?download=true -O imagebind_huge-q4_k.mllm Run on an android phone with at least 4GB of memory.
cd ../script
./run_imagebind.shResult are as followed:
> ./demo_imagebind
vision X text :
0.9985647 0.0013827 0.0000526
0.0000365 0.9998636 0.0000999
0.0000115 0.0083149 0.9916736
vision X audio :
0.8054272 0.1228001 0.0717727
0.0673458 0.8429284 0.0897258
0.0021967 0.0015335 0.9962698
cd scripts
./build.shcd ./bin
./demo_fuyu -m ../models/fuyu-8b-q4_k.mllm -v ../vocab/fuyu_vocab.mllmcd ./bin
./demo_llama -m ../models/llama-2-7b-chat-q4_k.mllm -v ../vocab/llama2_vocab.mllmcd ./bin
./demo_imagebind -m ../models/imagebind_huge-q4_k.mllm -v ../vocab/clip_vocab.mllmYou can download models from here, or you can convert a pytorch/safetensor model to mllm model by yourself.
cd tools/convertor
pip install -r ./requirements.txt
# for one file pytorch model
python converter.py --input_model=model.pth --output_model=model.mllm --type=torch
# for multi-file pytorch model
python converter.py --input_model=pytorch_model.bin.index.json --output_model=model.mllm --type=torch
# for one file safetensor model
python converter.py --input_model=model.bin --output_model=model.mllm --type=safetensor
# for multi-file safetensor model
python converter.py --input_model=model.safetensors.index.json --output_model=model.mllm --type=safetensorYou can convert vocabulary to mllm vocabulary as followed.
cd tools/convertor
python vocab.py --input_file=tokenizer.json --output_file=vocab.mllm --type=BPEYou can quantize mllm model to int4 model by yourself. mllm only support two quantize modes: Q4_0 and Q4_K.
cd bin
./quantize model.mllm model_q4_k.mllm Q4_K- More backends like QNN
- More models like PandaGPT
- More optimizations like LUT-GEMM
- More..
See the documentation here for more information
Read the contribution before you contribute.
mllm reuses many low-level kernel implementation from ggml on ARM CPU. It also utilizes stb and wenet for pre-processing images and audios. mllm also has benefitted from following projects: llama.cpp and MNN.
This project is licensed under the terms of the MIT License. Please see the LICENSE file in the root directory for the full text of the MIT License.
Certain component(wenet) of this project is licensed under the Apache License 2.0. These component is clearly identified in their respective subdirectories along with a copy of the Apache License 2.0. For the full text of the Apache License 2.0, please refer to the LICENSE-APACHE file located in the relevant subdirectories.
@article{xu2025fast,
title={Fast On-device LLM Inference with NPUs},
author={Xu, Daliang and Zhang, Hao and Yang, Liming and Liu, Ruiqi and Huang, Gang and Xu, Mengwei and Liu, Xuanzhe},
booktitle={International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year={2025}
}
@misc{yi2023mllm,
title = {mllm: fast and lightweight multimodal LLM inference engine for mobile and edge devices},
author = {Rongjie Yi and Xiang Li and Zhenyan Lu and Hao Zhang and Daliang Xu and Liming Yang and Weikai Xie and Chenghua Wang and Xuanzhe Liu and Mengwei Xu},
year = {2023},
publisher = {mllm Team},
url = {https://github.com/UbiquitousLearning/mllm}
}