An iOS chat app that runs Google's Gemma 4 language models entirely on-device using Apple's MLX framework. No server, no API keys — just local inference on your iPhone or iPad.
- On-device inference — Models run locally via MLX, keeping your conversations private
- Two model options — Gemma 4 E2B (2B params, fastest) and E4B (4B params, smarter)
- Conversation history — Persisted locally with SwiftData
- Streaming responses — Token-by-token output as the model generates
- Xcode 16+
- iOS 18.0+
- A physical device with an Apple Silicon chip (iPhone 15 Pro or later recommended)
- ~2–4 GB of free storage for model downloads
Note: The Simulator does not support MLX acceleration. Use a physical device for usable performance.
- Clone the repo:
git clone https://github.com/vdthatte/gemma4-ios.git
- Open
App/gemma4.xcodeprojin Xcode. - Select your physical device as the run destination.
- Build and run. On first launch, the app will prompt you to download a model from HuggingFace (~1–2 GB).
App/gemma4/
├── Models/ # Data models (ChatMessage, Conversation, GemmaModel)
├── Services/ # MLXService — model loading & text generation
├── ViewModels/ # ChatViewModel — orchestrates UI state & inference
└── Views/ # SwiftUI views (ChatView, MessageBubbleView)
MVVM with SwiftUI and SwiftData. MLXService wraps mlx-swift-lm for model management and generation.
Models are 4-bit quantized variants from the mlx-community on HuggingFace:
| Model | Params | HuggingFace ID |
|---|---|---|
| Gemma 4 E2B | 2B | mlx-community/gemma-4-E2B-it-4bit |
| Gemma 4 E4B | 4B | mlx-community/gemma-4-e4b-it-4bit |