MAM-AI

On-device medical search for nurses and midwives in Zanzibar

Demo Video · Gemma 3n Kaggle Challenge · Eval Report · Latency Report

MAM-AI is an Android application that provides medical information search for maternal and neonatal healthcare workers. It runs entirely on-device using Gemma 3n via Google AI Edge MediaPipe — no internet connection is needed after the initial model download. Users type clinical questions in natural language and receive guideline-grounded answers powered by on-device RAG (Retrieval-Augmented Generation).

Key Features

Fully offline — LLM inference, embedding, and vector search all run on the phone
On-device RAG — retrieves relevant passages from 2,826 pre-embedded medical guideline chunks via Gecko embeddings + SQLite cosine similarity
Streaming responses — answers appear token-by-token as they are generated
Conversation history — multi-turn conversations with persistent storage
Medical safety focus — prompt template emphasizes accuracy, simple language for second-language speakers, and emergency escalation
Gemma 3n E4B — 4.1 GB int4-quantized model, ~90s median query time on a Pixel 7

Architecture

┌─────────────────────────────────────────────────┐
│  Flutter UI (Dart)                              │
│  intro_page.dart · search_page.dart             │
│  conversation_store.dart                        │
├──────────────┬──────────────────────────────────┤
│ MethodChannel│  EventChannel (streaming)        │
├──────────────┴──────────────────────────────────┤
│  Android Native (Kotlin)                        │
│  MainActivity.kt · RagStream.kt                 │
│  ┌────────────────────────────────────────────┐ │
│  │ RagPipeline.kt                             │ │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │
│  │  │ Gemma 3n │ │  Gecko   │ │  SQLite    │ │ │
│  │  │ MediaPipe│ │ Embedder │ │ VectorStore│ │ │
│  │  └──────────┘ └──────────┘ └────────────┘ │ │
│  └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘

Query flow:

User types a clinical question in the Flutter chat UI
Query is sent to Android via platform MethodChannel
Gecko embeds the query → SQLite cosine similarity retrieves top-3 guideline chunks
Retrieved context + query + conversation history are assembled into a Gemma IT prompt
MediaPipe LLM generates a streaming response, sent back via EventChannel
Flutter renders the response as markdown in real time

Install

Download the APK from the GitHub Releases tab and install it on a real Android device. Emulators may not work — MediaPipe requires actual hardware acceleration.

On first launch, the app downloads ~4.5 GB of model files (LLM, embeddings model, tokenizer, vector database). After that, it works fully offline.

Repository Structure

mamai/
├── app/                    # Flutter + Android application
│   ├── lib/                #   Flutter frontend (Dart)
│   │   ├── main.dart
│   │   ├── screens/
│   │   │   ├── intro_page.dart      # License acceptance & model download
│   │   │   └── search_page.dart     # Main chat interface
│   │   └── conversation_store.dart  # Conversation persistence
│   ├── android/app/src/main/kotlin/com/example/app/
│   │   ├── MainActivity.kt          # Flutter ↔ Android channel bridge
│   │   ├── RagPipeline.kt           # Core RAG engine (LLM + embeddings + vector store)
│   │   ├── RagStream.kt             # Streaming & concurrency control
│   │   ├── BenchmarkActivity.kt     # Headless latency benchmarking
│   │   └── BenchmarkQueries.kt      # Predefined test queries
│   └── pubspec.yaml
├── rag/                    # Document preprocessing & chunking (Python)
│   ├── rag.py                       # Chunking, embedding, and RAG evaluation
│   ├── chunks_testing.py            # Chunk analysis utilities
│   └── text_extraction_json.py      # JSONL text extraction
├── finetune/               # Gemma 3n finetuning (Python, not deployed in app)
│   ├── main_training.py             # Training entry point
│   ├── config.py                    # Hyperparameters & paths
│   ├── model_setup.py               # LoRA + quantization setup
│   ├── data_processing.py           # QA dataset formatting
│   └── training.py                  # SFTTrainer wrapper
├── evaluation/             # Model quality & latency benchmarking
│   ├── run_eval.py                  # Main evaluation harness
│   ├── scoring.py                   # LLM-as-judge scoring
│   ├── benchmark_latency.py         # On-device latency analysis
│   ├── EVAL_REPORT.md               # Quality results (5 models × 6 datasets)
│   └── LATENCY_REPORT.md            # On-device latency results
└── CLAUDE.md               # Developer instructions for Claude Code

Building from Source

Prerequisites

Flutter SDK ≥ 3.8.1
Android SDK 24+ with NDK 27.0
A real Android device (not an emulator)

Build & Run

cd app
flutter pub get
flutter build apk        # Build release APK
flutter run               # Run on connected device

Monitor Performance

adb logcat -s mam-ai      # View timing, memory, and inference logs

RAG Document Pipeline

The offline document ingestion process:

Curate medical guideline PDFs
Extract text using MMORE
Chunk documents using the scripts in rag/
Copy chunks to app/assets/mamai_trim.txt
Uncomment memorizeChunks() in RagPipeline.kt, run the app (embeds chunks into SQLite)
Re-comment memorizeChunks() and pull embeddings.sqlite from the device with adb

cd rag
pip install -r requirements.txt.txt
python rag.py

Model Files

Downloaded on first launch from a temporary VPS and stored on-device:

File	Description	Source
`gemma-3n-E4B-it-int4.task`	Gemma 3n E4B LLM (int4 quantized, 4.1 GB)	Google
`Gecko_1024_quant.tflite`	Gecko embedding model (768-dim)	litert-community/Gecko-110m-en
`sentencepiece.model`	Gecko tokenizer	litert-community/Gecko-110m-en
`embeddings.sqlite`	Pre-computed embeddings for 2,826 guideline chunks	Generated via `rag/` pipeline

Note: Gemma requires license acceptance before use. The temporary VPS hosting these files will only remain up during the Kaggle challenge judging period. To self-host, update the download URLs in intro_page.dart and replace app/cert.pem with your server's TLS certificate.

Evaluation

We evaluated 5 models across 6 medical QA benchmarks (3 MCQ, 3 open-ended). See the full reports:

Eval Report — quality benchmarks
Latency Report — on-device performance

Quality Summary

Model	MCQ Avg	Open-ended Avg (/5)
GPT-5 (cloud baseline)	80.9%	4.47
Gemma 3n E4B (deployed)	45.6%	3.06
MedGemma 4B	44.5%	2.90
Meditron3 8B	41.0%	2.88
Gemma 3n E2B	41.4%	2.76

Gemma 3n E4B is the best on-device model across both MCQ accuracy and open-ended quality. Medical-domain finetuned models (MedGemma, Meditron3) did not consistently outperform it at this quantization level.

Latency Summary (Pixel 7)

Metric	E4B	E2B
Median query time	91s	205s
Decode throughput	3.3 tok/s	1.4 tok/s
Model load (warm)	~1.2s	~1.1s

E4B delivers consistent performance regardless of query length, while E2B degrades dramatically on medium/long queries.

Finetuning

We finetuned Gemma 3n E4B on medical QA data using LoRA (not yet deployed in the app).

cd finetune
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python main_training.py

Self-Hosting Model Files

To serve the model files from your own server:

Host the four model files behind nginx (or any HTTPS server)
Update the download URLs in app/lib/screens/intro_page.dart
Replace app/cert.pem with your server's TLS certificate
Rebuild the APK

License

This project is licensed under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
app		app
evaluation		evaluation
finetune		finetune
gradle/wrapper		gradle/wrapper
rag		rag
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GEMMA-DISTRIBUTION-NOTICE		GEMMA-DISTRIBUTION-NOTICE
LICENSE		LICENSE
README.md		README.md
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
view-log.sh		view-log.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAM-AI

Key Features

Architecture

Install

Repository Structure

Building from Source

Prerequisites

Build & Run

Monitor Performance

RAG Document Pipeline

Model Files

Evaluation

Quality Summary

Latency Summary (Pixel 7)

Finetuning

Self-Hosting Model Files

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAM-AI

Key Features

Architecture

Install

Repository Structure

Building from Source

Prerequisites

Build & Run

Monitor Performance

RAG Document Pipeline

Model Files

Evaluation

Quality Summary

Latency Summary (Pixel 7)

Finetuning

Self-Hosting Model Files

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages