The Problem: Most AI agents rely on expensive cloud APIs and large models that are impractical for real-world on-device deployment. Users face Privacy Concerns, Latency Issues, and High Costs when their phone needs to call external services for every interaction.
Our Solution: OpenPhone introduces the first Open-Source, 3B-parameter Agentic Foundation Model designed specifically for on-device smartphone interaction. This compact vision-language model runs entirely locally β meaning No Privacy Concerns, No Cloud Dependence, and Zero API Costs.
We believe the future of mobile AI lies not only in making models larger, but in making them smarter and more efficient for real-world constraints. Our 3B model is:
- β‘ Edge-Optimized: Efficient enough for commodity GPUs and next-generation mobile NPUs.
- π Privacy-First: All computation stays on your device.
- π° Cost-Free: No cloud inference and no ongoing API fees.
- π― High-Performance: Achieves performance comparable to 7Bβ9B models through advanced training.
Considering the compute limitations of todayβs edge devices, models with β€3B parameters strike a practical balance between capability and deployability. Based on this insight, we introduce OpenPhoneβ3B, a lightweight yet powerful onβdevice agent model.
- Model Size & Architecture: Vision-language model engineered for efficient on-device reasoning under tight mobile compute constraints.
- Edge-Native Design: Primary local agent compatible with consumer GPUs and mobile NPUs, eliminating continuous cloud dependency.
- GUIβAware Action Capabilities: Trained for visual interpretation, instruction following, and structured action generation across real mobile tasks.
- OpenβSource Release: Full model weights, configurations, and inference stack enabling community deployment and development.
- Practical Sweet Spot: 3B scale delivers optimal balanceβsignificantly stronger than tiny models while remaining deployable where larger models fail.
- Hardware Fit: 3B parameters align perfectly with consumer GPU memory (8-12GB) and emerging mobile NPU computational budgets.
- Speed Advantage: 3B models deliver 3-5x faster inference than 7B alternatives while maintaining competitive accuracy for sub-second GUI responses.
- Power Efficiency: Smaller footprint extends battery life - essential for mobile deployment where power consumption affects user experience.
- Privacy-First: Enables phone tasks to run entirely on-device, preserving user privacy while eliminating network dependencies.
- Cost Savings: Local processing eliminates expensive cloud APIs and per-request charges for sustainable operation.
- Model Weights: OpenPhone-3B is available on Hugging Face with full licensing for research and commercial use.
- Production-Ready Serving: Pre-configured vLLM inference scripts enable efficient deployment with optimized throughput and memory usage.
- Reproducible Recipe: Full training implementation including our novel two-stage approach (SFT + GRPO-style RL with synthetic GUI data).
- Customization Support: Detailed documentation in model_training/allows researchers to adapt the model for domain-specific phone tasks or extend to new mobile platforms.
- Data Generation Paradigm: Scripts and methodologies for creating high-quality training data at scale.
- β¨OpenPhoneβ¨: Mobile Agentic Foundation Models for AI Phone
- π― What is OpenPhone?
- π‘ Research Highlights
- π Model Release & Resources
- π Table of Contents
- π Quick Start
- π Key Features of OpenPhone
- π Technical Innovation & Implementation
- π§ͺ Testing & Evaluation
- π Result Generation
- π― Evaluation Results
- π Citation
- π Related Projects
- π License
This project comprises three core components designed for comprehensive mobile agent development and evaluation:
- β‘ For model training, please refer to the training guide README for comprehensive setup and execution instructions.
- π§ For the data generation pipeline, please refer to the data preparation guide README for detailed implementation steps.
Below, we focus on evaluation using the AndroidLab benchmark framework.
Installation: Follow the official AndroidLab documentation AndroidLab for complete setup instructions.
Environment Configuration:
- Recommended Mode: AVD on Mac (arm64) - validated in our experiments.
- App Setup: Manual installation and task-specific configuration required.
- Compatibility Note: Original Docker images are not compatible with AVD environments.
vLLM Integration:
- Inference scripts available in ./vllm_script/ directory
- Optimized for efficient small model serving
Model Access:
- OpenPhone Weights: 3B parameter model hosted on HuggingFace
- Deployment Process: Download weights β Deploy via vLLM β Configure inference service
- Service Ready: Seamless integration with evaluation pipeline
- API Setup Required: Configure cloud model credentials in ./evaluation/evaluation.py: Line 63, Line 75, Line 81
- Coming Soon: Streamlined configuration interface in development
β’ Compact Architecture: Specialized 3B-scale Vision-Language Models optimized for mobile GUI tasks with minimal computational footprint.
β’ On-Device Deployment: True smartphone-compatible models that maintain competitive performance while running locally without cloud dependency.
β’ Dynamic Orchestration: Real-time task complexity assessment that intelligently switches between device and cloud models based on execution requirements.
β’ Cost-Performance Optimization: Strategic resource allocation that leverages cost-efficient on-device models while compensating limitations through selective cloud model usage.
β’ Extended Benchmark Suite: Beyond AndroidLab, incorporating 25+ additional tasks across popular mobile applications for real-world validation.
β’ Multi-Dimensional Assessment: Comprehensive evaluation covering performance metrics, computational efficiency, and practical deployment scenarios.
β’ Synthetic Data Generation: Leverages advanced MLLMs to create high-quality reasoning chain training data, addressing the scarcity of manual annotations.
β’ Two-Stage Training: SFT injects GUI foundational knowledge, while GRPO reinforcement learning optimizes task completion accuracy.
β’ Small Model Enhancement: Enables 3B models to achieve performance comparable to 7B-9B models on GUI tasks through structured training.
β’ Dynamic Task Assessment: Real-time complexity evaluation determines when and how frequently to monitor device model performance.
β’ Intelligent Orchestration: Seamlessly switches between device and cloud models based on execution progress and failure patterns.
β’ Cost-Performance Optimization: Reduces cloud invocations by ~10% while maintaining high task success rates through strategic resource allocation.
β’ Long-Horizon Reasoning: Multi-step chain-of-thought reasoning with reflective error correction to enhance decision-making capabilities.
β’ Text-Based Summarization: Compresses high-resolution screenshots into compact textual representations for efficient memory management.
β’ Structured Context Retention: Maintains 10-20 steps of historical context in resource-constrained environments through optimized token usage.
Test individual tasks using the following command structure:
python eval.py -n test_name -c your path to config.yaml --task_id task_idExample Usage:
python eval.py -n all_cloud_v1_hyper -c ./configs/example_xml_cloud_hyper.yaml --task_id zoom_1Convenient batch testing scripts are available in ./test_script:
β’ all_test_cloud_v1_hyper.sh: Evaluates all 138 AndroidLab benchmark tasks
β’ all_test_cloud_v1_hyper_add.sh: Evaluates tasks for four additional mobile apps
For comprehensive details about the four additional app tasks, refer to the documentation: Additional Apps Documentation
Required Configuration: Set up LLM service credentials in ./evaluation/tasks/llm_evaluator.py:
β’ Line 10: API configuration
β’ Line 12: Service URL
π‘ Enhancement: Our implementation replaces AndroidLab's rule-based evaluation with LLM-powered assessment, providing more nuanced and accurate task completion evaluation.
Execute result generation with the following command:
python generate_result.py --input_folder ./logs/evaluation/ --output_folder ./logs/evaluation/ --output_excel ./logs/evaluation/test_name.xlsx
β’ Manual Transfer Required: Move generated evaluation files from script directory to ./logs/
β’ Then Execute: Run the result generation command above
β’ Error Prevention: This step prevents file path conflicts and ensures proper result compilation
- Size vs Performance: OpenPhone-3B achieves performance comparable to 9B models while maintaining the deployment advantages of a compact architecture.
- Efficiency Champion: Establishes itself as a genuine "small powerhouse" that challenges the bigger-is-better assumption in mobile AI.
- Against Proprietary Models: OpenPhone-3B shows respectable performance compared to lightweight versions of proprietary models when evaluated on standard benchmarks.
- Potential of Small Models: Demonstrates promising results that validate the viability of compact open-source approaches in mobile agent developmen.
- Performance with Efficiency: OpenPhone's hybrid architecture delivers near-optimal performance while dramatically reducing cloud model usage.
- Intelligent Routing: Proves that smart task routing creates practical efficiency gains without sacrificing capability.
- Context Matters: Extended prompting strategies only improve performance when paired with sufficiently capable cloud models.
- Smart Matching: Highlights the importance of matching reasoning complexity to model capability rather than assuming longer prompts always help.
To evaluate the practical efficiency of our hybrid approach, we measured key metrics across different MLLMs: average total steps per task, the proportion of steps handled by on-device versus cloud models, and cloud call reduction compared to cloud-only baselines.
Cloud models still handle approximately 65% of execution steps, reflecting the computational limitations of smaller on-device models for complex reasoning tasks.
Introducing on-device processing achieves roughly 10% reduction in cloud API calls, translating to direct cost savings and reduced latency.
Advanced cloud models like GLM-4.5V show smaller reductions in cloud dependency, as their superior capabilities enable more independent task completion without requiring on-device assistance.
We evaluated average inference time per step using vLLM across different GPU configurations to assess real-world deployment feasibility. Note that GLM-4.1V-9B-Thinking could not operate on a single 3090 GPU due to context length constraints.
| Model | GPUs | Size | SR | Time Cost / Step |
|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | Single 3090 | 7B | 10.1 | 6289.15 ms |
| OpenPhone | Single 3090 | 3B | 15.2 | 4170.63 ms |
| GLM-4.1V-9B-Thinking | Two 3090s | 9B | 24.6 | 14584.89 ms |
| Qwen2.5-VL-7B-Instruct | Two 3090s | 7B | 10.1 | 4587.79 ms |
| OpenPhone | Two 3090s | 3B | 15.2 | 3524.25 ms |
- Clear Winner: OpenPhone demonstrates significant inference speed advantages thanks to its lightweight 3B architecture
- Real-World Ready: Speed benefits become increasingly pronounced under constrained computational resources, matching typical edge deployment scenarios
- 3.5x Faster: OpenPhone on single 3090 vs GLM-4.1V-9B-Thinking on dual 3090s.
- 4x Faster: OpenPhone on dual 3090s vs GLM-4.1V-9B-Thinking on dual 3090s.
- OpenPhone's Lightweight: GLM-4.1V-9B-Thinking's inability to run on single 3090 severely limits edge deployment options.
The trade-off is clear: while larger models like GLM-4.1V-9B-Thinking achieve higher task performance, OpenPhone's speed advantages make it far more suitable for real-world on-device scenarios where response time and hardware constraints matter.
If you find this work helpful to your research, please kindly consider citing our paper.
@article{jiang2025lightagent,
title={LightAgent: Mobile Agentic Foundation Models},
author={Jiang, Yangqin and Huang, Chao},
journal={arXiv preprint arXiv:2510.22009},
year={2025}
}
OpenPhone builds upon excellent open-source projects. We sincerely thank their authors and contributors:
- AndroidLab - The benchmark framework.
- R1-V - Implementation details for the GRPO training methodology.
- LLaMA Factory - The unified training framework enabling efficient model fine-tuning.
This project is released under the MIT License.




