✨OpenPhone✨: Mobile Agentic Foundation Models for AI Phone

🎯 What is OpenPhone?

The Problem: Most AI agents rely on expensive cloud APIs and large models that are impractical for real-world on-device deployment. Users face Privacy Concerns, Latency Issues, and High Costs when their phone needs to call external services for every interaction.

Our Solution: OpenPhone introduces the first Open-Source, 3B-parameter Agentic Foundation Model designed specifically for on-device smartphone interaction. This compact vision-language model runs entirely locally — meaning No Privacy Concerns, No Cloud Dependence, and Zero API Costs.

🤔 Why 3B Parameters?

We believe the future of mobile AI lies not only in making models larger, but in making them smarter and more efficient for real-world constraints. Our 3B model is:

⚡ Edge-Optimized: Efficient enough for commodity GPUs and next-generation mobile NPUs.
🔒 Privacy-First: All computation stays on your device.
💰 Cost-Free: No cloud inference and no ongoing API fees.
🎯 High-Performance: Achieves performance comparable to 7B–9B models through advanced training.

💡 Research Highlights

🔍 OpenPhone‑3B: Lightweight Agentic Model

Considering the compute limitations of today’s edge devices, models with ≤3B parameters strike a practical balance between capability and deployability. Based on this insight, we introduce OpenPhone‑3B, a lightweight yet powerful on‑device agent model.

Model Size & Architecture: Vision-language model engineered for efficient on-device reasoning under tight mobile compute constraints.
Edge-Native Design: Primary local agent compatible with consumer GPUs and mobile NPUs, eliminating continuous cloud dependency.
GUI‑Aware Action Capabilities: Trained for visual interpretation, instruction following, and structured action generation across real mobile tasks.
Open‑Source Release: Full model weights, configurations, and inference stack enabling community deployment and development.
Practical Sweet Spot: 3B scale delivers optimal balance—significantly stronger than tiny models while remaining deployable where larger models fail.

Why 3B is the Sweet Spot for Phone Agents

Hardware Fit: 3B parameters align perfectly with consumer GPU memory (8-12GB) and emerging mobile NPU computational budgets.
Speed Advantage: 3B models deliver 3-5x faster inference than 7B alternatives while maintaining competitive accuracy for sub-second GUI responses.
Power Efficiency: Smaller footprint extends battery life - essential for mobile deployment where power consumption affects user experience.
Privacy-First: Enables phone tasks to run entirely on-device, preserving user privacy while eliminating network dependencies.
Cost Savings: Local processing eliminates expensive cloud APIs and per-request charges for sustainable operation.

🚀 Model Release & Resources

📦 Ready-to-Deploy Model

Model Weights: OpenPhone-3B is available on Hugging Face with full licensing for research and commercial use.
Production-Ready Serving: Pre-configured vLLM inference scripts enable efficient deployment with optimized throughput and memory usage.

🛠️ Complete Training Pipeline

Reproducible Recipe: Full training implementation including our novel two-stage approach (SFT + GRPO-style RL with synthetic GUI data).
Customization Support: Detailed documentation in model_training/allows researchers to adapt the model for domain-specific phone tasks or extend to new mobile platforms.
Data Generation Paradigm: Scripts and methodologies for creating high-quality training data at scale.

📖 Table of Contents

✨OpenPhone✨: Mobile Agentic Foundation Models for AI Phone

🚀 Quick Start

This project comprises three core components designed for comprehensive mobile agent development and evaluation:

⚡ For model training, please refer to the training guide README for comprehensive setup and execution instructions.
🔧 For the data generation pipeline, please refer to the data preparation guide README for detailed implementation steps.

Below, we focus on evaluation using the AndroidLab benchmark framework.

📱 AndroidLab Benchmark Setup

Installation: Follow the official AndroidLab documentation AndroidLab for complete setup instructions.

Environment Configuration:

Recommended Mode: AVD on Mac (arm64) - validated in our experiments.
App Setup: Manual installation and task-specific configuration required.
Compatibility Note: Original Docker images are not compatible with AVD environments.

🚀 Model Deployment & Inference

vLLM Integration:

Inference scripts available in ./vllm_script/ directory
Optimized for efficient small model serving

Model Access:

OpenPhone Weights: 3B parameter model hosted on HuggingFace
Deployment Process: Download weights → Deploy via vLLM → Configure inference service
Service Ready: Seamless integration with evaluation pipeline

⚙️ Pre-Testing Configuration

API Setup Required: Configure cloud model credentials in ./evaluation/evaluation.py: Line 63, Line 75, Line 81
Coming Soon: Streamlined configuration interface in development

🌟 Key Features of OpenPhone

🤖 Lightweight Agentic Foundation Models

• Compact Architecture: Specialized 3B-scale Vision-Language Models optimized for mobile GUI tasks with minimal computational footprint.
• On-Device Deployment: True smartphone-compatible models that maintain competitive performance while running locally without cloud dependency.

☁️ Device-Cloud Collaboration Framework

• Dynamic Orchestration: Real-time task complexity assessment that intelligently switches between device and cloud models based on execution requirements.
• Cost-Performance Optimization: Strategic resource allocation that leverages cost-efficient on-device models while compensating limitations through selective cloud model usage.

🎯 Comprehensive Mobile Agent Evaluation Playground

• Extended Benchmark Suite: Beyond AndroidLab, incorporating 25+ additional tasks across popular mobile applications for real-world validation.
• Multi-Dimensional Assessment: Comprehensive evaluation covering performance metrics, computational efficiency, and practical deployment scenarios.

🌟 Technical Innovation & Implementation

🧠 Model Training: SFT+RL

• Synthetic Data Generation: Leverages advanced MLLMs to create high-quality reasoning chain training data, addressing the scarcity of manual annotations.
• Two-Stage Training: SFT injects GUI foundational knowledge, while GRPO reinforcement learning optimizes task completion accuracy.
• Small Model Enhancement: Enables 3B models to achieve performance comparable to 7B-9B models on GUI tasks through structured training.

☁️ Device-Cloud Collaboration Framework

• Dynamic Task Assessment: Real-time complexity evaluation determines when and how frequently to monitor device model performance.
• Intelligent Orchestration: Seamlessly switches between device and cloud models based on execution progress and failure patterns.
• Cost-Performance Optimization: Reduces cloud invocations by ~10% while maintaining high task success rates through strategic resource allocation.

💾 Efficient Memory Mechanism for Mobile Agents

• Long-Horizon Reasoning: Multi-step chain-of-thought reasoning with reflective error correction to enhance decision-making capabilities.
• Text-Based Summarization: Compresses high-resolution screenshots into compact textual representations for efficient memory management.
• Structured Context Retention: Maintains 10-20 steps of historical context in resource-constrained environments through optimized token usage.

🧪 Testing & Evaluation

Single Task Testing

Test individual tasks using the following command structure:

python eval.py -n test_name -c your path to config.yaml --task_id task_id

Example Usage:

python eval.py -n all_cloud_v1_hyper -c ./configs/example_xml_cloud_hyper.yaml --task_id zoom_1

Batch Evaluation Scripts

Convenient batch testing scripts are available in ./test_script:

• all_test_cloud_v1_hyper.sh: Evaluates all 138 AndroidLab benchmark tasks
• all_test_cloud_v1_hyper_add.sh: Evaluates tasks for four additional mobile apps

Additional App Documentation

For comprehensive details about the four additional app tasks, refer to the documentation: Additional Apps Documentation

📊 Result Generation

LLM Evaluator Setup

Required Configuration: Set up LLM service credentials in ./evaluation/tasks/llm_evaluator.py:

• Line 10: API configuration
• Line 12: Service URL

💡 Enhancement: Our implementation replaces AndroidLab's rule-based evaluation with LLM-powered assessment, providing more nuanced and accurate task completion evaluation.

Generate Evaluation Results

Execute result generation with the following command:

python generate_result.py --input_folder ./logs/evaluation/ --output_folder ./logs/evaluation/ --output_excel ./logs/evaluation/test_name.xlsx

Batch Testing File Management

⚠️ Important: When using batch scripts from ./test_script/:
• Manual Transfer Required: Move generated evaluation files from script directory to ./logs/
• Then Execute: Run the result generation command above
• Error Prevention: This step prevents file path conflicts and ensures proper result compilation

🎯 📊 Key Evaluation Findings for OpenPhone

🏆 Small Model, Big Performance

Size vs Performance: OpenPhone-3B achieves performance comparable to 9B models while maintaining the deployment advantages of a compact architecture.
Efficiency Champion: Establishes itself as a genuine "small powerhouse" that challenges the bigger-is-better assumption in mobile AI.

🥊 Competitive Performance

Against Proprietary Models: OpenPhone-3B shows respectable performance compared to lightweight versions of proprietary models when evaluated on standard benchmarks.
Potential of Small Models: Demonstrates promising results that validate the viability of compact open-source approaches in mobile agent developmen.

🔄 Device-Cloud Framework Works

Performance with Efficiency: OpenPhone's hybrid architecture delivers near-optimal performance while dramatically reducing cloud model usage.
Intelligent Routing: Proves that smart task routing creates practical efficiency gains without sacrificing capability.

🧠 Longer Prompts Don't Always Help

Context Matters: Extended prompting strategies only improve performance when paired with sufficiently capable cloud models.
Smart Matching: Highlights the importance of matching reasoning complexity to model capability rather than assuming longer prompts always help.

📈 Device-Cloud Distribution Analysis for Phone Agents

To evaluate the practical efficiency of our hybrid approach, we measured key metrics across different MLLMs: average total steps per task, the proportion of steps handled by on-device versus cloud models, and cloud call reduction compared to cloud-only baselines.

📊 Workload Distribution

Cloud models still handle approximately 65% of execution steps, reflecting the computational limitations of smaller on-device models for complex reasoning tasks.

💰 Efficiency Gains

Introducing on-device processing achieves roughly 10% reduction in cloud API calls, translating to direct cost savings and reduced latency.

🎯 Model Capability Impact

Advanced cloud models like GLM-4.5V show smaller reductions in cloud dependency, as their superior capabilities enable more independent task completion without requiring on-device assistance.

⚡ Inference Speed Comparison

We evaluated average inference time per step using vLLM across different GPU configurations to assess real-world deployment feasibility. Note that GLM-4.1V-9B-Thinking could not operate on a single 3090 GPU due to context length constraints.

Model	GPUs	Size	SR	Time Cost / Step
Qwen2.5-VL-7B-Instruct	Single 3090	7B	10.1	6289.15 ms
OpenPhone	Single 3090	3B	15.2	4170.63 ms
GLM-4.1V-9B-Thinking	Two 3090s	9B	24.6	14584.89 ms
Qwen2.5-VL-7B-Instruct	Two 3090s	7B	10.1	4587.79 ms
OpenPhone	Two 3090s	3B	15.2	3524.25 ms

🎯 Speed Advantage

Clear Winner: OpenPhone demonstrates significant inference speed advantages thanks to its lightweight 3B architecture
Real-World Ready: Speed benefits become increasingly pronounced under constrained computational resources, matching typical edge deployment scenarios

📊 Quantified Comparison

3.5x Faster: OpenPhone on single 3090 vs GLM-4.1V-9B-Thinking on dual 3090s.
4x Faster: OpenPhone on dual 3090s vs GLM-4.1V-9B-Thinking on dual 3090s.
OpenPhone's Lightweight: GLM-4.1V-9B-Thinking's inability to run on single 3090 severely limits edge deployment options.

💡 Practical Implications

The trade-off is clear: while larger models like GLM-4.1V-9B-Thinking achieve higher task performance, OpenPhone's speed advantages make it far more suitable for real-world on-device scenarios where response time and hardware constraints matter.

🌟 Citation

If you find this work helpful to your research, please kindly consider citing our paper.

@article{jiang2025lightagent,
  title={LightAgent: Mobile Agentic Foundation Models},
  author={Jiang, Yangqin and Huang, Chao},
  journal={arXiv preprint arXiv:2510.22009},
  year={2025}
}

🔗 Related Projects

OpenPhone builds upon excellent open-source projects. We sincerely thank their authors and contributors:

AndroidLab - The benchmark framework.
R1-V - Implementation details for the GRPO training methodology.
LLaMA Factory - The unified training framework enabling efficient model fine-tuning.

📜 License

This project is released under the MIT License.

If this project helps you, please give us a Star🌟

🤖 Empower AI Phone with Agents!

❤️ Thanks for visiting ✨ OpenPhone!

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
agent		agent
app_data		app_data
configs		configs
demo		demo
docs		docs
evaluation		evaluation
figures		figures
ground_data		ground_data
logs		logs
model_training		model_training
page_executor		page_executor
prepare_data		prepare_data
recorder		recorder
templates		templates
test_script		test_script
tools		tools
utils_mobile		utils_mobile
vllm_script		vllm_script
.DS_Store		.DS_Store
.gitignore		.gitignore
Communication.md		Communication.md
LICENSE		LICENSE
README.md		README.md
adb_client.py		adb_client.py
eval.py		eval.py
generate_result.py		generate_result.py
requirements.txt		requirements.txt

License

HKUDS/OpenPhone

Folders and files

Latest commit

History

Repository files navigation

✨OpenPhone✨: Mobile Agentic Foundation Models for AI Phone

🎯 What is OpenPhone?

🤔 Why 3B Parameters?

💡 Research Highlights

🔍 OpenPhone‑3B: Lightweight Agentic Model

Why 3B is the Sweet Spot for Phone Agents

🚀 Model Release & Resources

📦 Ready-to-Deploy Model

🛠️ Complete Training Pipeline

📖 Table of Contents

🚀 Quick Start

📱 AndroidLab Benchmark Setup

🚀 Model Deployment & Inference

⚙️ Pre-Testing Configuration

🌟 Key Features of OpenPhone

🤖 Lightweight Agentic Foundation Models

☁️ Device-Cloud Collaboration Framework

🎯 Comprehensive Mobile Agent Evaluation Playground

🌟 Technical Innovation & Implementation

🧠 Model Training: SFT+RL

☁️ Device-Cloud Collaboration Framework

💾 Efficient Memory Mechanism for Mobile Agents

🧪 Testing & Evaluation

Single Task Testing

Batch Evaluation Scripts

Additional App Documentation

📊 Result Generation

LLM Evaluator Setup

Generate Evaluation Results

Batch Testing File Management

🎯 📊 Key Evaluation Findings for OpenPhone

🏆 Small Model, Big Performance

🥊 Competitive Performance

🔄 Device-Cloud Framework Works

🧠 Longer Prompts Don't Always Help

📈 Device-Cloud Distribution Analysis for Phone Agents

📊 Workload Distribution

💰 Efficiency Gains

🎯 Model Capability Impact

⚡ Inference Speed Comparison

🎯 Speed Advantage

📊 Quantified Comparison

💡 Practical Implications

🌟 Citation

🔗 Related Projects

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages