Open-AutoGLM (Secure Fork)

🔒 Security-hardened fork of zai-org/Open-AutoGLM with Russian/English localization and enhanced reliability features.

📖 What is This?

Phone Agent is an AI-powered framework for automating Android devices using Vision-Language Models (VLM). It captures screenshots, understands UI elements, and executes actions like tapping, swiping, and typing — all controlled by natural language commands.

Who is This For?

Audience	Use Case
QA Engineers	Automated UI testing without writing scripts
Accessibility	Voice-controlled phone automation for users with disabilities
Researchers	Studying AI agents and mobile automation
Developers	Prototyping AI-driven mobile applications

Example

python main.py --lang en "Open Chrome and search for weather forecast"

The agent will:

Take a screenshot
Identify Chrome icon
Tap to open
Find search bar
Type the query
Report completion

🏗️ Architecture

High-Level Overview

flowchart TB
    subgraph User["👤 User"]
        CLI[CLI / Python API]
    end
    
    subgraph Agent["🤖 Phone Agent Core"]
        direction TB
        PA[PhoneAgent]
        AC[AgentConfig]
        DS[DeviceState Checker]
        VA[Validator]
    end
    
    subgraph Model["🧠 AI Model"]
        MC[ModelClient]
        VLM[Vision-Language Model<br/>AutoGLM-Phone-9B]
    end
    
    subgraph Actions["⚡ Action Handler"]
        AH[ActionHandler]
        Parser[Safe Parser<br/>No eval!]
    end
    
    subgraph ADB["📱 ADB Layer"]
        direction TB
        Conn[Connection Manager]
        Dev[Device Control]
        SS[Screenshot]
        Input[Text Input]
    end
    
    subgraph Phone["📲 Android Device"]
        Screen[Screen Display]
        Apps[Applications]
    end
    
    CLI --> PA
    PA --> DS
    DS --> Conn
    PA --> MC
    MC <--> VLM
    VLM --> Parser
    Parser --> VA
    VA --> AH
    AH --> Dev
    AH --> Input
    Dev --> Phone
    Input --> Phone
    SS --> Phone
    SS --> MC

Detailed Component Diagram

flowchart LR
    subgraph phone_agent["📦 phone_agent/"]
        direction TB
        
        subgraph core["Core"]
            agent[agent.py<br/>PhoneAgent class]
            init[__init__.py<br/>Package exports]
        end
        
        subgraph adb["adb/"]
            connection[connection.py<br/>USB/WiFi/Remote]
            device[device.py<br/>tap, swipe, back]
            screenshot[screenshot.py<br/>Screen capture]
            input[input.py<br/>ADB Keyboard]
        end
        
        subgraph actions["actions/"]
            handler[handler.py<br/>Action execution]
        end
        
        subgraph model["model/"]
            client[client.py<br/>OpenAI-compatible API]
        end
        
        subgraph config["config/"]
            apps[apps.py<br/>App mappings]
            prompts_en[prompts_en.py]
            prompts_ru[prompts_ru.py]
            i18n[i18n.py<br/>Translations]
        end
        
        subgraph utils["Utilities (NEW)"]
            utils_py[utils.py<br/>Retry & Logging]
            device_state[device_state.py<br/>Pre-flight checks]
            validation[validation.py<br/>Response validation]
        end
    end
    
    agent --> handler
    agent --> client
    agent --> device_state
    handler --> device
    handler --> input
    agent --> screenshot

Execution Flow

sequenceDiagram
    autonumber
    participant U as User
    participant A as PhoneAgent
    participant DS as DeviceState
    participant M as VLM Model
    participant H as Handler
    participant D as Android Device
    
    U->>A: run("Open Settings")
    
    rect rgb(240, 248, 255)
        Note over A,DS: Pre-flight Check (NEW)
        A->>DS: check_device_state()
        DS->>D: ADB get-state
        DS->>D: Check screen on/off
        DS->>D: Check lock state
        DS-->>A: DeviceState{ready: true}
    end
    
    loop Until task complete or max_steps
        A->>D: Take screenshot
        D-->>A: PNG image
        A->>M: Send image + prompt
        M-->>A: do(action="Tap", element=[500,300])
        
        rect rgb(255, 248, 240)
            Note over A,H: Validation (NEW)
            A->>H: validate_action()
            H-->>A: ValidationResult{valid: true}
        end
        
        A->>H: execute(action)
        H->>D: ADB input tap 540 648
        D-->>H: Success
    end
    
    A-->>U: "Task completed"

⚠️ Why This Fork?

Original Project Issues

The original Open-AutoGLM has critical security vulnerabilities that make it unsafe for production use:

🔴 Critical: Remote Code Execution (RCE)

File: phone_agent/actions/handler.py (line 285)

# DANGEROUS - Original code
if response.startswith("do"):
    action = eval(response)  # ← Executes arbitrary Python code!

Risk: If an attacker compromises the model server or performs a MITM attack, they can inject malicious code:

# Attacker sends this instead of normal action:
do(action="Tap") or __import__('os').system('rm -rf /')

This would execute system commands on your machine.

🟡 Other Issues

Issue	Description
Chinese-only	Original prompts and UI are primarily in Chinese
No device checks	Agent starts without verifying device is ready
No retry logic	ADB commands fail silently on first error
No logging	Hard to debug issues
No validation	Invalid coordinates crash the agent

🔒 Security Improvements

What We Fixed

Vulnerability	Solution
`eval()` RCE	Replaced with regex-based safe parser
No input validation	Added coordinate range checking (0-999)
No action whitelist	Only known actions are executed

Safe Parser Implementation

# NEW - Safe parsing without eval()
def _safe_parse_do_action(response: str) -> dict:
    """Parse do(...) using regex, not eval()."""
    import re
    import json
    
    result = {"_metadata": "do"}
    pattern = r'(\w+)\s*=\s*(?:"([^"]*)"|\[([^\]]*)\])'
    
    for match in re.finditer(pattern, response):
        key, str_val, arr_val = match.groups()
        if str_val is not None:
            result[key] = str_val
        elif arr_val is not None:
            result[key] = json.loads(f"[{arr_val}]")
    
    return result

📲 Installation

Prerequisites

Component	Version	Notes
Python	3.10+	Required
ADB	Latest	Android SDK Platform Tools
Android Device	7.0+	USB debugging enabled
ADB Keyboard	-	Required for text input

Step 1: Install ADB

Windows

Download Platform Tools
Extract to C:\platform-tools

Add to PATH:

[Environment]::SetEnvironmentVariable("Path", $env:Path + ";C:\platform-tools", "User")

Restart terminal

macOS

brew install android-platform-tools

Linux

sudo apt install android-tools-adb

Step 2: Enable USB Debugging on Android

Go to Settings → About Phone
Tap Build Number 7 times (enables Developer Options)
Go to Settings → Developer Options
Enable USB Debugging
Connect phone via USB
Accept the RST key prompt on phone

Step 3: Install ADB Keyboard

Download and install ADB Keyboard APK:

adb install ADBKeyboard.apk

Enable it: Settings → Languages & Input → Virtual Keyboard → ADB Keyboard

Step 4: Install Phone Agent

git clone https://github.com/YOUR_USERNAME/Open-AutoGLM.git
cd Open-AutoGLM
pip install -e .

Step 5: Set Up Model

Option A: Cloud API (recommended for testing)

# BigModel API
export PHONE_AGENT_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
export PHONE_AGENT_API_KEY="your-api-key"
export PHONE_AGENT_MODEL="autoglm-phone"

Option B: Local Model (requires GPU)

# Deploy with vLLM or sglang
python -m vllm.entrypoints.openai.api_server \
    --model zai-org/AutoGLM-Phone-9B \
    --port 8000

Step 6: Verify Installation

python main.py --list-devices
# Should show your connected device

python main.py --lang en "Open Settings"
# Should navigate to Settings app

🚀 Usage

Command Line

# Interactive mode
python main.py --lang en

# Single task
python main.py --lang en "Open Chrome and search for Python tutorials"

# Russian interface
python main.py --lang ru "Открой настройки и проверь WiFi"

# Remote device
python main.py --connect 192.168.1.100:5555 --lang en "Open Gmail"

Python API

from phone_agent import PhoneAgent, setup_logging, check_device_state
from phone_agent.agent import AgentConfig
from phone_agent.model import ModelConfig
import logging

# Enable logging
setup_logging(logging.INFO, log_file="agent.log")

# Check device before starting
state = check_device_state()
if not state.is_ready:
    print(f"Device issues: {state.get_issues()}")
    exit(1)

# Configure
model_config = ModelConfig(
    base_url="http://localhost:8000/v1",
    model_name="autoglm-phone-9b",
)

agent_config = AgentConfig(
    max_steps=50,
    lang="en",  # or "ru"
    check_device_state=True,  # Pre-flight checks enabled
)

# Run
agent = PhoneAgent(model_config, agent_config)
result = agent.run("Open Telegram and check messages")
print(f"Result: {result}")

📁 Project Structure

Open-AutoGLM/
├── main.py                    # CLI entry point
├── Dockerfile                 # 🆕 Docker container
├── docker-compose.yml         # 🆕 Docker Compose
├── phone_agent/
│   ├── __init__.py           # Package exports
│   ├── agent.py              # PhoneAgent class
│   ├── utils.py              # 🆕 Retry, logging
│   ├── device_state.py       # 🆕 Device checks
│   ├── validation.py         # 🆕 Response validation
│   ├── models.py             # 🆕 Pydantic models
│   ├── ui_tree.py            # 🆕 UI element detection
│   ├── api.py                # 🆕 REST API (FastAPI)
│   ├── web_ui.py             # 🆕 Web Dashboard
│   ├── adb/
│   │   ├── connection.py     # USB/WiFi/Remote
│   │   ├── device.py         # Tap, swipe, etc.
│   │   ├── screenshot.py     # Screen capture
│   │   └── input.py          # ADB Keyboard
│   ├── actions/
│   │   └── handler.py        # 🔧 Safe parser (fixed)
│   ├── model/
│   │   └── client.py         # OpenAI API client
│   └── config/
│       ├── apps.py           # App mappings
│       ├── prompts_en.py     # English prompts
│       ├── prompts_ru.py     # 🆕 Russian prompts
│       └── i18n.py           # Translations
├── tests/                     # 🆕 Unit tests
│   └── test_phone_agent.py
├── .github/workflows/         # 🆕 CI/CD
│   └── ci.yml
└── README.md

🌐 Web UI & REST API

Web Dashboard

Launch the web interface to monitor and control the agent:

python -m phone_agent.web_ui
# Open http://localhost:3000/ui

Features:

📱 Live device status (battery, screen, app)
🎯 Execute tasks via natural language
📋 Click on UI elements directly
📝 Action log with timestamps

REST API

Run the API server for programmatic access:

python -m phone_agent.api --host 127.0.0.1 --port 8080 --api-key your-secret-key

Endpoints:

Method	Endpoint	Description
GET	`/`	API status
GET	`/device`	Device state
GET	`/ui/tree`	UI elements
POST	`/task`	Execute task
POST	`/action`	Execute single action

Security features:

🔒 Localhost-only by default
🔑 API key authentication
⏱️ Rate limiting (60 req/min)
📋 Action whitelist

🐳 Docker

Quick Start

# Build image
docker build -t phone-agent .

# Run with USB passthrough (Linux)
docker run -v /dev/bus/usb:/dev/bus/usb phone-agent

# Or use Docker Compose
docker-compose up

Environment Variables

Variable	Default	Description
`PHONE_AGENT_LANG`	`en`	Language (en/ru)
`PHONE_AGENT_BASE_URL`	`http://localhost:8000/v1`	Model API URL
`PHONE_AGENT_API_KEY`	-	Model API key

🎯 UI Tree Parsing

Access UI elements programmatically for precise interactions:

from phone_agent import get_ui_tree, find_element_coordinates

# Get all UI elements
tree = get_ui_tree()

# Find element by text
button = tree.find_one(text="Submit", clickable=True)
if button:
    print(f"Found at {button.center}")  # (540, 800)

# Find all input fields
inputs = tree.get_input_fields()

# Find coordinates by text
coords = find_element_coordinates(text="Login")

🧪 Testing

Run the test suite:

# Install dev dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=phone_agent --cov-report=html

📦 Pydantic Models

Type-safe configuration with validation:

from phone_agent import (
    ModelConfigPydantic,
    AgentConfigPydantic,
    ActionRequest,
    ActionType,
    Coordinates,
)

# Validated config (raises on invalid values)
model_config = ModelConfigPydantic(
    base_url="http://localhost:8000/v1",
    temperature=0.1,  # Must be 0.0-2.0
)

# Validated action
action = ActionRequest(
    action=ActionType.TAP,
    element=Coordinates(x=500, y=300),  # Must be 0-999
)

🔗 Links

Original Project: zai-org/Open-AutoGLM
Model (HuggingFace): AutoGLM-Phone-9B
Model (ModelScope): AutoGLM-Phone-9B
ADB Keyboard: senzhk/ADBKeyBoard

📄 License

Apache License 2.0 — see LICENSE.

⚠️ Disclaimer

This project is for research and educational purposes only. Do not use for:

Unauthorized access to devices
Bypassing security measures
Any illegal activities

Always obtain proper authorization before automating any device.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
examples		examples
phone_agent		phone_agent
resources		resources
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Open-AutoGLM SECURITY_ANALYSIS.md		Open-AutoGLM SECURITY_ANALYSIS.md
README.md		README.md
README_coding_agent.md		README_coding_agent.md
README_en.md		README_en.md
README_ru.md		README_ru.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

License

DmitrL-dev/Open-AutoGLM-Russian-version

Folders and files

Latest commit

History

Repository files navigation