AI-powered autonomous agent that controls computers through visual understanding and task planning.
OpenOperator uses multi-modal LLMs to analyze screenshots, create step-by-step plans, and execute desktop tasks autonomously — clicking, typing, navigating applications, and self-correcting when things go wrong. It supports Windows and macOS, with a web-based UI for task submission and live desktop viewing.
- Plan-and-Solve Workflow — generates structured plans, executes steps with validation, and dynamically replans on failure (inspired by arxiv.org/abs/2305.04091)
- Multi-Modal Perception — screenshot analysis via OmniParser (YOLOv8 + OCR) for UI element detection and text extraction
- Multi-Provider LLM Support — Azure OpenAI, OpenAI GPT-4o, Anthropic Claude, and Ollama (local inference) through a unified client
- Cross-Platform Computer Control — Windows (pywinauto), macOS (PyAutoGUI), and browser automation (Playwright)
- Web UI with Live Desktop View — Gradio chat interface with NoVNC integration for real-time desktop monitoring
- Observability — Elasticsearch + Kibana for logs, telemetry, and execution tracking
- Configuration-Driven Tasks — JSON scenario definitions with pre/post-task functions and evaluation criteria
graph TD
UI["Web UI<br/>(Gradio + NoVNC)"]
Agent["Agent OO4"]
Planner["Planner"]
Executor["Executor (agent_me)"]
Replanner["Replanner"]
OmniParser["OmniParser Server<br/>(YOLOv8 + OCR)"]
ComputerControl["Computer Control<br/>(Mouse/Keyboard)"]
BrowserControl["Browser Control<br/>(Playwright)"]
LLM["LLM Providers<br/>(Azure OpenAI / OpenAI / Claude / Ollama)"]
Computer["Windows / macOS VM"]
ELK["Elasticsearch + Kibana<br/>(Observability)"]
UI --> Agent
Agent --> Planner
Planner --> Executor
Executor --> Replanner
Replanner --> Planner
Planner --> LLM
Executor --> LLM
Replanner --> LLM
Executor --> OmniParser
Executor --> ComputerControl
Executor --> BrowserControl
ComputerControl --> Computer
BrowserControl --> Computer
Agent --> ELK
- Python 3.12+
- uv package manager
- Docker & Docker Compose
- GPU recommended for OmniParser (CUDA 12.6+)
docker compose upThis starts OmniParser, Elasticsearch, and Kibana. Uncomment the windowscomputer and agentoo1 services in compose.yml for the full stack.
1. Start a Windows computer (Docker, requires KVM)
cd computers/windows/docker
docker compose upSee computers/README.md for other options (Parallels VM, macOS).
2. Start the OmniParser server
cd servers/server_omniparser
uv run server.py3. Start the agent
cd agents/agent_oo4
uv venv && source .venv/bin/activate
uv sync
cd ..
uv run -m agent_oo4.main4. Start the Web UI
cd ui
uv run app.py
# Open http://127.0.0.1:7860Create .env files in the appropriate directories. Key variables:
| Variable | Description | Default |
|---|---|---|
AZURE_OPENAI_BASEURL |
Azure OpenAI endpoint URL | — |
AZURE_API_KEY |
Azure OpenAI API key | — |
AZURE_MODEL |
Azure model name | gpt-4o |
AZURE_MODEL_DEPLOYMENT_NAME |
Azure deployment name | — |
OMNIPARSER_URL |
OmniParser service URL | http://127.0.0.1:8000 |
COMPUTER_CONTROL_URL |
Computer control service URL | http://127.0.0.1:5050 |
BROWSER_CONTROL_URL |
Browser control service URL | http://127.0.0.1:5051 |
OPENAI_API_KEY |
OpenAI API key (optional) | — |
ANTHROPIC_API_KEY |
Anthropic API key (optional) | — |
OLLAMA_URL |
Ollama endpoint (optional) | http://127.0.0.1:11434 |
Tasks are defined as JSON files in agents/configs/. Example (agents/configs/teams/scenario-2.json):
{
"instruction": "Find chats in Teams, switch between 5 chat threads, summarize the latest chat",
"workflow.params": {
"max_plan_versions": 20,
"max_plan_step_iterations": 3,
"max_plan_step_actions": 5
},
"environment.start": [
{ "func": "close_all_windows" },
{ "func": "start_network_proxy" },
{ "func": "open_application", "args": { "app_name": "ms-teams" } }
]
}├── agents/
│ ├── agent_oo4/ # Main agent (Plan-and-Solve workflow)
│ │ └── workflow/ # Planner, Executor, Replanner nodes
│ ├── core/ # Shared library (LLM clients, state, config)
│ ├── configs/ # Task scenario definitions (JSON)
│ └── functions/ # Pre/post-task functions
├── servers/
│ ├── server_omniparser/ # UI parsing service (YOLOv8 + OCR)
│ ├── server_computer_control/ # Mouse/keyboard control
│ ├── server_browser_control/ # Playwright browser automation
│ ├── server_network_proxy/ # MITM proxy for traffic capture
│ ├── server_evaluator/ # Task evaluation service
│ └── server_teams_control/ # Microsoft Teams automation
├── computers/
│ ├── windows/ # Windows VM setup (Docker/Parallels)
│ └── macos/ # macOS Docker setup
├── models/ # Local model configurations
├── ui/ # Web UI (Gradio + NoVNC)
├── infra/ # Infrastructure as Code
├── compose.yml # Docker Compose orchestration
└── docs/ # Additional documentation
Elasticsearch and Kibana are included for observability. After running docker compose up, open Kibana at http://localhost:5601 to view agent logs and telemetry.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request