OpenOperator

AI-powered autonomous agent that controls computers through visual understanding and task planning.

OpenOperator uses multi-modal LLMs to analyze screenshots, create step-by-step plans, and execute desktop tasks autonomously — clicking, typing, navigating applications, and self-correcting when things go wrong. It supports Windows and macOS, with a web-based UI for task submission and live desktop viewing.

Features

Plan-and-Solve Workflow — generates structured plans, executes steps with validation, and dynamically replans on failure (inspired by arxiv.org/abs/2305.04091)
Multi-Modal Perception — screenshot analysis via OmniParser (YOLOv8 + OCR) for UI element detection and text extraction
Multi-Provider LLM Support — Azure OpenAI, OpenAI GPT-4o, Anthropic Claude, and Ollama (local inference) through a unified client
Cross-Platform Computer Control — Windows (pywinauto), macOS (PyAutoGUI), and browser automation (Playwright)
Web UI with Live Desktop View — Gradio chat interface with NoVNC integration for real-time desktop monitoring
Observability — Elasticsearch + Kibana for logs, telemetry, and execution tracking
Configuration-Driven Tasks — JSON scenario definitions with pre/post-task functions and evaluation criteria

Architecture

graph TD
    UI["Web UI<br/>(Gradio + NoVNC)"]
    Agent["Agent OO4"]
    Planner["Planner"]
    Executor["Executor (agent_me)"]
    Replanner["Replanner"]
    OmniParser["OmniParser Server<br/>(YOLOv8 + OCR)"]
    ComputerControl["Computer Control<br/>(Mouse/Keyboard)"]
    BrowserControl["Browser Control<br/>(Playwright)"]
    LLM["LLM Providers<br/>(Azure OpenAI / OpenAI / Claude / Ollama)"]
    Computer["Windows / macOS VM"]
    ELK["Elasticsearch + Kibana<br/>(Observability)"]

    UI --> Agent
    Agent --> Planner
    Planner --> Executor
    Executor --> Replanner
    Replanner --> Planner

    Planner --> LLM
    Executor --> LLM
    Replanner --> LLM

    Executor --> OmniParser
    Executor --> ComputerControl
    Executor --> BrowserControl

    ComputerControl --> Computer
    BrowserControl --> Computer

    Agent --> ELK

Quick Start

Prerequisites

Python 3.12+
uv package manager
Docker & Docker Compose
GPU recommended for OmniParser (CUDA 12.6+)

Run Everything with Docker Compose

docker compose up

This starts OmniParser, Elasticsearch, and Kibana. Uncomment the windowscomputer and agentoo1 services in compose.yml for the full stack.

Run Components Individually

1. Start a Windows computer (Docker, requires KVM)

cd computers/windows/docker
docker compose up

See computers/README.md for other options (Parallels VM, macOS).

2. Start the OmniParser server

cd servers/server_omniparser
uv run server.py

3. Start the agent

cd agents/agent_oo4
uv venv && source .venv/bin/activate
uv sync
cd ..
uv run -m agent_oo4.main

4. Start the Web UI

cd ui
uv run app.py
# Open http://127.0.0.1:7860

Configuration

Environment Variables

Create .env files in the appropriate directories. Key variables:

Variable	Description	Default
`AZURE_OPENAI_BASEURL`	Azure OpenAI endpoint URL	—
`AZURE_API_KEY`	Azure OpenAI API key	—
`AZURE_MODEL`	Azure model name	`gpt-4o`
`AZURE_MODEL_DEPLOYMENT_NAME`	Azure deployment name	—
`OMNIPARSER_URL`	OmniParser service URL	`http://127.0.0.1:8000`
`COMPUTER_CONTROL_URL`	Computer control service URL	`http://127.0.0.1:5050`
`BROWSER_CONTROL_URL`	Browser control service URL	`http://127.0.0.1:5051`
`OPENAI_API_KEY`	OpenAI API key (optional)	—
`ANTHROPIC_API_KEY`	Anthropic API key (optional)	—
`OLLAMA_URL`	Ollama endpoint (optional)	`http://127.0.0.1:11434`

Task Scenarios

Tasks are defined as JSON files in agents/configs/. Example (agents/configs/teams/scenario-2.json):

{
  "instruction": "Find chats in Teams, switch between 5 chat threads, summarize the latest chat",
  "workflow.params": {
    "max_plan_versions": 20,
    "max_plan_step_iterations": 3,
    "max_plan_step_actions": 5
  },
  "environment.start": [
    { "func": "close_all_windows" },
    { "func": "start_network_proxy" },
    { "func": "open_application", "args": { "app_name": "ms-teams" } }
  ]
}

Project Structure

├── agents/
│   ├── agent_oo4/           # Main agent (Plan-and-Solve workflow)
│   │   └── workflow/        # Planner, Executor, Replanner nodes
│   ├── core/                # Shared library (LLM clients, state, config)
│   ├── configs/             # Task scenario definitions (JSON)
│   └── functions/           # Pre/post-task functions
├── servers/
│   ├── server_omniparser/   # UI parsing service (YOLOv8 + OCR)
│   ├── server_computer_control/  # Mouse/keyboard control
│   ├── server_browser_control/   # Playwright browser automation
│   ├── server_network_proxy/     # MITM proxy for traffic capture
│   ├── server_evaluator/         # Task evaluation service
│   └── server_teams_control/     # Microsoft Teams automation
├── computers/
│   ├── windows/             # Windows VM setup (Docker/Parallels)
│   └── macos/               # macOS Docker setup
├── models/                  # Local model configurations
├── ui/                      # Web UI (Gradio + NoVNC)
├── infra/                   # Infrastructure as Code
├── compose.yml              # Docker Compose orchestration
└── docs/                    # Additional documentation

Analytics

Elasticsearch and Kibana are included for observability. After running docker compose up, open Kibana at http://localhost:5601 to view agent logs and telemetry.

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenOperator

Features

Architecture

Quick Start

Prerequisites

Run Everything with Docker Compose

Run Components Individually

Configuration

Environment Variables

Task Scenarios

Project Structure

Analytics

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.vscode		.vscode
agents		agents
computers		computers
docs		docs
infra		infra
models		models
old_agents		old_agents
servers		servers
ui		ui
.gitignore		.gitignore
README.md		README.md
compose.yml		compose.yml

Folders and files

Latest commit

History

Repository files navigation

OpenOperator

Features

Architecture

Quick Start

Prerequisites

Run Everything with Docker Compose

Run Components Individually

Configuration

Environment Variables

Task Scenarios

Project Structure

Analytics

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages