Clawd Cursor

OS-level desktop automation server. Gives any AI model eyes, hands, and ears on a real computer.
Model-agnostic · Works with Claude, GPT, Gemini, Llama, or any tool-calling model · Free with local models

Website · Discord · Quick Start · Connect · How It Works · API · Changelog

What's New in v0.7.5

Provider-agnostic Computer Use. App guides. Cleaner architecture.

Provider-agnostic refactor — eliminated all hardcoded model/provider checks. Uses declarative capability flags (reasoningVisionModel, computerUse, openaiCompat) instead of string matching. Works with ANY provider out of the box.
App Guide system — community-contributed JSON instruction manuals for 86+ apps. Install with clawdcursor guides install excel. Teaches the AI keyboard shortcuts, workflows, UI layout, and tips for each app. Loaded automatically at runtime.
Google Gemini 2.5 Flash — auto-detected from GEMINI_API_KEY or GOOGLE_API_KEY. Budget-friendly at ~$0.15/1M tokens.
3-stage pipeline — Stage 1 (deterministic, free) → Stage 2 (text LLM, cheap) → Stage 3 (vision LLM, expensive). Most tasks complete at Stage 1-2.
Fallback architecture — designed as the last-mile fallback for any AI agent. When APIs, CLIs, and integrations don't exist, clawdcursor gives your agent eyes and hands on the desktop.
Task classifier — zero-cost regex classifier routes tasks to optimal pipeline stage (mechanical/navigation/reasoning/spatial).
Guide registry CLI — clawdcursor guides available lists 86 apps. clawdcursor guides install spotify downloads shortcuts. Community-driven knowledge base.
Lazy guide loading — guides inject into the LLM context only when the target app is detected, after the first screenshot.
Adaptive learning — successful tasks save their action sequences to app guide JSON. Next time the same app is used, Stage 2 reads the learned workflow and executes it natively — no vision fallback needed. The system gets smarter with every interaction.
returnPartial mode — external agents (OpenClaw, Claude Code) send POST /task {"returnPartial": true}. If Stage 2 fails, control returns to the calling agent instead of burning tokens on Stage 3 vision. The agent finishes with MCP tools (smarter than one-shot vision).
POST /learn endpoint — after completing a task with MCP tools, agents report what worked. Saves workflows, keyboard shortcuts, and tips to the app's guide JSON. Community-driven knowledge base that grows with every interaction.
Per-layer API keys — mixed-provider pipelines (e.g., Kimi text + Anthropic vision) use separate API keys per layer. No more auth failures from sending the wrong key to the wrong provider.
CDP auto-init in serve mode — browser DevTools Protocol connects automatically on startup. Web tasks get DOM access instead of falling back to pure vision.
minimize_window tool — new cross-platform tool. Minimize any window by name, PID, or title in 1 call. Windows (Super+Down), macOS (Cmd+M), Linux (wmctrl).
smart_click timeout — 10s default timeout prevents silent hangs on complex UIs. Returns diagnostic error instead of blocking.
focus_window phantom cleanup — automatically minimizes off-screen full-screen windows (-14,-14 bounds) that steal focus before focusing the target.
Spatial layout analysis — textual scaffolding: detects toolbar zones, content center, sidebars from OCR element positions. Any text LLM knows WHERE to click without vision.
42 tools (was 40) — added minimize_window and smart_read. Full tool reference in SKILL.md.
172 tests pass — provider matrix tests (27), focus window, action router, OCR, safety, coordinates, and more.
Per-layer API keys — mixed-provider pipelines (e.g., Kimi text + Anthropic vision) use separate API keys per layer.
New config schema — textModel/visionModel field names. Old names still accepted.
Memory leak fixes — buffer release in OCR engine, TERMINAL env var quoting for Linux.
Cross-platform hardening — Linux GPU detection, PID file locking, signal handlers.

v0.7.2 features (still present)

Smart pipeline — L0 (Browser) -> L1 (Action Router) -> L1.5 (Deterministic Flows) -> L2 (Skill Cache) -> L2.5 (OCR Reasoner) -> L2.5b (A11y Reasoner) -> L3 (Computer Use). Most tasks never reach L3.
40 universal tools — served via REST (GET /tools, POST /execute/:name) and MCP stdio from a single definition. Any model that can call functions can control your desktop.
3 transport modes — start (full agent + tools), serve (tools only, bring your own brain), mcp (MCP stdio for Claude Code, Cursor, Windsurf, Zed)
CDP browser integration — Chrome DevTools Protocol for DOM interaction, text extraction, click-by-selector. Auto-connects to Edge/Chrome.
Action verifier — ground-truth checking after every action. Blocks false success reports.
A11y click resolver — bounds-based coordinate resolution, zero LLM cost
Deterministic flows — hardcoded keyboard sequences for common tasks (email compose, app switch). Zero LLM calls, instant.
No-progress loop detector — blocks same action repeated 3+ times. Forces the LLM to try something different.
Premature-done blocker — evidence-based completion checking. Won't report success unless verified.
Structured task logging — JSONL per-task logs with verified_success vs unverified_success distinction
First-run onboarding — consent flow explains what desktop control means before tools activate
Standalone data directory — all data in ~/.clawdcursor/ (migrates from legacy paths automatically)
Error reporting (opt-in) — clawdcursor report lets users send redacted task logs to help improve the agent
API key validation on startup — clear checkmark/cross messages so you know immediately if your key works
Graceful error handling — 401/402/429 responses produce actionable messages instead of hanging
Configurable browser — custom exe path, process name, and CDP port via config
DPI-safe coordinate conversion — consistent across all 8 call sites
13 providers auto-detected — up from 6, plus any OpenAI-compatible endpoint
OCR Reasoner — screenshot + OS-level OCR + cheap text LLM, with stagnation detection and done-verification
Universal OS support — platform-aware system prompts, DPI detection (Retina via NSScreen, Linux via xrandr/GDK_SCALE), and browser CDP paths for Windows, macOS, and Linux
Security hardening — log sanitization (API keys/tokens redacted), token fingerprinting on startup, auth required on sensitive GET endpoints (/favorites, /task-logs, /logs)
Task execution lock — prevents race conditions from concurrent /task requests
Clipboard fallback — catches a11y bridge failure, falls back to typeText
Install verification — scripts/verify-install.js checks Node version + native deps with platform-specific fix guidance

v0.6.3 vs v0.7.5

	v0.6.3	v0.7.5
Architecture	4-layer pipeline (L0-L3)	3-stage pipeline: deterministic → text LLM → vision LLM
Transport	REST API only	REST + MCP stdio + tools-only server
Tools	Monolithic agent, no tool exposure	40 discrete tools, OpenAI function-calling format
Browser	Playwright-only, no DOM access	CDP integration — click by selector, read text, type by label
Verification	LLM self-reports success (often wrong)	Ground-truth action verifier — reads actual content back
False positives	Common — agent says "done" prematurely	Premature-done blocker + evidence-based completion
Loops	Agent can repeat same failed action forever	No-progress detector blocks after 3 repeats in 8 steps
Click resolution	Vision model guesses coordinates	A11y bounds-based resolver (zero LLM cost), vision as fallback
Common tasks	Every task goes through LLM	Deterministic flows for email, app-switch — zero LLM calls
Task logging	Console output only	Structured JSONL per-task, verified vs unverified success
Data directory	`~/.openclaw/clawdcursor/` (coupled)	`~/.clawdcursor/` (standalone, auto-migrates)
Dependencies	Tied to OpenClaw platform	Fully standalone — works with any AI, any client
Onboarding	None — starts immediately	First-run consent flow for desktop control
MCP support	None	Native MCP stdio for Claude Code, Cursor, Windsurf, Zed
Error reporting	None	Opt-in redacted task log submission
Model coupling	Anthropic-favored defaults	Truly model-agnostic — 14 providers auto-detected + any OpenAI-compatible endpoint
OS support	Windows only	Windows, macOS, Linux — platform-aware prompts, DPI, browser paths
Security	Token in logs, no endpoint auth	Log sanitization, token fingerprint, auth on sensitive endpoints

The Glove for Any AI Hand

Think of your AI as the hand and Clawd Cursor as the glove.

The hand has the intelligence — it reasons, plans, and decides what to do. The glove gives it grip on the physical world. Clawd Cursor wraps your entire desktop — every window, every button, every text field, every pixel — and exposes it as simple tool calls that any AI model can use.

Your AI is the brain. Clawd Cursor is the body.

If it's visible on your screen, Clawd Cursor can interact with it. Native apps, web apps, legacy software, internal tools, desktop games — anything with a GUI. No app-specific integrations needed. No APIs to configure per-service. One universal interface that turns any AI into a desktop operator.

This is what makes v0.7.5 different from every other automation tool: it doesn't care which AI drives it. Claude, GPT, Gemini, Llama running locally, a custom model you trained yourself, or a simple Python script making function calls. If it can call tools, it can control your computer.

Your AI (any model)          Clawd Cursor (the glove)
  "Click the Send button"  ->  find_element + mouse_click
  "What's on screen?"      ->  desktop_screenshot + read_screen
  "Type my email"          ->  type_text
  "Open Chrome to gmail"   ->  open_app + navigate_browser
  "Read that table"        ->  cdp_read_text

Three Ways to Connect

Clawd Cursor is a tool server. It doesn't care which AI model drives it.

1. Built-in Agent (`start`)

Full autonomous agent with built-in LLM pipeline. Send a task, get a result.

clawdcursor start
curl http://localhost:3847/task -H "Content-Type: application/json" \
  -d '{"task": "Open Notepad and write a haiku about the ocean"}'

2. Tools-Only Server (`serve`)

Exposes 40 desktop tools via REST API. You bring the brain — Claude, GPT, Gemini, Llama, a script, anything.

clawdcursor serve

# Discover available tools (OpenAI function-calling format)
curl http://localhost:3847/tools

# Execute any tool
curl http://localhost:3847/execute/desktop_screenshot
curl http://localhost:3847/execute/mouse_click -d '{"x": 500, "y": 300}'
curl http://localhost:3847/execute/type_text -d '{"text": "Hello world"}'

3. MCP Mode (`mcp`)

Runs as an MCP tool server over stdio. Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client.

// Claude Code: ~/.claude/settings.json
{
  "mcpServers": {
    "clawdcursor": {
      "command": "node",
      "args": ["/path/to/clawdcursor/dist/index.js", "mcp"]
    }
  }
}

Tool Categories (40 tools)

Category	Tools	Examples
Perception	9	`desktop_screenshot`, `read_screen`, `get_active_window`, `get_focused_element`, `smart_read`, `ocr_read_screen`
Mouse	6	`mouse_click`, `mouse_double_click`, `mouse_drag`, `mouse_scroll`
Keyboard	5	`key_press`, `type_text`, `smart_type`, `shortcuts_list`, `shortcuts_execute`
Window/App	6	`focus_window`, `open_app`, `get_windows`, `invoke_element`
Browser CDP	10	`cdp_connect`, `cdp_click`, `cdp_type`, `cdp_read_text`
Orchestration	4	`delegate_to_agent`, `smart_click`, `navigate_browser`, `wait`

Quick Start

Windows

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor
npm install
npm run setup      # builds + registers 'clawdcursor' command globally

# Start the full agent
clawdcursor start

# Or start tools-only (bring your own AI)
clawdcursor serve

# Or run as MCP server
clawdcursor mcp

macOS

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor && npm install && npm run setup

# Grant Accessibility permissions to your terminal first!
# System Settings -> Privacy & Security -> Accessibility -> Add Terminal/iTerm

chmod +x scripts/mac/*.sh scripts/mac/*.jxa
clawdcursor start

Linux

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor && npm install && npm run setup

# Linux: browser control via CDP only (no native desktop automation yet)
clawdcursor start

See docs/MACOS-SETUP.md for the full macOS onboarding guide.

First run will:

Show a desktop control consent warning (one-time)
Scan for AI providers from environment variables and CLI flags
Auto-configure the optimal pipeline
Start the server on http://localhost:3847

Provider Setup

Free (no API key needed):

ollama pull qwen2.5:7b   # or any model
clawdcursor start

Budget cloud (Gemini 2.5 Flash — recommended):

echo "GEMINI_API_KEY=AIza..." > .env    # get free key at aistudio.google.com
clawdcursor start

Any cloud provider:

echo "AI_API_KEY=your-key-here" > .env
clawdcursor doctor    # optional — auto-detects from key format
clawdcursor start

Explicit provider:

clawdcursor start --provider anthropic --api-key sk-ant-...
clawdcursor start --base-url https://api.example.com/v1 --api-key KEY

Provider	Key prefix	Vision	Computer Use	Budget
Google (Gemini)	`AIza...`	Yes	No	★ Best value
Anthropic	`sk-ant-`	Yes	Yes	Premium
OpenAI	`sk-`	Yes	No	Standard
Groq	`gsk_`	Yes	No	Fast
Together AI	-	Yes	No	Open models
DeepSeek	`sk-`	Yes	No	Cheapest
Kimi/Moonshot	`sk-` (long)	No	No
Mistral	-	Yes	No
xAI (Grok)	`xai-`	Yes	No
Alibaba/Qwen (DashScope)	`sk-`	Yes	No
Fireworks	`fw_...`	Yes	No
Cohere	-	Yes	No
Perplexity	`pplx-`	Yes	No
Ollama (local)	-	Auto-detected	No	Free
Any OpenAI-compatible	-	Varies	No

How It Works

OCR-First Adaptive Pipeline

Every task flows cheapest-first. Learned patterns skip LLM entirely.

                        +-------------+
                        |  User Task  |
                        +------+------+
                               |
                               v
                    +--------------------+
                    | Screenshot Capture  |
                    | + A11y Tree Scan    |
                    | (parallel)          |
                    +--------------------+
                               |
                               v
                  +--------------------------+
                  |  Adaptive Pattern Lookup  |
                  |  JSON guides + Skill Cache|
                  |  — known workflow?        |
                  +-----+---------------+----+
                        |               |
                  Match found      No match
                        |               |
                        v               v
              +-----------------+  +------------------+
              | Direct Execute  |  |  Tesseract OCR   |
              | Skip OCR + LLM  |  |  Text + bounding |
              | (instant, free) |  |  boxes + spatial  |
              +--------+--------+  |  layout analysis  |
                       |           +--------+---------+
                       |                    |
                       |                    v
                       |           +------------------+
                       |           | Text LLM Reason  |
                       |           | Plan + execute   |
                       |           | actions per step |
                       |           +--------+---------+
                       |                    |
                       |            +-------+-------+
                       |            |               |
                       |         Success      OCR can't read
                       |            |               |
                       |            |               v
                       |            |      +----------------+
                       |            |      |  Vision LLM    |
                       |            |      |  Screenshot    |
                       |            |      |  loop (last    |
                       |            |      |  resort)       |
                       |            |      +--------+-------+
                       |            |               |
                       |            +-------+-------+
                       |                    |
                       v                    v
              +----------------------------------+
              |       Learn Pattern              |
              |  Save to JSON guide + Skill Cache|
              |  Next time: direct execute (free)|
              +----------------------------------+

Adaptive loop: First run takes 9 steps with LLM. Second run replays the learned pattern in 1 step (free).

returnPartial: External agents (OpenClaw, Claude Code) can send {"returnPartial": true} — if text LLM fails, control returns to the agent instead of burning tokens on vision. The agent finishes with MCP tools and calls POST /learn to teach clawdcursor.

Model Recommendations (2026)

Provider	Text model (Stage 2)	Vision model (Stage 3)	Notes
Google (Gemini)	`gemini-2.5-flash`	`gemini-2.5-flash`	Budget pick — one model for both, 1M ctx
Anthropic	`claude-haiku-4-5`	`claude-sonnet-4-20250514`	Best accuracy, Computer Use support
OpenAI	`gpt-4o-mini`	`gpt-4o`	Reliable, widely supported
Groq	`llama-3.3-70b-versatile`	`llama-3.2-90b-vision-preview`	Fastest inference
DeepSeek	`deepseek-chat`	`deepseek-chat`	Cheapest reasoning
Ollama	(auto-detected)	(auto-detected)	Free, runs locally

Cost Comparison (approximate, per 1K tasks)

Provider	Stage 2 cost	Stage 3 cost	Best for
Ollama (local)	$0	$0	Privacy, offline
Google Gemini 2.5 Flash	~$0.15	~$0.60	Budget cloud
DeepSeek Chat	~$0.14	~$0.14	Cheapest cloud
Groq Llama	~$0.59	~$2.70	Speed
OpenAI GPT-4o	~$0.15	~$2.50	Reliability
Anthropic Claude	~$0.80	~$15	Accuracy, Computer Use

Tip: Configure Gemini 2.5 Flash for both text and vision roles — it's the best single-model budget option. One API key, one model, handles everything.

Action Verification

Every action is verified after execution:

Type actions: Reads back the focused element's text content
Click actions: Checks if window/focus changed as expected
Key presses: Verifies the expected state change occurred
CDP actions: Re-reads DOM to confirm changes
Task completion: Ground-truth check reads actual content (Notepad text, email window state, etc.)

If verification fails, the agent retries with a different approach instead of reporting false success.

No-Progress Detection

If the LLM repeats the same action 3+ times in an 8-step window, it's blocked and forced to try something different. Combined with the premature-done blocker (requires evidence of completion for write tasks), this prevents the two most common failure modes: infinite loops and premature success.

API Endpoints

http://localhost:3847

Endpoint	Method	Description
`/`	GET	Web dashboard UI
`/tools`	GET	List all 40 tools (OpenAI function-calling format)
`/execute/:name`	POST	Execute a tool by name
`/task`	POST	Submit a task: `{"task": "Open Chrome"}`
`/status`	GET	Agent state and current task
`/task-logs`	GET	Recent task summaries (structured JSONL)
`/task-logs/current`	GET	Current task's step-by-step log
`/report`	POST	Submit an error report (opt-in)
`/logs`	GET	Last 200 console log entries
`/screenshot`	GET	Current screen as PNG
`/action`	POST	Direct action execution (LLM-space coords)
`/confirm`	POST	Approve/reject pending safety action
`/abort`	POST	Stop the current task
`/favorites`	GET/POST/DELETE	Saved command favorites
`/stop`	POST	Graceful server shutdown
`/health`	GET	Server health + version

Architecture

                     Any AI Model
                    (Claude, GPT, Gemini, Llama, scripts, etc.)
                          |
            +-------------+-------------+
            |             |             |
        REST API      MCP stdio    Built-in Agent
        (serve)        (mcp)        (start)
            |             |             |
            +-------------+-------------+
                          |
               Clawd Cursor Tool Server
               40 tools, single definition
                          |
        +---------+-------+-------+---------+
        |         |       |       |         |
    Perception  Mouse  Keyboard  Window  Browser
    screenshot  click  key_press focus   cdp_click
    read_screen drag   type_text open    cdp_type
    a11y_tree   scroll           switch  cdp_read
                          |
               Native Desktop Layer
          nut-js + PowerShell/JXA/AT-SPI + Playwright
                          |
                    Your Desktop

Safety

Tier	Actions	Behavior
Auto	Navigation, reading, opening apps	Runs immediately
Preview	Typing, form filling	Logs before executing
Confirm	Sending messages, deleting, purchases	Pauses for approval

First run shows a desktop control consent warning. Dangerous key combos (Alt+F4, Ctrl+Alt+Del) are blocked. Server binds to localhost only.

CLI Commands

clawdcursor start        Start the full agent (built-in LLM pipeline)
clawdcursor serve        Start tools-only server (no built-in LLM)
clawdcursor mcp          Run as MCP tool server over stdio
clawdcursor doctor       Diagnose and auto-configure
clawdcursor task <t>     Send a task to running agent
clawdcursor report       Send an error report (opt-in, redacted)
clawdcursor dashboard    Open the web dashboard
clawdcursor install      Set up API key and configure pipeline
clawdcursor uninstall    Remove all config and data
clawdcursor stop         Stop the running server
clawdcursor kill         Force stop

Options:
  --port <port>          API port (default: 3847)
  --provider <provider>  anthropic|openai|ollama|groq|together|deepseek|kimi|gemini|mistral|xai|qwen|fireworks|cohere|perplexity|...
  --model <model>        Override vision model
  --api-key <key>        AI provider API key
  --base-url <url>       Custom API endpoint
  --debug                Save screenshots to debug/ folder

Platform Support

Platform	UI Automation	OCR	Browser (CDP)	Status
Windows (x64/ARM64)	PowerShell + .NET UI Automation	Windows.Media.Ocr	Chrome/Edge	Full support
macOS (Intel/Apple Silicon)	JXA + System Events	Apple Vision framework	Chrome/Edge	Full support
Linux (x64/ARM64)	AT-SPI (planned)	Tesseract OCR	Chrome/Edge	Browser + OCR

Prerequisites

Node.js 20+ (x64 or ARM64)
Windows: PowerShell (included)
macOS 10.15+: Accessibility permissions granted, Xcode CLI tools (xcode-select --install)
Linux: tesseract-ocr and python3 for OCR (sudo apt install tesseract-ocr)
AI API Key — optional. Works offline with Ollama or tools-only mode.

Tech Stack

TypeScript - Node.js - @nut-tree-fork/nut-js - Playwright - sharp - Express - MCP SDK - Zod - Any OpenAI-compatible API - Anthropic Computer Use - Windows UI Automation - macOS Accessibility (JXA) - Chrome DevTools Protocol

License

MIT

clawdcursor.com

Name		Name	Last commit message	Last commit date
Latest commit History 354 Commits
.github/workflows		.github/workflows
docs		docs
guides		guides
perf		perf
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
V0.7.5-SPEC.md		V0.7.5-SPEC.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.tests.json		tsconfig.tests.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Clawd Cursor

What's New in v0.7.5

v0.7.2 features (still present)

v0.6.3 vs v0.7.5

The Glove for Any AI Hand

Three Ways to Connect

1. Built-in Agent (start)

2. Tools-Only Server (serve)

3. MCP Mode (mcp)

Tool Categories (40 tools)

Quick Start

Windows

macOS

Linux

Provider Setup

How It Works

OCR-First Adaptive Pipeline

Model Recommendations (2026)

Cost Comparison (approximate, per 1K tasks)

Action Verification

No-Progress Detection

API Endpoints

Architecture

Safety

CLI Commands

Platform Support

Prerequisites

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Built-in Agent (`start`)

2. Tools-Only Server (`serve`)

3. MCP Mode (`mcp`)

Packages