Skip to content

AmrDab/clawdcursor

Repository files navigation

Clawd Cursor

Clawd Cursor

OS-level desktop automation server. Gives any AI model eyes, hands, and ears on a real computer.
Model-agnostic · Works with Claude, GPT, Gemini, Llama, or any tool-calling model · Free with local models

Discord

Website · Discord · Quick Start · Connect · How It Works · API · Changelog


What's New in v0.7.5

Provider-agnostic Computer Use. App guides. Cleaner architecture.

  • Provider-agnostic refactor — eliminated all hardcoded model/provider checks. Uses declarative capability flags (reasoningVisionModel, computerUse, openaiCompat) instead of string matching. Works with ANY provider out of the box.
  • App Guide system — community-contributed JSON instruction manuals for 86+ apps. Install with clawdcursor guides install excel. Teaches the AI keyboard shortcuts, workflows, UI layout, and tips for each app. Loaded automatically at runtime.
  • Google Gemini 2.5 Flash — auto-detected from GEMINI_API_KEY or GOOGLE_API_KEY. Budget-friendly at ~$0.15/1M tokens.
  • 3-stage pipeline — Stage 1 (deterministic, free) → Stage 2 (text LLM, cheap) → Stage 3 (vision LLM, expensive). Most tasks complete at Stage 1-2.
  • Fallback architecture — designed as the last-mile fallback for any AI agent. When APIs, CLIs, and integrations don't exist, clawdcursor gives your agent eyes and hands on the desktop.
  • Task classifier — zero-cost regex classifier routes tasks to optimal pipeline stage (mechanical/navigation/reasoning/spatial).
  • Guide registry CLIclawdcursor guides available lists 86 apps. clawdcursor guides install spotify downloads shortcuts. Community-driven knowledge base.
  • Lazy guide loading — guides inject into the LLM context only when the target app is detected, after the first screenshot.
  • Adaptive learning — successful tasks save their action sequences to app guide JSON. Next time the same app is used, Stage 2 reads the learned workflow and executes it natively — no vision fallback needed. The system gets smarter with every interaction.
  • returnPartial mode — external agents (OpenClaw, Claude Code) send POST /task {"returnPartial": true}. If Stage 2 fails, control returns to the calling agent instead of burning tokens on Stage 3 vision. The agent finishes with MCP tools (smarter than one-shot vision).
  • POST /learn endpoint — after completing a task with MCP tools, agents report what worked. Saves workflows, keyboard shortcuts, and tips to the app's guide JSON. Community-driven knowledge base that grows with every interaction.
  • Per-layer API keys — mixed-provider pipelines (e.g., Kimi text + Anthropic vision) use separate API keys per layer. No more auth failures from sending the wrong key to the wrong provider.
  • CDP auto-init in serve mode — browser DevTools Protocol connects automatically on startup. Web tasks get DOM access instead of falling back to pure vision.
  • minimize_window tool — new cross-platform tool. Minimize any window by name, PID, or title in 1 call. Windows (Super+Down), macOS (Cmd+M), Linux (wmctrl).
  • smart_click timeout — 10s default timeout prevents silent hangs on complex UIs. Returns diagnostic error instead of blocking.
  • focus_window phantom cleanup — automatically minimizes off-screen full-screen windows (-14,-14 bounds) that steal focus before focusing the target.
  • Spatial layout analysis — textual scaffolding: detects toolbar zones, content center, sidebars from OCR element positions. Any text LLM knows WHERE to click without vision.
  • 42 tools (was 40) — added minimize_window and smart_read. Full tool reference in SKILL.md.
  • 172 tests pass — provider matrix tests (27), focus window, action router, OCR, safety, coordinates, and more.
  • Per-layer API keys — mixed-provider pipelines (e.g., Kimi text + Anthropic vision) use separate API keys per layer.
  • New config schematextModel/visionModel field names. Old names still accepted.
  • Memory leak fixes — buffer release in OCR engine, TERMINAL env var quoting for Linux.
  • Cross-platform hardening — Linux GPU detection, PID file locking, signal handlers.

v0.7.2 features (still present)

  • Smart pipeline — L0 (Browser) -> L1 (Action Router) -> L1.5 (Deterministic Flows) -> L2 (Skill Cache) -> L2.5 (OCR Reasoner) -> L2.5b (A11y Reasoner) -> L3 (Computer Use). Most tasks never reach L3.
  • 40 universal tools — served via REST (GET /tools, POST /execute/:name) and MCP stdio from a single definition. Any model that can call functions can control your desktop.
  • 3 transport modesstart (full agent + tools), serve (tools only, bring your own brain), mcp (MCP stdio for Claude Code, Cursor, Windsurf, Zed)
  • CDP browser integration — Chrome DevTools Protocol for DOM interaction, text extraction, click-by-selector. Auto-connects to Edge/Chrome.
  • Action verifier — ground-truth checking after every action. Blocks false success reports.
  • A11y click resolver — bounds-based coordinate resolution, zero LLM cost
  • Deterministic flows — hardcoded keyboard sequences for common tasks (email compose, app switch). Zero LLM calls, instant.
  • No-progress loop detector — blocks same action repeated 3+ times. Forces the LLM to try something different.
  • Premature-done blocker — evidence-based completion checking. Won't report success unless verified.
  • Structured task logging — JSONL per-task logs with verified_success vs unverified_success distinction
  • First-run onboarding — consent flow explains what desktop control means before tools activate
  • Standalone data directory — all data in ~/.clawdcursor/ (migrates from legacy paths automatically)
  • Error reporting (opt-in) — clawdcursor report lets users send redacted task logs to help improve the agent
  • API key validation on startup — clear checkmark/cross messages so you know immediately if your key works
  • Graceful error handling — 401/402/429 responses produce actionable messages instead of hanging
  • Configurable browser — custom exe path, process name, and CDP port via config
  • DPI-safe coordinate conversion — consistent across all 8 call sites
  • 13 providers auto-detected — up from 6, plus any OpenAI-compatible endpoint
  • OCR Reasoner — screenshot + OS-level OCR + cheap text LLM, with stagnation detection and done-verification
  • Universal OS support — platform-aware system prompts, DPI detection (Retina via NSScreen, Linux via xrandr/GDK_SCALE), and browser CDP paths for Windows, macOS, and Linux
  • Security hardening — log sanitization (API keys/tokens redacted), token fingerprinting on startup, auth required on sensitive GET endpoints (/favorites, /task-logs, /logs)
  • Task execution lock — prevents race conditions from concurrent /task requests
  • Clipboard fallback — catches a11y bridge failure, falls back to typeText
  • Install verificationscripts/verify-install.js checks Node version + native deps with platform-specific fix guidance

v0.6.3 vs v0.7.5

v0.6.3 v0.7.5
Architecture 4-layer pipeline (L0-L3) 3-stage pipeline: deterministic → text LLM → vision LLM
Transport REST API only REST + MCP stdio + tools-only server
Tools Monolithic agent, no tool exposure 40 discrete tools, OpenAI function-calling format
Browser Playwright-only, no DOM access CDP integration — click by selector, read text, type by label
Verification LLM self-reports success (often wrong) Ground-truth action verifier — reads actual content back
False positives Common — agent says "done" prematurely Premature-done blocker + evidence-based completion
Loops Agent can repeat same failed action forever No-progress detector blocks after 3 repeats in 8 steps
Click resolution Vision model guesses coordinates A11y bounds-based resolver (zero LLM cost), vision as fallback
Common tasks Every task goes through LLM Deterministic flows for email, app-switch — zero LLM calls
Task logging Console output only Structured JSONL per-task, verified vs unverified success
Data directory ~/.openclaw/clawdcursor/ (coupled) ~/.clawdcursor/ (standalone, auto-migrates)
Dependencies Tied to OpenClaw platform Fully standalone — works with any AI, any client
Onboarding None — starts immediately First-run consent flow for desktop control
MCP support None Native MCP stdio for Claude Code, Cursor, Windsurf, Zed
Error reporting None Opt-in redacted task log submission
Model coupling Anthropic-favored defaults Truly model-agnostic — 14 providers auto-detected + any OpenAI-compatible endpoint
OS support Windows only Windows, macOS, Linux — platform-aware prompts, DPI, browser paths
Security Token in logs, no endpoint auth Log sanitization, token fingerprint, auth on sensitive endpoints

The Glove for Any AI Hand

Think of your AI as the hand and Clawd Cursor as the glove.

The hand has the intelligence — it reasons, plans, and decides what to do. The glove gives it grip on the physical world. Clawd Cursor wraps your entire desktop — every window, every button, every text field, every pixel — and exposes it as simple tool calls that any AI model can use.

Your AI is the brain. Clawd Cursor is the body.

If it's visible on your screen, Clawd Cursor can interact with it. Native apps, web apps, legacy software, internal tools, desktop games — anything with a GUI. No app-specific integrations needed. No APIs to configure per-service. One universal interface that turns any AI into a desktop operator.

This is what makes v0.7.5 different from every other automation tool: it doesn't care which AI drives it. Claude, GPT, Gemini, Llama running locally, a custom model you trained yourself, or a simple Python script making function calls. If it can call tools, it can control your computer.

Your AI (any model)          Clawd Cursor (the glove)
  "Click the Send button"  ->  find_element + mouse_click
  "What's on screen?"      ->  desktop_screenshot + read_screen
  "Type my email"          ->  type_text
  "Open Chrome to gmail"   ->  open_app + navigate_browser
  "Read that table"        ->  cdp_read_text

Three Ways to Connect

Clawd Cursor is a tool server. It doesn't care which AI model drives it.

1. Built-in Agent (start)

Full autonomous agent with built-in LLM pipeline. Send a task, get a result.

clawdcursor start
curl http://localhost:3847/task -H "Content-Type: application/json" \
  -d '{"task": "Open Notepad and write a haiku about the ocean"}'

2. Tools-Only Server (serve)

Exposes 40 desktop tools via REST API. You bring the brain — Claude, GPT, Gemini, Llama, a script, anything.

clawdcursor serve

# Discover available tools (OpenAI function-calling format)
curl http://localhost:3847/tools

# Execute any tool
curl http://localhost:3847/execute/desktop_screenshot
curl http://localhost:3847/execute/mouse_click -d '{"x": 500, "y": 300}'
curl http://localhost:3847/execute/type_text -d '{"text": "Hello world"}'

3. MCP Mode (mcp)

Runs as an MCP tool server over stdio. Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client.

// Claude Code: ~/.claude/settings.json
{
  "mcpServers": {
    "clawdcursor": {
      "command": "node",
      "args": ["/path/to/clawdcursor/dist/index.js", "mcp"]
    }
  }
}

Tool Categories (40 tools)

Category Tools Examples
Perception 9 desktop_screenshot, read_screen, get_active_window, get_focused_element, smart_read, ocr_read_screen
Mouse 6 mouse_click, mouse_double_click, mouse_drag, mouse_scroll
Keyboard 5 key_press, type_text, smart_type, shortcuts_list, shortcuts_execute
Window/App 6 focus_window, open_app, get_windows, invoke_element
Browser CDP 10 cdp_connect, cdp_click, cdp_type, cdp_read_text
Orchestration 4 delegate_to_agent, smart_click, navigate_browser, wait

Quick Start

Windows

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor
npm install
npm run setup      # builds + registers 'clawdcursor' command globally

# Start the full agent
clawdcursor start

# Or start tools-only (bring your own AI)
clawdcursor serve

# Or run as MCP server
clawdcursor mcp

macOS

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor && npm install && npm run setup

# Grant Accessibility permissions to your terminal first!
# System Settings -> Privacy & Security -> Accessibility -> Add Terminal/iTerm

chmod +x scripts/mac/*.sh scripts/mac/*.jxa
clawdcursor start

Linux

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor && npm install && npm run setup

# Linux: browser control via CDP only (no native desktop automation yet)
clawdcursor start

See docs/MACOS-SETUP.md for the full macOS onboarding guide.

First run will:

  1. Show a desktop control consent warning (one-time)
  2. Scan for AI providers from environment variables and CLI flags
  3. Auto-configure the optimal pipeline
  4. Start the server on http://localhost:3847

Provider Setup

Free (no API key needed):

ollama pull qwen2.5:7b   # or any model
clawdcursor start

Budget cloud (Gemini 2.5 Flash — recommended):

echo "GEMINI_API_KEY=AIza..." > .env    # get free key at aistudio.google.com
clawdcursor start

Any cloud provider:

echo "AI_API_KEY=your-key-here" > .env
clawdcursor doctor    # optional — auto-detects from key format
clawdcursor start

Explicit provider:

clawdcursor start --provider anthropic --api-key sk-ant-...
clawdcursor start --base-url https://api.example.com/v1 --api-key KEY
Provider Key prefix Vision Computer Use Budget
Google (Gemini) AIza... Yes No ★ Best value
Anthropic sk-ant- Yes Yes Premium
OpenAI sk- Yes No Standard
Groq gsk_ Yes No Fast
Together AI - Yes No Open models
DeepSeek sk- Yes No Cheapest
Kimi/Moonshot sk- (long) No No
Mistral - Yes No
xAI (Grok) xai- Yes No
Alibaba/Qwen (DashScope) sk- Yes No
Fireworks fw_... Yes No
Cohere - Yes No
Perplexity pplx- Yes No
Ollama (local) - Auto-detected No Free
Any OpenAI-compatible - Varies No

How It Works

OCR-First Adaptive Pipeline

Every task flows cheapest-first. Learned patterns skip LLM entirely.

                        +-------------+
                        |  User Task  |
                        +------+------+
                               |
                               v
                    +--------------------+
                    | Screenshot Capture  |
                    | + A11y Tree Scan    |
                    | (parallel)          |
                    +--------------------+
                               |
                               v
                  +--------------------------+
                  |  Adaptive Pattern Lookup  |
                  |  JSON guides + Skill Cache|
                  |  — known workflow?        |
                  +-----+---------------+----+
                        |               |
                  Match found      No match
                        |               |
                        v               v
              +-----------------+  +------------------+
              | Direct Execute  |  |  Tesseract OCR   |
              | Skip OCR + LLM  |  |  Text + bounding |
              | (instant, free) |  |  boxes + spatial  |
              +--------+--------+  |  layout analysis  |
                       |           +--------+---------+
                       |                    |
                       |                    v
                       |           +------------------+
                       |           | Text LLM Reason  |
                       |           | Plan + execute   |
                       |           | actions per step |
                       |           +--------+---------+
                       |                    |
                       |            +-------+-------+
                       |            |               |
                       |         Success      OCR can't read
                       |            |               |
                       |            |               v
                       |            |      +----------------+
                       |            |      |  Vision LLM    |
                       |            |      |  Screenshot    |
                       |            |      |  loop (last    |
                       |            |      |  resort)       |
                       |            |      +--------+-------+
                       |            |               |
                       |            +-------+-------+
                       |                    |
                       v                    v
              +----------------------------------+
              |       Learn Pattern              |
              |  Save to JSON guide + Skill Cache|
              |  Next time: direct execute (free)|
              +----------------------------------+

Adaptive loop: First run takes 9 steps with LLM. Second run replays the learned pattern in 1 step (free).

returnPartial: External agents (OpenClaw, Claude Code) can send {"returnPartial": true} — if text LLM fails, control returns to the agent instead of burning tokens on vision. The agent finishes with MCP tools and calls POST /learn to teach clawdcursor.

Model Recommendations (2026)

Provider Text model (Stage 2) Vision model (Stage 3) Notes
Google (Gemini) gemini-2.5-flash gemini-2.5-flash Budget pick — one model for both, 1M ctx
Anthropic claude-haiku-4-5 claude-sonnet-4-20250514 Best accuracy, Computer Use support
OpenAI gpt-4o-mini gpt-4o Reliable, widely supported
Groq llama-3.3-70b-versatile llama-3.2-90b-vision-preview Fastest inference
DeepSeek deepseek-chat deepseek-chat Cheapest reasoning
Ollama (auto-detected) (auto-detected) Free, runs locally

Cost Comparison (approximate, per 1K tasks)

Provider Stage 2 cost Stage 3 cost Best for
Ollama (local) $0 $0 Privacy, offline
Google Gemini 2.5 Flash ~$0.15 ~$0.60 Budget cloud
DeepSeek Chat ~$0.14 ~$0.14 Cheapest cloud
Groq Llama ~$0.59 ~$2.70 Speed
OpenAI GPT-4o ~$0.15 ~$2.50 Reliability
Anthropic Claude ~$0.80 ~$15 Accuracy, Computer Use

Tip: Configure Gemini 2.5 Flash for both text and vision roles — it's the best single-model budget option. One API key, one model, handles everything.

Action Verification

Every action is verified after execution:

  • Type actions: Reads back the focused element's text content
  • Click actions: Checks if window/focus changed as expected
  • Key presses: Verifies the expected state change occurred
  • CDP actions: Re-reads DOM to confirm changes
  • Task completion: Ground-truth check reads actual content (Notepad text, email window state, etc.)

If verification fails, the agent retries with a different approach instead of reporting false success.

No-Progress Detection

If the LLM repeats the same action 3+ times in an 8-step window, it's blocked and forced to try something different. Combined with the premature-done blocker (requires evidence of completion for write tasks), this prevents the two most common failure modes: infinite loops and premature success.


API Endpoints

http://localhost:3847

Endpoint Method Description
/ GET Web dashboard UI
/tools GET List all 40 tools (OpenAI function-calling format)
/execute/:name POST Execute a tool by name
/task POST Submit a task: {"task": "Open Chrome"}
/status GET Agent state and current task
/task-logs GET Recent task summaries (structured JSONL)
/task-logs/current GET Current task's step-by-step log
/report POST Submit an error report (opt-in)
/logs GET Last 200 console log entries
/screenshot GET Current screen as PNG
/action POST Direct action execution (LLM-space coords)
/confirm POST Approve/reject pending safety action
/abort POST Stop the current task
/favorites GET/POST/DELETE Saved command favorites
/stop POST Graceful server shutdown
/health GET Server health + version

Architecture

                     Any AI Model
                    (Claude, GPT, Gemini, Llama, scripts, etc.)
                          |
            +-------------+-------------+
            |             |             |
        REST API      MCP stdio    Built-in Agent
        (serve)        (mcp)        (start)
            |             |             |
            +-------------+-------------+
                          |
               Clawd Cursor Tool Server
               40 tools, single definition
                          |
        +---------+-------+-------+---------+
        |         |       |       |         |
    Perception  Mouse  Keyboard  Window  Browser
    screenshot  click  key_press focus   cdp_click
    read_screen drag   type_text open    cdp_type
    a11y_tree   scroll           switch  cdp_read
                          |
               Native Desktop Layer
          nut-js + PowerShell/JXA/AT-SPI + Playwright
                          |
                    Your Desktop

Safety

Tier Actions Behavior
Auto Navigation, reading, opening apps Runs immediately
Preview Typing, form filling Logs before executing
Confirm Sending messages, deleting, purchases Pauses for approval

First run shows a desktop control consent warning. Dangerous key combos (Alt+F4, Ctrl+Alt+Del) are blocked. Server binds to localhost only.

CLI Commands

clawdcursor start        Start the full agent (built-in LLM pipeline)
clawdcursor serve        Start tools-only server (no built-in LLM)
clawdcursor mcp          Run as MCP tool server over stdio
clawdcursor doctor       Diagnose and auto-configure
clawdcursor task <t>     Send a task to running agent
clawdcursor report       Send an error report (opt-in, redacted)
clawdcursor dashboard    Open the web dashboard
clawdcursor install      Set up API key and configure pipeline
clawdcursor uninstall    Remove all config and data
clawdcursor stop         Stop the running server
clawdcursor kill         Force stop

Options:
  --port <port>          API port (default: 3847)
  --provider <provider>  anthropic|openai|ollama|groq|together|deepseek|kimi|gemini|mistral|xai|qwen|fireworks|cohere|perplexity|...
  --model <model>        Override vision model
  --api-key <key>        AI provider API key
  --base-url <url>       Custom API endpoint
  --debug                Save screenshots to debug/ folder

Platform Support

Platform UI Automation OCR Browser (CDP) Status
Windows (x64/ARM64) PowerShell + .NET UI Automation Windows.Media.Ocr Chrome/Edge Full support
macOS (Intel/Apple Silicon) JXA + System Events Apple Vision framework Chrome/Edge Full support
Linux (x64/ARM64) AT-SPI (planned) Tesseract OCR Chrome/Edge Browser + OCR

Prerequisites

  • Node.js 20+ (x64 or ARM64)
  • Windows: PowerShell (included)
  • macOS 10.15+: Accessibility permissions granted, Xcode CLI tools (xcode-select --install)
  • Linux: tesseract-ocr and python3 for OCR (sudo apt install tesseract-ocr)
  • AI API Key — optional. Works offline with Ollama or tools-only mode.

Tech Stack

TypeScript - Node.js - @nut-tree-fork/nut-js - Playwright - sharp - Express - MCP SDK - Zod - Any OpenAI-compatible API - Anthropic Computer Use - Windows UI Automation - macOS Accessibility (JXA) - Chrome DevTools Protocol

License

MIT


clawdcursor.com

About

AI desktop agent — sees your screen, controls your cursor, completes tasks autonomously.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors