an ai agent that controls your android phone. give it a goal in plain english — it figures out what to tap, type, and swipe.
i wanted to turn my old android devices into ai agents. after a few hours reverse engineering accessibility trees and playing with tailscale.. it worked.
think of it this way — a few years back, we could automate android with predefined flows. now imagine that automation layer has an llm brain. it can read any screen, understand what's happening, decide what to do, and execute. you don't need api's. you don't need to build integrations. just install your favourite apps and tell the agent what you want done.
one of the coolest things it can do right now is delegate incoming requests to chatgpt, gemini, or google search on the device... and bring the result back. no api keys for those services needed — it just uses the apps like a human would.
$ bun run src/kernel.ts
enter your goal: open youtube and search for "lofi hip hop"
--- step 1/30 ---
think: i'm on the home screen. launching youtube.
action: launch (842ms)
--- step 2/30 ---
think: youtube is open. tapping search icon.
action: tap (623ms)
--- step 3/30 ---
think: search field focused.
action: type "lofi hip hop" (501ms)
--- step 4/30 ---
action: enter (389ms)
--- step 5/30 ---
think: search results showing. done.
action: done (412ms)
the core idea is dead simple — a perception → reasoning → action loop that repeats until the goal is done (or it runs out of steps).
┌─────────────────────────────────────────┐
│ your goal │
│ "send good morning to mom on whatsapp"│
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ │
│ │ 1. perceive │ │
│ └──────┬───────┘ │
│ │ │
│ dump accessibility tree via adb │
│ parse xml → interactive ui elements │
│ diff with previous screen (detect changes) │
│ optionally capture screenshot │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ 2. reason │ │
│ └──────┬───────┘ │
│ │ │
│ send screen state + goal + history to llm │
│ llm returns { think, plan, action } │
│ "i see the search icon at (890, 156). │
│ i should tap it." │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ 3. act │ │
│ └──────┬───────┘ │
│ │ │
│ execute via adb: tap, type, swipe, etc. │
│ feed result back to llm on next step │
│ check if goal is done │
│ │ │
│ ▼ │
│ done? ─────── yes ──→ exit │
│ │ │
│ no │
│ │ │
│ └─────── loop back to perceive │
│ │
└─────────────────────────────────────────────────┘
llms controlling ui's sounds fragile. and it is, if you don't handle the failure modes. here's what droidclaw does:
- stuck loop detection — if the screen doesn't change for 3 steps, recovery hints get injected into the prompt. context-aware hints based on what type of action is failing (tap vs swipe vs wait).
- repetition tracking — a sliding window of recent actions catches retry loops even across screen changes. if the agent taps the same coordinates 3+ times, it gets told to stop and try something else.
- drift detection — if the agent spams navigation actions (swipe, back, wait) without interacting with anything, it gets nudged to take direct action.
- vision fallback — when the accessibility tree is empty (webviews, flutter apps, games), a screenshot gets sent to the llm instead, with coordinate-based tap suggestions.
- action feedback — every action result (success/failure + message) gets fed back to the llm on the next step. the agent knows whether its last move worked.
- multi-turn memory — conversation history is maintained across steps so the llm has context about what it already tried.
curl -fsSL https://droidclaw.ai/install.sh | shthis installs bun and adb if missing, clones the repo, and sets up .env.
prerequisites:
- bun (required — node/npm won't work. droidclaw uses bun-specific apis like
Bun.spawnSyncand native.envloading) - adb (android debug bridge — comes with android sdk platform tools)
- an android phone with usb debugging enabled
- an llm provider api key (or ollama for fully local)
# install adb
# macos:
brew install android-platform-tools
# linux:
sudo apt install android-tools-adb
# windows:
# download from https://developer.android.com/tools/releases/platform-tools
# install bun
curl -fsSL https://bun.sh/install | bash
# clone and setup
git clone https://github.com/unitedbyai/droidclaw.git
cd droidclaw
bun install
cp .env.example .envedit .env and pick a provider. fastest way to start is groq (free tier):
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_hereor run fully local with ollama (no api key, no internet needed):
ollama pull llama3.2
# then in .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2- go to settings → about phone → tap "build number" 7 times to enable developer options
- go to settings → developer options → enable "usb debugging"
- plug in via usb and tap "allow" on the phone when prompted
adb devices # should show your devicebun run src/kernel.ts
# type your goal and press enterdroidclaw has three modes, each for a different use case:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ interactive mode workflows flows │
│ ───────────────── ───────────────── ───────────────── │
│ │
│ type a goal and chain goals fixed sequences │
│ the agent figures across multiple of taps and types. │
│ it out on the fly. apps with ai. no llm, instant. │
│ │
│ $ bun run --workflow --flow │
│ src/kernel.ts file.json file.yaml │
│ │
│ best for: best for: best for: │
│ one-off tasks, multi-app tasks, things you do │
│ exploration, recurring routines, exactly the same │
│ quick commands morning briefings way every time │
│ │
│ uses llm: yes uses llm: yes uses llm: no │
│ │
└─────────────────────────────────────────────────────────────────────┘
just type what you want:
bun run src/kernel.ts
# enter your goal: open settings and turn on dark modeworkflows are json files describing a sequence of sub-goals. each step can optionally switch to a different app. the llm decides how to navigate, what to tap, what to type.
bun run src/kernel.ts --workflow examples/workflows/research/weather-to-whatsapp.json{
"name": "weather to whatsapp",
"steps": [
{
"app": "com.google.android.googlequicksearchbox",
"goal": "search for chennai weather today"
},
{
"goal": "share the result to whatsapp contact Sanju"
}
]
}you can inject specific data into steps using formData:
{
"name": "slack standup",
"steps": [
{
"app": "com.Slack",
"goal": "open #standup channel, type the message and send it",
"formData": {
"Message": "yesterday: api integration\ntoday: tests\nblockers: none"
}
}
]
}for tasks where you don't need ai thinking — just a fixed sequence of taps and types. no llm calls, instant execution. think of it like a macro.
bun run src/kernel.ts --flow examples/flows/send-whatsapp.yamlappId: com.whatsapp
name: Send WhatsApp Message
---
- launchApp
- wait: 2
- tap: "Contact Name"
- wait: 1
- tap: "Message"
- type: "hello from droidclaw"
- tap: "Send"
- done: "Message sent"| workflows | flows | |
|---|---|---|
| format | json | yaml |
| uses ai | yes | no |
| handles ui changes | yes | no |
| speed | slower (llm calls) | instant |
| best for | complex/multi-app tasks | simple repeatable tasks |
35 ready-to-use workflows organised by category:
messaging — whatsapp, telegram, slack, email
- slack-standup — post daily standup to a channel
- whatsapp-broadcast — send a message to multiple contacts
- telegram-send-message — send a telegram message
- email-reply — draft and send an email reply
- whatsapp-to-email — forward whatsapp messages to email
- slack-check-messages — read unread slack messages
- email-digest — summarise recent emails
- telegram-channel-digest — digest a telegram channel
- whatsapp-reply — reply to a whatsapp message
- send-whatsapp-vi — send whatsapp to a specific contact
social — instagram, youtube, cross-posting
- social-media-post — post across platforms
- social-media-engage — like/comment on posts
- instagram-post-check — check recent instagram posts
- youtube-watch-later — save videos to watch later
productivity — calendar, notes, github, notifications
- morning-briefing — read messages, calendar, weather across apps
- github-check-prs — check open pull requests
- calendar-create-event — create a calendar event
- notes-capture — capture a quick note
- notification-cleanup — clear and triage notifications
- screenshot-share-slack — screenshot and share to slack
- translate-and-reply — translate a message and reply
- logistics-workflow — multi-app logistics coordination
research — search, compare, monitor
- weather-to-whatsapp — get weather via google, share to whatsapp
- multi-app-research — research across multiple apps
- price-comparison — compare prices across shopping apps
- news-roundup — collect news from multiple sources
- google-search-report — search google and save results
- check-flight-status — check flight status
lifestyle — food, transport, music, fitness
- food-order — order food from a delivery app
- uber-ride — book an uber ride
- spotify-playlist — create or add to a spotify playlist
- maps-commute — check commute time
- fitness-log — log a workout
- expense-tracker — log an expense
- wifi-password-share — share wifi password
- do-not-disturb — toggle do not disturb with exceptions
flows — 5 deterministic flow templates (no ai)
- send-whatsapp — send a whatsapp message
- google-search — run a google search
- create-contact — add a new contact
- clear-notifications — clear all notifications
- toggle-wifi — toggle wifi on/off
the agent has 28 actions it can use. these are the building blocks — each one maps to an adb command.
basic interactions:
tap type enter longpress clear paste swipe scroll
navigation:
home back launch switch_app open_url open_settings notifications
clipboard:
clipboard_get clipboard_set
multi-step skills (compound actions that handle common patterns):
read_screen submit_message copy_visible_text wait_for_content find_and_tap compose_email
system:
screenshot shell keyevent pull_file push_file wait done
the multi-step skills are interesting — they replace 5-10 manual actions with a single call. for example, read_screen auto-scrolls through the entire screen, collects all text, and copies it to clipboard. compose_email fills To, Subject, and Body fields in the correct order using android intents. these dramatically reduce the number of llm decisions needed.
| provider | cost | vision | notes |
|---|---|---|---|
| groq | free tier | no | fastest to start, great for most tasks |
| ollama | free (local) | yes* | no api key, runs entirely on your machine |
| openrouter | per token | yes | 200+ models, single api |
| openai | per token | yes | gpt-4o, strong reasoning |
| bedrock | per token | yes | claude/llama on aws |
*ollama vision requires a vision-capable model like llama3.2-vision or llava
all configuration lives in .env. here's what you can tweak:
| key | default | what it does |
|---|---|---|
LLM_PROVIDER |
groq | which llm to use (groq/openai/ollama/bedrock/openrouter) |
MAX_STEPS |
30 | how many steps before the agent gives up |
STEP_DELAY |
2 | seconds to wait between actions (lets the ui settle) |
STUCK_THRESHOLD |
3 | how many unchanged steps before stuck recovery kicks in |
VISION_MODE |
fallback | off / fallback (only when accessibility tree is empty) / always |
MAX_ELEMENTS |
40 | max ui elements sent to the llm per step (scored & ranked) |
MAX_HISTORY_STEPS |
10 | how many past steps to keep in conversation context |
STREAMING_ENABLED |
true | stream llm responses (shows progress dots) |
LOG_DIR |
logs | directory for session json logs |
the entire agent is ~10 files in src/:
src/
├── kernel.ts the main perception → reasoning → action loop
├── actions.ts 28 action implementations (tap, type, swipe, etc.)
├── skills.ts 6 multi-step skills (read_screen, compose_email, etc.)
├── workflow.ts workflow orchestration engine (multi-app sub-goals)
├── flow.ts yaml flow runner (deterministic, no llm)
├── llm-providers.ts 5 providers + the system prompt that teaches the llm
├── sanitizer.ts accessibility xml parser → structured ui elements
├── config.ts env config loader with validation
├── constants.ts keycodes, swipe coordinates, defaults
└── logger.ts session logging (json, crash-safe partial writes)
kernel.ts
│
┌────────────┼────────────────┐
│ │ │
▼ ▼ ▼
sanitizer.ts llm-providers.ts actions.ts
(parse screen) (ask the llm) (execute via adb)
│
├── skills.ts
│ (multi-step compound actions)
│
config.ts ◄────── all files read config
constants.ts ◄─── keycodes, coordinates
workflow.ts ── calls kernel.runAgent() per sub-goal
flow.ts ────── calls actions.executeAction() directly (no llm)
logger.ts ◄─── kernel writes step logs here
the default setup is usb — phone plugged into your laptop. but you can go much further.
install tailscale on both your android device and your laptop/server. once they're on the same tailnet, connect adb over the network:
# on your phone: enable wireless debugging
# settings → developer options → wireless debugging
# note the ip:port shown
# from anywhere in the world:
adb connect <phone-tailscale-ip>:<port>
adb devices # should show your phone
bun run src/kernel.tsnow your phone is a remote ai agent. leave it on a desk plugged into power, and control it from a vps, your laptop at a cafe, or a cron job running workflows every morning at 8am. the phone doesn't need to be on the same wifi or even in the same country.
this is what makes old android devices useful again — they become always-on agents that can do things on apps that don't have api's.
bun run src/kernel.ts # interactive mode (prompts for goal)
bun run src/kernel.ts --workflow file.json # run a workflow
bun run src/kernel.ts --flow file.yaml # run a deterministic flow
bun install # install dependencies
bun run build # compile to dist/
bun run typecheck # type-check (tsc --noEmit)"adb: command not found" — install adb (brew install android-platform-tools on mac) or set ADB_PATH in .env to point to your adb binary.
"no devices found" — make sure usb debugging is enabled, you've tapped "allow" on the phone, and the cable supports data transfer (not just charging).
agent keeps repeating the same action — stuck detection should handle this automatically. if it persists, try a stronger model (groq's llama-3.3-70b or openai's gpt-4o).
empty accessibility tree — some apps (flutter, webviews, games) don't expose accessibility info. set VISION_MODE=always in .env to send screenshots every step instead.
swipe coordinates seem off — droidclaw auto-detects screen resolution at startup. if your device has an unusual resolution, check the console output on step 1 for the detected resolution.
built by unitedby.ai — an open ai community
droidclaw's workflow orchestration was influenced by android action kernel from action state labs. we took the core idea of sub-goal decomposition and built a different system around it — with stuck recovery, 28 actions, multi-step skills, and vision fallback.
mit