droidclaw

an ai agent that controls your android phone. give it a goal in plain english — it figures out what to tap, type, and swipe.

i wanted to turn my old android devices into ai agents. after a few hours reverse engineering accessibility trees and playing with tailscale.. it worked.

think of it this way — a few years back, we could automate android with predefined flows. now imagine that automation layer has an llm brain. it can read any screen, understand what's happening, decide what to do, and execute. you don't need api's. you don't need to build integrations. just install your favourite apps and tell the agent what you want done.

one of the coolest things it can do right now is delegate incoming requests to chatgpt, gemini, or google search on the device... and bring the result back. no api keys for those services needed — it just uses the apps like a human would.

$ bun run src/kernel.ts
enter your goal: open youtube and search for "lofi hip hop"

--- step 1/30 ---
think: i'm on the home screen. launching youtube.
action: launch (842ms)

--- step 2/30 ---
think: youtube is open. tapping search icon.
action: tap (623ms)

--- step 3/30 ---
think: search field focused.
action: type "lofi hip hop" (501ms)

--- step 4/30 ---
action: enter (389ms)

--- step 5/30 ---
think: search results showing. done.
action: done (412ms)

how it works

the core idea is dead simple — a perception → reasoning → action loop that repeats until the goal is done (or it runs out of steps).

                         ┌─────────────────────────────────────────┐
                         │              your goal                  │
                         │   "send good morning to mom on whatsapp"│
                         └────────────────┬────────────────────────┘
                                          │
                                          ▼
                    ┌─────────────────────────────────────────────────┐
                    │                                                 │
                    │              ┌──────────────┐                   │
                    │              │  1. perceive  │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    dump accessibility tree via adb               │
                    │    parse xml → interactive ui elements           │
                    │    diff with previous screen (detect changes)    │
                    │    optionally capture screenshot                 │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  2. reason    │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    send screen state + goal + history to llm     │
                    │    llm returns { think, plan, action }           │
                    │    "i see the search icon at (890, 156).         │
                    │     i should tap it."                            │
                    │                     │                           │
                    │                     ▼                           │
                    │              ┌──────────────┐                   │
                    │              │  3. act       │                   │
                    │              └──────┬───────┘                   │
                    │                     │                           │
                    │    execute via adb: tap, type, swipe, etc.       │
                    │    feed result back to llm on next step          │
                    │    check if goal is done                        │
                    │                     │                           │
                    │                     ▼                           │
                    │               done? ─────── yes ──→ exit        │
                    │                │                                │
                    │                no                               │
                    │                │                                │
                    │                └─────── loop back to perceive   │
                    │                                                 │
                    └─────────────────────────────────────────────────┘

what makes it not fall apart

llms controlling ui's sounds fragile. and it is, if you don't handle the failure modes. here's what droidclaw does:

stuck loop detection — if the screen doesn't change for 3 steps, recovery hints get injected into the prompt. context-aware hints based on what type of action is failing (tap vs swipe vs wait).
repetition tracking — a sliding window of recent actions catches retry loops even across screen changes. if the agent taps the same coordinates 3+ times, it gets told to stop and try something else.
drift detection — if the agent spams navigation actions (swipe, back, wait) without interacting with anything, it gets nudged to take direct action.
vision fallback — when the accessibility tree is empty (webviews, flutter apps, games), a screenshot gets sent to the llm instead, with coordinate-based tap suggestions.
action feedback — every action result (success/failure + message) gets fed back to the llm on the next step. the agent knows whether its last move worked.
multi-turn memory — conversation history is maintained across steps so the llm has context about what it already tried.

setup

quick install

curl -fsSL https://droidclaw.ai/install.sh | sh

this installs bun and adb if missing, clones the repo, and sets up .env.

manual install

prerequisites:

bun (required — node/npm won't work. droidclaw uses bun-specific apis like Bun.spawnSync and native .env loading)
adb (android debug bridge — comes with android sdk platform tools)
an android phone with usb debugging enabled
an llm provider api key (or ollama for fully local)

# install adb
# macos:
brew install android-platform-tools
# linux:
sudo apt install android-tools-adb
# windows:
# download from https://developer.android.com/tools/releases/platform-tools

# install bun
curl -fsSL https://bun.sh/install | bash

# clone and setup
git clone https://github.com/unitedbyai/droidclaw.git
cd droidclaw
bun install
cp .env.example .env

configure your llm

edit .env and pick a provider. fastest way to start is groq (free tier):

LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here

or run fully local with ollama (no api key, no internet needed):

ollama pull llama3.2
# then in .env:
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2

connect your phone

go to settings → about phone → tap "build number" 7 times to enable developer options
go to settings → developer options → enable "usb debugging"
plug in via usb and tap "allow" on the phone when prompted

adb devices   # should show your device

run it

bun run src/kernel.ts
# type your goal and press enter

three ways to use it

droidclaw has three modes, each for a different use case:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   interactive mode          workflows             flows             │
│   ─────────────────    ─────────────────    ─────────────────       │
│                                                                     │
│   type a goal and       chain goals          fixed sequences        │
│   the agent figures     across multiple      of taps and types.     │
│   it out on the fly.    apps with ai.        no llm, instant.       │
│                                                                     │
│   $ bun run              --workflow            --flow               │
│     src/kernel.ts         file.json             file.yaml           │
│                                                                     │
│   best for:             best for:            best for:              │
│   one-off tasks,        multi-app tasks,     things you do          │
│   exploration,          recurring routines,  exactly the same       │
│   quick commands        morning briefings    way every time         │
│                                                                     │
│   uses llm: yes         uses llm: yes        uses llm: no          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

interactive mode

just type what you want:

bun run src/kernel.ts
# enter your goal: open settings and turn on dark mode

workflows (ai-powered, multi-app)

workflows are json files describing a sequence of sub-goals. each step can optionally switch to a different app. the llm decides how to navigate, what to tap, what to type.

bun run src/kernel.ts --workflow examples/workflows/research/weather-to-whatsapp.json

{
  "name": "weather to whatsapp",
  "steps": [
    {
      "app": "com.google.android.googlequicksearchbox",
      "goal": "search for chennai weather today"
    },
    {
      "goal": "share the result to whatsapp contact Sanju"
    }
  ]
}

you can inject specific data into steps using formData:

{
  "name": "slack standup",
  "steps": [
    {
      "app": "com.Slack",
      "goal": "open #standup channel, type the message and send it",
      "formData": {
        "Message": "yesterday: api integration\ntoday: tests\nblockers: none"
      }
    }
  ]
}

flows (no ai, instant execution)

for tasks where you don't need ai thinking — just a fixed sequence of taps and types. no llm calls, instant execution. think of it like a macro.

bun run src/kernel.ts --flow examples/flows/send-whatsapp.yaml

appId: com.whatsapp
name: Send WhatsApp Message
---
- launchApp
- wait: 2
- tap: "Contact Name"
- wait: 1
- tap: "Message"
- type: "hello from droidclaw"
- tap: "Send"
- done: "Message sent"

quick comparison

	workflows	flows
format	json	yaml
uses ai	yes	no
handles ui changes	yes	no
speed	slower (llm calls)	instant
best for	complex/multi-app tasks	simple repeatable tasks

example workflows

35 ready-to-use workflows organised by category:

messaging — whatsapp, telegram, slack, email

slack-standup — post daily standup to a channel
whatsapp-broadcast — send a message to multiple contacts
telegram-send-message — send a telegram message
email-reply — draft and send an email reply
whatsapp-to-email — forward whatsapp messages to email
slack-check-messages — read unread slack messages
email-digest — summarise recent emails
telegram-channel-digest — digest a telegram channel
whatsapp-reply — reply to a whatsapp message
send-whatsapp-vi — send whatsapp to a specific contact

social — instagram, youtube, cross-posting

social-media-post — post across platforms
social-media-engage — like/comment on posts
instagram-post-check — check recent instagram posts
youtube-watch-later — save videos to watch later

productivity — calendar, notes, github, notifications

morning-briefing — read messages, calendar, weather across apps
github-check-prs — check open pull requests
calendar-create-event — create a calendar event
notes-capture — capture a quick note
notification-cleanup — clear and triage notifications
screenshot-share-slack — screenshot and share to slack
translate-and-reply — translate a message and reply
logistics-workflow — multi-app logistics coordination

research — search, compare, monitor

weather-to-whatsapp — get weather via google, share to whatsapp
multi-app-research — research across multiple apps
price-comparison — compare prices across shopping apps
news-roundup — collect news from multiple sources
google-search-report — search google and save results
check-flight-status — check flight status

lifestyle — food, transport, music, fitness

food-order — order food from a delivery app
uber-ride — book an uber ride
spotify-playlist — create or add to a spotify playlist
maps-commute — check commute time
fitness-log — log a workout
expense-tracker — log an expense
wifi-password-share — share wifi password
do-not-disturb — toggle do not disturb with exceptions

flows — 5 deterministic flow templates (no ai)

send-whatsapp — send a whatsapp message
google-search — run a google search
create-contact — add a new contact
clear-notifications — clear all notifications
toggle-wifi — toggle wifi on/off

actions

the agent has 28 actions it can use. these are the building blocks — each one maps to an adb command.

basic interactions: tap type enter longpress clear paste swipe scroll

navigation: home back launch switch_app open_url open_settings notifications

clipboard: clipboard_get clipboard_set

multi-step skills (compound actions that handle common patterns): read_screen submit_message copy_visible_text wait_for_content find_and_tap compose_email

system: screenshot shell keyevent pull_file push_file wait done

the multi-step skills are interesting — they replace 5-10 manual actions with a single call. for example, read_screen auto-scrolls through the entire screen, collects all text, and copies it to clipboard. compose_email fills To, Subject, and Body fields in the correct order using android intents. these dramatically reduce the number of llm decisions needed.

providers

provider	cost	vision	notes
groq	free tier	no	fastest to start, great for most tasks
ollama	free (local)	yes*	no api key, runs entirely on your machine
openrouter	per token	yes	200+ models, single api
openai	per token	yes	gpt-4o, strong reasoning
bedrock	per token	yes	claude/llama on aws

*ollama vision requires a vision-capable model like llama3.2-vision or llava

config

all configuration lives in .env. here's what you can tweak:

key	default	what it does
`LLM_PROVIDER`	groq	which llm to use (groq/openai/ollama/bedrock/openrouter)
`MAX_STEPS`	30	how many steps before the agent gives up
`STEP_DELAY`	2	seconds to wait between actions (lets the ui settle)
`STUCK_THRESHOLD`	3	how many unchanged steps before stuck recovery kicks in
`VISION_MODE`	fallback	`off` / `fallback` (only when accessibility tree is empty) / `always`
`MAX_ELEMENTS`	40	max ui elements sent to the llm per step (scored & ranked)
`MAX_HISTORY_STEPS`	10	how many past steps to keep in conversation context
`STREAMING_ENABLED`	true	stream llm responses (shows progress dots)
`LOG_DIR`	logs	directory for session json logs

source code

the entire agent is ~10 files in src/:

src/
├── kernel.ts          the main perception → reasoning → action loop
├── actions.ts         28 action implementations (tap, type, swipe, etc.)
├── skills.ts          6 multi-step skills (read_screen, compose_email, etc.)
├── workflow.ts        workflow orchestration engine (multi-app sub-goals)
├── flow.ts            yaml flow runner (deterministic, no llm)
├── llm-providers.ts   5 providers + the system prompt that teaches the llm
├── sanitizer.ts       accessibility xml parser → structured ui elements
├── config.ts          env config loader with validation
├── constants.ts       keycodes, swipe coordinates, defaults
└── logger.ts          session logging (json, crash-safe partial writes)

data flow through the codebase

                    kernel.ts
                       │
          ┌────────────┼────────────────┐
          │            │                │
          ▼            ▼                ▼
     sanitizer.ts   llm-providers.ts   actions.ts
     (parse screen)  (ask the llm)     (execute via adb)
                                        │
                                        ├── skills.ts
                                        │   (multi-step compound actions)
                                        │
     config.ts ◄────── all files read config
     constants.ts ◄─── keycodes, coordinates

     workflow.ts ── calls kernel.runAgent() per sub-goal
     flow.ts ────── calls actions.executeAction() directly (no llm)
     logger.ts ◄─── kernel writes step logs here

remote control with tailscale

the default setup is usb — phone plugged into your laptop. but you can go much further.

install tailscale on both your android device and your laptop/server. once they're on the same tailnet, connect adb over the network:

# on your phone: enable wireless debugging
# settings → developer options → wireless debugging
# note the ip:port shown

# from anywhere in the world:
adb connect <phone-tailscale-ip>:<port>
adb devices   # should show your phone

bun run src/kernel.ts

now your phone is a remote ai agent. leave it on a desk plugged into power, and control it from a vps, your laptop at a cafe, or a cron job running workflows every morning at 8am. the phone doesn't need to be on the same wifi or even in the same country.

this is what makes old android devices useful again — they become always-on agents that can do things on apps that don't have api's.

commands

bun run src/kernel.ts                          # interactive mode (prompts for goal)
bun run src/kernel.ts --workflow file.json     # run a workflow
bun run src/kernel.ts --flow file.yaml         # run a deterministic flow
bun install                                    # install dependencies
bun run build                                  # compile to dist/
bun run typecheck                              # type-check (tsc --noEmit)

troubleshooting

"adb: command not found" — install adb (brew install android-platform-tools on mac) or set ADB_PATH in .env to point to your adb binary.

"no devices found" — make sure usb debugging is enabled, you've tapped "allow" on the phone, and the cable supports data transfer (not just charging).

agent keeps repeating the same action — stuck detection should handle this automatically. if it persists, try a stronger model (groq's llama-3.3-70b or openai's gpt-4o).

empty accessibility tree — some apps (flutter, webviews, games) don't expose accessibility info. set VISION_MODE=always in .env to send screenshots every step instead.

swipe coordinates seem off — droidclaw auto-detects screen resolution at startup. if your device has an unusual resolution, check the console output on step 1 for the detected resolution.

contributors

built by unitedby.ai — an open ai community

acknowledgements

droidclaw's workflow orchestration was influenced by android action kernel from action state labs. we took the core idea of sub-goal decomposition and built a different system around it — with stuck recovery, 28 actions, multi-step skills, and vision fallback.

license

mit

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
android		android
docs		docs
examples		examples
packages/shared		packages/shared
server		server
site		site
src		src
web		web
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
package.json		package.json
test-ollama.ts		test-ollama.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

droidclaw

how it works

what makes it not fall apart

setup

quick install

manual install

configure your llm

connect your phone

run it

three ways to use it

interactive mode

workflows (ai-powered, multi-app)

flows (no ai, instant execution)

quick comparison

example workflows

actions

providers

config

source code

data flow through the codebase

remote control with tailscale

commands

troubleshooting

contributors

acknowledgements

license

About

Uh oh!

Releases

Packages

Languages

RibbonBlockchain/droidclaw

Folders and files

Latest commit

History

Repository files navigation

droidclaw

how it works

what makes it not fall apart

setup

quick install

manual install

configure your llm

connect your phone

run it

three ways to use it

interactive mode

workflows (ai-powered, multi-app)

flows (no ai, instant execution)

quick comparison

example workflows

actions

providers

config

source code

data flow through the codebase

remote control with tailscale

commands

troubleshooting

contributors

acknowledgements

license

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages