An autonomous ML agent that thinks like an experienced MLE. It works through a cognitive loop: ORIENT → RESEARCH → HYPOTHESIZE → EXECUTE → ANALYZE → VALIDATE → DECIDE.
- Claude Code or Codex CLI
- A git repo for your ML project
pip install ml-ralphOr with uv:
uv tool install --editable .ml-ralph initOpen Claude Code and use the /ml-ralph skill:
/ml-ralph
Ralph will ask clarifying questions to understand your ML problem and create a PRD.
ml-ralph runRalph works through the cognitive loop until success criteria are met.
After running Ralph, your project will have:
your-project/
├── .ml-ralph/
│ ├── RALPH.md # Agent instructions
│ ├── prd.json # PRD (the contract)
│ ├── ralph.json # Execution state
│ ├── backlog.json # Hypotheses queue
│ ├── log.jsonl # Thinking log (research, learnings, analysis)
├── .claude/skills/ml-ralph/
├── .codex/skills/ml-ralph/
├── CLAUDE.md
└── AGENTS.md
| Command | Purpose |
|---|---|
ml-ralph init |
Initialize Ralph in current project |
ml-ralph run |
Run autonomous execution loop |
# Use Claude Code (default)
ml-ralph run --tool claude
# Use OpenAI Codex
ml-ralph run --tool codex
# Codex with custom sandbox mode (default: workspace-write)
ml-ralph run --tool codex --sandbox danger-full-access
# Set max iterations (default: 100)
ml-ralph run --max-iterations 200
# Force overwrite on init
ml-ralph init --forceWhen using --tool codex, you can control the sandbox policy:
| Mode | Description |
|---|---|
read-only |
Agent can only read files, not modify |
workspace-write |
Agent can modify files in workspace (default) |
danger-full-access |
Full system access (use with caution) |
ORIENT → RESEARCH → HYPOTHESIZE → EXECUTE → ANALYZE → VALIDATE → DECIDE
↑ │
└─────────────────────────────────────────┘
- ORIENT: Understand the problem, constraints, failure modes
- RESEARCH: Learn from existing knowledge, find SOTA approaches
- HYPOTHESIZE: Form testable bets with expected outcomes
- EXECUTE: Implement minimal changes, run experiments
- ANALYZE: Understand results, examine failures, find patterns
- VALIDATE: Check for leakage, ensure results are trustworthy
- DECIDE: Keep/revert/pivot based on evidence
