A local implementation of Recursive Language Models (RLM), based on the paper "Recursive Language Models" by Zhang, Kraska & Khattab (MIT OASYS Lab).
RLM gives an LLM a live Python REPL and a recursive sub-LLM — turning a single inference call into an iterative reasoning loop. The model can write and execute code, inspect results, call smaller models as tools, and keep iterating until it's confident in its answer. This makes it practical for tasks that are impossible in a single forward pass: exploring 100K-line documents, multi-step computation, structured data extraction, and more.
Runs entirely locally via Ollama, with optional cloud model support (OpenAI, Anthropic, Together, Groq, and any Ollama-hosted cloud model).
┌──────────────────────────────────────────────────────────┐
│ User query + optional context │
│ │ │
│ ┌──────▼──────┐ │
│ │ Root LLM │ (any Ollama or cloud model) │
│ └──────┬──────┘ │
│ │ outputs either: │
│ ┌─────────┴───────────────┐ │
│ ```repl``` code block FINAL(answer) │
│ │ │ │
│ ┌─▼──────────┐ ─── done ─── │
│ │ REPL env │ sandboxed Python exec │
│ │ │ `context` var = user data │
│ │ │ `llm_query(p)` → Sub-LLM call │
│ └─┬──────────┘ │
│ │ stdout / variables fed back as next user message │
│ └──► loop back to Root LLM │
└──────────────────────────────────────────────────────────┘
The model writes code in repl blocks. The engine executes them and feeds the output back into the conversation. This repeats until the model outputs FINAL(answer).
Requirements: Python 3.10+, Ollama running locally.
git clone https://github.com/zitacron/RLM-Ollama
cd RLM-Ollama
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtPull at least one model before using:
ollama pull qwen3.5:2bpip install openai # OpenAI, Together AI, Groq, Fireworks, OpenRouter
pip install anthropic # Anthropic ClaudeA full conversational interface. Every query runs through the RLM loop — the model gets a REPL and can call sub-LLMs before answering.
python3 chat.py
python3 chat.py --model qwen3:4b
python3 chat.py --model qwen3:4b --no-think # disable thinking tokens (faster)
python3 chat.py --model llama3.1:8b --recursive-model qwen3:0.6b # smaller sub-LLM
python3 chat.py --model qwen3:4b --context path/to/bigfile.txt # preload contextIn-session commands:
| Command | Effect |
|---|---|
/context <file> |
Load a file as context mid-session |
/model <name> |
Switch the root model |
/quit |
Exit |
All options:
| Flag | Default | Description |
|---|---|---|
--model / -m |
llama3.1:8b |
Root LLM model |
--recursive-model / -r |
same as --model |
Sub-LLM for llm_query() calls inside REPL |
--context / -c |
none | Path to a file preloaded as context |
--host |
localhost:11434 |
Ollama server address |
--max-iterations |
15 |
Cap on root-LLM turns per query |
--num-ctx |
131072 |
Context window tokens passed to Ollama |
--temperature |
0.6 |
Sampling temperature |
--think / --no-think |
auto | Enable/disable thinking mode (qwen3, deepseek-r1, etc.) |
Ollama cloud models (e.g. glm-5:cloud) require signing in first:
ollama signin
python3 chat.py --model glm-5:cloudRuns a Needle-in-a-Haystack test across multiple models in one shot. Embeds a secret number in a large block of random text and measures each model's accuracy, speed, and iteration count.
By default tests qwen3:0.6b and qwen3:1.7b. Auto-detects cloud providers from environment variables.
python3 benchmark_models.pyBasic options:
# Change haystack size and number of trials per model
python3 benchmark_models.py --lines 5000 --trials 5
# Add extra local Ollama models
python3 benchmark_models.py --ollama qwen3:4b llama3.1:8b
# Skip the default local models (cloud-only run)
python3 benchmark_models.py --no-local
# Reproducible run
python3 benchmark_models.py --seed 42Cloud models — auto-detected from env vars:
| Env var | Provider | Default model tested |
|---|---|---|
OPENAI_API_KEY |
OpenAI | gpt-4o-mini |
ANTHROPIC_API_KEY |
Anthropic | claude-haiku-3-5-latest |
TOGETHER_API_KEY |
Together AI | meta-llama/Llama-3-8b-chat-hf |
GROQ_API_KEY |
Groq | llama-3.1-8b-instant |
FIREWORKS_API_KEY |
Fireworks | llama-3.1-8b-instruct |
OPENROUTER_API_KEY |
OpenRouter | mistral-7b-instruct |
# Explicit cloud models
python3 benchmark_models.py --cloud openai:gpt-4o anthropic:claude-opus-4-5
# Skip auto-detection, use only what you specify
python3 benchmark_models.py --no-cloud-auto --cloud groq:llama-3.1-70b-versatileRemote / cloud-hosted Ollama instances:
# Inline URL per model
python3 benchmark_models.py --ollama http://myserver:11434/qwen3:72b
# Global host applied to all --ollama models (and default models)
python3 benchmark_models.py --ollama-host http://myserver:11434
python3 benchmark_models.py --ollama-host http://myserver:11434 --ollama qwen3:14b llama3.1:8bSample output:
╭──────────────────────────────────────────────────────────╮
│ Model Provider Acc AvgTime Iters │
│ qwen3:0.6b (local) ollama 33% 42.1 5.0 │
│ qwen3:4b (local) ollama 100% 38.7 3.0 │
│ OpenAI / gpt-4o-mini openai 100% 8.2 2.3 │
╰──────────────────────────────────────────────────────────╯
A minimal self-contained script that generates a 10K-line haystack, hides a random 7-digit number in it, and runs RLM to find it.
python3 example_niah.pyHardcoded to qwen3:0.6b. Edit the file to change the model or haystack size.
from rlm import RLM
rlm = RLM(
model="qwen3:4b",
recursive_model="qwen3:0.6b", # optional smaller sub-LLM
max_iterations=10,
think=False,
)
result = rlm.completion(
prompt="Summarise the key events and extract all dates.",
context=open("my_document.txt").read(),
)
print(result.answer)
print(f"{result.iterations} iterations, {result.total_time:.1f}s")With a cloud model:
from rlm import RLM
from rlm.cloud_client import CloudClient
client = CloudClient(model="gpt-4o-mini") # uses OPENAI_API_KEY from env
rlm = RLM(root_client=client, max_iterations=10)
result = rlm.completion(prompt="What is 17 * 23?")
print(result.answer)RLMResult fields:
| Field | Type | Description |
|---|---|---|
answer |
str |
The extracted final answer |
iterations |
int |
Number of root-LLM turns used |
total_time |
float |
Wall-clock seconds |
usage |
dict |
Token counts (total_prompt_tokens, total_completion_tokens) |
Inside the REPL, the model has access to:
| Name | Type | Description |
|---|---|---|
context |
str | dict | list |
The user-supplied data |
llm_query(prompt) |
function |
Calls the sub-LLM; returns a string |
print(...) |
function |
Output is captured and fed back to the model |
The model terminates the loop with:
FINAL(answer)— return a literal answerFINAL_VAR(varname)— return the value of a REPL variable
By default Ollama stores models in /usr/share/ollama/.ollama/models. To point it at a different drive, set OLLAMA_MODELS permanently via a systemd drop-in:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_MODELS=/path/to/your/drive/ollama-models"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama- Zhang, Kraska & Khattab — Recursive Language Models (MIT OASYS Lab)
arXiv: 2512.24601 · GitHub: alexzhang13/rlm