RLM-Ollama

A local implementation of Recursive Language Models (RLM), based on the paper "Recursive Language Models" by Zhang, Kraska & Khattab (MIT OASYS Lab).

RLM gives an LLM a live Python REPL and a recursive sub-LLM — turning a single inference call into an iterative reasoning loop. The model can write and execute code, inspect results, call smaller models as tools, and keep iterating until it's confident in its answer. This makes it practical for tasks that are impossible in a single forward pass: exploring 100K-line documents, multi-step computation, structured data extraction, and more.

Runs entirely locally via Ollama, with optional cloud model support (OpenAI, Anthropic, Together, Groq, and any Ollama-hosted cloud model).

How It Works

┌──────────────────────────────────────────────────────────┐
│  User query + optional context                           │
│              │                                           │
│       ┌──────▼──────┐                                    │
│       │  Root LLM   │  (any Ollama or cloud model)       │
│       └──────┬──────┘                                    │
│              │  outputs either:                          │
│    ┌─────────┴───────────────┐                           │
│  ```repl``` code block    FINAL(answer)                  │
│    │                         │                           │
│  ┌─▼──────────┐         ─── done ───                     │
│  │  REPL env  │  sandboxed Python exec                   │
│  │            │  `context` var = user data               │
│  │            │  `llm_query(p)` → Sub-LLM call           │
│  └─┬──────────┘                                          │
│    │ stdout / variables fed back as next user message     │
│    └──► loop back to Root LLM                            │
└──────────────────────────────────────────────────────────┘

The model writes code in repl blocks. The engine executes them and feeds the output back into the conversation. This repeats until the model outputs FINAL(answer).

Installation

Requirements: Python 3.10+, Ollama running locally.

git clone https://github.com/zitacron/RLM-Ollama
cd RLM-Ollama
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Pull at least one model before using:

ollama pull qwen3.5:2b

Cloud model support (optional)

pip install openai      # OpenAI, Together AI, Groq, Fireworks, OpenRouter
pip install anthropic   # Anthropic Claude

Utilities

`chat.py` — Interactive terminal chat

A full conversational interface. Every query runs through the RLM loop — the model gets a REPL and can call sub-LLMs before answering.

python3 chat.py
python3 chat.py --model qwen3:4b
python3 chat.py --model qwen3:4b --no-think        # disable thinking tokens (faster)
python3 chat.py --model llama3.1:8b --recursive-model qwen3:0.6b  # smaller sub-LLM
python3 chat.py --model qwen3:4b --context path/to/bigfile.txt     # preload context

In-session commands:

Command	Effect
`/context <file>`	Load a file as context mid-session
`/model <name>`	Switch the root model
`/quit`	Exit

All options:

Flag	Default	Description
`--model` / `-m`	`llama3.1:8b`	Root LLM model
`--recursive-model` / `-r`	same as `--model`	Sub-LLM for `llm_query()` calls inside REPL
`--context` / `-c`	none	Path to a file preloaded as `context`
`--host`	`localhost:11434`	Ollama server address
`--max-iterations`	`15`	Cap on root-LLM turns per query
`--num-ctx`	`131072`	Context window tokens passed to Ollama
`--temperature`	`0.6`	Sampling temperature
`--think` / `--no-think`	auto	Enable/disable thinking mode (qwen3, deepseek-r1, etc.)

Ollama cloud models (e.g. glm-5:cloud) require signing in first:

ollama signin
python3 chat.py --model glm-5:cloud

`benchmark_models.py` — Multi-model NIAH benchmark

Runs a Needle-in-a-Haystack test across multiple models in one shot. Embeds a secret number in a large block of random text and measures each model's accuracy, speed, and iteration count.

By default tests qwen3:0.6b and qwen3:1.7b. Auto-detects cloud providers from environment variables.

python3 benchmark_models.py

Basic options:

# Change haystack size and number of trials per model
python3 benchmark_models.py --lines 5000 --trials 5

# Add extra local Ollama models
python3 benchmark_models.py --ollama qwen3:4b llama3.1:8b

# Skip the default local models (cloud-only run)
python3 benchmark_models.py --no-local

# Reproducible run
python3 benchmark_models.py --seed 42

Cloud models — auto-detected from env vars:

Env var	Provider	Default model tested
`OPENAI_API_KEY`	OpenAI	`gpt-4o-mini`
`ANTHROPIC_API_KEY`	Anthropic	`claude-haiku-3-5-latest`
`TOGETHER_API_KEY`	Together AI	`meta-llama/Llama-3-8b-chat-hf`
`GROQ_API_KEY`	Groq	`llama-3.1-8b-instant`
`FIREWORKS_API_KEY`	Fireworks	`llama-3.1-8b-instruct`
`OPENROUTER_API_KEY`	OpenRouter	`mistral-7b-instruct`

# Explicit cloud models
python3 benchmark_models.py --cloud openai:gpt-4o anthropic:claude-opus-4-5

# Skip auto-detection, use only what you specify
python3 benchmark_models.py --no-cloud-auto --cloud groq:llama-3.1-70b-versatile

Remote / cloud-hosted Ollama instances:

# Inline URL per model
python3 benchmark_models.py --ollama http://myserver:11434/qwen3:72b

# Global host applied to all --ollama models (and default models)
python3 benchmark_models.py --ollama-host http://myserver:11434
python3 benchmark_models.py --ollama-host http://myserver:11434 --ollama qwen3:14b llama3.1:8b

Sample output:

╭──────────────────────────────────────────────────────────╮
│ Model                    Provider   Acc   AvgTime  Iters │
│ qwen3:0.6b (local)       ollama     33%      42.1    5.0 │
│ qwen3:4b (local)         ollama    100%      38.7    3.0 │
│ OpenAI / gpt-4o-mini     openai    100%       8.2    2.3 │
╰──────────────────────────────────────────────────────────╯

`example_niah.py` — Quick needle-in-a-haystack demo

A minimal self-contained script that generates a 10K-line haystack, hides a random 7-digit number in it, and runs RLM to find it.

python3 example_niah.py

Hardcoded to qwen3:0.6b. Edit the file to change the model or haystack size.

Using RLM as a library

from rlm import RLM

rlm = RLM(
    model="qwen3:4b",
    recursive_model="qwen3:0.6b",   # optional smaller sub-LLM
    max_iterations=10,
    think=False,
)

result = rlm.completion(
    prompt="Summarise the key events and extract all dates.",
    context=open("my_document.txt").read(),
)

print(result.answer)
print(f"{result.iterations} iterations, {result.total_time:.1f}s")

With a cloud model:

from rlm import RLM
from rlm.cloud_client import CloudClient

client = CloudClient(model="gpt-4o-mini")   # uses OPENAI_API_KEY from env

rlm = RLM(root_client=client, max_iterations=10)
result = rlm.completion(prompt="What is 17 * 23?")
print(result.answer)

RLMResult fields:

Field	Type	Description
`answer`	`str`	The extracted final answer
`iterations`	`int`	Number of root-LLM turns used
`total_time`	`float`	Wall-clock seconds
`usage`	`dict`	Token counts (`total_prompt_tokens`, `total_completion_tokens`)

REPL protocol

Inside the REPL, the model has access to:

Name	Type	Description
`context`	`str` \| `dict` \| `list`	The user-supplied data
`llm_query(prompt)`	`function`	Calls the sub-LLM; returns a string
`print(...)`	`function`	Output is captured and fed back to the model

The model terminates the loop with:

FINAL(answer) — return a literal answer
FINAL_VAR(varname) — return the value of a REPL variable

Configuring Ollama model storage

By default Ollama stores models in /usr/share/ollama/.ollama/models. To point it at a different drive, set OLLAMA_MODELS permanently via a systemd drop-in:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_MODELS=/path/to/your/drive/ollama-models"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

References

Zhang, Kraska & Khattab — Recursive Language Models (MIT OASYS Lab)
arXiv: 2512.24601 · GitHub: alexzhang13/rlm

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
rlm		rlm
vault		vault
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_models.py		benchmark_models.py
chat.py		chat.py
example_niah.py		example_niah.py
requirements.txt		requirements.txt
syntheticRLMdata.json		syntheticRLMdata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLM-Ollama

How It Works

Installation

Cloud model support (optional)

Utilities

`chat.py` — Interactive terminal chat

`benchmark_models.py` — Multi-model NIAH benchmark

`example_niah.py` — Quick needle-in-a-haystack demo

Using RLM as a library

REPL protocol

Configuring Ollama model storage

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RLM-Ollama

How It Works

Installation

Cloud model support (optional)

Utilities

chat.py — Interactive terminal chat

benchmark_models.py — Multi-model NIAH benchmark

example_niah.py — Quick needle-in-a-haystack demo

Using RLM as a library

REPL protocol

Configuring Ollama model storage

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`chat.py` — Interactive terminal chat

`benchmark_models.py` — Multi-model NIAH benchmark

`example_niah.py` — Quick needle-in-a-haystack demo

Packages