Skip to content

zitacron/RLM-Ollama

Repository files navigation

RLM-Ollama

A local implementation of Recursive Language Models (RLM), based on the paper "Recursive Language Models" by Zhang, Kraska & Khattab (MIT OASYS Lab).

RLM gives an LLM a live Python REPL and a recursive sub-LLM — turning a single inference call into an iterative reasoning loop. The model can write and execute code, inspect results, call smaller models as tools, and keep iterating until it's confident in its answer. This makes it practical for tasks that are impossible in a single forward pass: exploring 100K-line documents, multi-step computation, structured data extraction, and more.

Runs entirely locally via Ollama, with optional cloud model support (OpenAI, Anthropic, Together, Groq, and any Ollama-hosted cloud model).


How It Works

┌──────────────────────────────────────────────────────────┐
│  User query + optional context                           │
│              │                                           │
│       ┌──────▼──────┐                                    │
│       │  Root LLM   │  (any Ollama or cloud model)       │
│       └──────┬──────┘                                    │
│              │  outputs either:                          │
│    ┌─────────┴───────────────┐                           │
│  ```repl``` code block    FINAL(answer)                  │
│    │                         │                           │
│  ┌─▼──────────┐         ─── done ───                     │
│  │  REPL env  │  sandboxed Python exec                   │
│  │            │  `context` var = user data               │
│  │            │  `llm_query(p)` → Sub-LLM call           │
│  └─┬──────────┘                                          │
│    │ stdout / variables fed back as next user message     │
│    └──► loop back to Root LLM                            │
└──────────────────────────────────────────────────────────┘

The model writes code in repl blocks. The engine executes them and feeds the output back into the conversation. This repeats until the model outputs FINAL(answer).


Installation

Requirements: Python 3.10+, Ollama running locally.

git clone https://github.com/zitacron/RLM-Ollama
cd RLM-Ollama
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Pull at least one model before using:

ollama pull qwen3.5:2b

Cloud model support (optional)

pip install openai      # OpenAI, Together AI, Groq, Fireworks, OpenRouter
pip install anthropic   # Anthropic Claude

Utilities

chat.py — Interactive terminal chat

A full conversational interface. Every query runs through the RLM loop — the model gets a REPL and can call sub-LLMs before answering.

python3 chat.py
python3 chat.py --model qwen3:4b
python3 chat.py --model qwen3:4b --no-think        # disable thinking tokens (faster)
python3 chat.py --model llama3.1:8b --recursive-model qwen3:0.6b  # smaller sub-LLM
python3 chat.py --model qwen3:4b --context path/to/bigfile.txt     # preload context

In-session commands:

Command Effect
/context <file> Load a file as context mid-session
/model <name> Switch the root model
/quit Exit

All options:

Flag Default Description
--model / -m llama3.1:8b Root LLM model
--recursive-model / -r same as --model Sub-LLM for llm_query() calls inside REPL
--context / -c none Path to a file preloaded as context
--host localhost:11434 Ollama server address
--max-iterations 15 Cap on root-LLM turns per query
--num-ctx 131072 Context window tokens passed to Ollama
--temperature 0.6 Sampling temperature
--think / --no-think auto Enable/disable thinking mode (qwen3, deepseek-r1, etc.)

Ollama cloud models (e.g. glm-5:cloud) require signing in first:

ollama signin
python3 chat.py --model glm-5:cloud

benchmark_models.py — Multi-model NIAH benchmark

Runs a Needle-in-a-Haystack test across multiple models in one shot. Embeds a secret number in a large block of random text and measures each model's accuracy, speed, and iteration count.

By default tests qwen3:0.6b and qwen3:1.7b. Auto-detects cloud providers from environment variables.

python3 benchmark_models.py

Basic options:

# Change haystack size and number of trials per model
python3 benchmark_models.py --lines 5000 --trials 5

# Add extra local Ollama models
python3 benchmark_models.py --ollama qwen3:4b llama3.1:8b

# Skip the default local models (cloud-only run)
python3 benchmark_models.py --no-local

# Reproducible run
python3 benchmark_models.py --seed 42

Cloud models — auto-detected from env vars:

Env var Provider Default model tested
OPENAI_API_KEY OpenAI gpt-4o-mini
ANTHROPIC_API_KEY Anthropic claude-haiku-3-5-latest
TOGETHER_API_KEY Together AI meta-llama/Llama-3-8b-chat-hf
GROQ_API_KEY Groq llama-3.1-8b-instant
FIREWORKS_API_KEY Fireworks llama-3.1-8b-instruct
OPENROUTER_API_KEY OpenRouter mistral-7b-instruct
# Explicit cloud models
python3 benchmark_models.py --cloud openai:gpt-4o anthropic:claude-opus-4-5

# Skip auto-detection, use only what you specify
python3 benchmark_models.py --no-cloud-auto --cloud groq:llama-3.1-70b-versatile

Remote / cloud-hosted Ollama instances:

# Inline URL per model
python3 benchmark_models.py --ollama http://myserver:11434/qwen3:72b

# Global host applied to all --ollama models (and default models)
python3 benchmark_models.py --ollama-host http://myserver:11434
python3 benchmark_models.py --ollama-host http://myserver:11434 --ollama qwen3:14b llama3.1:8b

Sample output:

╭──────────────────────────────────────────────────────────╮
│ Model                    Provider   Acc   AvgTime  Iters │
│ qwen3:0.6b (local)       ollama     33%      42.1    5.0 │
│ qwen3:4b (local)         ollama    100%      38.7    3.0 │
│ OpenAI / gpt-4o-mini     openai    100%       8.2    2.3 │
╰──────────────────────────────────────────────────────────╯

example_niah.py — Quick needle-in-a-haystack demo

A minimal self-contained script that generates a 10K-line haystack, hides a random 7-digit number in it, and runs RLM to find it.

python3 example_niah.py

Hardcoded to qwen3:0.6b. Edit the file to change the model or haystack size.


Using RLM as a library

from rlm import RLM

rlm = RLM(
    model="qwen3:4b",
    recursive_model="qwen3:0.6b",   # optional smaller sub-LLM
    max_iterations=10,
    think=False,
)

result = rlm.completion(
    prompt="Summarise the key events and extract all dates.",
    context=open("my_document.txt").read(),
)

print(result.answer)
print(f"{result.iterations} iterations, {result.total_time:.1f}s")

With a cloud model:

from rlm import RLM
from rlm.cloud_client import CloudClient

client = CloudClient(model="gpt-4o-mini")   # uses OPENAI_API_KEY from env

rlm = RLM(root_client=client, max_iterations=10)
result = rlm.completion(prompt="What is 17 * 23?")
print(result.answer)

RLMResult fields:

Field Type Description
answer str The extracted final answer
iterations int Number of root-LLM turns used
total_time float Wall-clock seconds
usage dict Token counts (total_prompt_tokens, total_completion_tokens)

REPL protocol

Inside the REPL, the model has access to:

Name Type Description
context str | dict | list The user-supplied data
llm_query(prompt) function Calls the sub-LLM; returns a string
print(...) function Output is captured and fed back to the model

The model terminates the loop with:

  • FINAL(answer) — return a literal answer
  • FINAL_VAR(varname) — return the value of a REPL variable

Configuring Ollama model storage

By default Ollama stores models in /usr/share/ollama/.ollama/models. To point it at a different drive, set OLLAMA_MODELS permanently via a systemd drop-in:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_MODELS=/path/to/your/drive/ollama-models"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

References

About

An Ollama implementation of RLMs. Achieve near limitless context through REPL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages