| layout | default |
|---|---|
| title | Ollama Tutorial - Chapter 7: Integrations |
| nav_order | 7 |
| has_children | false |
| parent | Ollama Tutorial |
Welcome to Chapter 7: Integrations with OpenAI API, LangChain, and LlamaIndex. In this part of Ollama Tutorial: Running and Serving LLMs Locally, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Use Ollama with common AI frameworks and OpenAI-compatible SDKs.
Ollama exposes an OpenAI-compatible API, which means virtually any tool, framework, or library that works with OpenAI can work with Ollama by simply changing the base URL. This chapter walks through complete, working integration examples for the most popular frameworks and tools in the AI ecosystem.
We will cover direct SDK usage, framework integrations, structured output with Instructor, agent frameworks, a self-hosted chat UI, and a full chat application built with FastAPI and WebSockets.
The diagram below shows how Ollama sits at the center of your AI stack, providing a local, private inference endpoint that many tools can connect to simultaneously.
flowchart TD
OL[Ollama Server<br/>localhost:11434]
OL --- A[OpenAI Python SDK]
OL --- B[OpenAI Node SDK]
OL --- C[LangChain]
OL --- D[LlamaIndex]
OL --- E[Instructor]
OL --- F[Open WebUI]
OL --- G[AutoGen / CrewAI]
OL --- H[Custom FastAPI App]
OL --- I[LiteLLM Proxy]
OL --- J[curl / HTTP Clients]
style OL fill:#f90,stroke:#333,color:#000
Every integration in this chapter connects to Ollama through the same HTTP API. The key connection details are always the same:
- Base URL:
http://localhost:11434(native API) orhttp://localhost:11434/v1(OpenAI-compatible) - API Key: Any string works (Ollama does not enforce authentication)
- Model name: Whatever you have pulled (e.g.,
llama3,mistral,codellama)
The easiest way to integrate Ollama is through the official OpenAI SDKs. Just change the base URL and you are done.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the SDK but not enforced by Ollama
)
# Simple chat completion
response = client.chat.completions.create(
model="llama3",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to flatten a nested list."},
],
temperature=0.3,
)
print(response.choices[0].message.content)from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
stream = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Explain how hash tables work."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # Newline at the endimport OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama",
});
async function main() {
const response = await client.chat.completions.create({
model: "mistral",
messages: [
{ role: "system", content: "You are a concise technical writer." },
{ role: "user", content: "Summarize RAG in 3 bullet points." },
],
});
console.log(response.choices[0].message.content);
}
main();import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama",
});
async function main() {
const stream = await client.chat.completions.create({
model: "llama3",
messages: [{ role: "user", content: "Write a haiku about programming." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
}
console.log();
}
main();from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.embeddings.create(
model="nomic-embed-text",
input=["What is retrieval augmented generation?", "How do vector databases work?"],
)
for i, embedding in enumerate(response.data):
print(f"Embedding {i}: {len(embedding.embedding)} dimensions")LangChain is one of the most popular frameworks for building LLM applications. Ollama integrates cleanly as both an LLM and an embeddings provider.
from langchain_ollama import OllamaLLM
from langchain_core.prompts import PromptTemplate
llm = OllamaLLM(model="llama3", base_url="http://localhost:11434")
prompt = PromptTemplate.from_template(
"Explain {topic} in 3 bullet points for a beginner."
)
chain = prompt | llm
result = chain.invoke({"topic": "vector databases"})
print(result)from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
chat = ChatOllama(
model="llama3",
base_url="http://localhost:11434",
temperature=0.3,
)
messages = [
SystemMessage(content="You are a Python expert. Be concise."),
HumanMessage(content="What are the main differences between lists and tuples?"),
]
# Streaming response
for chunk in chat.stream(messages):
print(chunk.content, end="", flush=True)
print()from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load and split documents
loader = TextLoader("./docs/architecture.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = splitter.split_documents(docs)
# Create vector store with Ollama embeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url="http://localhost:11434")
vectorstore = Chroma.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Build RAG chain
llm = ChatOllama(model="llama3", base_url="http://localhost:11434")
template = """Answer the question based on the following context:
Context: {context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
result = chain.invoke("How is the system architected?")
print(result)import { ChatOllama } from "@langchain/ollama";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
const chat = new ChatOllama({
model: "llama3",
baseUrl: "http://localhost:11434",
temperature: 0.3,
});
const response = await chat.invoke([
new SystemMessage("You are a helpful assistant."),
new HumanMessage("Explain the observer pattern in one paragraph."),
]);
console.log(response.content);LlamaIndex specializes in connecting LLMs to your data. Ollama works as both the LLM and the embedding model.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
# Configure Ollama as the default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main components of the system?")
print(response)from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create a chat engine that remembers conversation history
chat_engine = index.as_chat_engine(chat_mode="condense_question")
# Multi-turn conversation
response1 = chat_engine.chat("What does this project do?")
print(f"Bot: {response1}")
response2 = chat_engine.chat("How is it deployed?")
print(f"Bot: {response2}")
response3 = chat_engine.chat("What were the challenges mentioned?")
print(f"Bot: {response3}")Instructor is a library that makes it easy to get structured, validated output from LLMs using Pydantic models. It works with Ollama through the OpenAI compatibility layer.
pip install instructor openai pydanticimport instructor
from openai import OpenAI
from pydantic import BaseModel, Field
# Patch the OpenAI client with Instructor
client = instructor.from_openai(
OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"),
mode=instructor.Mode.JSON,
)
class MovieReview(BaseModel):
title: str = Field(description="The movie title")
rating: float = Field(ge=0, le=10, description="Rating from 0 to 10")
pros: list[str] = Field(description="List of positive aspects")
cons: list[str] = Field(description="List of negative aspects")
summary: str = Field(description="One-sentence summary")
review = client.chat.completions.create(
model="llama3",
messages=[
{"role": "user", "content": "Write a review of the movie Inception."},
],
response_model=MovieReview,
)
print(f"Title: {review.title}")
print(f"Rating: {review.rating}/10")
print(f"Pros: {', '.join(review.pros)}")
print(f"Cons: {', '.join(review.cons)}")
print(f"Summary: {review.summary}")import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
client = instructor.from_openai(
OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"),
mode=instructor.Mode.JSON,
)
class ContactInfo(BaseModel):
name: str
email: str | None = None
phone: str | None = None
company: str | None = None
class ExtractedContacts(BaseModel):
contacts: list[ContactInfo]
text = """
Hi, I'm Sarah Chen from Acme Corp. You can reach me at sarah@acme.com or 555-0123.
Also CC my colleague Bob Martinez (bob.m@acme.com).
"""
result = client.chat.completions.create(
model="llama3",
messages=[
{"role": "user", "content": f"Extract contact information from: {text}"},
],
response_model=ExtractedContacts,
)
for contact in result.contacts:
print(f"{contact.name} ({contact.company}) - {contact.email}, {contact.phone}")Open WebUI is a self-hosted web interface for Ollama that provides a ChatGPT-like experience. It is the fastest way to give your team a polished chat UI backed by local models.
docker run -d \
--name open-webui \
-p 3000:8080 \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:mainThen open http://localhost:3000 in your browser. Create an account (the first account becomes admin) and start chatting with your local models.
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
volumes:
- open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama-data:
open-webui-data:- Multi-model chat: Switch between models in a single conversation.
- Document upload: Upload PDFs and text files for RAG-powered Q&A.
- Prompt library: Save and share prompt templates across your team.
- User management: Admin controls, user roles, and usage tracking.
- Model management: Pull, delete, and configure models from the web UI.
Agent frameworks let you create teams of AI agents that collaborate to solve complex tasks. Ollama provides the local LLM backbone.
from autogen import ConversableAgent
config_list = [
{
"model": "llama3",
"base_url": "http://localhost:11434/v1",
"api_key": "ollama",
}
]
llm_config = {"config_list": config_list, "temperature": 0.3}
# Create a coding assistant agent
assistant = ConversableAgent(
name="coding_assistant",
system_message="You are a Python expert. Write clean, well-documented code.",
llm_config=llm_config,
)
# Create a code reviewer agent
reviewer = ConversableAgent(
name="code_reviewer",
system_message="You are a code reviewer. Review the code for bugs, style issues, and suggest improvements.",
llm_config=llm_config,
)
# Start a conversation between the agents
result = assistant.initiate_chat(
reviewer,
message="Write a Python class for a thread-safe queue with max size.",
max_turns=3,
)from crewai import Agent, Task, Crew
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3", base_url="http://localhost:11434")
# Define agents
researcher = Agent(
role="Technical Researcher",
goal="Research and summarize technical topics thoroughly",
backstory="You are a senior technical researcher with expertise in software architecture.",
llm=llm,
verbose=True,
)
writer = Agent(
role="Technical Writer",
goal="Write clear, engaging technical content",
backstory="You are an experienced technical writer who makes complex topics accessible.",
llm=llm,
verbose=True,
)
# Define tasks
research_task = Task(
description="Research the key differences between REST and GraphQL APIs. Include pros, cons, and use cases.",
agent=researcher,
expected_output="A detailed comparison document.",
)
writing_task = Task(
description="Take the research and write a beginner-friendly blog post comparing REST and GraphQL.",
agent=writer,
expected_output="A polished blog post of about 500 words.",
)
# Create crew and run
crew = Crew(agents=[researcher, writer], tasks=[research_task, writing_task], verbose=True)
result = crew.kickoff()
print(result)For teams that want a custom chat experience, here is a complete working example using FastAPI for the backend and WebSockets for real-time streaming.
import json
import httpx
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles
app = FastAPI()
OLLAMA_URL = "http://localhost:11434/api/chat"
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
await websocket.accept()
conversation_history = []
try:
while True:
# Receive user message
user_message = await websocket.receive_text()
conversation_history.append({"role": "user", "content": user_message})
# Stream response from Ollama
payload = {
"model": "llama3",
"messages": conversation_history,
"stream": True,
}
assistant_message = ""
async with httpx.AsyncClient(timeout=120.0) as client:
async with client.stream("POST", OLLAMA_URL, json=payload) as response:
async for line in response.aiter_lines():
if line:
data = json.loads(line)
if "message" in data and "content" in data["message"]:
token = data["message"]["content"]
assistant_message += token
await websocket.send_text(
json.dumps({"type": "token", "content": token})
)
# Signal end of message
await websocket.send_text(json.dumps({"type": "done"}))
conversation_history.append(
{"role": "assistant", "content": assistant_message}
)
except WebSocketDisconnect:
print("Client disconnected")
@app.get("/")
async def get_chat_page():
return HTMLResponse(CHAT_HTML)
CHAT_HTML = """
<!DOCTYPE html>
<html>
<head>
<title>Ollama Chat</title>
<style>
body { font-family: system-ui, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
#messages { height: 500px; overflow-y: auto; border: 1px solid #ddd; padding: 15px;
border-radius: 8px; margin-bottom: 15px; background: #fafafa; }
.user { color: #0066cc; margin: 10px 0; }
.assistant { color: #333; margin: 10px 0; }
#input-area { display: flex; gap: 10px; }
#user-input { flex: 1; padding: 10px; border: 1px solid #ddd; border-radius: 6px; font-size: 14px; }
button { padding: 10px 20px; background: #0066cc; color: white; border: none;
border-radius: 6px; cursor: pointer; font-size: 14px; }
button:hover { background: #0052a3; }
</style>
</head>
<body>
<h1>Ollama Chat</h1>
<div id="messages"></div>
<div id="input-area">
<input type="text" id="user-input" placeholder="Type a message..." onkeypress="if(event.key==='Enter')sendMessage()">
<button onclick="sendMessage()">Send</button>
</div>
<script>
const ws = new WebSocket(`ws://${location.host}/ws/chat`);
const messages = document.getElementById('messages');
let currentAssistant = null;
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'token') {
if (!currentAssistant) {
currentAssistant = document.createElement('div');
currentAssistant.className = 'assistant';
currentAssistant.innerHTML = '<strong>Assistant:</strong> ';
messages.appendChild(currentAssistant);
}
currentAssistant.innerHTML += data.content;
messages.scrollTop = messages.scrollHeight;
} else if (data.type === 'done') {
currentAssistant = null;
}
};
function sendMessage() {
const input = document.getElementById('user-input');
const text = input.value.trim();
if (!text) return;
const userDiv = document.createElement('div');
userDiv.className = 'user';
userDiv.innerHTML = '<strong>You:</strong> ' + text;
messages.appendChild(userDiv);
ws.send(text);
input.value = '';
messages.scrollTop = messages.scrollHeight;
}
</script>
</body>
</html>
"""
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)# Install dependencies
pip install fastapi uvicorn httpx websockets
# Start Ollama (if not already running)
ollama serve
# Start the chat server
python server.pyOpen http://localhost:8000 in your browser. You now have a real-time streaming chat interface backed by your local Ollama models.
LiteLLM provides a unified interface to many LLM providers, including Ollama. This is useful when you want to switch between providers without changing your application code.
from litellm import completion
# Ollama via LiteLLM (prefix model name with "ollama/")
response = completion(
model="ollama/llama3",
messages=[{"role": "user", "content": "What is the capital of France?"}],
api_base="http://localhost:11434",
)
print(response.choices[0].message.content)You can also run LiteLLM as a proxy server that translates OpenAI API calls to Ollama:
pip install litellm
litellm --model ollama/llama3 --port 4000Now any OpenAI-compatible client can connect to http://localhost:4000 and it will route to Ollama.
Any Python web framework can call Ollama through the OpenAI SDK or direct HTTP requests:
from fastapi import FastAPI
from openai import OpenAI
app = FastAPI()
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
@app.post("/api/ask")
async def ask(question: str):
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": question}],
)
return {"answer": response.choices[0].message.content}Ollama embeddings work with all major vector databases:
- Chroma:
langchain_community.vectorstores.Chromawith Ollama embeddings - Qdrant:
qdrant_clientwith Ollama embedding endpoint - Weaviate: Configure the OpenAI module to point at Ollama
- Pinecone: Generate embeddings via Ollama, upsert to Pinecone
- n8n: Use the HTTP Request node to call Ollama's API
- Make (Integromat): HTTP module with Ollama endpoints
- Zapier: Webhooks to Ollama API
- Keep model names consistent. If your LangChain code uses
"llama3"and your direct API calls use"llama3:latest", you may load the model twice. Standardize on one name. - Set timeouts. Local models can take time to load on first request. Set HTTP timeouts to at least 60 seconds for the first call.
- Tune per integration. A chatbot might use
temperature 0.7while a code generator usestemperature 0.2. Set these per integration, not globally. - Use streaming. For any user-facing application, streaming provides a much better experience than waiting for the full response.
- Monitor token counts. The API response includes token usage. Log this to track costs (even though local inference is "free," context length affects performance).
- Test with the same model. When switching from OpenAI to Ollama, expect different behavior. Test thoroughly and adjust prompts as needed.
| Navigation | |
|---|---|
| Previous | Chapter 6: Performance & Hardware Tuning |
| Next | Chapter 8: Production Deployment |
| Index | Ollama Tutorial Home |
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for content, ollama, messages so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 7: Integrations with OpenAI API, LangChain, and LlamaIndex as an operating subsystem inside Ollama Tutorial: Running and Serving LLMs Locally, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around model, print, chat as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 7: Integrations with OpenAI API, LangChain, and LlamaIndex usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
content. - Input normalization: shape incoming data so
ollamareceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
messages. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Ollama Repository
Why it matters: authoritative reference on
Ollama Repository(github.com). - Ollama Releases
Why it matters: authoritative reference on
Ollama Releases(github.com). - Ollama Website and Docs
Why it matters: authoritative reference on
Ollama Website and Docs(ollama.com).
Suggested trace strategy:
- search upstream code for
contentandollamato map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production