| layout | default |
|---|---|
| title | LiteLLM Tutorial - Chapter 3: Completion API |
| nav_order | 3 |
| has_children | false |
| parent | LiteLLM Tutorial |
Welcome to Chapter 3: Completion API. In this part of LiteLLM Tutorial: Unified LLM Gateway and Routing Layer, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Master text and chat completions with advanced parameters, formatting, and multi-turn conversations.
The completion API is the core of LiteLLM. This chapter covers how to craft effective prompts, use advanced parameters, and handle different types of completions across all providers.
The standard chat completion format:
import litellm
response = litellm.completion(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(response.choices[0].message.content)Understanding conversation roles:
- system: Sets the AI's behavior and context
- user: Human messages
- assistant: AI responses (can include previous responses for context)
messages = [
{
"role": "system",
"content": "You are an expert Python programmer. Provide clear, well-commented code examples."
},
{
"role": "user",
"content": "Write a function to calculate fibonacci numbers recursively."
}
]
response = litellm.completion(model="gpt-4", messages=messages)Control model behavior with parameters:
response = litellm.completion(
model="gpt-4",
messages=messages,
max_tokens=500, # Maximum response length
temperature=0.7, # Randomness (0.0-1.0)
top_p=0.9, # Nucleus sampling
frequency_penalty=0.0, # Reduce repetition (-2.0 to 2.0)
presence_penalty=0.0, # Encourage new topics (-2.0 to 2.0)
stop=["\n\n", "###"], # Stop sequences
n=1, # Number of completions to generate
logit_bias={}, # Bias token probabilities
)| Parameter | Range | Description | Use Case |
|---|---|---|---|
temperature |
0.0-2.0 | Randomness in output | Creative writing (high), Code (low) |
top_p |
0.0-1.0 | Nucleus sampling | Alternative to temperature |
max_tokens |
1+ | Maximum response length | Control costs and length |
frequency_penalty |
-2.0-2.0 | Reduce token repetition | Avoid loops in text |
presence_penalty |
-2.0-2.0 | Encourage new topics | Diverse responses |
stop |
strings | Stop generation at these strings | Structured outputs |
Maintain context across multiple exchanges:
conversation = [
{"role": "system", "content": "You are a helpful coding tutor."},
{"role": "user", "content": "How do I reverse a string in Python?"},
]
# First response
response1 = litellm.completion(model="gpt-4", messages=conversation)
print("Assistant:", response1.choices[0].message.content)
# Continue conversation
conversation.append({
"role": "assistant",
"content": response1.choices[0].message.content
})
conversation.append({
"role": "user",
"content": "Can you show me a more efficient way using slicing?"
})
response2 = litellm.completion(model="gpt-4", messages=conversation)
print("Assistant:", response2.choices[0].message.content)Create a helper class for managing conversations:
class ConversationManager:
def __init__(self, model="gpt-4", system_message=None):
self.model = model
self.messages = []
if system_message:
self.messages.append({"role": "system", "content": system_message})
def add_message(self, role, content):
"""Add a message to the conversation."""
self.messages.append({"role": role, "content": content})
def send_message(self, user_message, **kwargs):
"""Send a user message and get response."""
self.add_message("user", user_message)
response = litellm.completion(
model=self.model,
messages=self.messages,
**kwargs
)
assistant_message = response.choices[0].message.content
self.add_message("assistant", assistant_message)
return assistant_message, response
def get_history(self):
"""Get conversation history."""
return self.messages.copy()
def clear_history(self):
"""Clear conversation history."""
system_msg = None
if self.messages and self.messages[0]["role"] == "system":
system_msg = self.messages[0]
self.messages = [system_msg] if system_msg else []
# Usage
chat = ConversationManager(
model="gpt-4",
system_message="You are a knowledgeable history teacher."
)
response, _ = chat.send_message("Tell me about the Roman Empire")
print("Response:", response[:200] + "...")
response, _ = chat.send_message("What were their biggest achievements?")
print("Follow-up:", response[:200] + "...")Force specific output formats:
# JSON output
json_prompt = """
Extract the following information from the text and return as JSON:
- Name
- Age
- Occupation
Text: John Smith is a 35-year-old software engineer from San Francisco.
Return only valid JSON.
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": json_prompt}],
temperature=0.1 # Lower temperature for consistent formatting
)
import json
try:
data = json.loads(response.choices[0].message.content)
print("Extracted:", data)
except json.JSONDecodeError:
print("Failed to parse JSON response")Specialized prompts for code:
def generate_code(requirement, language="python"):
"""Generate code based on requirements."""
prompt = f"""
Write a {language} function that {requirement}.
Requirements:
- Include docstring
- Add type hints
- Handle edge cases
- Include example usage
Return only the code, no explanation.
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.2, # Low temperature for code
stop=["```"] # Stop at code blocks
)
return response.choices[0].message.content
# Usage
code = generate_code("calculates the factorial of a number with memoization")
print(code)Provide examples for better results:
def analyze_sentiment(text):
"""Analyze sentiment using few-shot examples."""
examples = """
Here are examples of sentiment analysis:
Text: "I love this product, it's amazing!"
Sentiment: positive
Text: "This is terrible, I hate it."
Sentiment: negative
Text: "It's okay, nothing special."
Sentiment: neutral
Now analyze this text:
"""
prompt = examples + text + "\n\nSentiment:"
response = litellm.completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
max_tokens=10
)
return response.choices[0].message.content.strip()
# Usage
sentiment = analyze_sentiment("This movie was fantastic!")
print(f"Sentiment: {sentiment}")Encourage step-by-step reasoning:
def solve_problem(problem):
"""Solve a problem with chain of thought."""
prompt = f"""
Solve this problem step by step. Show your work clearly.
Problem: {problem}
Think through this systematically:
1. Understand the problem
2. Identify the key information
3. Consider different approaches
4. Choose the best method
5. Execute the solution
6. Verify the answer
Final Answer: [Your final answer here]
"""
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=1000
)
return response.choices[0].message.content
# Usage
solution = solve_problem("If a train travels at 60 mph for 2 hours, how far does it go?")
print(solution)Generate multiple responses for comparison:
def generate_multiple_responses(prompt, n=3, model="gpt-4"):
"""Generate multiple completions for the same prompt."""
response = litellm.completion(
model=model,
messages=[{"role": "user", "content": prompt}],
n=n, # Number of completions
temperature=0.8, # Higher temperature for variety
max_tokens=200
)
return [choice.message.content for choice in response.choices]
# Usage
responses = generate_multiple_responses(
"Write a creative slogan for a coffee shop",
n=5,
model="claude-3-sonnet-20240229"
)
for i, response in enumerate(responses, 1):
print(f"{i}. {response}")Leverage unique provider capabilities:
# Anthropic's extended thinking (if available)
response = litellm.completion(
model="claude-3-opus-20240229",
messages=[{"role": "user", "content": "Solve this complex math problem..."}],
max_tokens=4000,
thinking_budget=2000 # Anthropic-specific parameter
)
# OpenAI's function calling
function_response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
functions=[
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
],
function_call="auto"
)
# Check if function was called
if hasattr(function_response.choices[0].message, 'function_call'):
func_call = function_response.choices[0].message.function_call
print(f"Function: {func_call.name}")
print(f"Arguments: {func_call.arguments}")Robust completion handling:
def safe_completion(model, messages, **kwargs):
"""Completion with comprehensive error handling."""
# Default parameters
defaults = {
"max_tokens": 1000,
"temperature": 0.7,
"timeout": 30
}
defaults.update(kwargs)
# Validate inputs
if not messages or not isinstance(messages, list):
raise ValueError("Messages must be a non-empty list")
for msg in messages:
if not isinstance(msg, dict) or "role" not in msg or "content" not in msg:
raise ValueError("Each message must have 'role' and 'content' fields")
try:
response = litellm.completion(model=model, messages=messages, **defaults)
# Validate response
if not response.choices:
raise ValueError("No choices returned in response")
content = response.choices[0].message.content
if not content or not content.strip():
raise ValueError("Empty response content")
return response
except litellm.RateLimitError:
print("Rate limit exceeded. Waiting and retrying...")
time.sleep(60) # Wait 1 minute
return safe_completion(model, messages, **kwargs) # Retry
except litellm.AuthenticationError:
raise ValueError(f"Invalid API key for {model}")
except litellm.APIError as e:
print(f"API error: {e}")
# Could implement fallback to different model here
raise
except Exception as e:
print(f"Unexpected error: {e}")
raise
# Usage
try:
response = safe_completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
print("Response:", response.choices[0].message.content)
except Exception as e:
print(f"Error: {e}")Estimate costs before making calls:
def estimate_completion_cost(model, messages, max_tokens=1000):
"""Estimate cost for a completion request."""
# Rough token estimation
total_chars = sum(len(msg["content"]) for msg in messages)
estimated_input_tokens = total_chars // 4 # Rough approximation
estimated_output_tokens = max_tokens
# Cost per 1K tokens (approximate)
costs = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
"gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
"claude-3-opus-20240229": {"input": 0.015, "output": 0.075},
"claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
}
if model not in costs:
return None # Unknown cost
model_costs = costs[model]
input_cost = (estimated_input_tokens / 1000) * model_costs["input"]
output_cost = (estimated_output_tokens / 1000) * model_costs["output"]
return {
"estimated_input_tokens": estimated_input_tokens,
"estimated_output_tokens": estimated_output_tokens,
"estimated_cost": input_cost + output_cost,
"currency": "USD"
}
# Usage
cost_estimate = estimate_completion_cost(
"gpt-4",
[{"role": "user", "content": "Write a 500-word essay about AI"}],
max_tokens=1000
)
if cost_estimate:
print(f"Estimated cost: ${cost_estimate['estimated_cost']:.4f}")
print(f"Input tokens: {cost_estimate['estimated_input_tokens']}")
print(f"Output tokens: {cost_estimate['estimated_output_tokens']}")- Temperature Tuning: Use lower temperatures (0.1-0.3) for factual/coding tasks, higher (0.7-0.9) for creative tasks
- Max Tokens: Set appropriate limits to control costs and response length
- System Messages: Use system messages to set context and behavior
- Conversation Context: Maintain conversation history for multi-turn interactions
- Error Handling: Always wrap API calls in try-catch blocks
- Cost Monitoring: Track usage and set budgets
- Prompt Engineering: Craft clear, specific prompts for better results
The completion API is your primary interface to LLM capabilities. Mastering these patterns will enable you to build sophisticated AI applications across any provider.
This chapter is expanded to v1-style depth for production-grade learning and implementation quality.
- tutorial: LiteLLM Tutorial: Unified LLM Gateway and Routing Layer
- tutorial slug: litellm-tutorial
- chapter focus: Chapter 3: Completion API
- system context: Litellm Tutorial
- objective: move from surface-level usage to repeatable engineering operation
- Define the runtime boundary for
Chapter 3: Completion API. - Separate control-plane decisions from data-plane execution.
- Capture input contracts, transformation points, and output contracts.
- Trace state transitions across request lifecycle stages.
- Identify extension hooks and policy interception points.
- Map ownership boundaries for team and automation workflows.
- Specify rollback and recovery paths for unsafe changes.
- Track observability signals for correctness, latency, and cost.
| Decision Area | Low-Risk Path | High-Control Path | Tradeoff |
|---|---|---|---|
| Runtime mode | managed defaults | explicit policy config | speed vs control |
| State handling | local ephemeral | durable persisted state | simplicity vs auditability |
| Tool integration | direct API use | mediated adapter layer | velocity vs governance |
| Rollout method | manual change | staged + canary rollout | effort vs safety |
| Incident response | best effort logs | runbooks + SLO alerts | cost vs reliability |
| Failure Mode | Early Signal | Root Cause Pattern | Countermeasure |
|---|---|---|---|
| stale context | inconsistent outputs | missing refresh window | enforce context TTL and refresh hooks |
| policy drift | unexpected execution | ad hoc overrides | centralize policy profiles |
| auth mismatch | 401/403 bursts | credential sprawl | rotation schedule + scope minimization |
| schema breakage | parser/validation errors | unmanaged upstream changes | contract tests per release |
| retry storms | queue congestion | no backoff controls | jittered backoff + circuit breakers |
| silent regressions | quality drop without alerts | weak baseline metrics | eval harness with thresholds |
- Establish a reproducible baseline environment.
- Capture chapter-specific success criteria before changes.
- Implement minimal viable path with explicit interfaces.
- Add observability before expanding feature scope.
- Run deterministic tests for happy-path behavior.
- Inject failure scenarios for negative-path validation.
- Compare output quality against baseline snapshots.
- Promote through staged environments with rollback gates.
- Record operational lessons in release notes.
- chapter-level assumptions are explicit and testable
- API/tool boundaries are documented with input/output examples
- failure handling includes retry, timeout, and fallback policy
- security controls include auth scopes and secret rotation plans
- observability includes logs, metrics, traces, and alert thresholds
- deployment guidance includes canary and rollback paths
- docs include links to upstream sources and related tracks
- post-release verification confirms expected behavior under load
- Langfuse Tutorial
- Vercel AI SDK Tutorial
- OpenAI Python SDK Tutorial
- Aider Tutorial
- Chapter 1: Getting Started
- Build a minimal end-to-end implementation for
Chapter 3: Completion API. - Add instrumentation and measure baseline latency and error rate.
- Introduce one controlled failure and confirm graceful recovery.
- Add policy constraints and verify they are enforced consistently.
- Run a staged rollout and document rollback decision criteria.
- Which execution boundary matters most for this chapter and why?
- What signal detects regressions earliest in your environment?
- What tradeoff did you make between delivery speed and governance?
- How would you recover from the highest-impact failure mode?
- What must be automated before scaling to team-wide adoption?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for messages, content, response so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 3: Completion API as an operating subsystem inside LiteLLM Tutorial: Unified LLM Gateway and Routing Layer, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around self, role, model as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 3: Completion API usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
messages. - Input normalization: shape incoming data so
contentreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
response. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- LiteLLM Repository
Why it matters: authoritative reference on
LiteLLM Repository(github.com). - LiteLLM Releases
Why it matters: authoritative reference on
LiteLLM Releases(github.com). - LiteLLM Docs
Why it matters: authoritative reference on
LiteLLM Docs(docs.litellm.ai).
Suggested trace strategy:
- search upstream code for
messagesandcontentto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production