layout	default
title	Ollama Tutorial - Chapter 1: Getting Started
nav_order	1
has_children	false
parent	Ollama Tutorial

Chapter 1: Getting Started with Ollama

Welcome to Chapter 1: Getting Started with Ollama. In this part of Ollama Tutorial: Running and Serving LLMs Locally, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Install Ollama, pull your first model, and run a local chat with an OpenAI-compatible API.

Overview

Ollama runs LLMs locally with zero cloud dependency. It wraps the high-performance llama.cpp inference engine in a friendly CLI and an OpenAI-style HTTP API. It works on macOS, Linux, and Windows (via WSL2), supports both CPU and GPU inference, and makes switching between models as simple as ollama run <model>.

Whether you want to prototype an AI feature without worrying about API keys, keep sensitive data off third-party servers, or simply experiment with open-source models on your own hardware, Ollama is one of the fastest ways to get started.

How Ollama Works Under the Hood

Before you install anything, it helps to understand the architecture at a high level.

flowchart LR
    A["CLI / HTTP Client"] -->|"REST request"| B["Ollama Server\n(port 11434)"]
    B -->|"loads weights"| C["Model\n(GGUF file)"]
    C -->|"llama.cpp\ninference"| D["Token Stream"]
    D -->|"response"| A

When you type ollama run llama3 in your terminal, here is what happens behind the scenes:

CLI sends a request -- The ollama binary is a thin client. It sends an HTTP request to the Ollama server process (which listens on localhost:11434 by default).
Server loads the model -- The server locates the requested model on disk, loads its weights into memory (RAM or VRAM), and prepares the inference context.
llama.cpp runs inference -- Ollama uses llama.cpp as its core inference engine. This C/C++ library is optimized for running quantized models efficiently on consumer hardware.
Tokens stream back -- The model generates tokens one at a time. By default, Ollama streams them back to the client so you see text appear incrementally, just like a cloud-hosted chatbot.

The GGUF Model Format

Ollama stores models in the GGUF (GPT-Generated Unified Format) format. GGUF is a single-file binary format designed specifically for llama.cpp. Each GGUF file bundles:

Model metadata -- Architecture details, vocabulary size, context length, and other hyperparameters.
Tokenizer data -- The byte-pair encoding (BPE) or SentencePiece tables needed to convert text into tokens and back.
Quantized weight tensors -- The actual neural network weights, compressed using quantization schemes like Q4_0, Q4_K_M, or Q8_0.

Quantization is the key to running large models on modest hardware. A full-precision 7B-parameter model would need roughly 28 GB of memory, but a 4-bit quantized version fits in about 4 GB.

Model Layers and Memory

When a model is loaded, Ollama maps its layers into memory. On systems with a compatible GPU, layers can be offloaded to VRAM for faster inference. You can control this behavior with the --gpu-layers flag or let Ollama decide automatically. The server will:

Place as many layers on the GPU as VRAM allows.
Fall back to system RAM for the remaining layers.
Use memory-mapped files so the OS can page layers in and out efficiently.

This hybrid approach means you can run a 13B model on a machine with only 8 GB of VRAM -- some layers will run on the GPU and the rest on the CPU, giving you a balance of speed and capability.

Prerequisites

macOS 12+ / Linux (Ubuntu, Debian, RHEL, Fedora, Arch) / Windows 10 or 11 via WSL2
Sufficient RAM for your target model size (see the hardware requirements table below)
Optional: NVIDIA GPU with CUDA 11.7+ drivers, AMD GPU with ROCm, or Apple Silicon (M1/M2/M3/M4) for GPU-accelerated inference
curl for API tests
Python 3.8+ or Node.js 18+ for the SDK quickstart examples

Hardware Requirements

The amount of RAM you need depends directly on the model size and quantization level. Here is a practical reference table for 4-bit quantized models (Q4_K_M), which is the default quantization Ollama uses:

Model Size	RAM Required	VRAM (GPU) Recommended	Example Models
1B -- 3B	4 GB	2 GB+	`tinyllama`, `phi3:mini`, `gemma:2b`
7B	8 GB	6 GB+	`llama3:8b`, `mistral`, `gemma:7b`
13B	16 GB	10 GB+	`llama2:13b`, `codellama:13b`
34B	32 GB	24 GB+	`codellama:34b`, `yi:34b`
70B	64 GB	48 GB+	`llama3:70b`, `mixtral:8x7b`

Tips for choosing a model:

If you have 8 GB of RAM and no dedicated GPU, stick with 7B models. They offer surprisingly good quality for most tasks.
If you have 16 GB of RAM, 13B models are the sweet spot -- noticeably smarter than 7B with reasonable speed.
70B models require serious hardware. Consider cloud GPU instances if you do not have 64 GB+ of RAM.
Apple Silicon Macs with unified memory work particularly well because the GPU and CPU share the same memory pool.

Install Ollama

macOS

The simplest way to install on macOS is via Homebrew:

brew install ollama/tap/ollama

Alternatively, download the macOS app directly from ollama.com/download. The desktop app includes the CLI and runs the server as a background service automatically.

Linux (Official Script)

The one-liner install script works on most distributions:

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary to /usr/local/bin and sets up a systemd service. On non-systemd distributions, you will need to start the server manually.

Windows (via WSL2)

Install WSL2 and Ubuntu from the Microsoft Store.
Open your Ubuntu terminal and run the Linux install script:
```
curl -fsSL https://ollama.com/install.sh | sh
```
If you have an NVIDIA GPU, make sure the NVIDIA Container Toolkit / CUDA drivers are installed inside WSL2 for GPU acceleration.

Docker

Running Ollama in Docker is a great option for reproducible environments, CI/CD pipelines, or keeping your host system clean:

# CPU-only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# NVIDIA GPU (requires nvidia-container-toolkit)
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Once the container is running, you can execute Ollama commands inside it:

docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama run llama3 "Hello from Docker!"

Or simply call the API from your host machine since port 11434 is mapped.

Verifying Your Installation

After installing, run these checks to make sure everything is working:

# 1. Check the CLI version
ollama --version
# Expected output: ollama version 0.x.x

# 2. Start the server (skip if using the macOS app or Docker)
ollama serve &

# 3. Verify the server is responding
curl http://localhost:11434
# Expected output: "Ollama is running"

# 4. Pull a small test model
ollama pull tinyllama

# 5. Run a quick test
ollama run tinyllama "Say hello in three languages"

If step 3 returns "Ollama is running", your installation is healthy and ready to go.

Start the Ollama Service

ollama serve   # starts the background server on port 11434

Default API endpoint: http://localhost:11434
Logs: ~/.ollama/logs (location varies by platform)
Automatic start: On macOS (desktop app) and Linux (systemd), the server starts automatically at boot. You only need to run ollama serve manually if you installed via a method that does not set up a service.

To confirm the server is running at any time:

curl http://localhost:11434
# Returns: "Ollama is running"

Pull Your First Model

ollama pull llama3
ollama list          # verify the model is downloaded

The first pull will take a few minutes depending on your internet speed -- llama3 (8B, 4-bit) is approximately 4.7 GB. Progress is displayed in the terminal.

Popular starter models:

Model	Size	Best For
`tinyllama`	~600 MB	Testing, fast iteration
`phi3:mini`	~2.3 GB	Small and capable, great for code
`mistral`	~4.1 GB	Balanced quality and speed
`llama3`	~4.7 GB	Strong general-purpose model
`codellama`	~3.8 GB	Code generation and analysis

Understanding Model Tags

When you see a model name like llama3:8b-instruct-q4_K_M, it follows a structured naming convention:

<family>:<size>-<variant>-<quantization>

Here is what each part means:

Family (llama3) -- The base model family or architecture.
Size (8b) -- The number of parameters (e.g., 8 billion).
Variant (instruct) -- The fine-tuning variant. Common variants include:
- instruct -- Fine-tuned to follow instructions (most common for chat).
- chat -- Optimized for conversational interactions.
- code -- Fine-tuned specifically for programming tasks.
- text -- Base model without instruction tuning.
Quantization (q4_K_M) -- The compression method. Lower numbers mean smaller files but slightly reduced quality:
- q4_0 -- 4-bit, basic quantization. Smallest size, lower quality.
- q4_K_M -- 4-bit, K-quant medium. Good balance (Ollama's default).
- q5_K_M -- 5-bit, K-quant medium. Better quality, slightly larger.
- q8_0 -- 8-bit quantization. Near-original quality, roughly double the size.

When you run ollama pull llama3, Ollama automatically downloads the default tag, which is usually the instruct variant with Q4_K_M quantization. You can be explicit if you want a specific version:

ollama pull llama3:8b              # default quantization
ollama pull llama3:8b-q8_0         # higher quality, larger file
ollama pull llama3:70b             # much larger model

Run Your First Chat

Interactive chat in the terminal:

ollama run llama3 "What is Ollama?"          # one-off prompt with immediate response
ollama run llama3                             # enters the REPL for an ongoing conversation

Inside the REPL you can type messages, and the model will respond. Press Ctrl+D or type /bye to exit.

Your First Conversation -- Multi-Turn Chat

One of Ollama's strengths is maintaining context across multiple turns in a conversation. Here is how multi-turn chat works with the HTTP API:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "stream": false,
  "messages": [
    {"role": "system", "content": "You are a helpful cooking assistant."},
    {"role": "user", "content": "I have chicken, rice, and broccoli. What can I make?"},
    {"role": "assistant", "content": "You could make a chicken stir-fry with rice and broccoli! Slice the chicken, stir-fry it with the broccoli, season with soy sauce and garlic, and serve over steamed rice."},
    {"role": "user", "content": "Great idea! How long should I cook the chicken?"}
  ]
}'

The key insight is that you send the entire conversation history with each request. Ollama does not store session state between HTTP calls -- the context is built from the messages array you provide. This is the same pattern used by the OpenAI Chat Completions API, so if you have experience with that, you will feel right at home.

In the REPL (ollama run llama3), this history is managed automatically for you. Each message you type is appended to the conversation, giving the model full context of what you have discussed.

Call the HTTP API (OpenAI-Compatible)

Ollama exposes two API styles. The native Ollama API and an OpenAI-compatible endpoint:

Native Ollama API

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Tell me a short joke"}
  ]
}'

By default this streams newline-delimited JSON objects. Add "stream": false to receive a single JSON response.

OpenAI-Compatible API

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Tell me a short joke"}
  ]
}' -H "Content-Type: application/json"

This endpoint mirrors the OpenAI chat completions format (choices, message, usage), making it straightforward to point existing OpenAI-based code at your local Ollama instance.

Python Quickstart

Basic Request

import requests

resp = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3",
    "messages": [{"role": "user", "content": "Explain RAG in 3 bullets"}],
    "stream": False
})
print(resp.json()["message"]["content"])

Streaming Responses

For a better user experience -- especially with longer answers -- you can stream the response token by token. This lets you display text as it is generated, rather than waiting for the full response:

import requests
import json

resp = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3",
    "messages": [{"role": "user", "content": "Write a short poem about open source"}],
    "stream": True
}, stream=True)

for line in resp.iter_lines():
    if line:
        chunk = json.loads(line)
        # Each chunk contains a partial message
        content = chunk.get("message", {}).get("content", "")
        print(content, end="", flush=True)

        # The last chunk has "done": true
        if chunk.get("done"):
            print()  # newline at the end

When stream is True, Ollama sends a sequence of JSON objects, one per line. Each object contains a small piece of the response in message.content. The final object includes "done": true along with timing and token-count metadata.

Using the Official Python Library

Ollama also provides an official Python client that simplifies things further:

pip install ollama

import ollama

# Non-streaming
response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "What is the capital of France?"}
])
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Explain quantum computing simply"}
], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Node.js Quickstart

Because Ollama's /v1 endpoint is OpenAI-compatible, you can use the official OpenAI Node.js SDK directly:

npm install openai

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama"  // required by the SDK but not used by Ollama
});

const chat = await client.chat.completions.create({
  model: "llama3",
  messages: [{ role: "user", content: "Summarize Ollama in one sentence" }],
});

console.log(chat.choices[0].message.content);

Key CLI Commands

ollama list                # list all installed models
ollama pull <model>        # download a model from the registry
ollama run <model>         # start a chat or send a one-off prompt
ollama show <model>        # display model metadata (parameters, template, license)
ollama rm <model>          # remove a model from disk
ollama cp <src> <dst>      # copy/rename a local model
ollama ps                  # show currently loaded models and memory usage

The ollama ps command is particularly useful for debugging. It shows you which models are currently loaded into memory, how much RAM/VRAM they are using, and whether they are running on CPU or GPU.

Environment Variables

You can customize Ollama's behavior using environment variables. Set these before starting the server, or add them to your shell profile for persistence:

Variable	Default	Description
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address and port for the server. Set to `0.0.0.0:11434` to allow remote connections.
`OLLAMA_MODELS`	`~/.ollama/models`	Directory where model files are stored. Change this to use a different disk or partition.
`OLLAMA_KEEP_ALIVE`	`5m`	How long a model stays loaded in memory after the last request. Set to `0` to unload immediately, or `-1` to keep loaded forever.
`OLLAMA_NUM_PARALLEL`	`1`	Number of parallel request slots per model. Increase for concurrent users.
`OLLAMA_MAX_LOADED_MODELS`	`1`	Maximum number of models loaded in memory simultaneously.
`OLLAMA_DEBUG`	`0`	Set to `1` to enable verbose debug logging.
`OLLAMA_ORIGINS`	(none)	Comma-separated list of allowed CORS origins for browser-based access.
`OLLAMA_TMPDIR`	system default	Directory for temporary files during model downloads and operations.

Example: Changing the model storage directory and allowing remote access:

export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=/mnt/large-disk/ollama-models
export OLLAMA_KEEP_ALIVE=30m
ollama serve

Files and Paths

Item	macOS / Linux	Windows (WSL)
Models	`~/.ollama/models`	`~/.ollama/models` (inside WSL)
Logs	`~/.ollama/logs`	`~/.ollama/logs` (inside WSL)
Config	`~/.ollama/config.json`	`~/.ollama/config.json`
Binary	`/usr/local/bin/ollama`	`/usr/local/bin/ollama` (inside WSL)

The config.json file is rarely needed -- environment variables and CLI flags cover most configuration needs.

Troubleshooting

Problem	Solution
"could not connect"	Make sure the server is running with `ollama serve`, or check that the macOS app / Docker container is active.
Port conflict on 11434	Set `OLLAMA_HOST=0.0.0.0:11435` (or another free port) and restart the server.
Slow model downloads	Retry the pull -- downloads resume automatically. Check your network connection.
Out of memory errors	Try a smaller model or a more aggressive quantization (e.g., `q4_0` instead of `q8_0`). Close other memory-heavy applications.
GPU not detected	Verify drivers are installed (`nvidia-smi` for NVIDIA, `rocm-smi` for AMD). On macOS, Apple Silicon GPUs are used automatically.
Model not found	Check the model name with `ollama list`. Pull it first with `ollama pull <model>`.
Slow inference on CPU	This is expected for larger models. Consider a smaller model, or add a GPU for significant speedup.

Next: Chapter 2: Models & Modelfiles

Depth Expansion Playbook

This chapter is expanded to v1-style depth for production-grade learning and implementation quality.

Strategic Context

tutorial: Ollama Tutorial: Running and Serving LLMs Locally
tutorial slug: ollama-tutorial
chapter focus: Chapter 1: Getting Started with Ollama
system context: Ollama Tutorial
objective: move from surface-level usage to repeatable engineering operation

Architecture Decomposition

Define the runtime boundary for Chapter 1: Getting Started with Ollama.
Separate control-plane decisions from data-plane execution.
Capture input contracts, transformation points, and output contracts.
Trace state transitions across request lifecycle stages.
Identify extension hooks and policy interception points.
Map ownership boundaries for team and automation workflows.
Specify rollback and recovery paths for unsafe changes.
Track observability signals for correctness, latency, and cost.

Operator Decision Matrix

Decision Area	Low-Risk Path	High-Control Path	Tradeoff
Runtime mode	managed defaults	explicit policy config	speed vs control
State handling	local ephemeral	durable persisted state	simplicity vs auditability
Tool integration	direct API use	mediated adapter layer	velocity vs governance
Rollout method	manual change	staged + canary rollout	effort vs safety
Incident response	best effort logs	runbooks + SLO alerts	cost vs reliability

Failure Modes and Countermeasures

Failure Mode	Early Signal	Root Cause Pattern	Countermeasure
stale context	inconsistent outputs	missing refresh window	enforce context TTL and refresh hooks
policy drift	unexpected execution	ad hoc overrides	centralize policy profiles
auth mismatch	401/403 bursts	credential sprawl	rotation schedule + scope minimization
schema breakage	parser/validation errors	unmanaged upstream changes	contract tests per release
retry storms	queue congestion	no backoff controls	jittered backoff + circuit breakers
silent regressions	quality drop without alerts	weak baseline metrics	eval harness with thresholds

Implementation Runbook

Establish a reproducible baseline environment.
Capture chapter-specific success criteria before changes.
Implement minimal viable path with explicit interfaces.
Add observability before expanding feature scope.
Run deterministic tests for happy-path behavior.
Inject failure scenarios for negative-path validation.
Compare output quality against baseline snapshots.
Promote through staged environments with rollback gates.
Record operational lessons in release notes.

Quality Gate Checklist

chapter-level assumptions are explicit and testable
API/tool boundaries are documented with input/output examples
failure handling includes retry, timeout, and fallback policy
security controls include auth scopes and secret rotation plans
observability includes logs, metrics, traces, and alert thresholds
deployment guidance includes canary and rollback paths
docs include links to upstream sources and related tracks
post-release verification confirms expected behavior under load

Source Alignment

Cross-Tutorial Connection Map

Advanced Practice Exercises

Build a minimal end-to-end implementation for Chapter 1: Getting Started with Ollama.
Add instrumentation and measure baseline latency and error rate.
Introduce one controlled failure and confirm graceful recovery.
Add policy constraints and verify they are enforced consistently.
Run a staged rollout and document rollback decision criteria.

Review Questions

Which execution boundary matters most for this chapter and why?
What signal detects regressions earliest in your environment?
What tradeoff did you make between delivery speed and governance?
How would you recover from the highest-impact failure mode?
What must be automated before scaling to team-wide adoption?

Scenario Playbook 1: Chapter 1: Getting Started with Ollama

tutorial context: Ollama Tutorial: Running and Serving LLMs Locally
trigger condition: incoming request volume spikes after release
initial hypothesis: identify the smallest reproducible failure boundary
immediate action: protect user-facing stability before optimization work
engineering control: introduce adaptive concurrency limits and queue bounds
verification target: latency p95 and p99 stay within defined SLO windows
rollback trigger: pre-defined quality gate fails for two consecutive checks
communication step: publish incident status with owner and ETA
learning capture: add postmortem and convert findings into automated tests

Scenario Playbook 2: Chapter 1: Getting Started with Ollama

tutorial context: Ollama Tutorial: Running and Serving LLMs Locally
trigger condition: tool dependency latency increases under concurrency
initial hypothesis: identify the smallest reproducible failure boundary
immediate action: protect user-facing stability before optimization work
engineering control: enable staged retries with jitter and circuit breaker fallback
verification target: error budget burn rate remains below escalation threshold
rollback trigger: pre-defined quality gate fails for two consecutive checks
communication step: publish incident status with owner and ETA
learning capture: add postmortem and convert findings into automated tests

Scenario Playbook 3: Chapter 1: Getting Started with Ollama

tutorial context: Ollama Tutorial: Running and Serving LLMs Locally
trigger condition: schema updates introduce incompatible payloads
initial hypothesis: identify the smallest reproducible failure boundary
immediate action: protect user-facing stability before optimization work
engineering control: pin schema versions and add compatibility shims
verification target: throughput remains stable under target concurrency
rollback trigger: pre-defined quality gate fails for two consecutive checks
communication step: publish incident status with owner and ETA
learning capture: add postmortem and convert findings into automated tests

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for ollama, model, content so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 1: Getting Started with Ollama as an operating subsystem inside Ollama Tutorial: Running and Serving LLMs Locally, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around llama3, chat, role as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 1: Getting Started with Ollama usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for ollama.
Input normalization: shape incoming data so model receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through content.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

Ollama Repository Why it matters: authoritative reference on Ollama Repository (github.com).
Ollama Releases Why it matters: authoritative reference on Ollama Releases (github.com).
Ollama Website and Docs Why it matters: authoritative reference on Ollama Website and Docs (ollama.com).

Suggested trace strategy:

search upstream code for ollama and model to map concrete implementation paths
compare docs claims against actual runtime/config code before reusing patterns in production

FilesExpand file tree

01-getting-started.md

Latest commit

History