layout	default
title	Ollama Tutorial - Chapter 2: Models & Modelfiles
nav_order	2
has_children	false
parent	Ollama Tutorial

Chapter 2: Models, Pulling, and Modelfiles

Welcome to Chapter 2: Models, Pulling, and Modelfiles. In this part of Ollama Tutorial: Running and Serving LLMs Locally, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Discover, manage, and customize models with Modelfiles and parameters. Learn how quantization works, which model families exist, how to import your own weights, and how to keep your disk tidy.

The Model Lifecycle

Before diving into commands, it helps to see the big picture. Every model you work with in Ollama moves through a simple lifecycle:

flowchart LR
    A["Pull"] --> B["Store"]
    B --> C["Load"]
    C --> D["Serve"]
    D --> E["Customize"]
    E --> B

    style A fill:#4a9eff,stroke:#2b7acc,color:#fff
    style B fill:#50c878,stroke:#359b5c,color:#fff
    style C fill:#ffa94d,stroke:#cc8a3d,color:#fff
    style D fill:#ff6b6b,stroke:#cc5555,color:#fff
    style E fill:#cc5de8,stroke:#a34bba,color:#fff

Stage	What happens
Pull	Download a model from the Ollama library (or import a local GGUF file).
Store	The model blob is saved under `~/.ollama/models` on disk.
Load	When you first run a model, Ollama loads its weights into RAM (and VRAM when a GPU is available).
Serve	The loaded model handles inference requests through the CLI or API.
Customize	You create a Modelfile to wrap a base model with a system prompt, parameters, or adapters, then store the result as a new named model.

This cycle means you can always go back and re-customize without re-downloading the base weights.

Browsing and Pulling Models

The Ollama library at ollama.com/library lists every officially supported model. From the CLI you can pull any of them by name and optional tag:

ollama list                     # show all installed models
ollama pull llama3:8b           # download a specific size tag
ollama pull mistral             # latest default tag
ollama pull phi3:mini           # small and fast

Tags follow the pattern model:variant. When you omit the tag, Ollama pulls the default variant (usually the smallest recommended size).

Common tags at a glance

llama3, llama3:8b, llama3:70b
mistral, mixtral:8x7b
phi3:mini, phi3:medium
qwen2, qwen2:72b
gemma, gemma:7b

Removing and copying models

ollama rm llama3:8b             # delete a model from disk
ollama cp llama3:8b my-llama3   # copy/rename (cheap, shares blobs)

Copying is almost free on disk because Ollama uses content-addressed storage. The copy points to the same underlying blob until you customize it.

Popular Models Comparison

Choosing a model can be overwhelming. The table below gives you a starting point. RAM figures are approximate for the default quantization that Ollama ships.

Model	Default Size	Strengths	Best Use Case	RAM Needed
Llama 3 8B	~4.7 GB	Strong general reasoning, good instruction following	General assistant, summarization, Q&A	8 GB
Llama 3 70B	~40 GB	Near-frontier quality, excellent coding	Complex reasoning, long-form writing	48 GB+
Mistral 7B	~4.1 GB	Fast, strong for its size, great at European languages	Chatbots, translation, light coding	8 GB
Mixtral 8x7B	~26 GB	Mixture-of-experts, high throughput	Multi-domain assistant, enterprise apps	32 GB+
Phi-3 Mini	~2.4 GB	Tiny footprint, surprisingly capable	Edge devices, quick prototyping, CI pipelines	4 GB
Phi-3 Medium	~7.5 GB	Best quality-to-size in the Phi family	Coding assistance, structured extraction	10 GB
Qwen 2 7B	~4.4 GB	Strong multilingual, excellent at Chinese/English	Multilingual chat, data extraction	8 GB
Qwen 2 72B	~41 GB	Top-tier multilingual reasoning	Research, enterprise multilingual	48 GB+
Gemma 7B	~5.0 GB	Lightweight, Google-trained on high-quality data	Summarization, lightweight assistants	8 GB
CodeGemma 7B	~5.0 GB	Purpose-built for code generation	Code completion, refactoring, review	8 GB
DeepSeek Coder	~4.5 GB	Trained on 2T tokens of code	Code generation, debugging, unit tests	8 GB

Tip: If you are just getting started, phi3:mini lets you validate your entire pipeline on a laptop with 8 GB of RAM. Once everything works, swap in a larger model for production quality.

Model Parameters (Runtime)

Every time you call a model you can override its default generation parameters. Use the options object in API calls or pass flags in the CLI:

Parameter	Range / Type	What it controls
`temperature`	0 -- 1.5	Randomness. Lower is more deterministic.
`top_p`	0 -- 1.0	Nucleus sampling cutoff.
`top_k`	integer	Limits token pool per step.
`repeat_penalty`	float	Penalizes repeated tokens.
`presence_penalty`	float	Penalizes tokens that have already appeared.
`frequency_penalty`	float	Penalizes tokens proportional to their frequency.
`num_ctx`	integer	Context window in tokens (e.g., 4096, 8192).
`num_gpu`	integer	Number of GPU layers. Auto-detected; override if needed.
`num_predict`	integer	Maximum tokens to generate.

Example: tuning via the API

curl http://localhost:11434/api/chat -d '{
  "model": "mistral",
  "options": {
    "temperature": 0.7,
    "top_p": 0.9,
    "num_ctx": 4096
  },
  "messages": [{"role": "user", "content": "Give me 3 startup ideas"}]
}'

Example: tuning via the CLI

ollama run mistral --verbose \
  "Give me 3 startup ideas" \
  --num-ctx 4096

The --verbose flag prints timing stats after each response so you can see how parameter changes affect speed.

Modelfiles (Build Custom Models)

A Modelfile is a plain-text recipe that tells Ollama how to create a new model from a base model plus your own configuration. Think of it like a Dockerfile, but for LLMs.

Modelfile syntax reference

Instruction	Purpose
`FROM <base>`	Required. The base model name or path to a GGUF file.
`SYSTEM "<text>"`	Sets the default system prompt baked into the model.
`PARAMETER <key> <value>`	Sets a default generation parameter.
`TEMPLATE "<text>"`	Custom prompt template using Go template syntax.
`FORMAT json`	Encourages structured JSON output.
`ADAPTER <path>`	Applies a LoRA or QLoRA adapter file.
`LICENSE "<text>"`	Embeds a license string into model metadata.
`MESSAGE <role> <content>`	Seeds the conversation with example messages.

Build and run workflow

# 1. Write a Modelfile (any filename works)
# 2. Create the model
ollama create my-model -f ./Modelfile

# 3. Test it
ollama run my-model "Hello, who are you?"

# 4. Inspect metadata
ollama show my-model

Now let's look at three complete, real-world Modelfile examples.

Worked Example 1: Code Assistant

This model is tuned for concise, accurate code generation. It uses a low temperature so answers are deterministic, a large context window for reading long files, and a system prompt that keeps responses focused.

# Modelfile.code-assistant
FROM llama3:8b

SYSTEM """You are a senior software engineer acting as a code assistant.
Rules:
- Always respond with code first, explanation second.
- Use the language the user specifies. Default to Python if unspecified.
- Include brief inline comments in generated code.
- When fixing bugs, show the original line and the corrected line.
- Never apologize or add filler phrases.
"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
PARAMETER num_predict 2048

Build and test:

ollama create code-assistant -f Modelfile.code-assistant

ollama run code-assistant "Write a Python function that merges two sorted lists in O(n) time"

Because the temperature is set to 0.1 the model will produce nearly identical output every time, which is exactly what you want for code generation.

Worked Example 2: JSON Extractor

This model takes unstructured text and returns clean JSON. The FORMAT json instruction plus a strict system prompt make it reliable for pipelines.

# Modelfile.json-extractor
FROM mistral

SYSTEM """You are a structured data extraction engine.
Given any input text, extract entities and return ONLY valid JSON.
Do not include markdown fences, explanations, or commentary.

Output schema:
{
  "entities": [
    {
      "name": "<string>",
      "type": "<person|organization|location|date|other>",
      "confidence": <0.0-1.0>
    }
  ],
  "summary": "<one sentence summary of the input>"
}
"""

FORMAT json
PARAMETER temperature 0.0
PARAMETER top_p 0.85
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER repeat_penalty 1.0

MESSAGE user "Acme Corp announced on Jan 5 that CEO Jane Doe will open a new office in Berlin."
MESSAGE assistant "{\"entities\":[{\"name\":\"Acme Corp\",\"type\":\"organization\",\"confidence\":0.95},{\"name\":\"Jane Doe\",\"type\":\"person\",\"confidence\":0.97},{\"name\":\"Berlin\",\"type\":\"location\",\"confidence\":0.99},{\"name\":\"Jan 5\",\"type\":\"date\",\"confidence\":0.90}],\"summary\":\"Acme Corp CEO Jane Doe will open a new Berlin office.\"}"

Build and test:

ollama create json-extractor -f Modelfile.json-extractor

echo "SpaceX launched Starship from Boca Chica on March 14. Elon Musk watched from mission control." \
  | ollama run json-extractor

The MESSAGE instructions at the bottom act as few-shot examples. They teach the model the exact output shape you expect, which dramatically improves consistency.

Worked Example 3: Creative Writer

This model is the opposite of the code assistant -- it values variety, flair, and unexpected turns. A high temperature and generous top-k let it explore the token space.

# Modelfile.creative-writer
FROM llama3:8b

SYSTEM """You are a world-class creative writing collaborator.
- Write vivid, sensory prose with varied sentence lengths.
- Embrace metaphor, subtext, and surprising word choices.
- Match the genre and tone the user requests (fantasy, sci-fi, literary, humor, etc.).
- When asked to continue a passage, maintain voice consistency.
- Avoid cliches. Prefer fresh imagery over stock phrases.
"""

PARAMETER temperature 1.2
PARAMETER top_p 0.95
PARAMETER top_k 80
PARAMETER num_ctx 8192
PARAMETER num_predict 4096
PARAMETER repeat_penalty 1.15
PARAMETER presence_penalty 0.6

Build and test:

ollama create creative-writer -f Modelfile.creative-writer

ollama run creative-writer "Write the opening paragraph of a noir detective story set on Mars"

The high repeat_penalty and presence_penalty push the model away from repetitive phrasing, which is essential for creative text that needs to stay interesting over long passages.

Adding Templates and Formats

The TEMPLATE instruction lets you control the raw prompt structure sent to the model. This is useful when a model expects a particular chat format:

FROM mistral
TEMPLATE """
{{ if .System }}[INST] <<SYS>>{{ .System }}<</SYS>> {{ end }}
{{ .Prompt }} [/INST]
"""
FORMAT json
PARAMETER temperature 0.1

TEMPLATE uses Go template syntax. .System and .Prompt are injected by Ollama.
FORMAT json tells Ollama to constrain the output grammar toward valid JSON.

Using LoRA / Adapters

If you have fine-tuned a LoRA adapter, you can layer it on top of a base model:

FROM llama3
ADAPTER ./adapters/my-lora.bin
SYSTEM "You are a domain-specific assistant."
PARAMETER num_ctx 4096

The adapter file must be in a format compatible with the base model architecture. Ollama currently supports GGUF-based adapters.

Understanding Quantization

When people talk about running a "7B model in 4 GB of RAM," quantization is the reason that is possible. This section explains what it is and how to pick the right level.

What is quantization?

Full-precision model weights use 16 bits (FP16) or 32 bits (FP32) per parameter. A 7-billion-parameter model at FP16 needs about 14 GB just for the weights. Quantization reduces each weight to fewer bits -- 8, 5, 4, or even 2 -- shrinking the file and the RAM footprint at the cost of some quality.

The GGUF format

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama. It bundles the quantized weights, tokenizer, and metadata into a single file. When you pull a model from the Ollama library, you are downloading a GGUF blob.

Common quantization levels

Quant Label	Bits per Weight	File Size (7B)	Quality	Speed	When to use
`Q2_K`	~2.5	~2.7 GB	Low	Fastest	Extreme memory constraints, rough prototyping
`Q3_K_S`	~3.0	~3.2 GB	Low-Medium	Very fast	Edge devices with 4 GB RAM
`Q3_K_M`	~3.3	~3.5 GB	Medium-Low	Fast	Budget hardware testing
`Q4_K_S`	~4.0	~3.9 GB	Medium	Fast	Good default for speed-sensitive apps
`Q4_K_M`	~4.5	~4.3 GB	Good (recommended)	Balanced	Best all-round choice for most users
`Q5_K_S`	~5.0	~4.8 GB	Good-High	Moderate	When you want a quality bump
`Q5_K_M`	~5.3	~5.1 GB	High	Moderate	Quality-sensitive production use
`Q6_K`	~6.0	~5.8 GB	Very High	Slower	Near-original quality
`Q8_0`	~8.0	~7.5 GB	Excellent	Slow	Benchmarking, reference baseline
`FP16`	16	~14 GB	Original	Slowest	Research, fine-tuning starting point

How to read a model tag

When you see a tag like llama3:8b-q4_K_M, it tells you:

llama3 -- model family
8b -- 8 billion parameters
q4_K_M -- Q4_K_M quantization

If no quantization is mentioned in the tag, Ollama picks a sensible default (usually Q4_K_M or Q4_0).

Picking the right quantization

A simple rule of thumb:

Laptop / 8 GB RAM -- use Q4_K_M or smaller.
Desktop / 16--32 GB RAM -- use Q5_K_M for a quality boost.
Server / 64 GB+ RAM -- use Q6_K or Q8_0 if quality matters more than throughput.
GPU with limited VRAM (6--8 GB) -- use Q4_K_S to keep the whole model in VRAM.

The quality difference between Q4_K_M and Q5_K_M is noticeable in creative writing and nuanced reasoning. For straightforward Q&A or code generation, Q4_K_M is usually indistinguishable from higher quants.

Model Families

The open-weight LLM landscape has several major families. Each has different training data, architecture choices, and sweet spots.

Llama (Meta)

Meta's Llama family is the most widely used open-weight foundation. Llama 3 improved instruction following, multilingual support, and coding ability over Llama 2.

Strengths: Broad general knowledge, strong reasoning, large community and ecosystem.
Best for: General assistants, summarization, Q&A, RAG pipelines.
Sizes: 8B, 70B (and community-made variants).

Mistral / Mixtral (Mistral AI)

Mistral 7B punches well above its weight class. Mixtral 8x7B is a mixture-of-experts model that activates only a fraction of its parameters per token, giving excellent throughput.

Strengths: Speed, European language quality, strong coding, efficient inference.
Best for: Chatbots, translation, real-time applications, multilingual support.
Sizes: 7B (Mistral), 8x7B (Mixtral).

Phi (Microsoft)

The Phi series proves that small models trained on high-quality, curated data can compete with models several times their size.

Strengths: Tiny RAM footprint, good reasoning for its size, fast inference.
Best for: Edge deployment, CI pipeline testing, resource-constrained environments, mobile.
Sizes: 3.8B (Mini), 14B (Medium).

Qwen (Alibaba Cloud)

Qwen 2 is a multilingual powerhouse with particularly strong Chinese and English performance. It also performs well on math and code benchmarks.

Strengths: Best-in-class multilingual, strong math and structured data.
Best for: Multilingual applications, data extraction, math tutoring.
Sizes: 7B, 72B (and smaller instruct variants).

Gemma (Google)

Gemma models are built on the same research and technology as Google's Gemini family but released as open weights. CodeGemma is a variant fine-tuned specifically for code tasks.

Strengths: Clean training data, solid reasoning, well-documented.
Best for: Summarization, lightweight assistants, code completion (CodeGemma).
Sizes: 2B, 7B.

Quick decision guide

I need...	Try first
A reliable general assistant	Llama 3 8B
The smallest possible model	Phi-3 Mini
Multilingual support	Qwen 2 7B
Fast inference on modest hardware	Mistral 7B
Code generation and review	DeepSeek Coder or CodeGemma
Frontier-level quality locally	Llama 3 70B or Qwen 2 72B

Importing Custom Models

You are not limited to the Ollama library. If you find a GGUF file on HuggingFace or train your own, you can import it directly.

Step 1: Download a GGUF from HuggingFace

Many model authors publish GGUF quantizations on HuggingFace. Look for repositories with names like *-GGUF. You can download with curl, wget, or the HuggingFace CLI:

# Example: downloading a Q4_K_M quant of a model
# Replace the URL with the actual file from the HuggingFace repo
curl -L -o my-model.Q4_K_M.gguf \
  "https://huggingface.co/TheBloke/Example-7B-GGUF/resolve/main/example-7b.Q4_K_M.gguf"

Step 2: Create a Modelfile pointing to the GGUF

# Modelfile.custom
FROM ./my-model.Q4_K_M.gguf

SYSTEM "You are a helpful assistant."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

The FROM line points to a local file path instead of a library model name. Ollama will register the GGUF blob in its store.

Step 3: Build and test

ollama create my-custom-model -f Modelfile.custom
ollama run my-custom-model "Tell me about yourself"

Step 4: Verify the import

ollama show my-custom-model

This prints the model's metadata including parameter count, quantization level, context length, and your system prompt.

Tips for importing

Make sure the GGUF file is compatible with the version of llama.cpp that your Ollama release uses. If you get an error about unsupported format versions, update Ollama first.
You can also import safetensors by first converting them to GGUF using the llama.cpp conversion scripts, but GGUF is the simplest path.
Adapters (LoRA) must also be in GGUF format when used with the ADAPTER instruction.

Model Storage Management

As you experiment with many models, disk usage can grow quickly. A single 70B model can take 40 GB or more. Here is how to stay in control.

Where models live

Platform	Default path
macOS / Linux	`~/.ollama/models`
Windows (WSL)	`~/.ollama/models` inside your WSL distro

You can change the storage location by setting the OLLAMA_MODELS environment variable before starting the server:

export OLLAMA_MODELS="/mnt/external-drive/ollama-models"
ollama serve

Checking disk usage

# List all models with their sizes
ollama list

# See detailed info for one model
ollama show llama3:8b

# Check total disk usage of the models directory
du -sh ~/.ollama/models

Cleaning up

# Remove a specific model
ollama rm llama3:70b

# Remove a custom model you no longer need
ollama rm json-extractor

# List what remains
ollama list

Sharing blobs between models

Ollama uses content-addressed storage. When you create a custom model with ollama create, it does not duplicate the base model weights. The custom model's metadata layer points to the same blob. This means:

ollama cp llama3:8b my-llama costs almost zero extra disk.
Creating ten Modelfile variants of mistral only stores the weights once.
Deleting a custom model does not delete the shared base blob unless it is the last reference.

Practical disk-saving tips

Remove models you are not using. It sounds obvious, but it is the single most effective strategy.
Prefer smaller quants when quality is acceptable. Q4_K_M is roughly half the size of Q8_0.
Avoid pulling both the tagged and untagged version. ollama pull mistral and ollama pull mistral:latest are the same blob, but ollama pull mistral:7b might be different. Check with ollama list.
Move storage to an external drive using OLLAMA_MODELS if your boot drive is small.
Back up your ~/.ollama/models directory if you maintain curated custom models. A simple rsync or tar works.

Best Practices

Start small. Validate your entire pipeline with phi3:mini before scaling up.
Pin your tags. Use llama3:8b instead of just llama3 to avoid unexpected weight changes when the library updates.
Use FORMAT json in Modelfiles when you need structured output for downstream parsing.
Keep separate custom models for different personas and use cases. They are cheap to store.
Version your Modelfiles in git alongside your application code so changes are tracked and reproducible.
Test quantization trade-offs on your actual workload. Benchmark with a sample of real prompts rather than relying only on leaderboard scores.

Summary

In this chapter you learned how to browse and pull models, compare popular options, write Modelfiles with real-world examples, understand quantization trade-offs, navigate the major model families, import custom GGUF weights from HuggingFace, and manage disk usage. With these skills you can confidently pick, configure, and maintain the right models for any project.

Previous: Chapter 1: Getting Started | Next: Chapter 3: Chat & Completions

Depth Expansion Playbook

This chapter is expanded to v1-style depth for production-grade learning and implementation quality.

Strategic Context

tutorial: Ollama Tutorial: Running and Serving LLMs Locally
tutorial slug: ollama-tutorial
chapter focus: Chapter 2: Models, Pulling, and Modelfiles
system context: Ollama Tutorial
objective: move from surface-level usage to repeatable engineering operation

Architecture Decomposition

Define the runtime boundary for Chapter 2: Models, Pulling, and Modelfiles.
Separate control-plane decisions from data-plane execution.
Capture input contracts, transformation points, and output contracts.
Trace state transitions across request lifecycle stages.
Identify extension hooks and policy interception points.
Map ownership boundaries for team and automation workflows.
Specify rollback and recovery paths for unsafe changes.
Track observability signals for correctness, latency, and cost.

Operator Decision Matrix

Decision Area	Low-Risk Path	High-Control Path	Tradeoff
Runtime mode	managed defaults	explicit policy config	speed vs control
State handling	local ephemeral	durable persisted state	simplicity vs auditability
Tool integration	direct API use	mediated adapter layer	velocity vs governance
Rollout method	manual change	staged + canary rollout	effort vs safety
Incident response	best effort logs	runbooks + SLO alerts	cost vs reliability

Failure Modes and Countermeasures

Failure Mode	Early Signal	Root Cause Pattern	Countermeasure
stale context	inconsistent outputs	missing refresh window	enforce context TTL and refresh hooks
policy drift	unexpected execution	ad hoc overrides	centralize policy profiles
auth mismatch	401/403 bursts	credential sprawl	rotation schedule + scope minimization
schema breakage	parser/validation errors	unmanaged upstream changes	contract tests per release
retry storms	queue congestion	no backoff controls	jittered backoff + circuit breakers
silent regressions	quality drop without alerts	weak baseline metrics	eval harness with thresholds

Implementation Runbook

Establish a reproducible baseline environment.
Capture chapter-specific success criteria before changes.
Implement minimal viable path with explicit interfaces.
Add observability before expanding feature scope.
Run deterministic tests for happy-path behavior.
Inject failure scenarios for negative-path validation.
Compare output quality against baseline snapshots.
Promote through staged environments with rollback gates.
Record operational lessons in release notes.

Quality Gate Checklist

chapter-level assumptions are explicit and testable
API/tool boundaries are documented with input/output examples
failure handling includes retry, timeout, and fallback policy
security controls include auth scopes and secret rotation plans
observability includes logs, metrics, traces, and alert thresholds
deployment guidance includes canary and rollback paths
docs include links to upstream sources and related tracks
post-release verification confirms expected behavior under load

Source Alignment

Cross-Tutorial Connection Map

Advanced Practice Exercises

Build a minimal end-to-end implementation for Chapter 2: Models, Pulling, and Modelfiles.
Add instrumentation and measure baseline latency and error rate.
Introduce one controlled failure and confirm graceful recovery.
Add policy constraints and verify they are enforced consistently.
Run a staged rollout and document rollback decision criteria.

Review Questions

Which execution boundary matters most for this chapter and why?
What signal detects regressions earliest in your environment?
What tradeoff did you make between delivery speed and governance?
How would you recover from the highest-impact failure mode?
What must be automated before scaling to team-wide adoption?

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for ollama, PARAMETER, model so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 2: Models, Pulling, and Modelfiles as an operating subsystem inside Ollama Tutorial: Running and Serving LLMs Locally, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around Modelfile, llama3, assistant as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 2: Models, Pulling, and Modelfiles usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for ollama.
Input normalization: shape incoming data so PARAMETER receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through model.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

Ollama Repository Why it matters: authoritative reference on Ollama Repository (github.com).
Ollama Releases Why it matters: authoritative reference on Ollama Releases (github.com).
Ollama Website and Docs Why it matters: authoritative reference on Ollama Website and Docs (ollama.com).

Suggested trace strategy:

search upstream code for ollama and PARAMETER to map concrete implementation paths
compare docs claims against actual runtime/config code before reusing patterns in production

FilesExpand file tree

02-models.md

Latest commit

History

02-models.md

File metadata and controls

Chapter 2: Models, Pulling, and Modelfiles

The Model Lifecycle

Browsing and Pulling Models

Common tags at a glance

Removing and copying models

Popular Models Comparison

Model Parameters (Runtime)

Example: tuning via the API

Example: tuning via the CLI

Modelfiles (Build Custom Models)

Modelfile syntax reference

Build and run workflow

Worked Example 1: Code Assistant

Worked Example 2: JSON Extractor

Worked Example 3: Creative Writer

Adding Templates and Formats

Using LoRA / Adapters

Understanding Quantization

What is quantization?

The GGUF format

Common quantization levels

How to read a model tag

Picking the right quantization

Model Families

Llama (Meta)

Mistral / Mixtral (Mistral AI)

Phi (Microsoft)

Qwen (Alibaba Cloud)

Gemma (Google)

Quick decision guide

Importing Custom Models

Step 1: Download a GGUF from HuggingFace

Step 2: Create a Modelfile pointing to the GGUF

Step 3: Build and test

Step 4: Verify the import

Tips for importing

Model Storage Management

Where models live

Checking disk usage

Cleaning up

Sharing blobs between models

Practical disk-saving tips

Best Practices

Summary

Depth Expansion Playbook

Strategic Context

Architecture Decomposition

Operator Decision Matrix

Failure Modes and Countermeasures

Implementation Runbook

Quality Gate Checklist

Source Alignment

Cross-Tutorial Connection Map

Advanced Practice Exercises

Review Questions

What Problem Does This Solve?

How it Works Under the Hood

Source Walkthrough

Chapter Connections