Gemma 4 FiftyOne Zoo Model

A FiftyOne remote Model Zoo integration for Google Gemma 4, a multimodal vision-language model family supporting image and video understanding. Learn more about the FiftyOne Model Zoo in the Voxel51 docs.

Structured operations (detect, point, classify) use Gemma s4's native function calling for reliable structured output. Text operations (vqa, caption, ocr) use plain generation. All outputs go through parse_response for clean separation of thinking and content.

Installation

Using pip:

pip install fiftyone "transformers>=4.52.0" torch torchvision accelerate huggingface-hub

Or using uv:

uv add fiftyone "transformers>=4.52.0" torch torchvision accelerate huggingface-hub

For video processing, you also need torchcodec and ffmpeg:

# pip
pip install torchcodec

# or uv
uv add torchcodec

ffmpeg must be installed separately as a system package:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows (via chocolatey)
choco install ffmpeg

Quick Start

import fiftyone as fo
import fiftyone.zoo as foz

# Register the model source
foz.register_zoo_model_source(
    "https://github.com/Burhan-Q/gemma4",
    overwrite=True,
)

# Download the model weights
foz.download_zoo_model(
    "https://github.com/Burhan-Q/gemma4",
    model_name="google/gemma-4-E4B-it",
)

# Load a dataset and run inference
dataset = foz.load_zoo_dataset("quickstart")

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    media_type="image",
    operation="vqa",
)

model.prompt = "Describe what is happening in this image."
dataset.apply_model(model, label_field="description")

session = fo.launch_app(dataset)

Supported Models

Model	Effective Params	Context	Modalities	VRAM (est.)
`google/gemma-4-E2B-it`	2.3B (5.1B total)	128K	Text / Image / Video / Audio	~10 GB
`google/gemma-4-E4B-it`	4.5B (8B total)	128K	Text / Image / Video / Audio	~16 GB
`google/gemma-4-26B-A4B-it`	3.8B active (25.2B MoE)	256K	Text / Image	~50 GB
`google/gemma-4-31B-it`	30.7B dense	256K	Text / Image	~62 GB

Notes:

Only the E2B and E4B models support video and audio input. The 26B-A4B and 31B models are image-only.
The 26B-A4B model (MoE architecture) requires CUDA. It does not currently run on Apple Silicon MPS due to a missing torch.histc implementation for the MoE expert routing layer.
All instruction-tuned models use the -it suffix. Base (pre-trained) models exist but are not supported by this integration since they lack the chat template and tool calling capabilities.

⚠️ Important

Use short prompts for better results, especially for smaller models.

Image Operations

Load the model with media_type="image" (the default). Switch operations by setting model.operation.

dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    media_type="image",
)

👈 Expand for all image tasks

Visual Question Answering (VQA)

Prompt model with questions to answer about images.

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="vqa")
model.prompt = "What objects are visible in this image?"

dataset.apply_model(model, label_field="q_vqa")

print(dataset.first().q_vqa)  # fo.Classification with label=text

Output: fo.Classification

Image Captioning

Model provides detailed captioning of image scene

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="caption")
model.prompt = "Describe this image in one sentence."

dataset.apply_model(model, label_field="caption")

Output: fo.Classification

Image Object Detection

Model classifies and locates objects spatially in images.

Uses function calling with the report_detections tool. The model outputs bounding boxes in its native box_2d format with [y1, x1, y2, x2] coordinates scaled 0-1000, which are automatically converted to FiftyOne's [x, y, w, h] in [0, 1] range.

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="detect")
model.prompt = "Detect all objects."

dataset.apply_model(model, label_field="dets")

# Inspect results
for det in dataset.first().dets.detections:
    print(det.label, det.bounding_box)  # [x, y, w, h] normalized

Output: fo.Detections

Image Keypoint Detection

Model classifies objects and places single keypoint at object's center

Uses function calling with the report_points tool. Coordinates follow the same [y, x] native format, auto-converted to [x, y] for FiftyOne.

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="point")
model.prompt = "Point to the center of each animal in this image."

dataset.apply_model(model, label_field="pts")

for kp in dataset.first().pts.keypoints:
    print(kp.label, kp.points)  # [[x, y]] normalized

Output: fo.Keypoints

Image Classification

Model generates labels for objects in image

Uses function calling with the report_classifications tool.

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="classify")
model.prompt = "Classify this image."

dataset.apply_model(model, label_field="cls")

for c in dataset.first().cls.classifications:
    print(c.label)

Output: fo.Classifications

Image Optical Character Recognition (OCR)

Model extracts text from images.

Best results with document images. Use max_soft_tokens=560 or higher for fine text.

from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub("Voxel51/visual_ai_at_neurips2025", max_samples=2)

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    operation="ocr",
    max_soft_tokens=560,  # higher resolution for text extraction
)
model.prompt = "Extract all visible text from this image."

dataset.apply_model(model, label_field="text")

print(dataset.first().text.label)

Output: fo.Classification

Per-Sample Prompts

Use a dataset field as the prompt source for each sample:

# First, generate descriptions
model.operation = "vqa"
model.prompt = "List all objects in this image."
dataset.apply_model(model, label_field="objects")

# Then use those descriptions to ground detection
model.operation = "detect"
dataset.apply_model(
    model,
    label_field="grounded_dets",
    prompt_field="objects",
)

Overriding the System Prompt

Each operation has a default system prompt. Override it for custom behavior:

model = foz.load_zoo_model("google/gemma-4-E4B-it", operation="detect")
model.system_prompt = "You are a quality inspector. Detect all defects..."
model.prompt = "Find any scratches, dents, or discoloration."

dataset.apply_model(model, label_field="defects")

Video Operations

Load the model with media_type="video". Only E2B and E4B models support video. Operations that produce temporal or frame-level labels require dataset.compute_metadata().

Prerequisite: Video processing requires ffprobe (part of ffmpeg).

video_dataset = foz.load_zoo_dataset("quickstart-video")
video_dataset.compute_metadata()

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    media_type="video",
)

👈 Expand for all video tasks

Video Description

Plain-text summary. Does not require metadata.

model.operation = "description"
video_dataset.apply_model(model, label_field="desc")
# result: sample.desc_summary (str)

Video Temporal Localization

Detects activity events with start/end timestamps.

model.operation = "temporal_localization"
video_dataset.apply_model(model, label_field="events")
# result: sample.events_events (fo.TemporalDetections)

Video Object Tracking

Tracks objects across frames with per-frame bounding boxes.

model.operation = "tracking"
video_dataset.apply_model(model, label_field="tracking")
# result: sample.frames[N].tracking_objects (fo.Detections)

Video OCR

Extracts text with bounding boxes per frame.

model.operation = "ocr"
video_dataset.apply_model(model, label_field="vocr")
# result: sample.frames[N].vocr_text_content (fo.Detections)

Video Comprehensive Analysis

All analyses in a single pass: summary, events, objects, scene info, activities.

model.operation = "comprehensive"
video_dataset.apply_model(model, label_field="analysis")

Video Custom Prompts

Full control over the prompt for domain-specific analysis.

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    media_type="video",
    operation="custom",
    custom_prompt="Count the number of people entering and leaving the frame.",
)

video_dataset.apply_model(model, label_field="count")
# result: sample.count_result (str)

Configuration Parameters

All parameters can be set at load time or modified after loading via properties.

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    operation="detect",
    max_new_tokens=4096,
    temperature=0.7,
    max_soft_tokens=560,
)

# Or modify after loading
model.max_new_tokens = 2048
model.temperature = 0.5

Generation Parameters

Parameter	Default	Description
`max_new_tokens`	2048	Maximum tokens to generate. Must be high enough for model thinking (if enabled) plus the response.
`temperature`	1.0	Sampling temperature
`top_p`	0.95	Nucleus sampling threshold
`top_k`	64	Top-k sampling parameter
`do_sample`	True	Sampling (True) vs greedy decoding (False)
`repetition_penalty`	1.0	Penalize repeated tokens
`enable_thinking`	False	Enable step-by-step reasoning mode. See Thinking Mode for caveats.
`cache_implementation`	None	KV cache strategy for `generate()`. `"static"` pre-allocates cache (used in official Gemma 4 examples). May not work with all model variants (e.g., 26B MoE).

Vision Parameters

Parameter	Default	Description
`max_soft_tokens`	varies	Vision token budget per image. Must be one of: 70, 140, 280, 560, 1120. Default is operation-dependent (see below).

Operation-specific defaults for max_soft_tokens:

Operation	Default	Rationale
detect, point, classify	280	Balanced detail for object localization
vqa, caption	280	General-purpose
ocr	560	Higher resolution needed for text extraction

Override for your use case:

# Maximum detail for document OCR
model = foz.load_zoo_model(..., operation="ocr", max_soft_tokens=1120)

# Fast detection with lower resolution
model = foz.load_zoo_model(..., operation="detect", max_soft_tokens=140)

Thinking Mode

Gemma 4 supports a reasoning mode where the model shows step-by-step thinking before its answer.

model = foz.load_zoo_model(
    "google/gemma-4-E4B-it",
    operation="detect",
    enable_thinking=True,
)

model.prompt = "Detect all road signs in this image."
dataset.apply_model(model, label_field="signs")

# Reasoning is stored as a dynamic attribute on each label
det = dataset.first().signs.detections[0]
print(det.label)
print(det["reasoning"])  # model's thinking chain, if present

Important: Thinking mode significantly increases token usage and inference time. For structured operations (detect, point, classify), thinking can cause the model to exhaust its generation budget before producing the tool call. It is recommended to keep enable_thinking=False (the default) for structured operations and increase max_new_tokens if you do enable it.

Additional Information

See ADDITIONAL_INFO.md for setup verification, architecture details, logging configuration, and technical details.

Citation

@misc{gemma4,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://ai.google.dev/gemma}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
ADDITIONAL_INFO.md		ADDITIONAL_INFO.md
README.md		README.md
__init__.py		__init__.py
examples.py		examples.py
manifest.json		manifest.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zoo.py		zoo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma 4 FiftyOne Zoo Model

Table of Contents

Installation

Quick Start

Supported Models

⚠️ Important

Image Operations

Visual Question Answering (VQA)

Image Captioning

Image Object Detection

Image Keypoint Detection

Image Classification

Image Optical Character Recognition (OCR)

Per-Sample Prompts

Overriding the System Prompt

Video Operations

Video Description

Video Temporal Localization

Video Object Tracking

Video OCR

Video Comprehensive Analysis

Video Custom Prompts

Configuration Parameters

Generation Parameters

Vision Parameters

Thinking Mode

Additional Information

Citation

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemma 4 FiftyOne Zoo Model

Table of Contents

Installation

Quick Start

Supported Models

⚠️ Important

Image Operations

Visual Question Answering (VQA)

Image Captioning

Image Object Detection

Image Keypoint Detection

Image Classification

Image Optical Character Recognition (OCR)

Per-Sample Prompts

Overriding the System Prompt

Video Operations

Video Description

Video Temporal Localization

Video Object Tracking

Video OCR

Video Comprehensive Analysis

Video Custom Prompts

Configuration Parameters

Generation Parameters

Vision Parameters

Thinking Mode

Additional Information

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages