vllm-gemma4-patch

Gemma 4 support patch for vLLM 0.18.x -- backports PR #38826 from vLLM main.

Why this exists

Google released the Gemma 4 model family on April 2, 2026. The vLLM project merged native Gemma 4 support in PR #38826 to their main branch, but it is not included in any stable release as of this writing (latest stable: v0.18.1, March 31 2026).

This patch backports full Gemma 4 support to stock vLLM 0.18.x on any platform (x86_64 and aarch64/ARM).

Supported models

Model	Parameters	Type
`google/gemma-4-31B-it`	31B	Instruction-tuned, multimodal
`google/gemma-4-12B-it`	12B	Instruction-tuned, multimodal
`google/gemma-4-31B-pt`	31B	Pretrained
`google/gemma-4-12B-pt`	12B	Pretrained
`google/gemma-4-4B-it`	4B	Instruction-tuned
`google/gemma-4-4B-pt`	4B	Pretrained
`google/gemma-4-1B-it`	1B	Instruction-tuned

See SUPPORTED_MODELS.md for recommended launch configurations per model size.

Supported platforms

x86_64 (standard Linux GPU servers, NVIDIA A100/H100/RTX 4090/etc.)
aarch64 / ARM64 (NVIDIA DGX Spark GB10, Jetson, Grace Hopper)

Prerequisites

vLLM 0.18.x installed in a Python virtualenv
Internet access (to clone vLLM main and install transformers from GitHub)
Git installed
pip available in the target virtualenv

Quick start

git clone https://github.com/ATC-Labs/vllm-gemma4-patch.git
cd vllm-gemma4-patch
chmod +x patch.sh verify.sh

# Patch your vLLM installation (pass venv path as argument)
./patch.sh /path/to/your/vllm-venv

# Verify the patch
./verify.sh /path/to/your/vllm-venv

One-liner if you already know your venv path:

git clone https://github.com/ATC-Labs/vllm-gemma4-patch.git && cd vllm-gemma4-patch && ./patch.sh ~/vllm-env

What the patch does

The script performs 6 steps, each idempotent (safe to re-run):

Step 1: Upgrade transformers

Installs huggingface_hub (latest) and transformers from GitHub main. The gemma4 model type is not in any stable transformers release, so the development version is required to load Gemma 4 configs and tokenizers.

Step 2: Clone vLLM main

Performs a shallow clone of vLLM's main branch to a temporary directory. This is the source for all Gemma 4 model implementation files from PR #38826.

Step 3: Copy Gemma 4 model files (12+ files)

Copies the following from vLLM main into your installed vLLM package:

File	Location	Purpose
`gemma4.py`	`model_executor/models/`	Text-only Gemma 4 CausalLM implementation
`gemma4_mm.py`	`model_executor/models/`	Multimodal Gemma 4 (vision + text)
`gemma4_utils.py`	`model_executor/models/`	Shared utilities for Gemma 4 models
`gemma4_rope.py`	`model_executor/layers/rotary_embedding/`	Proportional RoPE for Gemma 4
`__init__.py`	`model_executor/layers/rotary_embedding/`	Updated RoPE registry with Gemma 4
`gemma4_reasoning_parser.py`	`reasoning/`	Reasoning/thinking block parser
`gemma4_utils.py`	`reasoning/`	Reasoning utilities
`__init__.py`	`reasoning/`	Updated reasoning parser registry
`gemma4_tool_parser.py`	`tool_parsers/`	Function-calling / tool use parser
`gemma4_utils.py`	`tool_parsers/`	Tool parser utilities
`__init__.py`	`tool_parsers/`	Updated tool parser registry
`model_arch_config_convertor.py`	`transformers_utils/`	Architecture config converter

Additionally applies a compatibility fix: replaces from vllm.inputs import MultiModalDataDict with from vllm.multimodal.inputs import MultiModalDataDict in gemma4_mm.py, since the import path changed between vLLM main and 0.18.x.

Step 4: Patch model registry

Adds two entries to model_executor/models/registry.py:

"Gemma4ForCausalLM": ("gemma4", "Gemma4ForCausalLM"),
"Gemma4ForConditionalGeneration": ("gemma4_mm", "Gemma4ForConditionalGeneration"),

These are inserted after the existing Gemma3 entries so vLLM can dispatch Gemma 4 model architectures.

Step 5: Patch base.py for null sub_configs

Gemma 4 declares audio_config in its HuggingFace config, but it is null (the model does not process audio). The transformer model loader in vLLM iterates sub_configs and crashes on None. This patch adds a continue guard:

if sub_config is None:
    continue

Step 6: Patch utils.py for named buffers

Gemma 4 registers layer_scalar as a buffer (via register_buffer), not a parameter. vLLM's weight loader only looks at named_parameters(), so layer_scalar is silently skipped, causing incorrect outputs. This patch adds a named_buffers() sweep to _add_loadable_non_param_tensors:

for buf_name, buf_tensor in module.named_buffers(recurse=False):
    if buf_name not in child_params:
        child_params[buf_name] = buf_tensor

Architecture notes

Gemma 4 introduces several architectural innovations:

Asymmetric KV heads: attention_k_eq_v=True with global_head_dim=512 but head_dim=256. Keys and values share a 512-dim head while queries use 256-dim heads.
Proportional RoPE: Custom rotary position embeddings with per-layer frequency scaling (implemented in gemma4_rope.py).
Sliding + full attention mix: Alternating layers use local sliding window attention and full global attention.
Vision encoder: SigLIP-based vision tower for multimodal variants, processing images into visual tokens.
Thinking/reasoning: Native support for <think>...</think> blocks with a dedicated reasoning parser.
Tool calling: Built-in function-calling support with a Gemma 4-specific tool parser.

Verification

After patching, verify with:

./verify.sh /path/to/your/vllm-venv

Or manually:

source /path/to/your/vllm-venv/bin/activate

# Check transformers has Gemma4
python -c "from transformers.models.gemma4 import Gemma4Config; print('transformers: OK')"

# Check vLLM model imports
python -c "from vllm.model_executor.models.gemma4 import Gemma4ForCausalLM; print('gemma4: OK')"
python -c "from vllm.model_executor.models.gemma4_mm import Gemma4ForConditionalGeneration; print('gemma4_mm: OK')"

# Check registry
python -c "
from vllm.model_executor.models.registry import _VLLM_MODELS
assert 'Gemma4ForCausalLM' in _VLLM_MODELS, 'Not in registry'
print('registry: OK')
"

# Check RoPE
python -c "from vllm.model_executor.layers.rotary_embedding.gemma4_rope import *; print('gemma4_rope: OK')"

Launch examples

gemma-4-31B-it on DGX Spark (GB10, 128GB unified memory)

vllm serve /path/to/gemma-4-31B-it \
  --trust-remote-code --enforce-eager \
  --gpu-memory-utilization 0.55 --max-model-len 8192 \
  --max-num-seqs 4 --enable-prefix-caching

gemma-4-31B-it on A100 80GB

vllm serve google/gemma-4-31B-it \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 --max-model-len 16384 \
  --max-num-seqs 16 --enable-prefix-caching

gemma-4-12B-it on RTX 4090 24GB (4-bit quantized)

vllm serve google/gemma-4-12B-it \
  --trust-remote-code \
  --quantization awq \
  --gpu-memory-utilization 0.90 --max-model-len 8192 \
  --max-num-seqs 8

gemma-4-4B-it on any GPU (fits in 16GB+)

vllm serve google/gemma-4-4B-it \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 --max-model-len 16384 \
  --max-num-seqs 32

Benchmark results

GB10 (DGX Spark, 128GB unified memory, aarch64)

Model	TTFT (8K ctx)	Decode throughput	Max context	Notes
gemma-4-31B-it	~4.2s	~3.8 tok/s	8192	`--enforce-eager`, 0.55 GPU util

A100 80GB (x86_64)

Model	TTFT (8K ctx)	Decode throughput	Max context	Notes
gemma-4-31B-it	~1.1s	~45 tok/s	16384	Single GPU, fp16
gemma-4-12B-it	~0.5s	~85 tok/s	32768	Single GPU, fp16

Known limitations

Requires transformers from GitHub main: The gemma4 model type is not in any stable transformers release. Once a new transformers version ships with Gemma 4 support, you can switch back to a stable release.
No audio support: Gemma 4 config declares audio_config: null. Audio modality is not implemented.
vLLM version locked to 0.18.x: This patch targets the 0.18.x codebase. When vLLM 0.19 or later ships with native Gemma 4 support, this patch will no longer be needed.
Sliding window attention: On some platforms, sliding window + prefix caching interactions may require --enforce-eager for stability.
Shallow clone required: The patch clones vLLM main at HEAD. If the Gemma 4 files are reorganized upstream, the copy paths may need updating.

Upstream references

vLLM PR #38826: Add Gemma 4 support
Google Gemma 4 announcement: blog.google
HuggingFace model cards: google/gemma-4-31B-it

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm-gemma4-patch

Why this exists

Supported models

Supported platforms

Prerequisites

Quick start

What the patch does

Step 1: Upgrade transformers

Step 2: Clone vLLM main

Step 3: Copy Gemma 4 model files (12+ files)

Step 4: Patch model registry

Step 5: Patch base.py for null sub_configs

Step 6: Patch utils.py for named buffers

Architecture notes

Verification

Launch examples

gemma-4-31B-it on DGX Spark (GB10, 128GB unified memory)

gemma-4-31B-it on A100 80GB

gemma-4-12B-it on RTX 4090 24GB (4-bit quantized)

gemma-4-4B-it on any GPU (fits in 16GB+)

Benchmark results

GB10 (DGX Spark, 128GB unified memory, aarch64)

A100 80GB (x86_64)

Known limitations

Upstream references

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
SUPPORTED_MODELS.md		SUPPORTED_MODELS.md
patch.sh		patch.sh
verify.sh		verify.sh

Folders and files

Latest commit

History

Repository files navigation

vllm-gemma4-patch

Why this exists

Supported models

Supported platforms

Prerequisites

Quick start

What the patch does

Step 1: Upgrade transformers

Step 2: Clone vLLM main

Step 3: Copy Gemma 4 model files (12+ files)

Step 4: Patch model registry

Step 5: Patch base.py for null sub_configs

Step 6: Patch utils.py for named buffers

Architecture notes

Verification

Launch examples

gemma-4-31B-it on DGX Spark (GB10, 128GB unified memory)

gemma-4-31B-it on A100 80GB

gemma-4-12B-it on RTX 4090 24GB (4-bit quantized)

gemma-4-4B-it on any GPU (fits in 16GB+)

Benchmark results

GB10 (DGX Spark, 128GB unified memory, aarch64)

A100 80GB (x86_64)

Known limitations

Upstream references

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages