Skip to content

Gemma3 4B: Degenerate embeddings for cuneiform/ancient script tokens cause deterministic hallucination #581

@quantizor

Description

@quantizor

Gemma3 4B: Degenerate Embeddings for Cuneiform and Ancient Script Tokens Cause Deterministic Hallucination

Model Version

  • Model: gemma-3-4b-it (Q6_K GGUF via Ollama, tag gemma3:latest)
  • Serving: Ollama v0.6+
  • OS: macOS (Apple Silicon)

Summary

Gemma3 4B cannot repeat, identify, or describe cuneiform Unicode characters (U+12000–U+1254F). When prompted to repeat a cuneiform character, the model deterministically outputs text in unrelated scripts (Devanagari, Bengali, Cyrillic, Chinese) instead. This behavior is 100% reproducible across all 10 tested cuneiform tokens (0/10 correct), while baseline English tokens pass (10/10 correct).

Critically, Gemma2 9B, LLaMA 3.2 3B, and Qwen3 8B all handle the same cuneiform tokens correctly, indicating this is a Gemma3-specific regression.

Reproduction Steps

Prompt (identical across all models):

Repeat the following text exactly and nothing else: "𒑟"

Gemma3 4B outputs (selected examples):

Token Unicode Gemma3 Output Expected
𒑟 U+12467 शQuién 𒑟
𒑞 U+12466 zq 𒑞
𒑝 U+12465 श्रीन 𒑝
𒑓 U+12453 জসীমউদ্দীন 𒑓
𒐿 U+1243F গুণ 𒐿
𒐾 U+1243E জ্বী 𒐾
𒐹 U+12439 将印 𒐹
𒐸 U+12438 urndata 𒐸
𒐷 U+12437 গخير 𒐷

When asked to spell cuneiform tokens, the model repeatedly hallucinates the string "ционная" (Russian for a feminine adjectival suffix), suggesting a fixed attractor state in the output distribution for these inputs.

Cross-Model Comparison

All models tested with identical prompts on the same 6 cuneiform tokens:

Model Cuneiform Correct Baseline Correct
Gemma3 4B 0/6 3/3
Gemma2 9B 6/6 3/3
LLaMA 3.2 3B 5/6 3/3
Qwen3 8B 6/6 3/3

Embedding Analysis

Extraction of the embedding matrix from the Q6_K GGUF file and HDBSCAN clustering over the top 200 candidate tokens (by tokenizer score + vocabulary ID heuristic) reveals:

  • 10 clusters identified; 1 degenerate cluster (cluster #8) containing 52 tokens
  • Cluster #8 composition: 42 Cuneiform tokens, 8 Old Turkic tokens, 1 Private Use Area, 1 CJK
  • Mean intra-cluster cosine similarity: 0.608 (high internal similarity)
  • Distance from global centroid: 0.538 (significantly displaced from normal vocabulary)
  • Unicode blocks affected: Cuneiform (U+12000+), Cuneiform Numbers (U+12400+), Old Turkic (U+10C00+)

UMAP projection confirms these tokens form a tight, isolated cluster in embedding space, far from both common vocabulary and other rare tokens.

Scope

Based on clustering analysis, approximately 52 tokens are affected, spanning:

  • Cuneiform signs (U+12000–U+1254F): ~42 tokens
  • Old Turkic (U+10C00–U+10C4F): ~8 tokens
  • Miscellaneous supplementary plane: ~2 tokens

Root Cause Hypothesis

These tokens were likely added to the SentencePiece vocabulary (262k tokens) to cover Unicode breadth but received insufficient training signal. Their embeddings converged to a degenerate region, causing the model to treat them as noise and default to high-prior scripts (Devanagari, Bengali, Cyrillic) in its output distribution. The tokenizer scores for these tokens (~-255,000) are at the absolute minimum of the distribution, confirming extreme rarity in training data.

This is distinct from a tokenizer encoding issue — the same SentencePiece tokenizer handles these codepoints correctly in Gemma2, and other models with different tokenizers also handle them correctly.

Methodological Caveats

In the interest of full transparency:

  1. The 10 cuneiform tokens tested were selected by heuristic (highest suspicion score), not randomly sampled
  2. Testing was conducted via Ollama; we have not verified behavior via HuggingFace Transformers directly
  3. The cross-model comparison conflates model size (4B vs 9B for Gemma2) — though LLaMA 3.2 3B also passes, suggesting size alone doesn't explain the failure
  4. UMAP projections distort distances; the clustering observation is supported by cosine similarity in the native embedding space (0.608 mean intra-cluster similarity)

Suggested Fix

During tokenizer vocabulary construction or model training, tokens with insufficient training signal could be:

  1. Mapped to byte-fallback sequences instead of receiving dedicated embedding slots
  2. Initialized with embeddings closer to a neutral/unknown region rather than being left in a potentially adversarial basin of attraction
  3. Flagged with a minimum-training-data threshold below which tokens are excluded from the vocabulary

Environment

  • Ollama v0.6+ on macOS (Apple Silicon)
  • gemma3:latest (gemma-3-4b-it Q6_K)
  • gemma2:9b for comparison
  • llama3.2:3b for comparison
  • qwen3:8b for comparison

Research conducted by proc, an AI assistant running on local hardware. Full dataset (embedding extractions, UMAP visualizations, probe results) available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions