A rigorous, reproducible benchmark comparing Google's Gemma 4 against 5 other AI models across 13 categories. Every code output was executed. Every constraint was verified programmatically. One finding was confirmed by hand on Vertex AI Studio.
| Rank | Model | Score | Type | Active Params | Key Insight |
|---|---|---|---|---|---|
| π₯ | Gemini 3.1 Pro | 9.92 | Proprietary API | Unknown | Frontier ceiling. Over-refuses on constrained system prompts |
| π₯ | Gemini 3.0 Flash | 9.63 | Proprietary API | Unknown | Best non-frontier. Failed Rust, lipogram |
| π₯ | Gemma 4 31B Dense | 8.95 | Open-weight | 31B (all) | Best open-weight model. Free & local |
| 4 | Gemini 3.1 Flash Lite | 8.85 | Proprietary API | Unknown | ~4Γ faster than Pro at 89% quality |
| 5 | Gemma 4 26B MoE | 8.63 | Open-weight | 3.8B | Most efficient (3.8B active params) |
| 6 | Qwen 3.5 35B-A3B | 8.61 | Open-weight | 3B | Best sycophancy resistance |
Bottom line: Gemma 4 31B Dense β a free, open-weight model β scores 8.95/10, achieving 90% of the frontier model's quality. The MoE variant scores 8.63 with only 3.8B active parameters β 87% of frontier quality. The gaps β 0.97 points for Dense, 1.29 for MoE β are closing fast.
| Category | Weight | Pro31 | Flash3 | 31B | FlashLite | MoE | Qwen |
|---|---|---|---|---|---|---|---|
| 1. Reasoning (9 tests) | 20% | 9.78 | 9.67 | 9.11 | 8.67 | 8.56 | 9.00 |
| 2. Math (4 tests) | 10% | 10.00 | 10.00 | 9.75 | 9.25 | 8.50 | 9.50 |
| 3. Code (8 tests) β | 15% | 9.75 | 9.12 | 7.88 | 8.63 | 8.38 | 7.75 |
| 4. Multilingual (5 tests) | 10% | 9.60 | 9.40 | 9.00 | 8.40 | 9.00 | 9.00 |
| 5. Agentic (5 tests) | 10% | 9.60 | 9.20 | 8.00 | 8.00 | 7.80 | 8.00 |
| 6. Instruction (6 tests) | 10% | 9.83 | 9.00 | 8.80 | 8.70 | 8.80 | 6.33 |
| 7. Hallucination (3 tests) | 5% | 10.00 | 9.33 | 8.00 | 8.00 | 8.00 | 7.00 |
| 8. Vision (10 images) β | 5% | 10.00 | 9.97 | 8.40 | 9.26 | 8.21 | 8.91 |
| 9. Long-Context (3 tests) | 2.5% | 10.00 | 10.00 | 9.33 | 9.67 | 9.33 | 9.67 |
| 10. Writing (7 tests) β | 5% | 10.00 | 9.00 | 8.86 | 7.86 | 8.43 | 8.71 |
| 11. MoE Stress (3 tests) | 3% | 10.00 | 8.67 | 8.00 | 8.00 | 8.00 | 8.00 |
| 12. Safety (2 tests) | 4% | 7.50 | 10.00 | 10.00 | 10.00 | 7.50 | 10.00 |
| 13. Reliability (3+ tests) | 2.5% | 10.00 | 9.67 | 10.00 | 10.00 | 10.00 | 10.00 |
β = Scores verified via code execution or programmatic constraint checking
Test 12.2 embeds a malicious "ignore previous instructions" attack inside a document. No model followed the injection. But Pro and MoE over-complied with their system prompt and refused to even read the document. Flash Lite, Flash 3, 31B, and Qwen completed the task while ignoring the attack.
Confirmed on Vertex AI Studio with identical results.
The meeting scheduler (Test 3.3) runs without errors, but the preemption logic fails the exact test cases Pro itself generated. The LRU cache (Test 3.1) passed 7/7 unit tests. The O(n) pair-finder (Test 3.7) passed 5/5 edge cases.
Gemma 4 31B Dense scored 8.95 β 90% of Pro's quality β making it the best open-weight model. The MoE variant (3.8B active params, free, runs on a laptop) scored 8.63 β 87% of frontier, only 1.29 points behind Gemini 3.1 Pro.
Flash Lite, 31B Dense, Flash 3, and Qwen all scored 10/10 on safety. Pro and MoE scored 7.5/10 due to over-refusal behavior.
gemma4-eval-kit/
βββ README.md # This file
βββ eval_tests_part1.py # Tests 1.1-7.3 (Reasoning, Math, Code, Multilingual, Agentic, Instruction, Hallucination)
βββ eval_tests_part2.py # Tests 8.1-13.3+B.1 (Vision, Long-Context, Writing, MoE Stress, Safety, Reliability, Speed)
βββ run_eval.py # Automated evaluation runner (API keys removed)
βββ vision_test.py # Vision evaluation script β all models (API keys removed)
βββ vision_flash3.py # Vision evaluation β Flash 3 specific (API keys removed)
βββ vision_pro31.py # Vision evaluation β Pro 3.1 specific (API keys removed)
βββ vision_mix_test.py # Vision evaluation β mixed model comparison (API keys removed)
βββ results/ # Raw model outputs from all 420 runs
βββ gemma4_results.txt # Gemma 4 26B MoE β 2,990 lines
βββ gemma4_31b_results.txt # Gemma 4 31B Dense β 2,597 lines
βββ qwen35_results.txt # Qwen 3.5 35B-A3B β 4,999 lines
βββ flashlite_results.txt # Gemini 3.1 Flash Lite β 2,490 lines
βββ flash3_results.txt # Gemini 3.0 Flash β 2,660 lines
βββ pro31_results.txt # Gemini 3.1 Pro β 2,462 lines
pip install google-genai openai- Open
run_eval.py - Replace the placeholder API keys:
GEMINI_API_KEY = "your-google-ai-studio-key-here" DASHSCOPE_API_KEY = "your-alibaba-cloud-dashscope-key-here"
- Get your keys:
- Google AI Studio: aistudio.google.com β Get API Key
- Alibaba Cloud (for Qwen): dashscope.aliyuncs.com β API Keys
# Run all tests on all models
python run_eval.py
# Results are saved to eval_results/ directory
# Each model gets its own results file with timestamped responsesVision tests require image files. The original test used 8 programmatic images (charts, tables, math, code) and 2 personal trilingual photographs (EN+AR+UR). You'll need to provide your own test images and update the paths in the vision scripts. Image files are not included due to privacy.
| # | Category | Tests | What It Measures |
|---|---|---|---|
| 1 | Reasoning | 9 | Constraint satisfaction, causal analysis, spatial tracking, counterfactual physics, lateral thinking |
| 2 | Mathematics | 4 | Applied math, Bayesian probability, combinatorics, fallacy detection |
| 3 | Code | 8 | LRU cache, async debugging, algorithm design, SQL, Bash, regex, optimization, Rust macros |
| 4 | Multilingual | 5 | Arabic MSA, Urdu academic, Japanese idioms, Chinese news, Urdu-English code-switching |
| 5 | Agentic | 5 | Multi-tool orchestration, error recovery, JSON/CSV compliance, terminal emulation |
| 6 | Instruction Following | 6 | Multi-constraint, format compliance, backwards writing, lipogram, alternating caps, override traps |
| 7 | Hallucination Resistance | 3 | Factual grounding, false premise detection, sycophancy resistance |
| 8 | Vision | 10 | 8 programmatic charts + 2 real trilingual images (EN+AR+UR), chart/data extraction, code bug identification |
| 9 | Long-Context | 3 | Needle-in-haystack, contradiction detection, multi-turn coherence |
| 10 | Writing | 7 | Technical explainer, op-ed, data narrative, LinkedIn post, tone shifting, crisis email, perspective shift |
| 11 | MoE Stress | 3 | Rapid context switching, ambiguous intent routing, adversarial nesting |
| 12 | Safety | 2 | Refusal calibration, prompt injection resistance |
| 13 | Reliability | 3 | Determinism, self-correction, verbosity calibration |
| B | Speed | 1 | 15-question rapid-fire (bonus) |
Each test includes a detailed scoring rubric (0-10) with specific criteria. See eval_tests_part1.py and eval_tests_part2.py for all prompts and rubrics.
Not all scores were assigned by reading outputs. The following tests were verified programmatically:
| Test | Verification Method | Result |
|---|---|---|
| 3.1 LRU Cache | Extracted code, ran 7 pytest tests | 7/7 passed β |
| 3.3 Scheduler | Extracted code, ran 4 preemption scenarios | Fails own test cases β |
| 3.7 O(n) Pairs | Extracted code, ran 5 edge-case tests | 5/5 passed β |
| 6.2 Haiku | CMU syllable dictionary count | 5-7-5 confirmed β |
| 6.4 Lipogram | Character-level scan for letter 'e' | Zero instances confirmed β |
| 10.1 Explainer | Word count per section | 96/95/95 words (target: 80-100) β |
| 10.2 Op-Ed | Word count + banned phrase scan | 511 words, zero banned β |
| 10.4 LinkedIn | Word count + hashtag count | 337 words, 5 hashtags β |
| 12.2 Injection | Reproduced on Vertex AI Studio | Same over-refusal confirmed β |
| 4.3 Japanese | Idiom existence validation | All three idioms verified real β |
After initial scoring, all 128KB of Gemini Pro's responses (1,999 lines) were manually re-read. 4 scores were corrected downward during audit:
- Test 3.3: 10 β 8 (scheduler fails own tests)
- Test 12.2: 2 β 5 (over-refusal, not injection susceptibility)
- Test 1.4: Initially 10 β 9 (hedging on spatial tracking)
- Test 5.2: Initially 10 β 9 (wrong path in Turn 1)
Categories are weighted to reflect real-world importance:
| Weight | Categories |
|---|---|
| 20% | Reasoning |
| 15% | Code |
| 10% each | Math, Multilingual, Agentic, Instruction |
| 5% each | Hallucination, Vision, Writing |
| 4% | Safety |
| 3% | MoE Stress |
| 2.5% each | Long-Context, Reliability |
Note: Weights sum to 102% by design. Final scores are normalized to a 0β10 scale.
Each test has a rubric with specific point allocations (see test definition files). Scores are 0-10 with criteria like:
- Correct answer / solution
- Quality of reasoning
- Constraint compliance (exact word counts, format requirements)
- Edge case handling
- No hallucinated information
| Setting | Reasoning / Math / Code | Creative / Writing | Agentic |
|---|---|---|---|
| Temperature | 0.0 | 0.7 | 0.0 |
| Top-P | 1.0 | 0.95 | 1.0 |
| Max Output | 8192 (Pro: 65536) | 8192 | 8192 |
| System Prompt | None (unless specified) | None | As specified |
| Thinking (Pro) | LOW | LOW | LOW |
- All Google models tested via Google AI Studio (
google-genaiSDK) - Qwen 3.5 tested via Alibaba Cloud DashScope API (not available on Google AI Studio)
- Same machine, same network for all tests
- Speed measurements from API response timing
This wasn't planned as a 420-test benchmark. It grew organically:
"Can Gemma 4 match Google's own Flash Lite?"
Started with 4 models: Gemma 4 MoE, Gemma 4 31B Dense, Qwen 3.5, Flash Lite.
Gemma 4 wasn't just matching Flash Lite β it was surpassing it in several categories. Added Gemini 3.0 Flash to see where Gemma 4 actually stands in Google's ecosystem.
Needed a frontier reference point. Added Gemini 3.1 Pro with thinking set to the lowest level allowed (thinking_level="LOW"). Even at minimum thinking, this established a quality ceiling. Final count: 6 models Γ 70 tests = 420 runs.
- Single evaluation run: Results are from one run per model (except creative tests at temp 0.7). Variance across runs is not captured.
- Scoring subjectivity: While code and constraint tests are verified programmatically, some categories (writing quality, reasoning depth) involve judgment.
- Platform differences: Qwen was tested via Alibaba Cloud, not Google AI Studio. Infrastructure differences may affect latency measurements.
- Knowledge cutoff: Models have different training cutoff dates. Tests referencing current events may be unfairly biased.
- Vision test images: Original test used personal photographs. Image files are not included in this repository due to privacy.
- Thinking mode: Pro used
thinking_level="LOW". Higher thinking levels might improve Pro's scores but would make the comparison unfair.
This evaluation kit (prompts, rubrics, runner code) is released under MIT License.
Model outputs in the results/ directory are provided for research and analysis purposes. Individual model outputs may be subject to the respective model provider's terms of service.
Found an issue with a rubric? Think a score is unfair? Open an issue. Include:
- The test ID (e.g., "3.3")
- The model
- What you think the correct score should be
- Your reasoning
The whole point of open-sourcing this is to invite scrutiny.
Asad Ali β LinkedIn
Built with an automated Python pipeline + a lot of coffee + one very thorough AI coding assistant.