Statistical Mode — Handling Flaky LLM Tests with pass@k

Problem: LLM agents are non-deterministic. A test that passes once might fail the next run. Traditional pass/fail doesn't work when your AI agent gives different answers each time.

Solution: EvalView's statistical mode runs each test N times and uses statistical metrics (pass@k, pass^k, flakiness score, confidence intervals) to give you reliable pass/fail decisions for non-deterministic AI agents.

LLMs are non-deterministic. A test that passes once might fail the next run. Statistical mode addresses this by running tests multiple times and using statistical thresholds for pass/fail decisions.

Quick Start: CLI Flag

The easiest way to enable statistical mode is with the --runs flag:

# Run each test 10 times
evalview run --runs 10

# Run with custom pass rate (70% must pass)
evalview run --runs 10 --pass-rate 0.7

# Filter by difficulty and run statistically
evalview run --difficulty hard --runs 5

This overrides any per-test variance configuration.

Per-Test Configuration (YAML)

Add variance config to your test case:

# tests/test-cases/my-test.yaml
name: "My Agent Test"
input:
  query: "Analyze the market trends"

expected:
  tools:
    - fetch_data
    - analyze

thresholds:
  min_score: 70

  # Statistical mode config
  variance:
    runs: 10           # Run test 10 times
    pass_rate: 0.8     # 80% of runs must pass
    min_mean_score: 70 # Average score must be >= 70
    max_std_dev: 15    # Score std dev must be <= 15

What You Get

Pass rate - Percentage of runs that passed
pass@k / pass^k - Industry-standard reliability metrics (see below)
Score statistics - Mean, std dev, min/max, percentiles, confidence intervals
Flakiness score - 0 (stable) to 1 (flaky) with category labels
Contributing factors - Why the test is flaky (score variance, tool inconsistency, etc.)

Reliability Metrics: pass@k vs pass^k

Two metrics that tell you different things:

Metric	Question it answers	High =
pass@k	"Will it work if I give it a few tries?"	Usually finds a solution
pass^k	"Will it work reliably every time?"	Consistent and reliable

Reliability Metrics:
  pass@10:       99.9% (usually finds a solution)
  pass^10:       2.8% (unreliable)

This example shows an agent that eventually works but isn't production-ready.

Example Output

Statistical Evaluation: My Agent Test
PASSED

┌─ Run Summary ─────────────────────────┐
│  Total Runs:     10                   │
│  Passed:         8                    │
│  Failed:         2                    │
│  Pass Rate:      80% (required: 80%)  │
└───────────────────────────────────────┘

Score Statistics:
  Mean:      79.86    95% CI: [78.02, 81.70]
  Std Dev:   2.97     ▂▂▁▁▁ Low variance
  Min:       75.5
  Max:       84.5

┌─ Flakiness Assessment ────────────────┐
│  Flakiness Score: 0.12 ██░░░░░░░░     │
│  Category:        low_variance        │
│  Pass Rate:       80%                 │
└───────────────────────────────────────┘

Variance Configuration Options

Option	Description	Default
`runs`	Number of times to run the test	10
`pass_rate`	Minimum percentage of runs that must pass	0.8 (80%)
`min_mean_score`	Minimum average score across all runs	Same as `min_score`
`max_std_dev`	Maximum allowed standard deviation	15

When to Use Statistical Mode

Use it when:

Your agent's output varies between runs
You're testing creative/generative tasks
You need production-readiness confidence
You're debugging flaky tests

Skip it when:

Your agent is deterministic (temperature=0)
You just need a quick sanity check
Running in CI where time is critical

Complete Example

See examples/statistical-mode-example.yaml for a complete example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical Mode — Handling Flaky LLM Tests with pass@k

Quick Start: CLI Flag

Per-Test Configuration (YAML)

What You Get

Reliability Metrics: pass@k vs pass^k

Example Output

Variance Configuration Options

When to Use Statistical Mode

Complete Example

Related Documentation

FilesExpand file tree

STATISTICAL_MODE.md

Latest commit

History

STATISTICAL_MODE.md

File metadata and controls

Statistical Mode — Handling Flaky LLM Tests with pass@k

Quick Start: CLI Flag

Per-Test Configuration (YAML)

What You Get

Reliability Metrics: pass@k vs pass^k

Example Output

Variance Configuration Options

When to Use Statistical Mode

Complete Example

Related Documentation