Skip to content

Releases: hidai25/eval-view

v0.6.1 — Full MCP feature parity

28 Mar 22:22

Choose a tag to compare

What's new

  • Full MCP feature parity — all CLI flags now exposed via MCP tools (heal, strict, statistical, budget, tags, variants, and more)
  • New MCP tools: compare_agents (A/B test two endpoints) and replay (trajectory diff viewer)
  • 33 MCP regression tests — protocol, schema contracts, flag wiring, routing, timeouts

Fixes

  • Stable JSON response contract on run_check regardless of flags
  • --report no longer opens browser from MCP server
  • Replay timeout increased to 120s
  • Subprocess calls use stdin=DEVNULL to prevent hangs

Install / Upgrade

pip install --upgrade evalview

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

v0.6.0 — Auto-heal engine

27 Mar 09:22

Choose a tag to compare

What's new

  • Auto-heal engineevalview check --heal retries flaky tests, distinguishes non-determinism from real regressions, and self-heals output drift
  • Model change detection — detects when the underlying model has changed and adjusts evaluation accordingly

Install / Upgrade

pip install --upgrade evalview

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

v0.5.5 — Watch Mode, Native Adapters, Smart DX, Python API

25 Mar 09:23

Choose a tag to compare

What's New

New Commands

  • evalview watch — re-run regression checks on every file save with live scorecard ($0 in quick mode)
  • evalview badge — shields.io status badge, auto-updates on every check
  • evalview monitor --dashboard — live terminal dashboard with per-test history dots

Native Adapters

  • Pydantic AI (pydantic-ai) — calls agent.run() in-process, extracts tool calls from typed messages
  • CrewAI (crewai-native) — calls crew.kickoff() in-process, captures tools via event bus

Smart DX

  • Assertion wizard — capture real traffic, get pre-configured assertions automatically
  • Auto-variant discovery--statistical N --auto-variant finds and saves non-deterministic paths
  • Budget circuit breaker--budget 0.50 enforces spend limits mid-execution
  • Eval profilesinit auto-detects agent type and configures evaluators

Python API

  • gate(), gate_async(), gate_or_revert() — programmatic regression checks
  • OpenClaw integration with check_and_decide() for autonomous loops

GitHub Action

  • Auto PR comments, artifact uploads, version pinning — all in one step

Documentation

  • CrewAI, Pydantic AI, and OpenClaw integration guides
  • README rewritten for conversion
  • 26 community issues for contributors

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

Install: pip install evalview==0.5.5 or curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bash

v0.5.4 — Python API, Dashboard, OpenClaw Integration

23 Mar 15:32

Choose a tag to compare

What's New

Python API

  • gate() and gate_async() — programmatic regression checks, no CLI needed
  • gate(quick=True) — skip LLM judge for free, sub-second checks
  • from evalview import gate, DiffStatus — clean top-level imports
  • Typed results: GateResult, TestDiff, GateSummary

Terminal Dashboard

  • Scorecard panel with colored health bar, streak tracker, and gauge
  • Unicode sparkline trends from drift history
  • Confidence scoring on each verdict (z-score based signal vs noise)
  • Smart accept suggestions when changes look intentional

HTML Report Dashboard

  • SVG health gauge with pass/fail breakdown
  • Chart.js trend lines for output similarity over time
  • Confidence badges on diff rows
  • Accept suggestion boxes with copy-paste commands

OpenClaw Integration

  • evalview openclaw install — install gate skill into claw workspace
  • evalview openclaw check — run gate with auto-revert
  • gate_or_revert() / check_and_decide() Python helpers
  • Built-in SKILL.md for autonomous agent loops

MCP Server

  • run_check rewired to call gate() directly (no subprocess)
  • Fallback to subprocess on error

Other

  • evalview snapshot --preview — dry-run before saving baselines
  • python -m evalview support
  • Centralized model defaults (DEFAULT_MODELS, DEFAULT_JUDGE_MODEL)
  • Updated all defaults from gpt-4o-mini to gpt-5.4-mini
  • 22 new API tests (1147 total passing)
  • mypy clean (166 source files, 0 errors)

Install

pip install evalview==0.5.4

v0.5.3

18 Mar 22:10

Choose a tag to compare

HTML Report Redesign

Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.

Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.

Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.

Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.

All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.

v0.5.2

17 Mar 08:57

Choose a tag to compare

What's New in v0.5.2

Cold-Start Test Generation (evalview generate)

  • Production-grade test generation from live agent probing — no manual YAML writing needed
  • Interactive probe budget and model selection
  • Multi-turn conversation tests generated as single cohesive test cases
  • Domain-aware draft generation with coherence filtering
  • --synth-model flag to override the synthesis model
  • Real-time elapsed timer during probe runs
  • Delta reporting: shows changes since last generation

Improved Reports

  • Model and token usage displayed in HTML reports
  • Judge cost tracking surfaced in check reports
  • Per-query model shown in trace cost breakdown
  • Cleaner baseline metadata and timeline in check reports
  • Turn-level details with clickable chevrons in multi-turn traces

Better Onboarding (evalview init)

  • Remembers active test suite for plain snapshot and check
  • Auto-approves generated drafts with scoped snapshot guidance
  • Detects local agents on /execute and /health endpoints
  • Refreshes stale config when a live agent is detected

Check Command Improvements

  • Shows last baseline snapshot timestamp
  • Auto-generates local HTML report on failures
  • Streamlined regression demo flow

Model Support

  • GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
  • Interactive model selection from available providers

Multi-Turn & Monitoring

  • Multi-turn golden baselines with per-turn tool sequences
  • Cost/latency spike alerts in monitor mode
  • Batch edge-case expansion for test coverage

Bug Fixes

  • Fix multi-turn filter — different output is meaningful regardless of tools
  • Fix probe progress for skipped follow-ups
  • Predictable timing — 1 discovery, multi-turn counts against budget
  • Always show agent model in run output
  • Eliminate duplicate multi-turn tests
  • Silence Ollama JSON fallback warnings in normal runs

Docs

  • Trimmed README from 1420 to 274 lines — details moved to dedicated docs
  • Comparison docs and SEO content added

v0.5.1

13 Mar 20:13

Choose a tag to compare

What's New

Added

  • evalview generate — draft test suite generation from agent probing or log imports, with approval gating and CI review flow
  • Approval workflow — generated tests require explicit approval before becoming baselines
  • CI review commentsevalview ci comment posts generation reports on PRs

Fixed

  • Python 3.9 compatibility: replaced datetime.UTC with timezone.utc
  • Mypy type errors in generate command and test generation module
  • Codebase refactor and cleanup across 71 files

Full Changelog: v0.5.0...v0.5.1

v0.5.0 — Production Monitoring

12 Mar 12:05

Choose a tag to compare

What's New

Production Monitoring (evalview monitor)

  • Continuous regression detection — runs evalview check in a loop with configurable interval (default: 5 min)
  • Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
  • Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
  • JSONL history export--history monitor.jsonl appends cycle data for trend analysis and dashboards
  • Graceful shutdown — Ctrl+C stops cleanly with cost summary
  • Config support — CLI flags, config.yaml, or EVALVIEW_SLACK_WEBHOOK env var
evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # Save trends

Community Contributions

  • CSV exportevalview check --csv results.csv (@muhammadrashid4587)
  • Timeout flagevalview check --timeout 60 (@zamadye)
  • Better errors — human-friendly connection failure messages (@passionworkeer)
  • JSONL history--history flag for monitor (@clawtom)

Bug Fixes & Refactoring

  • Fixed severity comparison bug (was using string matching instead of enum comparison)
  • Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
  • Extracted shared _parse_fail_statuses utility for consistent fail_on parsing
  • Eliminated redundant config loading in monitor loop

Deployment

# Quick background run
nohup evalview monitor --slack-webhook https://... &

# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...

Full Changelog: v0.4.1...v0.5.0

v0.4.1

09 Mar 09:47

Choose a tag to compare

What's New

Mistral Adapter

  • Direct Mistral API support via pip install evalview[mistral]
  • Lazy import — no dependency unless you use it

PII Evaluator

  • Opt-in detection for emails, phones, SSNs, credit cards, addresses
  • Luhn algorithm validation for credit cards to reduce false positives
  • Enable with checks: { pii: true } in test YAML

Multi-Turn HTML Reports

  • Mermaid sequence diagrams showing conversation turns with tool calls
  • Per-turn query and tool breakdown in the Execution Trace tab

Security

  • GitHub Action: replaced eval $CMD with bash arrays, moved inputs to env vars
  • Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content

README

  • New hero section with logo, sequence diagram screenshot, data flow diagram
  • "Your data stays local" privacy explanation
  • Updated model version examples to Claude 4.5/4.6

Full Changelog: v0.4.0...v0.4.1

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

05 Mar 10:55

Choose a tag to compare

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

  • evalview capture — HTTP proxy records real agent traffic as test YAMLs
  • evalview install-hooks — inject regression checks into git pre-push
  • Silent model update detection — alerts when provider swaps model behind same API name
  • Gradual drift detection — OLS regression over 10-check window
  • Semantic diff--semantic-diff scores by meaning, not character similarity
  • Auto-open HTML report after every evalview run
  • evalview init now auto-detects your agent endpoint and generates starter tests
  • Test quality gating — low-quality generated tests are skipped, not silently polluting scores
  • mypy clean — 0 errors across 109 source files

Community contributions


Full changelog: CHANGELOG.md