Releases: hidai25/eval-view
v0.6.1 — Full MCP feature parity
What's new
- Full MCP feature parity — all CLI flags now exposed via MCP tools (heal, strict, statistical, budget, tags, variants, and more)
- New MCP tools:
compare_agents(A/B test two endpoints) andreplay(trajectory diff viewer) - 33 MCP regression tests — protocol, schema contracts, flag wiring, routing, timeouts
Fixes
- Stable JSON response contract on
run_checkregardless of flags --reportno longer opens browser from MCP server- Replay timeout increased to 120s
- Subprocess calls use
stdin=DEVNULLto prevent hangs
Install / Upgrade
pip install --upgrade evalviewFull changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md
v0.6.0 — Auto-heal engine
What's new
- Auto-heal engine —
evalview check --healretries flaky tests, distinguishes non-determinism from real regressions, and self-heals output drift - Model change detection — detects when the underlying model has changed and adjusts evaluation accordingly
Install / Upgrade
pip install --upgrade evalviewFull changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md
v0.5.5 — Watch Mode, Native Adapters, Smart DX, Python API
What's New
New Commands
evalview watch— re-run regression checks on every file save with live scorecard ($0 in quick mode)evalview badge— shields.io status badge, auto-updates on every checkevalview monitor --dashboard— live terminal dashboard with per-test history dots
Native Adapters
- Pydantic AI (
pydantic-ai) — callsagent.run()in-process, extracts tool calls from typed messages - CrewAI (
crewai-native) — callscrew.kickoff()in-process, captures tools via event bus
Smart DX
- Assertion wizard — capture real traffic, get pre-configured assertions automatically
- Auto-variant discovery —
--statistical N --auto-variantfinds and saves non-deterministic paths - Budget circuit breaker —
--budget 0.50enforces spend limits mid-execution - Eval profiles —
initauto-detects agent type and configures evaluators
Python API
gate(),gate_async(),gate_or_revert()— programmatic regression checks- OpenClaw integration with
check_and_decide()for autonomous loops
GitHub Action
- Auto PR comments, artifact uploads, version pinning — all in one step
Documentation
- CrewAI, Pydantic AI, and OpenClaw integration guides
- README rewritten for conversion
- 26 community issues for contributors
Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md
Install: pip install evalview==0.5.5 or curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bash
v0.5.4 — Python API, Dashboard, OpenClaw Integration
What's New
Python API
gate()andgate_async()— programmatic regression checks, no CLI neededgate(quick=True)— skip LLM judge for free, sub-second checksfrom evalview import gate, DiffStatus— clean top-level imports- Typed results:
GateResult,TestDiff,GateSummary
Terminal Dashboard
- Scorecard panel with colored health bar, streak tracker, and gauge
- Unicode sparkline trends from drift history
- Confidence scoring on each verdict (z-score based signal vs noise)
- Smart accept suggestions when changes look intentional
HTML Report Dashboard
- SVG health gauge with pass/fail breakdown
- Chart.js trend lines for output similarity over time
- Confidence badges on diff rows
- Accept suggestion boxes with copy-paste commands
OpenClaw Integration
evalview openclaw install— install gate skill into claw workspaceevalview openclaw check— run gate with auto-revertgate_or_revert()/check_and_decide()Python helpers- Built-in SKILL.md for autonomous agent loops
MCP Server
run_checkrewired to callgate()directly (no subprocess)- Fallback to subprocess on error
Other
evalview snapshot --preview— dry-run before saving baselinespython -m evalviewsupport- Centralized model defaults (
DEFAULT_MODELS,DEFAULT_JUDGE_MODEL) - Updated all defaults from gpt-4o-mini to gpt-5.4-mini
- 22 new API tests (1147 total passing)
- mypy clean (166 source files, 0 errors)
Install
pip install evalview==0.5.4v0.5.3
HTML Report Redesign
Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.
Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.
Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.
Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.
All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.
v0.5.2
What's New in v0.5.2
Cold-Start Test Generation (evalview generate)
- Production-grade test generation from live agent probing — no manual YAML writing needed
- Interactive probe budget and model selection
- Multi-turn conversation tests generated as single cohesive test cases
- Domain-aware draft generation with coherence filtering
--synth-modelflag to override the synthesis model- Real-time elapsed timer during probe runs
- Delta reporting: shows changes since last generation
Improved Reports
- Model and token usage displayed in HTML reports
- Judge cost tracking surfaced in check reports
- Per-query model shown in trace cost breakdown
- Cleaner baseline metadata and timeline in check reports
- Turn-level details with clickable chevrons in multi-turn traces
Better Onboarding (evalview init)
- Remembers active test suite for plain
snapshotandcheck - Auto-approves generated drafts with scoped snapshot guidance
- Detects local agents on
/executeand/healthendpoints - Refreshes stale config when a live agent is detected
Check Command Improvements
- Shows last baseline snapshot timestamp
- Auto-generates local HTML report on failures
- Streamlined regression demo flow
Model Support
- GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
- Interactive model selection from available providers
Multi-Turn & Monitoring
- Multi-turn golden baselines with per-turn tool sequences
- Cost/latency spike alerts in monitor mode
- Batch edge-case expansion for test coverage
Bug Fixes
- Fix multi-turn filter — different output is meaningful regardless of tools
- Fix probe progress for skipped follow-ups
- Predictable timing — 1 discovery, multi-turn counts against budget
- Always show agent model in run output
- Eliminate duplicate multi-turn tests
- Silence Ollama JSON fallback warnings in normal runs
Docs
- Trimmed README from 1420 to 274 lines — details moved to dedicated docs
- Comparison docs and SEO content added
v0.5.1
What's New
Added
evalview generate— draft test suite generation from agent probing or log imports, with approval gating and CI review flow- Approval workflow — generated tests require explicit approval before becoming baselines
- CI review comments —
evalview ci commentposts generation reports on PRs
Fixed
- Python 3.9 compatibility: replaced
datetime.UTCwithtimezone.utc - Mypy type errors in generate command and test generation module
- Codebase refactor and cleanup across 71 files
Full Changelog: v0.5.0...v0.5.1
v0.5.0 — Production Monitoring
What's New
Production Monitoring (evalview monitor)
- Continuous regression detection — runs
evalview checkin a loop with configurable interval (default: 5 min) - Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
- Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
- JSONL history export —
--history monitor.jsonlappends cycle data for trend analysis and dashboards - Graceful shutdown — Ctrl+C stops cleanly with cost summary
- Config support — CLI flags,
config.yaml, orEVALVIEW_SLACK_WEBHOOKenv var
evalview monitor # Check every 5 min
evalview monitor --interval 60 # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # Save trendsCommunity Contributions
- CSV export —
evalview check --csv results.csv(@muhammadrashid4587) - Timeout flag —
evalview check --timeout 60(@zamadye) - Better errors — human-friendly connection failure messages (@passionworkeer)
- JSONL history —
--historyflag for monitor (@clawtom)
Bug Fixes & Refactoring
- Fixed severity comparison bug (was using string matching instead of enum comparison)
- Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
- Extracted shared
_parse_fail_statusesutility for consistent fail_on parsing - Eliminated redundant config loading in monitor loop
Deployment
# Quick background run
nohup evalview monitor --slack-webhook https://... &
# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...Full Changelog: v0.4.1...v0.5.0
v0.4.1
What's New
Mistral Adapter
- Direct Mistral API support via
pip install evalview[mistral] - Lazy import — no dependency unless you use it
PII Evaluator
- Opt-in detection for emails, phones, SSNs, credit cards, addresses
- Luhn algorithm validation for credit cards to reduce false positives
- Enable with
checks: { pii: true }in test YAML
Multi-Turn HTML Reports
- Mermaid sequence diagrams showing conversation turns with tool calls
- Per-turn query and tool breakdown in the Execution Trace tab
Security
- GitHub Action: replaced
eval $CMDwith bash arrays, moved inputs to env vars - Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content
README
- New hero section with logo, sequence diagram screenshot, data flow diagram
- "Your data stays local" privacy explanation
- Updated model version examples to Claude 4.5/4.6
Full Changelog: v0.4.0...v0.4.1
v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync
What's new in 0.4.0
Multi-turn conversation testing
Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.
name: flight-booking-conversation
turns:
- query: "I want to fly from NYC to Paris next Friday"
expected:
tools: [search_flights]
- query: "Book the cheapest economy option"
expected:
tools: [book_flight]
output:
contains: ["confirmed", "Paris"]
- query: "Send me a confirmation email"
expected:
tools: [send_email]
expected:
tools: [search_flights, book_flight, send_email]
thresholds:
min_score: 80A/B endpoint comparison
Run the same test suite against two endpoints and get a per-test verdict table.
evalview compare \
--v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
--v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
--tests tests/Cloud baseline sync
evalview login # OAuth sign-in
evalview snapshot # baselines auto-sync to cloud
evalview check # teammates pull your baselines automaticallyOther highlights
evalview capture— HTTP proxy records real agent traffic as test YAMLsevalview install-hooks— inject regression checks into git pre-push- Silent model update detection — alerts when provider swaps model behind same API name
- Gradual drift detection — OLS regression over 10-check window
- Semantic diff —
--semantic-diffscores by meaning, not character similarity - Auto-open HTML report after every
evalview run - evalview init now auto-detects your agent endpoint and generates starter tests
- Test quality gating — low-quality generated tests are skipped, not silently polluting scores
- mypy clean — 0 errors across 109 source files
Community contributions
- Pydantic field validation for
TestCase(#54 by @illbeurs) - Edge tests for
CostEvaluatorandLatencyEvaluator(#55 by @illbeurs) health_check()onOllamaAdapter(#57 by @gauravxthakur)ConsoleReporterdocstrings (#56 by @gauravxthakur)
Full changelog: CHANGELOG.md