Releases · hidai25/eval-view

28 Mar 22:22

hidai25

v0.6.1

2e42de2

v0.6.1 — Full MCP feature parity Latest

Latest

What's new

Full MCP feature parity — all CLI flags now exposed via MCP tools (heal, strict, statistical, budget, tags, variants, and more)
New MCP tools: compare_agents (A/B test two endpoints) and replay (trajectory diff viewer)
33 MCP regression tests — protocol, schema contracts, flag wiring, routing, timeouts

Fixes

Stable JSON response contract on run_check regardless of flags
--report no longer opens browser from MCP server
Replay timeout increased to 120s
Subprocess calls use stdin=DEVNULL to prevent hangs

Install / Upgrade

pip install --upgrade evalview

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

Assets 2

27 Mar 09:22

hidai25

v0.6.0

3557954

v0.6.0 — Auto-heal engine

What's new

Auto-heal engine — evalview check --heal retries flaky tests, distinguishes non-determinism from real regressions, and self-heals output drift
Model change detection — detects when the underlying model has changed and adjusts evaluation accordingly

Install / Upgrade

pip install --upgrade evalview

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

Assets 2

25 Mar 09:23

hidai25

v0.5.5

1b7d472

v0.5.5 — Watch Mode, Native Adapters, Smart DX, Python API

What's New

New Commands

evalview watch — re-run regression checks on every file save with live scorecard ($0 in quick mode)
evalview badge — shields.io status badge, auto-updates on every check
evalview monitor --dashboard — live terminal dashboard with per-test history dots

Native Adapters

Pydantic AI (pydantic-ai) — calls agent.run() in-process, extracts tool calls from typed messages
CrewAI (crewai-native) — calls crew.kickoff() in-process, captures tools via event bus

Smart DX

Assertion wizard — capture real traffic, get pre-configured assertions automatically
Auto-variant discovery — --statistical N --auto-variant finds and saves non-deterministic paths
Budget circuit breaker — --budget 0.50 enforces spend limits mid-execution
Eval profiles — init auto-detects agent type and configures evaluators

Python API

gate(), gate_async(), gate_or_revert() — programmatic regression checks
OpenClaw integration with check_and_decide() for autonomous loops

GitHub Action

Auto PR comments, artifact uploads, version pinning — all in one step

Documentation

CrewAI, Pydantic AI, and OpenClaw integration guides
README rewritten for conversion
26 community issues for contributors

Full changelog: https://github.com/hidai25/eval-view/blob/main/CHANGELOG.md

Install: pip install evalview==0.5.5 or curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bash

Assets 2

23 Mar 15:32

hidai25

v0.5.4

33823c9

v0.5.4 — Python API, Dashboard, OpenClaw Integration

What's New

Python API

gate() and gate_async() — programmatic regression checks, no CLI needed
gate(quick=True) — skip LLM judge for free, sub-second checks
from evalview import gate, DiffStatus — clean top-level imports
Typed results: GateResult, TestDiff, GateSummary

Terminal Dashboard

Scorecard panel with colored health bar, streak tracker, and gauge
Unicode sparkline trends from drift history
Confidence scoring on each verdict (z-score based signal vs noise)
Smart accept suggestions when changes look intentional

HTML Report Dashboard

SVG health gauge with pass/fail breakdown
Chart.js trend lines for output similarity over time
Confidence badges on diff rows
Accept suggestion boxes with copy-paste commands

OpenClaw Integration

evalview openclaw install — install gate skill into claw workspace
evalview openclaw check — run gate with auto-revert
gate_or_revert() / check_and_decide() Python helpers
Built-in SKILL.md for autonomous agent loops

MCP Server

run_check rewired to call gate() directly (no subprocess)
Fallback to subprocess on error

Other

evalview snapshot --preview — dry-run before saving baselines
python -m evalview support
Centralized model defaults (DEFAULT_MODELS, DEFAULT_JUDGE_MODEL)
Updated all defaults from gpt-4o-mini to gpt-5.4-mini
22 new API tests (1147 total passing)
mypy clean (166 source files, 0 errors)

Install

pip install evalview==0.5.4

Assets 2

18 Mar 22:10

hidai25

v0.5.3

b3948b7

v0.5.3

HTML Report Redesign

Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.

Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.

Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.

Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.

All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.

Assets 2

17 Mar 08:57

hidai25

v0.5.2

3b10acd

v0.5.2

What's New in v0.5.2

Cold-Start Test Generation (`evalview generate`)

Production-grade test generation from live agent probing — no manual YAML writing needed
Interactive probe budget and model selection
Multi-turn conversation tests generated as single cohesive test cases
Domain-aware draft generation with coherence filtering
--synth-model flag to override the synthesis model
Real-time elapsed timer during probe runs
Delta reporting: shows changes since last generation

Improved Reports

Model and token usage displayed in HTML reports
Judge cost tracking surfaced in check reports
Per-query model shown in trace cost breakdown
Cleaner baseline metadata and timeline in check reports
Turn-level details with clickable chevrons in multi-turn traces

Better Onboarding (`evalview init`)

Remembers active test suite for plain snapshot and check
Auto-approves generated drafts with scoped snapshot guidance
Detects local agents on /execute and /health endpoints
Refreshes stale config when a live agent is detected

Check Command Improvements

Shows last baseline snapshot timestamp
Auto-generates local HTML report on failures
Streamlined regression demo flow

Model Support

GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
Interactive model selection from available providers

Multi-Turn & Monitoring

Multi-turn golden baselines with per-turn tool sequences
Cost/latency spike alerts in monitor mode
Batch edge-case expansion for test coverage

Bug Fixes

Fix multi-turn filter — different output is meaningful regardless of tools
Fix probe progress for skipped follow-ups
Predictable timing — 1 discovery, multi-turn counts against budget
Always show agent model in run output
Eliminate duplicate multi-turn tests
Silence Ollama JSON fallback warnings in normal runs

Docs

Trimmed README from 1420 to 274 lines — details moved to dedicated docs
Comparison docs and SEO content added

Assets 2

13 Mar 20:13

hidai25

v0.5.1

f9d3d06

v0.5.1

What's New

Added

evalview generate — draft test suite generation from agent probing or log imports, with approval gating and CI review flow
Approval workflow — generated tests require explicit approval before becoming baselines
CI review comments — evalview ci comment posts generation reports on PRs

Fixed

Python 3.9 compatibility: replaced datetime.UTC with timezone.utc
Mypy type errors in generate command and test generation module
Codebase refactor and cleanup across 71 files

Full Changelog: v0.5.0...v0.5.1

Assets 2

12 Mar 12:05

hidai25

v0.5.0

42a5c38

v0.5.0 — Production Monitoring

What's New

Production Monitoring (`evalview monitor`)

Continuous regression detection — runs evalview check in a loop with configurable interval (default: 5 min)
Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
JSONL history export — --history monitor.jsonl appends cycle data for trend analysis and dashboards
Graceful shutdown — Ctrl+C stops cleanly with cost summary
Config support — CLI flags, config.yaml, or EVALVIEW_SLACK_WEBHOOK env var

evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # Save trends

Community Contributions

CSV export — evalview check --csv results.csv (@muhammadrashid4587)
Timeout flag — evalview check --timeout 60 (@zamadye)
Better errors — human-friendly connection failure messages (@passionworkeer)
JSONL history — --history flag for monitor (@clawtom)

Bug Fixes & Refactoring

Fixed severity comparison bug (was using string matching instead of enum comparison)
Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
Extracted shared _parse_fail_statuses utility for consistent fail_on parsing
Eliminated redundant config loading in monitor loop

Deployment

# Quick background run
nohup evalview monitor --slack-webhook https://... &

# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...

Full Changelog: v0.4.1...v0.5.0

Contributors

zamadye, muhammadrashid4587, and 2 other contributors

Assets 2

09 Mar 09:47

hidai25

v0.4.1

4ad61d3

v0.4.1

What's New

Mistral Adapter

Direct Mistral API support via pip install evalview[mistral]
Lazy import — no dependency unless you use it

PII Evaluator

Opt-in detection for emails, phones, SSNs, credit cards, addresses
Luhn algorithm validation for credit cards to reduce false positives
Enable with checks: { pii: true } in test YAML

Multi-Turn HTML Reports

Mermaid sequence diagrams showing conversation turns with tool calls
Per-turn query and tool breakdown in the Execution Trace tab

Security

GitHub Action: replaced eval $CMD with bash arrays, moved inputs to env vars
Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content

README

New hero section with logo, sequence diagram screenshot, data flow diagram
"Your data stays local" privacy explanation
Updated model version examples to Claude 4.5/4.6

Full Changelog: v0.4.0...v0.4.1

Assets 2

05 Mar 10:55

hidai25

v0.4.0

d87cd3d

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

evalview capture — HTTP proxy records real agent traffic as test YAMLs
evalview install-hooks — inject regression checks into git pre-push
Silent model update detection — alerts when provider swaps model behind same API name
Gradual drift detection — OLS regression over 10-check window
Semantic diff — --semantic-diff scores by meaning, not character similarity
Auto-open HTML report after every evalview run
evalview init now auto-detects your agent endpoint and generates starter tests
Test quality gating — low-quality generated tests are skipped, not silently polluting scores
mypy clean — 0 errors across 109 source files

Community contributions

Pydantic field validation for TestCase (#54 by @illbeurs)
Edge tests for CostEvaluator and LatencyEvaluator (#55 by @illbeurs)
health_check() on OllamaAdapter (#57 by @gauravxthakur)
ConsoleReporter docstrings (#56 by @gauravxthakur)

Full changelog: CHANGELOG.md

Contributors

gauravxthakur and illbeurs

Assets 2

Releases: hidai25/eval-view

v0.6.1 — Full MCP feature parity

What's new

Fixes

Install / Upgrade

Uh oh!

v0.6.0 — Auto-heal engine

What's new

Install / Upgrade

Uh oh!

v0.5.5 — Watch Mode, Native Adapters, Smart DX, Python API

What's New

New Commands

Native Adapters

Smart DX

Python API

GitHub Action

Documentation

Uh oh!

v0.5.4 — Python API, Dashboard, OpenClaw Integration

What's New

Python API

Terminal Dashboard

HTML Report Dashboard

OpenClaw Integration

MCP Server

Other

Install

Uh oh!

v0.5.3

HTML Report Redesign

Uh oh!

v0.5.2

What's New in v0.5.2

Cold-Start Test Generation (evalview generate)

Improved Reports

Better Onboarding (evalview init)

Check Command Improvements

Model Support

Multi-Turn & Monitoring

Bug Fixes

Docs

Uh oh!

v0.5.1

What's New

Added

Fixed

Uh oh!

v0.5.0 — Production Monitoring

What's New

Production Monitoring (evalview monitor)

Community Contributions

Bug Fixes & Refactoring

Deployment

Contributors

Uh oh!

v0.4.1

What's New

Mistral Adapter

PII Evaluator

Multi-Turn HTML Reports

Security

README

Uh oh!

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

What's new in 0.4.0

Multi-turn conversation testing

A/B endpoint comparison

Cloud baseline sync

Other highlights

Community contributions

Contributors

Uh oh!

Cold-Start Test Generation (`evalview generate`)

Better Onboarding (`evalview init`)

Production Monitoring (`evalview monitor`)