Toolkit Eval Harness

Enterprise-grade evaluation framework for AI/ML models with versioned test suites, deterministic scoring, and CI/CD integration for regression testing.

Overview

The Toolkit Eval Harness is a lightweight, powerful evaluation tool designed for AI/ML engineers to version "golden" task suites, score predictions with deterministic metrics, and produce CI-friendly JSON reports with regression comparisons. It complements Neural Forge model lifecycle gates and provides a solid foundation for model evaluation in production environments.

Key Features

Test Suite Management

Versioned Test Suites: Immutable, version-controlled evaluation datasets
Deterministic Scoring: Consistent, reproducible evaluation metrics
Suite Packs: Compressed, hash-verified test suite packages
Digital Signatures: Optional cryptographic signing for integrity verification

Comprehensive Evaluation

Multiple Scoring Methods: Exact match, JSON schema validation, plugin scorers
Flexible Test Cases: Support for various input/output formats
Tagging System: Organize tests by category, difficulty, or use case
Plugin System: Extend with custom scorers via entry points or programmatic registration

Enterprise Integration

CI/CD Friendly: JSON reports with exit codes for pipeline integration
Regression Detection: Automated comparison against baseline evaluations
Multiple Output Formats: JSON, table, and CSV output for different workflows
Execution Metadata: Timing, tool version, and platform info in reports

Security and Compliance

Package Signing: Ed25519 cryptographic signatures for integrity
Hash Verification: SHA-256 checksums for all packages
Zip Path Traversal Protection: Safe extraction with member path validation
Health Checks: Built-in dependency and environment verification

Quick Start

Installation

# Install from source
git clone https://github.com/AKIVA-AI/toolkit-eval-harness.git
cd toolkit-eval-harness
pip install -e ".[dev]"

# Install with signing support
pip install -e ".[signing]"

# Install in production
pip install toolkit-eval-harness

Basic Usage

# 1. Create a test suite pack
toolkit-eval pack create --suite-dir examples/suite --out packs/suite.zip

# 2. Verify pack integrity
toolkit-eval pack verify --suite packs/suite.zip

# 3. Run evaluation
toolkit-eval run --suite packs/suite.zip --predictions examples/preds.jsonl --out report.json

# 4. Compare with baseline (CI gating)
toolkit-eval compare --baseline baseline.json --candidate report.json

CLI Commands

pack create - Create a suite pack from a directory
pack verify - Verify pack integrity (hashes)
pack sign - Sign suite packs (optional)
pack verify-signature - Verify pack signatures
pack inspect - Show pack metadata
run - Run evaluation against predictions
compare - Compare candidate report to baseline (CI gating)
validate-report - Validate a report JSON file
check-deps - Health check and environment verification
keygen - Generate signing keys

Exit Codes

0 - Success
1 - General error
4 - Regression detected (for compare command)

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
docs		docs
examples		examples
src/toolkit_eval_harness		src/toolkit_eval_harness
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toolkit Eval Harness

Overview

Key Features

Test Suite Management

Comprehensive Evaluation

Enterprise Integration

Security and Compliance

Quick Start

Installation

Basic Usage

CLI Commands

Exit Codes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toolkit Eval Harness

Overview

Key Features

Test Suite Management

Comprehensive Evaluation

Enterprise Integration

Security and Compliance

Quick Start

Installation

Basic Usage

CLI Commands

Exit Codes

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages