Enterprise-grade evaluation framework for AI/ML models with versioned test suites, deterministic scoring, and CI/CD integration for regression testing.
The Toolkit Eval Harness is a lightweight, powerful evaluation tool designed for AI/ML engineers to version "golden" task suites, score predictions with deterministic metrics, and produce CI-friendly JSON reports with regression comparisons. It complements Neural Forge model lifecycle gates and provides a solid foundation for model evaluation in production environments.
- Versioned Test Suites: Immutable, version-controlled evaluation datasets
- Deterministic Scoring: Consistent, reproducible evaluation metrics
- Suite Packs: Compressed, hash-verified test suite packages
- Digital Signatures: Optional cryptographic signing for integrity verification
- Multiple Scoring Methods: Exact match, JSON schema validation, plugin scorers
- Flexible Test Cases: Support for various input/output formats
- Tagging System: Organize tests by category, difficulty, or use case
- Plugin System: Extend with custom scorers via entry points or programmatic registration
- CI/CD Friendly: JSON reports with exit codes for pipeline integration
- Regression Detection: Automated comparison against baseline evaluations
- Multiple Output Formats: JSON, table, and CSV output for different workflows
- Execution Metadata: Timing, tool version, and platform info in reports
- Package Signing: Ed25519 cryptographic signatures for integrity
- Hash Verification: SHA-256 checksums for all packages
- Zip Path Traversal Protection: Safe extraction with member path validation
- Health Checks: Built-in dependency and environment verification
# Install from source
git clone https://github.com/AKIVA-AI/toolkit-eval-harness.git
cd toolkit-eval-harness
pip install -e ".[dev]"
# Install with signing support
pip install -e ".[signing]"
# Install in production
pip install toolkit-eval-harness# 1. Create a test suite pack
toolkit-eval pack create --suite-dir examples/suite --out packs/suite.zip
# 2. Verify pack integrity
toolkit-eval pack verify --suite packs/suite.zip
# 3. Run evaluation
toolkit-eval run --suite packs/suite.zip --predictions examples/preds.jsonl --out report.json
# 4. Compare with baseline (CI gating)
toolkit-eval compare --baseline baseline.json --candidate report.jsonpack create- Create a suite pack from a directorypack verify- Verify pack integrity (hashes)pack sign- Sign suite packs (optional)pack verify-signature- Verify pack signaturespack inspect- Show pack metadatarun- Run evaluation against predictionscompare- Compare candidate report to baseline (CI gating)validate-report- Validate a report JSON filecheck-deps- Health check and environment verificationkeygen- Generate signing keys
0- Success1- General error4- Regression detected (for compare command)
MIT License - see LICENSE file for details.