The open-source MultiAgentOps evaluation harness for any industry scenario.
-
Updated
Mar 18, 2026 - HTML
The open-source MultiAgentOps evaluation harness for any industry scenario.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."