evaluation-harness

Star

Here are 5 public repositories matching this topic...

najeed / ai-agent-eval-harness

Star

The open-source MultiAgentOps evaluation harness for any industry scenario.

Updated Mar 18, 2026
HTML

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

Arnav-Ajay / rag-retrieval-eval

Star

A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.

ai-systems failure-analysis rag-evaluation evaluation-harness retrieval-observability

Updated Jan 10, 2026
Python

jsp2195 / frontier-evals-harness

Star

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

reproducible-research model-evaluation llm-evaluation llm-benchmarking statistical-evaluation evaluation-harness

Updated Feb 19, 2026
Python

Arnav-Ajay / rag-reranking-playground

Star

Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.

bm25 reranking rag failure-analysis hybrid-retrieval evaluation-harness

Updated Jan 23, 2026
Python

Improve this page

Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-harness

Here are 5 public repositories matching this topic...

najeed / ai-agent-eval-harness

tpertner / confess

Arnav-Ajay / rag-retrieval-eval

jsp2195 / frontier-evals-harness

Arnav-Ajay / rag-reranking-playground

Improve this page

Add this topic to your repo