evals
Here are 148 public repositories matching this topic...
AI Observability & Evaluation
-
Updated
Mar 2, 2026 - Jupyter Notebook
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
-
Updated
Oct 30, 2025 - Python
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
-
Updated
Mar 2, 2026 - Python
AI observability platform for production LLM and agent systems.
-
Updated
Mar 2, 2026 - Python
Evaluation and Tracking for LLM Experiments and AI Agents
-
Updated
Feb 28, 2026 - Python
Laminar - open-source observability platform purpose-built for AI agents. YC S24.
-
Updated
Mar 2, 2026 - TypeScript
OpenSource Production ready Customer service with built in Evals and monitoring
-
Updated
Jan 12, 2026 - TypeScript
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
-
Updated
Mar 2, 2026 - Python
Harbor is a framework for running agent evaluations and creating and using RL environments.
-
Updated
Mar 2, 2026 - Python
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
-
Updated
Jul 12, 2025 - Jupyter Notebook
Test Generation for Prompts
-
Updated
Mar 1, 2026 - TeX
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
-
Updated
Feb 15, 2026 - TypeScript
Evalica, your favourite evaluation toolkit
-
Updated
Mar 1, 2026 - Python
Run coding agents against each other. Merge the winner.
-
Updated
Feb 28, 2026 - TypeScript
Benchmarking Large Language Models for FHIR
-
Updated
Feb 4, 2026 - TypeScript
Improve this page
Add a description, image, and links to the evals topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."