llm-as-evaluator

Star

Here are 14 public repositories matching this topic...

prometheus-eval / prometheus-eval

Star

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

Updated Apr 25, 2025
Python

Pacific-AI-Corp / langtest

Star

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Oct 25, 2025
Python

IAAR-Shanghai / xFinder

Star

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Updated Feb 26, 2025
Python

KID-22 / LLM-IR-Bias-Fairness-Survey

Star

This is the repo for the survey of Bias and Fairness in IR with LLMs.

information-retrieval recommender-systems bias ir fairness large-language-models llm chatgpt llm4rec llm4rs llm-as-a-judge llm-as-evaluator llm4ir

Updated Sep 4, 2025

zhaochen0110 / Timo

Star

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

temporal-reasoning sota-model llms rlhf rlaif llm-as-a-judge llm-as-evaluator self-critic-framework colm2024

Updated Oct 23, 2024
Python

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

HillPhelmuth / LlmAsJudgeEvalPlugins

Star

LLM-as-judge evals as Semantic Kernel Plugins

semantickernel llm-evaluation llm-as-a-judge llm-as-evaluator

Updated Aug 22, 2025
C#

djokester / groqeval

Star

Use groq for evaluations

groq llm generative-ai mixtral llm-as-a-judge llm-as-evaluator llama3

Updated Jul 11, 2024
Python

Kakz / prometheus-llm

Star

PrometheusLLM is a unique transformer architecture inspired by dignity and recursion. This project aims to explore new frontiers in AI research and welcomes contributions from the community. 🐙🌟

deep-learning mcp evaluation pipelines tracing language-model self-organization cognitive-architecture hermeneutics philosophy-of-mind gpt4 llm llmops ollama litellm llm-as-evaluator autopoietic-systems prompt-logging

Updated Oct 30, 2025
Python

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

computer-vision evaluation evaluation-metrics performance-testing evaluation-framework web-agent llm-as-judge llm-as-a-judge llm-as-evaluator computer-use