diff --git a/openhands/usage/cli/critic.mdx b/openhands/usage/cli/critic.mdx index fbf1405f..59dcf9df 100644 --- a/openhands/usage/cli/critic.mdx +++ b/openhands/usage/cli/critic.mdx @@ -16,7 +16,11 @@ For detailed information about the critic feature, including programmatic access ## What is the Critic? -The critic is an LLM-based evaluator that analyzes agent actions and conversation history to predict the quality or success probability of agent decisions. It provides: +The critic is an LLM-based evaluator that analyzes agent actions and conversation history to predict the quality or success probability of agent decisions (see our technical report: [A Rubric-Supervised Critic from Sparse Real-World Outcomes](https://arxiv.org/abs/2603.03800) for detailed methodology). + +It provides: + +It provides: - **Quality scores**: Probability scores between 0.0 and 1.0 indicating predicted success - **Real-time feedback**: Scores computed during agent execution, not just at completion diff --git a/sdk/guides/critic.mdx b/sdk/guides/critic.mdx index 9fbffccf..a78bc78e 100644 --- a/sdk/guides/critic.mdx +++ b/sdk/guides/critic.mdx @@ -21,7 +21,7 @@ A **critic** is an evaluator that analyzes agent actions and conversation histor You can use critic scores to build automated workflows, such as triggering the agent to reflect on and fix its previous solution when the critic indicates poor task performance. -This critic is a more advanced extension of the approach described in our blog post [SOTA on SWE-Bench Verified with Inference-Time Scaling and Critic Model](https://openhands.dev/blog/sota-on-swe-bench-verified-with-inference-time-scaling-and-critic-model). A technical report with detailed evaluation metrics is forthcoming. +This critic is a more advanced extension of the approach described in our blog post [SOTA on SWE-Bench Verified with Inference-Time Scaling and Critic Model](https://openhands.dev/blog/sota-on-swe-bench-verified-with-inference-time-scaling-and-critic-model). For detailed evaluation metrics and methodology, see our technical report: [A Rubric-Supervised Critic from Sparse Real-World Outcomes](https://arxiv.org/abs/2603.03800). ## Quick Start