Objective
Test whether the judge evaluates nuanced, multi-dimensional rubrics properly and whether revisions target weak dimensions.
Test Plan
- Task: "Write a technical blog post introduction about LLM-as-judge evaluation patterns"
- Rubric with 5+ dimensions:
- Technical accuracy (correct terminology, no hallucinated claims)
- Audience calibration (accessible to senior engineers, not too basic)
- Hook quality (first sentence grabs attention)
- Structural clarity (logical flow, paragraph transitions)
- Actionability (reader knows what they'll learn)
- Include reference context: 2-3 real papers on LLM-as-judge
- Include required concepts: ["calibration", "inter-rater reliability", "rubric design"]
- Run 5 rounds with threshold 0.85
Success Criteria
- Each dimension scored independently (not all the same score)
- Revisions target the lowest-scoring dimension
- Required concepts appear in final output
- Reference context influences accuracy scoring
Acceptance
Objective
Test whether the judge evaluates nuanced, multi-dimensional rubrics properly and whether revisions target weak dimensions.
Test Plan
Success Criteria
Acceptance