Level 3 — Rubric Depth: Multi-dimensional complex evaluation

## Objective

Test whether the judge evaluates nuanced, multi-dimensional rubrics properly and whether revisions target weak dimensions.

## Test Plan

* Task: "Write a technical blog post introduction about LLM-as-judge evaluation patterns"
* Rubric with 5+ dimensions:
  1. Technical accuracy (correct terminology, no hallucinated claims)
  2. Audience calibration (accessible to senior engineers, not too basic)
  3. Hook quality (first sentence grabs attention)
  4. Structural clarity (logical flow, paragraph transitions)
  5. Actionability (reader knows what they'll learn)
* Include reference context: 2-3 real papers on LLM-as-judge
* Include required concepts: \["calibration", "inter-rater reliability", "rubric design"\]
* Run 5 rounds with threshold 0.85

## Success Criteria

* Each dimension scored independently (not all the same score)
* Revisions target the lowest-scoring dimension
* Required concepts appear in final output
* Reference context influences accuracy scoring

## Acceptance

- [ ] All 5 dimensions scored per round
- [ ] Dimension scores diverge (not uniform)
- [ ] Weakest dimension improves across rounds
- [ ] Final output contains all required concepts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Level 3 — Rubric Depth: Multi-dimensional complex evaluation #71

Objective

Test Plan

Success Criteria

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level 3 — Rubric Depth: Multi-dimensional complex evaluation #71

Description

Objective

Test Plan

Success Criteria

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions