Skip to content

Level 3 — Rubric Depth: Multi-dimensional complex evaluation #71

@cirdan-greyhaven

Description

@cirdan-greyhaven

Objective

Test whether the judge evaluates nuanced, multi-dimensional rubrics properly and whether revisions target weak dimensions.

Test Plan

  • Task: "Write a technical blog post introduction about LLM-as-judge evaluation patterns"
  • Rubric with 5+ dimensions:
    1. Technical accuracy (correct terminology, no hallucinated claims)
    2. Audience calibration (accessible to senior engineers, not too basic)
    3. Hook quality (first sentence grabs attention)
    4. Structural clarity (logical flow, paragraph transitions)
    5. Actionability (reader knows what they'll learn)
  • Include reference context: 2-3 real papers on LLM-as-judge
  • Include required concepts: ["calibration", "inter-rater reliability", "rubric design"]
  • Run 5 rounds with threshold 0.85

Success Criteria

  • Each dimension scored independently (not all the same score)
  • Revisions target the lowest-scoring dimension
  • Required concepts appear in final output
  • Reference context influences accuracy scoring

Acceptance

  • All 5 dimensions scored per round
  • Dimension scores diverge (not uniform)
  • Weakest dimension improves across rounds
  • Final output contains all required concepts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions