Sample code for experiments on using LLM-as-a-judge techniques to monitor the quality of narrative feedback in medical education.
Holcomb MJ, Dalton TO, Jamieson AR. Beyond Single-Loop Feedback: A Self-Validating LLM System for Medical Student Documentation Assessment. Presentation at: Learn Serve Lead 2025: The AAMC Annual Meeting; November 1-5, 2025; San Antonio, TX.
- Michael J. Holcomb, MS¹ - Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX
- Thomas O. Dalton, MD² - Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX
- Andrew R. Jamieson, PhD¹ - Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX
¹Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center ²Department of Internal Medicine, University of Texas Southwestern Medical Center
This repository contains:
- Judge evaluation system: A Python script that uses Claude to evaluate feedback quality across 6 dimensions
- Prompts: Templates for generating and evaluating narrative feedback
- Examples: Sample clinical cases, checklists, student notes, and feedback
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install Python dependencies
pip install -r requirements.txt
# Set your Anthropic API key (choose one method)
# Option A: Using environment variables (temporary, current session only)
export ANTHROPIC_API_KEY='your-api-key-here'
# Option B: Using .env file (recommended, persistent)
echo "ANTHROPIC_API_KEY=your-api-key-here" > .env# Basic usage (outputs to output.json by default)
python judge_feedback.py \
examples/checklist.md \
examples/student_note1.md \
examples/feedback/student_note1_simple.md
# Specify custom output path
python judge_feedback.py \
examples/checklist.md \
examples/student_note2.md \
examples/feedback/student_note2_mega.md \
-o evaluation_results.json
# Use a different Claude model
python judge_feedback.py \
examples/checklist.md \
examples/student_note1.md \
examples/feedback/student_note1_simple.md \
--model claude-3-opus-20240229The script provides:
- Beautiful terminal output with color-coded scores and summaries
- JSON output file with complete metadata, parsed scores, analyses, and raw response
The judge evaluates feedback across six dimensions (1-5 scale each):
- Polarity (Sentiment): Does the feedback tone appropriately match the student's performance level? This dimension indicates whether feedback is positive/negative/mixed, reflecting the student's performance. It is NOT a quality metric—feedback should be negative if the student performed poorly.
- Factuality (Clinical Accuracy): Is the feedback medically accurate?
- Evidence (Groundedness): Does feedback align with actual note content?
- Actionability (Guidance Quality): Can students understand what actions to take?
- Connectivity (Linking to Performance): Are suggestions logically connected to observations?
- Developmental Alignment: Is feedback appropriate for MS2 students transitioning to clerkship?
The Total Quality Score is calculated by summing the five quality dimensions (maximum 25 points).
lsl-beyond-single-loop-feedback/
├── judge_feedback.py # Main evaluation script
├── requirements.txt # Python dependencies
├── README.md # This file
├── LICENSE # Academic research license
├── .gitignore # Git ignore patterns
│
├── prompts/
│ ├── feedback/
│ │ ├── simple_feedback.md # Concise feedback generation prompt
│ │ └── mega_feedback.md # Detailed feedback with reflection prompt
│ └── judge/
│ └── judge.md # LLM-as-judge evaluation prompt
│
├── resources/
│ └── screenshot.png # Terminal output screenshot
│
└── examples/
├── case.md # OSCE case: Acute Appendicitis
├── checklist.md # Evaluation checklist
├── student_note1.md # Strong performance example
├── student_note2.md # Weak performance example
└── feedback/
├── student_note1_simple.md # Simple feedback for strong student
├── student_note1_mega.md # Detailed feedback for strong student
├── student_note2_simple.md # Simple feedback for weak student
└── student_note2_mega.md # Detailed feedback for weak student
The output JSON file contains structured results with polarity tracked separately from quality dimensions:
{
"metadata": {
"timestamp": "2025-11-02T10:30:00.123456",
"model": "claude-sonnet-4-5-20250929",
"inputs": {
"checklist_path": "/path/to/checklist.md",
"student_note_path": "/path/to/student_note.md",
"feedback_path": "/path/to/feedback.md"
}
},
"evaluation": {
"polarity": {
"score": 4,
"analysis": "The feedback consistently emphasizes strengths..."
},
"quality_dimensions": {
"Factuality": {
"score": 5,
"analysis": "The feedback is clinically accurate throughout..."
},
"Evidence": {
"score": 4,
"analysis": "The feedback accurately references specific elements..."
},
"Actionability": {
"score": 4,
"analysis": "The feedback provides specific, implementable guidance..."
},
"Connectivity": {
"score": 4,
"analysis": "The feedback effectively connects observations..."
},
"Developmental Alignment": {
"score": 5,
"analysis": "The feedback appropriately targets MS2-to-clerkship..."
}
},
"total_quality_score": 22,
"max_quality_score": 25,
"quality_percentage": 88.0
},
"raw_response": "Full text response from Claude..."
}usage: judge_feedback.py [-h] [-o OUTPUT] [-m MODEL] [--judge-prompt JUDGE_PROMPT]
checklist student_note feedback
Evaluate narrative feedback quality using Claude as a judge.
positional arguments:
checklist Path to the checklist file (markdown format)
student_note Path to the student note file (markdown format)
feedback Path to the feedback file to evaluate (plain text)
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output JSON file path (default: output.json)
-m MODEL, --model MODEL
Claude model to use (default: claude-3-5-sonnet-20241022)
--judge-prompt JUDGE_PROMPT
Path to judge prompt template (default: prompts/judge/judge.md)
# Evaluate concise feedback
python judge_feedback.py \
examples/checklist.md \
examples/student_note1.md \
examples/feedback/student_note1_simple.md \
-o eval_simple.json
# Evaluate detailed feedback with reflection
python judge_feedback.py \
examples/checklist.md \
examples/student_note1.md \
examples/feedback/student_note1_mega.md \
-o eval_mega.json# Strong student performance
python judge_feedback.py \
examples/checklist.md \
examples/student_note1.md \
examples/feedback/student_note1_simple.md \
-o eval_strong.json
# Weak student performance
python judge_feedback.py \
examples/checklist.md \
examples/student_note2.md \
examples/feedback/student_note2_simple.md \
-o eval_weak.json- Python 3.8+
- Anthropic API key
- Dependencies:
anthropic,rich,python-dotenv
Copyright (c) 2025 The University of Texas Southwestern Medical Center. All rights reserved.
This software is licensed for academic research use only. Commercial use is expressly prohibited.
See LICENSE file for full terms and conditions.
For issues or questions about this evaluation system, you may e-mail the authors at firstname.lastname@utsouthwestern.edu.
Authors:
- Michael J. Holcomb, MS
- Thomas O. Dalton, MD
- Andrew R. Jamieson, PhD
