Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline #186

luizvalle · 2025-10-27T21:39:48Z

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline

This makes it easier to know what were the causes of the unit test failures.

Verification

Ran the pipeline locally and it produces two JSON files:

countuniqueerrormessagesaggregator__Unique_Error_Messages_report.json

{
  "Code execution timed out after 20.0 seconds.": 32,
  "Expected stdout: 1, but got: ": 20,
  "Expected stdout: 4, but got: ": 6,
  "Expected stdout: 12, but got: ": 6,
  "Expected stdout: 2, but got: ": 4,
  "Expected stdout: 169, but got: ": 4,
  "Expected stdout: 20, but got: ": 4,
  "Expected stdout: 3, but got: ": 4,
  "Expected stdout: 17, but got: ": 4,
  "Expected stdout: 41, but got: ": 2,
  "Expected stdout: 18, but got: ": 2,
  "Expected stdout: 10, but got: ": 2,
  "Expected stdout: 9, but got: ": 2,
  "Expected stdout: 26, but got: ": 2,
  "Expected stdout: 154, but got: ": 2,
  "Expected stdout: 45, but got: ": 2,
  "Expected stdout: 5, but got: ": 2,
  "Expected stdout: 7, but got: ": 2,
  "Expected stdout: 33, but got: ": 2,
  "Expected stdout: 13, but got: ": 2,
  "Expected stdout: 157, but got: ": 2,
  "Expected stdout: 15, but got: ": 2,
  "Expected stdout: 11, but got: ": 2,
  "Expected stdout: 8, but got: ": 2,
  "Expected stdout: 23, but got: ": 2,
  "Expected stdout: 24, but got: ": 2
}

countuniqueerrormessagesaggregator__Unique_Error_Messages_report.json

{
  "hard": {
    "Expected stdout: 1, but got: ": 20,
    "Expected stdout: 4, but got: ": 6,
    "Expected stdout: 12, but got: ": 6,
    "Expected stdout: 3, but got: ": 4,
    "Expected stdout: 169, but got: ": 4,
    "Expected stdout: 20, but got: ": 4,
    "Expected stdout: 2, but got: ": 4,
    "Expected stdout: 17, but got: ": 4,
    "Expected stdout: 41, but got: ": 2,
    "Expected stdout: 18, but got: ": 2,
    "Expected stdout: 10, but got: ": 2,
    "Expected stdout: 9, but got: ": 2,
    "Expected stdout: 26, but got: ": 2,
    "Expected stdout: 154, but got: ": 2,
    "Expected stdout: 45, but got: ": 2,
    "Expected stdout: 5, but got: ": 2,
    "Expected stdout: 7, but got: ": 2,
    "Expected stdout: 33, but got: ": 2,
    "Expected stdout: 13, but got: ": 2,
    "Expected stdout: 157, but got: ": 2,
    "Expected stdout: 15, but got: ": 2,
    "Expected stdout: 11, but got: ": 2,
    "Expected stdout: 8, but got: ": 2,
    "Expected stdout: 23, but got: ": 2,
    "Expected stdout: 24, but got: ": 2
  },
  "medium": {
    "Code execution timed out after 20.0 seconds.": 32
  }
}

eureka_ml_insights/metrics/live_code_bench/count_unique_error_messages_aggregator.py

Copilot

Pull Request Overview

This PR adds functionality to track and report unique error messages from unit test failures in the LiveCodeBench pipeline. This helps identify the root causes of test failures more easily.

Key changes:

Introduces a new aggregator class to count unique error messages
Adds error message tracking by overall results and by difficulty level
Refactors hardcoded column names to use class constants for consistency

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
eureka_ml_insights/user_configs/live_code_bench.py	Adds new column name constants, imports the error messages aggregator, refactors hardcoded strings to constants, and configures two new aggregator instances for error message tracking
eureka_ml_insights/metrics/live_code_bench/count_unique_error_messages_aggregator.py	Implements the CountUniqueErrorMessagesAggregator class that counts unique error messages with optional grouping support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add aggregator that counts the number of unique error messages

d161e71

vidhishanair reviewed Oct 27, 2025

View reviewed changes

eureka_ml_insights/metrics/live_code_bench/count_unique_error_messages_aggregator.py Show resolved Hide resolved

luizvalle requested review from Copilot and vidhishanair October 27, 2025 23:34

Copilot AI reviewed Oct 27, 2025

View reviewed changes

Use vectorization

a352e74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline #186

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline #186

luizvalle commented Oct 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline #186

Are you sure you want to change the base?

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline #186

Conversation

luizvalle commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline

Verification

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luizvalle commented Oct 27, 2025 •

edited

Loading