Skip to content

Conversation

@luizvalle
Copy link
Contributor

@luizvalle luizvalle commented Oct 27, 2025

Add aggregator that counts the number of unique error messages in the LiveCodeBench pipeline

This makes it easier to know what were the causes of the unit test failures.

Verification

Ran the pipeline locally and it produces two JSON files:

countuniqueerrormessagesaggregator__Unique_Error_Messages_report.json

{
  "Code execution timed out after 20.0 seconds.": 32,
  "Expected stdout: 1, but got: ": 20,
  "Expected stdout: 4, but got: ": 6,
  "Expected stdout: 12, but got: ": 6,
  "Expected stdout: 2, but got: ": 4,
  "Expected stdout: 169, but got: ": 4,
  "Expected stdout: 20, but got: ": 4,
  "Expected stdout: 3, but got: ": 4,
  "Expected stdout: 17, but got: ": 4,
  "Expected stdout: 41, but got: ": 2,
  "Expected stdout: 18, but got: ": 2,
  "Expected stdout: 10, but got: ": 2,
  "Expected stdout: 9, but got: ": 2,
  "Expected stdout: 26, but got: ": 2,
  "Expected stdout: 154, but got: ": 2,
  "Expected stdout: 45, but got: ": 2,
  "Expected stdout: 5, but got: ": 2,
  "Expected stdout: 7, but got: ": 2,
  "Expected stdout: 33, but got: ": 2,
  "Expected stdout: 13, but got: ": 2,
  "Expected stdout: 157, but got: ": 2,
  "Expected stdout: 15, but got: ": 2,
  "Expected stdout: 11, but got: ": 2,
  "Expected stdout: 8, but got: ": 2,
  "Expected stdout: 23, but got: ": 2,
  "Expected stdout: 24, but got: ": 2
}

countuniqueerrormessagesaggregator__Unique_Error_Messages_report.json

{
  "hard": {
    "Expected stdout: 1, but got: ": 20,
    "Expected stdout: 4, but got: ": 6,
    "Expected stdout: 12, but got: ": 6,
    "Expected stdout: 3, but got: ": 4,
    "Expected stdout: 169, but got: ": 4,
    "Expected stdout: 20, but got: ": 4,
    "Expected stdout: 2, but got: ": 4,
    "Expected stdout: 17, but got: ": 4,
    "Expected stdout: 41, but got: ": 2,
    "Expected stdout: 18, but got: ": 2,
    "Expected stdout: 10, but got: ": 2,
    "Expected stdout: 9, but got: ": 2,
    "Expected stdout: 26, but got: ": 2,
    "Expected stdout: 154, but got: ": 2,
    "Expected stdout: 45, but got: ": 2,
    "Expected stdout: 5, but got: ": 2,
    "Expected stdout: 7, but got: ": 2,
    "Expected stdout: 33, but got: ": 2,
    "Expected stdout: 13, but got: ": 2,
    "Expected stdout: 157, but got: ": 2,
    "Expected stdout: 15, but got: ": 2,
    "Expected stdout: 11, but got: ": 2,
    "Expected stdout: 8, but got: ": 2,
    "Expected stdout: 23, but got: ": 2,
    "Expected stdout: 24, but got: ": 2
  },
  "medium": {
    "Code execution timed out after 20.0 seconds.": 32
  }
}

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to track and report unique error messages from unit test failures in the LiveCodeBench pipeline. This helps identify the root causes of test failures more easily.

Key changes:

  • Introduces a new aggregator class to count unique error messages
  • Adds error message tracking by overall results and by difficulty level
  • Refactors hardcoded column names to use class constants for consistency

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
eureka_ml_insights/user_configs/live_code_bench.py Adds new column name constants, imports the error messages aggregator, refactors hardcoded strings to constants, and configures two new aggregator instances for error message tracking
eureka_ml_insights/metrics/live_code_bench/count_unique_error_messages_aggregator.py Implements the CountUniqueErrorMessagesAggregator class that counts unique error messages with optional grouping support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants