Skip to content

[FR]: Support for population based metrics #3764

@SvenLorenz

Description

@SvenLorenz

Proposal summary

I have a classification problem where my metrics are population based (e.g. Recall and Precision) and I am trying to optimize the prompt for this.

This is how I would imagine it would work:

metric_tracker = MetricTracker()

def population_metric(dataset_item, llm_output):
    metric_tracker.update(dataset_item["output"], llm_output)
    return metric_tracker.score()

optimizer.optimize_prompt(metric=population_metric, metric_accumulation="last")

or if you want to extend this you could let the user inject a metric_accumulation_function instead of a keyword. This would also benefit people if they wanted to do weighted averaging for example.

Motivation

What problem are you trying to solve?

Being able to optimize prompts for classification problems that can't rely on accuracy alone.

How are you currently solving this problem?

I have the following approach for optimizing for the f1 metric:

class MetricTracker:
    def __init__(self):
        self.true_positives = 0
        self.false_positives = 0
        self.true_negatives = 0
        self.false_negatives = 0
        self.id_set = set()

    def update(self, expected_output: str, prediction: str, identifier: str) -> None:
        if identifier in self.id_set:
            self.reset()
        else:
            self.id_set.add(identifier)
        if expected_output == prediction:
            if expected_output == "Detection":
                self.true_positives += 1
            else:  # "No Detection"
                self.true_negatives += 1
        else:
            if prediction == "Detection":
                self.false_positives += 1
            else:  # "No Detection"
                self.false_negatives += 1

    @property
    def recall(self) -> float:
        denominator = self.true_positives + self.false_negatives
        if denominator == 0:
            return 0.0
        return self.true_positives / denominator

    @property
    def precision(self) -> float:
        denominator = self.true_positives + self.false_positives
        if denominator == 0:
            return 0.0
        return self.true_positives / denominator


    def f_beta(self, beta: float = 1) -> float:
        precision = self.precision
        recall = self.recall
        beta_squared = beta**2
        if precision == 0 and recall == 0:
            return 0.0
        return (
            (1 + beta_squared)
            * (precision * recall)
            / (beta_squared * precision + recall)
        )

    def reset(self) -> None:
        self.id_set = set()
        self.true_positives = 0
        self.false_positives = 0
        self.true_negatives = 0
        self.false_negatives = 0

def metric_wrapper(tracker: MetricTracker) -> Callable:
    def classification_metric(dataset_item, llm_output):
        tracker.update(dataset_item["output"], llm_output, dataset_item["id"])
        return tracker.f_beta(beta=1)
    return classification_metric
...

    tracker = MetricTracker()
    classification_metric = metric_wrapper(tracker=tracker)
    result = optimizer.optimize_prompt(
        ....,
        metric=classification_metric,
        n_samples=len(dataset.get_items()),
    )

I had two main issues with my currently working hacked together solution:

  1. When to reset the tracker, since I don't want leakage between prompt candidates. Currently I solve this by tracking the dataset_item["id"] in my tracker and resetting when I have already seen that one, this requires going through the whole dataset for every prompt though. A hook for when a new round starts would be better.
  2. The final accumulation of the metrics is based on averaging. This doesn't work for recall and precision. For my use case I changed
    avg_score = sum(
    [score_result_.value for score_result_ in objective_score_results]
    ) / len(objective_score_results)
    to just taking the last entry. (Thats why I put an option for metric_accumulation="last")
  3. minor issue: since recall and precision make no sense on a per sample basis the ui also doesn't make so much sense, but fixing that would probably require a major overhaul and I can absolutely live with the current state of things.

What are the benefits of this feature?

Being able to support a wider variety of problem sets.

Thank you and nice work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions