[FR]: Support for population based metrics

### Proposal summary

I have a classification problem where my metrics are population based (e.g. Recall and Precision) and I am trying to optimize the prompt for this.

This is how I would imagine it would work:
```
metric_tracker = MetricTracker()

def population_metric(dataset_item, llm_output):
    metric_tracker.update(dataset_item["output"], llm_output)
    return metric_tracker.score()

optimizer.optimize_prompt(metric=population_metric, metric_accumulation="last")
``` 
or if you want to extend this you could let the user inject a `metric_accumulation_function` instead of a keyword. This would also benefit people if they wanted to do weighted averaging for example.

### Motivation

> What problem are you trying to solve?

Being able to optimize prompts for classification problems that can't rely on accuracy alone.

> How are you currently solving this problem?

I have the following approach for optimizing for the f1 metric:
```
class MetricTracker:
    def __init__(self):
        self.true_positives = 0
        self.false_positives = 0
        self.true_negatives = 0
        self.false_negatives = 0
        self.id_set = set()

    def update(self, expected_output: str, prediction: str, identifier: str) -> None:
        if identifier in self.id_set:
            self.reset()
        else:
            self.id_set.add(identifier)
        if expected_output == prediction:
            if expected_output == "Detection":
                self.true_positives += 1
            else:  # "No Detection"
                self.true_negatives += 1
        else:
            if prediction == "Detection":
                self.false_positives += 1
            else:  # "No Detection"
                self.false_negatives += 1

    @property
    def recall(self) -> float:
        denominator = self.true_positives + self.false_negatives
        if denominator == 0:
            return 0.0
        return self.true_positives / denominator

    @property
    def precision(self) -> float:
        denominator = self.true_positives + self.false_positives
        if denominator == 0:
            return 0.0
        return self.true_positives / denominator


    def f_beta(self, beta: float = 1) -> float:
        precision = self.precision
        recall = self.recall
        beta_squared = beta**2
        if precision == 0 and recall == 0:
            return 0.0
        return (
            (1 + beta_squared)
            * (precision * recall)
            / (beta_squared * precision + recall)
        )

    def reset(self) -> None:
        self.id_set = set()
        self.true_positives = 0
        self.false_positives = 0
        self.true_negatives = 0
        self.false_negatives = 0

def metric_wrapper(tracker: MetricTracker) -> Callable:
    def classification_metric(dataset_item, llm_output):
        tracker.update(dataset_item["output"], llm_output, dataset_item["id"])
        return tracker.f_beta(beta=1)
    return classification_metric
...

    tracker = MetricTracker()
    classification_metric = metric_wrapper(tracker=tracker)
    result = optimizer.optimize_prompt(
        ....,
        metric=classification_metric,
        n_samples=len(dataset.get_items()),
    )
``` 
I had two main issues with my currently working hacked together solution:

1. When to reset the tracker, since I don't want leakage between prompt candidates. Currently I solve this by tracking the dataset_item["id"] in my tracker and resetting when I have already seen that one, this requires going through the whole dataset for every prompt though. A hook for when a new round starts would be better.
2. The final accumulation of the metrics is based on averaging. This doesn't work for recall and precision. For my use case I changed https://github.com/comet-ml/opik/blob/bd5b1c8573e6f3530b416b71bdaae25d4e95a394/sdks/opik_optimizer/src/opik_optimizer/task_evaluator.py#L134-L136 to just taking the last entry. (Thats why I put an option for `metric_accumulation="last"`)
3. minor issue: since recall and precision make no sense on a per sample basis the ui also doesn't make so much sense, but fixing that would probably require a major overhaul and I can absolutely live with the current state of things. 

> What are the benefits of this feature?

Being able to support a wider variety of problem sets.

Thank you and nice work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FR]: Support for population based metrics #3764

Proposal summary

Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	avg_score = sum(
	[score_result_.value for score_result_ in objective_score_results]
	) / len(objective_score_results)

Uh oh!

[FR]: Support for population based metrics #3764

Description

Proposal summary

Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions