-
Couldn't load subscription status.
- Fork 1.1k
Description
Proposal summary
I have a classification problem where my metrics are population based (e.g. Recall and Precision) and I am trying to optimize the prompt for this.
This is how I would imagine it would work:
metric_tracker = MetricTracker()
def population_metric(dataset_item, llm_output):
metric_tracker.update(dataset_item["output"], llm_output)
return metric_tracker.score()
optimizer.optimize_prompt(metric=population_metric, metric_accumulation="last")
or if you want to extend this you could let the user inject a metric_accumulation_function instead of a keyword. This would also benefit people if they wanted to do weighted averaging for example.
Motivation
What problem are you trying to solve?
Being able to optimize prompts for classification problems that can't rely on accuracy alone.
How are you currently solving this problem?
I have the following approach for optimizing for the f1 metric:
class MetricTracker:
def __init__(self):
self.true_positives = 0
self.false_positives = 0
self.true_negatives = 0
self.false_negatives = 0
self.id_set = set()
def update(self, expected_output: str, prediction: str, identifier: str) -> None:
if identifier in self.id_set:
self.reset()
else:
self.id_set.add(identifier)
if expected_output == prediction:
if expected_output == "Detection":
self.true_positives += 1
else: # "No Detection"
self.true_negatives += 1
else:
if prediction == "Detection":
self.false_positives += 1
else: # "No Detection"
self.false_negatives += 1
@property
def recall(self) -> float:
denominator = self.true_positives + self.false_negatives
if denominator == 0:
return 0.0
return self.true_positives / denominator
@property
def precision(self) -> float:
denominator = self.true_positives + self.false_positives
if denominator == 0:
return 0.0
return self.true_positives / denominator
def f_beta(self, beta: float = 1) -> float:
precision = self.precision
recall = self.recall
beta_squared = beta**2
if precision == 0 and recall == 0:
return 0.0
return (
(1 + beta_squared)
* (precision * recall)
/ (beta_squared * precision + recall)
)
def reset(self) -> None:
self.id_set = set()
self.true_positives = 0
self.false_positives = 0
self.true_negatives = 0
self.false_negatives = 0
def metric_wrapper(tracker: MetricTracker) -> Callable:
def classification_metric(dataset_item, llm_output):
tracker.update(dataset_item["output"], llm_output, dataset_item["id"])
return tracker.f_beta(beta=1)
return classification_metric
...
tracker = MetricTracker()
classification_metric = metric_wrapper(tracker=tracker)
result = optimizer.optimize_prompt(
....,
metric=classification_metric,
n_samples=len(dataset.get_items()),
)
I had two main issues with my currently working hacked together solution:
- When to reset the tracker, since I don't want leakage between prompt candidates. Currently I solve this by tracking the dataset_item["id"] in my tracker and resetting when I have already seen that one, this requires going through the whole dataset for every prompt though. A hook for when a new round starts would be better.
- The final accumulation of the metrics is based on averaging. This doesn't work for recall and precision. For my use case I changed to just taking the last entry. (Thats why I put an option for
opik/sdks/opik_optimizer/src/opik_optimizer/task_evaluator.py
Lines 134 to 136 in bd5b1c8
avg_score = sum( [score_result_.value for score_result_ in objective_score_results] ) / len(objective_score_results) metric_accumulation="last") - minor issue: since recall and precision make no sense on a per sample basis the ui also doesn't make so much sense, but fixing that would probably require a major overhaul and I can absolutely live with the current state of things.
What are the benefits of this feature?
Being able to support a wider variety of problem sets.
Thank you and nice work!