Evaluation improvements #217

le1nux · 2025-05-06T23:01:04Z

No description provided.

rrutmann

Please have a look at the missing file. Other than that the PR looks good to me, the other remarks I made are optional

rrutmann · 2025-05-13T14:13:53Z

src/ml_filter/analysis/utils.py

                )

    document_scores_df = pd.DataFrame(document_scores)
+


Why is the filtering necessary? We handle the case of missing / ummatched document IDs later by filtering on documents that are common for the annotators we are currently comparing

rrutmann · 2025-05-13T14:57:02Z

notebooks/edu_content_human_as_a_judge.ipynb

A few comments on the notebook:

You define jsonl_path = "annotations__educational_content__en__gt.jsonl". This fails for me because the file is missing. Should this be gt_annotations_path instead?

Why do we plot the standard deviations as a histogram, but the spread as a cumulative distribution?

In the section about spreads, why do we print Document ID 1 and 2? Shouldn't they be the same? Or is this just a sanity check?

For the evaluation of our predictions, we aggregate the human annotations using majority voting. In the notebook, we're only looking at the mean of the human annotations. Should we add info about the majority voting as well? E.g. we could add a plot of the distribution of the human annotations aggregated with majority voting.

The computations look correct to me.

le1nux added 6 commits May 7, 2025 00:59

feat: added NDCG@all metric

6757c5c

feat: added some consistency checks for input data

4db5eb8

feat: added llm-as-a-judge evaluation

3f2e74f

feat: more diagrams in mlfilter paper notebook

8c9f50a

feat: added f1 @k metric

d04d472

feat: more work on plots

d026813

le1nux requested a review from rrutmann May 12, 2025 19:23

rrutmann requested changes May 13, 2025

View reviewed changes

feat: model correlation claculation and confusion matrix refactoring

66f2dc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation improvements #217

Evaluation improvements #217

Uh oh!

le1nux commented May 6, 2025

Uh oh!

rrutmann left a comment •

edited

Loading

Uh oh!

rrutmann May 13, 2025

Uh oh!

rrutmann May 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evaluation improvements #217

Are you sure you want to change the base?

Evaluation improvements #217

Uh oh!

Conversation

le1nux commented May 6, 2025

Uh oh!

rrutmann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrutmann May 13, 2025

Choose a reason for hiding this comment

Uh oh!

rrutmann May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rrutmann left a comment •

edited

Loading