-
Notifications
You must be signed in to change notification settings - Fork 0
Evaluation improvements #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: human_inner_annotator_agreement
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the missing file. Other than that the PR looks good to me, the other remarks I made are optional
| ) | ||
|
|
||
| document_scores_df = pd.DataFrame(document_scores) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the filtering necessary? We handle the case of missing / ummatched document IDs later by filtering on documents that are common for the annotators we are currently comparing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the notebook:
- You define
jsonl_path = "annotations__educational_content__en__gt.jsonl". This fails for me because the file is missing. Should this begt_annotations_pathinstead? - Why do we plot the standard deviations as a histogram, but the spread as a cumulative distribution?
- In the section about spreads, why do we print Document ID 1 and 2? Shouldn't they be the same? Or is this just a sanity check?
- For the evaluation of our predictions, we aggregate the human annotations using majority voting. In the notebook, we're only looking at the mean of the human annotations. Should we add info about the majority voting as well? E.g. we could add a plot of the distribution of the human annotations aggregated with majority voting.
The computations look correct to me.
No description provided.