GitHub · Where software is built

Milestones

[EV1] Setup evaluation pipeline
No due date
•8/8 issues closed
100% complete0 open 8 closed
[ML3] Impl of classifier-based annotation pipeline
Implementation classifier-based annotation pipeline. Todo: * Need to decide for inference backend (TGI, Optimum, vllm) * Need to decide on dataset format Assignees: Tim, Felix, Max
No due date
•12/12 issues closed
100% complete0 open 12 closed
[Art2] Trained classifier(s)
We will have one classifier per prompt (possibly per language).
Overdue by 1 year(s)
•
Due by December 4, 2024
•3/3 issues closed
100% complete0 open 3 closed
[Art1] Prompt-based Annotation Datasets
All datasets for training the BERT-based classifiers are generated.
Overdue by 1 year(s)
•
Due by November 28, 2024
•6/6 issues closed
100% complete0 open 6 closed
[Art4] ML-filtered Fineweb2
ML-filtered fineweb2 dataset Used for training the first models on MN5.
Overdue by 11 month(s)
•
Due by December 15, 2024
•4/4 issues closed
100% complete0 open 4 closed
[ML2] Impl. of classifier training pipeline
Bert-based scoring pipeline consisting of: * training Bert-based classifiers (using HF script) on annotation data (multi-class classification with multiple Bert models vs multi-label classification with single Bert model) * evaluation of the classifiers against ground truth data (probably only educational scores) HF Implementation: https://github.com/huggingface/cosmopedia/tree/main/classification https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier Assignees: Richard, Florian, David, (Abbas)
Overdue by 1 year(s)
•
Due by November 29, 2024
•34/34 issues closed
100% complete0 open 34 closed
[ML1] Impl. of prompt-based annotation pipeline
End-to-end pipeline that annotates all documents in a JSONL file based on the scoring schema defined in a prompt. Supports different models such as LLama or Gemma. Script that combines annotations with original documents in the following format (suggestion only) ``` { [original values], annotations: [{tag: educational, model: llama, score: 1}, ...]} ``` Assignees: Max, Abbas, Mehdi, Alex, Richard, (Felix)
Overdue by 1 year(s)
•
Due by November 19, 2024
•40/40 issues closed
100% complete0 open 40 closed