List view
- No due date•8/8 issues closed
Implementation classifier-based annotation pipeline. Todo: * Need to decide for inference backend (TGI, Optimum, vllm) * Need to decide on dataset format Assignees: Tim, Felix, Max
No due date•12/12 issues closedWe will have one classifier per prompt (possibly per language).
Overdue by 1 year(s)•Due by December 4, 2024•3/3 issues closedAll datasets for training the BERT-based classifiers are generated.
Overdue by 1 year(s)•Due by November 28, 2024•6/6 issues closedML-filtered fineweb2 dataset Used for training the first models on MN5.
Overdue by 11 month(s)•Due by December 15, 2024•4/4 issues closedBert-based scoring pipeline consisting of: * training Bert-based classifiers (using HF script) on annotation data (multi-class classification with multiple Bert models vs multi-label classification with single Bert model) * evaluation of the classifiers against ground truth data (probably only educational scores) HF Implementation: https://github.com/huggingface/cosmopedia/tree/main/classification https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier Assignees: Richard, Florian, David, (Abbas)
Overdue by 1 year(s)•Due by November 29, 2024•34/34 issues closedEnd-to-end pipeline that annotates all documents in a JSONL file based on the scoring schema defined in a prompt. Supports different models such as LLama or Gemma. Script that combines annotations with original documents in the following format (suggestion only) ``` { [original values], annotations: [{tag: educational, model: llama, score: 1}, ...]} ``` Assignees: Max, Abbas, Mehdi, Alex, Richard, (Felix)
Overdue by 1 year(s)•Due by November 19, 2024•40/40 issues closed