Skip to content

Milestones

List view

  • No due date
    8/8 issues closed
  • Implementation classifier-based annotation pipeline. Todo: * Need to decide for inference backend (TGI, Optimum, vllm) * Need to decide on dataset format Assignees: Tim, Felix, Max

    No due date
    12/12 issues closed
  • We will have one classifier per prompt (possibly per language).

    Overdue by 1 year(s)
    Due by December 4, 2024
    3/3 issues closed
  • All datasets for training the BERT-based classifiers are generated.

    Overdue by 1 year(s)
    Due by November 28, 2024
    6/6 issues closed
  • ML-filtered fineweb2 dataset Used for training the first models on MN5.

    Overdue by 11 month(s)
    Due by December 15, 2024
    4/4 issues closed
  • Bert-based scoring pipeline consisting of: * training Bert-based classifiers (using HF script) on annotation data (multi-class classification with multiple Bert models vs multi-label classification with single Bert model) * evaluation of the classifiers against ground truth data (probably only educational scores) HF Implementation: https://github.com/huggingface/cosmopedia/tree/main/classification https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier Assignees: Richard, Florian, David, (Abbas)

    Overdue by 1 year(s)
    Due by November 29, 2024
    34/34 issues closed
  • End-to-end pipeline that annotates all documents in a JSONL file based on the scoring schema defined in a prompt. Supports different models such as LLama or Gemma. Script that combines annotations with original documents in the following format (suggestion only) ``` { [original values], annotations: [{tag: educational, model: llama, score: 1}, ...]} ``` Assignees: Max, Abbas, Mehdi, Alex, Richard, (Felix)

    Overdue by 1 year(s)
    Due by November 19, 2024
    40/40 issues closed