This repository provides code solution for Data Fusion Contest task 1
Short description: Single distilbert
Place: 7/265 (top 3%)
Public LB = 0.8683
Private LB = 0.8674
To install requirements:
pip install -r requirements.txt
Task is to predict the predefined category of the item in a receipt based on its name
- Baseline — Russian Part of Multilingual Distillbert as is (spoiler - it was Cased): Public =
0.7875 - + Pretraining on masked language modeling task: Public =
0.8261 - + Label Smoothing: Public =
0.8323 - + Custom Model Arch (Weighted sum of hidden states + multisample dropout): Public =
0.8354 - + Lowercase: Public =
0.8459 - + Increase number of training epochs to 50: Public =
0.8532 - + Pseudolabeling (distilbert-distilbert): Public =
0.8626 - + Pseudolabeling (RuBERT-distilbert): Public =
0.8683
- Pretrain RuBERT and distilbert on all unique texts using masked language modeling task:
train_mlm_base_tokenizer.ipynb - Finetune pretrained RuBERT on the texts with labels (~40k unique texts):
rubert_base.ipynb - Create pseudolabels (~1M unique texts) for all unique texts using finetuned RuBERT:
pseudo_label.ipynb - Finetune distilbert on these pseudolabels:
pseudo_label.ipynb - Create submission .zip with finetuned distilbert:
pseudo_label.ipynb