This project implements the classification task using Transformer model. On IMDB sentiment analysis task it achieved a score of 85+ accuracy.
It also contains BERT training-
- Transformer based Neural MT training and decoding
- Training and fine tuning mBart for Neural MT (Experimental) (mBart)
- Bert encoder (Default Bert)
- python (3.6+)
- pytorch (1.3+)
- Sentencepiece
- numpy
pip3 install -r requirements.txt
python -m spacy download encd examples/translation/
bash prepare-iwslt14.sh
cd -
bash prep.shbash train.shbash decode.shbash translate_file.shcd examples/translation/
bash prepare-iwslt14.sh
# This will add language tag at the end of each segment in the corpu
sed -e 's/$/ <\/s> <EN>/' train.en > src-train-mbart.txt
sed -e 's/$/ <\/s> <DE>/' train.de >> src-train-mbart.txt
sed -e 's/^/<EN> /' train.en > temp-file.en
sed -e 's/^/<DE> /' train.de > temp-file.de
sed -e 's/$/ <\/s> <EN>/' temp-file.en > tgt-train-mbart.txt
sed -e 's/$/ <\/s> <DE>/' temp-file.de >> tgt-train-mbart.txt
sed -e 's/$/ <\/s> <EN>/' valid.en > src-valid-mbart.txt
sed -e 's/$/ <\/s> <DE>/' valid.de >> src-valid-mbart.txt
sed -e 's/^/<EN> /' valid.en > temp-file.en
sed -e 's/^/<DE> /' valid.de > temp-file.de
sed -e 's/$/ <\/s> <EN>/' temp-file.en > tgt-valid-mbart.txt
sed -e 's/$/ <\/s> <DE>/' temp-file.de >> tgt-valid-mbart.txt
rm temp-file.en temp-file.de
cd -
bash prep_mbart.shbash train_mbart.shNote: This model now could be directly used for NMT training as described in the above section. Simply provide the model path (--save_model) and it will be automatically used for further fine-tuning.
Also, its important to note that we must use the same vocab for corpus preparation, the one used for mbart training. Check the sample shell scripts in the following section for both corpus preparation and training.
# This will add language tag at the end of each segment in the corpu
sed -e 's/$/ <\/s> <EN>/' train.en > src-train-finetune-mbart.txt
sed -e 's/^/<DE> /' train.de > temp-file.de
sed -e 's/$/ <\/s> <DE>/' temp-file.de > tgt-train-finetune-mbart.txt
sed -e 's/$/ <\/s> <EN>/' valid.en > src-valid-finetune-mbart.txt
sed -e 's/^/<DE> /' valid.de > temp-file.de
sed -e 's/$/ <\/s> <DE>/' temp-file.de > tgt-valid-finetune-mbart.txt
rm temp-file.de
bash prep_finetune_mbart_nmt.shbash finetune_mbart_nmt.shcd examples/translation/
bash prepare-iwslt14.sh
cat train.en > train-roberta.txt
cat train.de >> train-roberta.txt
cat valid.en > valid-roberta.txt
cat valid.de >> valid-roberta.txt
cd -
bash prep_roberta.shbash train_roberta.sh$python classify.py
Raj Nath Patel (patelrajnath@gmail.com)
Linkedin: https://ie.linkedin.com/in/raj-nath-patel-2262b024
0.1
Copyright Raj Nath Patel 2020 - present
Pytorch-dl is a free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
You should have received a copy of the GNU General Public License along with Pytorch-dl project. If not, see http://www.gnu.org/licenses/.