-
Couldn't load subscription status.
- Fork 7
Description
I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.
Environment:
Python version: 3.6.15
Operating System: Ubuntu 20.04
GPU: Nvidia Titan RTX
Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata
Issue:
The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.
Attempts:
| Params | Iteration | Train Acc | Val Acc | Train Loss | Val Loss |
|---|---|---|---|---|---|
| num GPU = 1 batch size = 1 |
1 | 0.14 | 0.15 | 2.38 | 2.35 |
| 2 | 0.18 | 0.14 | 2.01 | 2.35 | |
| 3 | 0.22 | 0.15 | 1.80 | 2.43 | |
| 4 | 0.26 | 0.14 | 1.63 | 2.56 | |
| 5 | 0.30 | 0.14 | 1.50 | 2.71 | |
| num GPU = 3 batch size = 1 |
1 | 0.13 | 0.14 | 0.79 | 2.30 |
| 2 | 0.17 | 0.14 | 0.67 | 2.26 | |
| 3 | 0.20 | 0.15 | 0.61 | 2.28 | |
| 4 | 0.22 | 0.15 | 0.57 | 2.33 | |
| 5 | 0.25 | 0.14 | 0.53 | 2.38 | |
| num GPU = 1 batch size = 15 |
1 | 0.10 | 0.13 | 0.18 | 2.32 |
| 2 | 0.13 | 0.13 | 0.15 | 2.28 | |
| 3 | 0.15 | 0.14 | 0.14 | 2.27 | |
| 4 | 0.17 | 0.14 | 0.13 | 2.27 | |
| 5 | 0.18 | 0.14 | 0.13 | 2.31 | |
| num GPU = 3 batch size = 15 |
1 | 0.10 | 0.12 | 0.06 | 2.35 |
| 2 | 0.12 | 0.13 | 0.05 | 2.30 | |
| 3 | 0.14 | 0.13 | 0.05 | 2.29 | |
| 4 | 0.15 | 0.14 | 0.05 | 2.28 | |
| 5 | 0.16 | 0.13 | 0.05 | 2.27 | |
| num GPU = 8 batch size = 12 |
1 | 0.09 | 0.11 | 0.03 | 2.41 |
| 2 | 0.12 | 0.12 | 0.03 | 2.34 | |
| 3 | 0.13 | 0.13 | 0.02 | 2.31 | |
| 4 | 0.14 | 0.13 | 0.02 | 2.29 | |
| 5 | 0.14 | 0.14 | 0.02 | 2.29 |
Request:
Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?
Thank you for your time.