Having issues getting training to converge - hyper parameter issue?

I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.

### Environment:
Python version: 3.6.15
Operating System: Ubuntu 20.04
GPU: Nvidia Titan RTX
Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata
### Issue:
The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.

### Attempts:
| Params | Iteration | Train Acc | Val Acc | Train Loss | Val Loss |
|----------------|-----------|-----------|---------|------------|----------|
| num GPU = 1 batch size = 1 | 1 | 0.14 | 0.15 | 2.38 | 2.35 |
| | 2 | 0.18 | 0.14 | 2.01 | 2.35 |
| | 3 | 0.22 | 0.15 | 1.80 | 2.43 |
| | 4 | 0.26 | 0.14 | 1.63 | 2.56 |
| | 5 | 0.30 | 0.14 | 1.50 | 2.71 |
| num GPU = 3 batch size = 1 | 1 | 0.13 | 0.14 | 0.79 | 2.30 |
| | 2 | 0.17 | 0.14 | 0.67 | 2.26 |
| | 3 | 0.20 | 0.15 | 0.61 | 2.28 |
| | 4 | 0.22 | 0.15 | 0.57 | 2.33 |
| | 5 | 0.25 | 0.14 | 0.53 | 2.38 |
| num GPU = 1 batch size = 15| 1 | 0.10 | 0.13 | 0.18 | 2.32 |
| | 2 | 0.13 | 0.13 | 0.15 | 2.28 |
| | 3 | 0.15 | 0.14 | 0.14 | 2.27 |
| | 4 | 0.17 | 0.14 | 0.13 | 2.27 |
| | 5 | 0.18 | 0.14 | 0.13 | 2.31 |
| num GPU = 3 batch size = 15| 1 | 0.10 | 0.12 | 0.06 | 2.35 |
| | 2 | 0.12 | 0.13 | 0.05 | 2.30 |
| | 3 | 0.14 | 0.13 | 0.05 | 2.29 |
| | 4 | 0.15 | 0.14 | 0.05 | 2.28 |
| | 5 | 0.16 | 0.13 | 0.05 | 2.27 |
| num GPU = 8 batch size = 12| 1 | 0.09 | 0.11 | 0.03 | 2.41 |
| | 2 | 0.12 | 0.12 | 0.03 | 2.34 |
| | 3 | 0.13 | 0.13 | 0.02 | 2.31 |
| | 4 | 0.14 | 0.13 | 0.02 | 2.29 |
| | 5 | 0.14 | 0.14 | 0.02 | 2.29 |


### Request:
Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Having issues getting training to converge - hyper parameter issue? #8

Environment:

Issue:

Attempts:

Request:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Params	Iteration	Train Acc	Val Acc	Train Loss	Val Loss
num GPU = 1 batch size = 1	1	0.14	0.15	2.38	2.35
	2	0.18	0.14	2.01	2.35
	3	0.22	0.15	1.80	2.43
	4	0.26	0.14	1.63	2.56
	5	0.30	0.14	1.50	2.71
num GPU = 3 batch size = 1	1	0.13	0.14	0.79	2.30
	2	0.17	0.14	0.67	2.26
	3	0.20	0.15	0.61	2.28
	4	0.22	0.15	0.57	2.33
	5	0.25	0.14	0.53	2.38
num GPU = 1 batch size = 15	1	0.10	0.13	0.18	2.32
	2	0.13	0.13	0.15	2.28
	3	0.15	0.14	0.14	2.27
	4	0.17	0.14	0.13	2.27
	5	0.18	0.14	0.13	2.31
num GPU = 3 batch size = 15	1	0.10	0.12	0.06	2.35
	2	0.12	0.13	0.05	2.30
	3	0.14	0.13	0.05	2.29
	4	0.15	0.14	0.05	2.28
	5	0.16	0.13	0.05	2.27
num GPU = 8 batch size = 12	1	0.09	0.11	0.03	2.41
	2	0.12	0.12	0.03	2.34
	3	0.13	0.13	0.02	2.31
	4	0.14	0.13	0.02	2.29
	5	0.14	0.14	0.02	2.29

Uh oh!

Having issues getting training to converge - hyper parameter issue? #8

Description

Environment:

Issue:

Attempts:

Request:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions