Optimizer choices?

In both of the jupyter notebooks and the paper, I noticed that instead of using Adam, the most commonly used optimizer for transformers, you used Adagrad for all of the experiments. Is there a reason behind this or simply a empirical observation?

Additionally, are other newly developed optimizers (RAdam, NovoGrad, DiffGrad, etc.) compatible with the method introduced or do they defeat the purpose?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer choices? #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimizer choices? #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions