In both of the jupyter notebooks and the paper, I noticed that instead of using Adam, the most commonly used optimizer for transformers, you used Adagrad for all of the experiments. Is there a reason behind this or simply a empirical observation?
Additionally, are other newly developed optimizers (RAdam, NovoGrad, DiffGrad, etc.) compatible with the method introduced or do they defeat the purpose?
In both of the jupyter notebooks and the paper, I noticed that instead of using Adam, the most commonly used optimizer for transformers, you used Adagrad for all of the experiments. Is there a reason behind this or simply a empirical observation?
Additionally, are other newly developed optimizers (RAdam, NovoGrad, DiffGrad, etc.) compatible with the method introduced or do they defeat the purpose?