As per discussion in the January e3nn meeting, it is currently difficult to train models with large L. We suspect this is a due to different paths in the network (in1 x in2 -> out... wash, rinse, repeat) having very different sensitivities to inputs and convolutions and requires rigorous regularization.
Several strategies were suggested:
- Choose different learning rates for parameters of different paths of or output L (@mariogeiger)
- Change learning rates in time for different paths (@mariogeiger)
- Initialize weights based on path (@mariogeiger)
- e.g. Start with a purely scalar network that learns to include higher tensor contributions
- Start off with only scalar network, train and then gradually add higher L's (@JoshRackers and @muhrin)
Please add to the thread if I missed anything.