I think there could be value in creating a separate dataset for pretraining. It would cover the same chemical space as the standard SPICE dataset, but have many more conformations and be computed at a much lower level of theory. The idea would be to pretrain your model on the large dataset, then fine tune it on the smaller, higher quality one.
This raises several questions.
- How large should the pretraining dataset be? I suggest roughly 10x the standard one.
- What level of theory should it use? An obvious choice would be GFN2-xTB since it's very fast (a fraction of a second for most calculations) and not too terrible accuracy.
- Should it include larger molecules than in the SPICE dataset? For example, longer peptides and drug molecules with more than 50 atoms.
- What results should it include? To keep the size manageable, I suggest energies, forces, and nothing else.
- How should the conformations be generated? In particular, should it include higher energy conformations than we currently have?
For example, the current depeptides and PubChem subsets include 50 conformations for each molecule: 25 high energy that are sampled at 500K and 25 low energy that are partially energy minimized. For the pretraining dataset we might instead include 100 conformations at each of four temperatures: 100K, 300K, 500K, and 1000K. In place of DES 370K we could use DES 5M.
I think there could be value in creating a separate dataset for pretraining. It would cover the same chemical space as the standard SPICE dataset, but have many more conformations and be computed at a much lower level of theory. The idea would be to pretrain your model on the large dataset, then fine tune it on the smaller, higher quality one.
This raises several questions.
For example, the current depeptides and PubChem subsets include 50 conformations for each molecule: 25 high energy that are sampled at 500K and 25 low energy that are partially energy minimized. For the pretraining dataset we might instead include 100 conformations at each of four temperatures: 100K, 300K, 500K, and 1000K. In place of DES 370K we could use DES 5M.