This repository contains the CGSchnet ML model for molecular dynamics and machine learning pipelines, providing tools for training, testing, and building upon coarse-grained neural network force fields. The model architecture is inspired by CGSchNet.
This repository implements a machine learning framework for coarse-grained molecular dynamics simulations. The pipeline consists of three main stages:
- Preprocessing: Convert all-atom MD trajectories into coarse-grained representations and fit classical prior force fields
- Training: Train neural network models to predict delta forces (corrections to the prior force field)
- Simulation: Run MD simulations using the trained models combined with prior force fields
The system supports multiple coarse-graining schemes (CA, CACB) and various prior force field configurations with bonds, angles, dihedrals, and non-bonded terms.
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
- Required Python packages: torchmd, moleculekit, mdtraj, numpy, pyyaml
Key hyperparameters specified in config files (e.g., configs/config.yaml):
embedding_dimension: Dimension of learned atomic embeddingsnum_layers: Number of interaction layers in the neural networknum_rbf: Number of radial basis functionscutoff: Interaction cutoff distancemax_num_neighbors: Maximum neighbors per atomactivation: Activation function (e.g., 'silu')
python train.py <input_dir> <output_dir> [options]Required:
input_dir: Path to preprocessed data directoryoutput_dir: Directory for saving checkpoints and results
Key Options:
--config: Path to model configuration YAML file--gpus: GPU IDs to use (e.g., "0,1,2,3" or "cpu")--batch: Batch size for training--epochs: Number of training epochs--lr: Learning rate--wd: Weight decay for regularization--atoms-per-call: Max atoms per sub-batch (for memory management)--energy-weight: Weight for energy loss term--force-weight: Weight for force loss term--val-ratio: Validation set ratio (default: 0.1)--early-stopping: Patience for early stopping (default: 1)
Learning Rate Schedulers:
--exp-lr: Exponential decay (parameter: gamma)--cos-lr: Cosine annealing (parameters: T_max, eta_min)--plateau-lr: Reduce on plateau (parameters: factor, patience, threshold, min_lr)
Preprocessing converts all-atom MD trajectories into coarse-grained representations and fits prior force fields.
python preprocess.py <input_path> -o <output_dir> --prior <prior_type>python preprocess.py benchmark_set.yaml \
-o /path/to/output/preprocessed_CA_Majewski \
--prior CA_Majewski2022_v1 \
--temp 300 \
--num-cores 32Required:
input: Input directory or YAML config file listing trajectories-o, --output: Output directory for preprocessed data--prior: Prior force field type (see Available Priors below)
Optional:
--pdbids: Specific PDB IDs to process--num-frames: Number of frames to process per trajectory--frame-slice: Python slice notation (e.g., "0:1000:10")--temp: Temperature in Kelvin (default: 300)--optimize-forces: Use statistically optimal force aggregation--prior-file: Use existing prior instead of fitting new one--no-box: Disable periodic boundary conditions--num-cores: Number of parallel processes (default: 32)
CA: Basic CA representation with bonds onlyCA_lj: CA with bonds and repulsionCA_lj_angle: CA with bonds, angles, and repulsionCA_lj_angle_dihedral: CA with all classical termsCA_lj_angleXCX_dihedralX: CA with unified angle/dihedral parametersCA_Majewski2022_v1: Parameters from Majewski et al. 2022CACB: CA-CB representationCACB_lj: CA-CB with repulsionCACB_lj_angle_dihedral: CA-CB with all terms
Preprocessing expects HDF5 files with structure:
input_dir/
├── pdb_id_1/
│ └── result/
│ └── output_pdb_id_1.h5 # Contains coordinates, forces, box (optional)
├── pdb_id_2/
│ └── result/
│ └── output_pdb_id_2.h5
...
output_dir/
├── priors.yaml # Fitted prior force field parameters
├── prior_params.json # Force field configuration
├── result/
│ ├── info.json # Preprocessing metadata
│ └── ok_list.txt # Successfully processed PDB IDs
└── <pdb_id>/
├── raw/
│ ├── coordinates.npy # CG coordinates
│ ├── forces.npy # CG forces from all-atom MD
│ ├── deltaforces.npy # Delta forces (MD - prior)
│ ├── embeddings.npy # Atom type embeddings
│ └── box.npy # Periodic box (if used)
└── processed/
└── topology.psf # CG topology
Train neural network models to predict corrections to the prior force field.
python train.py \
/path/to/preprocessed_data \
/path/to/results \
--config configs/config.yaml \
--gpus 0,1,2,3 \
--batch 4 \
--epochs 35 \
--lr 0.001 \
--exp-lr 0.85 \
--atoms-per-call 140000 \
--energy-weight 0.1 \
--force-weight 1.0python train.py \
/path/to/preprocessed_with_tica \
/path/to/results_energy_matching \
--config configs/config_cutoff2.yaml \
--gpus 0,1,2,3 \
--batch 4 \
--epochs 35 \
--lr 0.001 \
--wd 0 \
--exp-lr 0.85 \
--atoms-per-call 140000 \
--energy-weight 0.5 \
--force-weight 1.0To resume from a checkpoint, simply provide the same output directory:
python train.py \
/path/to/preprocessed_data \
/path/to/existing_results \
--config configs/config.yaml \
--gpus 0,1,2,3results_dir/
├── checkpoint.pth # Latest model checkpoint
├── checkpoint-best.pth # Best validation loss checkpoint
├── history.npy # Training/validation loss history
├── epoch_history.json # Detailed per-epoch metrics
├── training_info.json # Training configuration
├── priors.yaml # Copy of prior force field
└── prior_params.json # Copy of prior configuration
Use trained models to run coarse-grained MD simulations.
python simulate.py \
/path/to/results/checkpoint.pth \
/path/to/starting_structure.pdb \
--output simulation.h5 \
--steps 1000000 \
--save-steps 1000 \
--temperature 300 \
--timestep 2 \
--replicas 1python simulate.py \
/path/to/results/checkpoint-best.pth \
/path/to/structure1.pdb /path/to/structure2.pdb /path/to/structure3.pdb \
--output replicas.pkl \
--steps 5000000 \
--save-steps 1000 \
--temperature 300 \
--replicas 3To run simulations using only the classical prior (without ML corrections):
python simulate.py \
/path/to/results/checkpoint.pth \
/path/to/starting_structure.pdb \
--output prior_only.h5 \
--prior-only \
--steps 1000000 \
--save-steps 1000Required:
checkpoint_path: Path to trained model checkpointprocessed_path: One or more starting structures (PDB or preprocessed directories)
Key Options:
--output: Output trajectory file (.pdb, .h5, or .pkl)--steps: Total simulation steps--save-steps: Save frequency (frames)--temperature: Simulation temperature in Kelvin (default: 300)--timestep: Integration timestep in femtoseconds (default: 1)--replicas: Number of parallel replicas (default: 1)--prior-only: Run without ML model (classical prior only)--no-box: Disable periodic boundary conditions--max-num-neighbors: Override max neighbors parameter
- PDB format (
.pdb): Human-readable, compatible with visualization tools - HDF5 format (
.h5): Compact binary format with energies and metadata - Pickle format (
.pkl): Contains MDTraj trajectory objects for analysis
Reset optimizer or scheduler state in checkpoints:
python edit_checkpoint.py /path/to/checkpoint.pth --reset-optimizer --reset-schedulerView checkpoint information:
python edit_checkpoint.py /path/to/checkpoint.pth --infoTrain classical prior parameters directly from data:
python learn_prior.py \
/path/to/preprocessed_data \
/path/to/prior_output \
--epochs 50 \
--batch 10 \
--lr 0.005Inspired by CGSchNet.
License information will be added soon.
We welcome contributions! Please use GitHub Issues to:
- Request new features
- Report bugs
- Suggest improvements
- Ask questions about the pipeline