CGML Base Model Repository

This repository contains the CGSchnet ML model for molecular dynamics and machine learning pipelines, providing tools for training, testing, and building upon coarse-grained neural network force fields. The model architecture is inspired by CGSchNet.

Overview

This repository implements a machine learning framework for coarse-grained molecular dynamics simulations. The pipeline consists of three main stages:

Preprocessing: Convert all-atom MD trajectories into coarse-grained representations and fit classical prior force fields
Training: Train neural network models to predict delta forces (corrections to the prior force field)
Simulation: Run MD simulations using the trained models combined with prior force fields

The system supports multiple coarse-graining schemes (CA, CACB) and various prior force field configurations with bonds, angles, dihedrals, and non-bonded terms.

Installation

⚠️ Note: Installation instructions will be provided as the code is finalized for public release.

Prerequisites

Python 3.8+
PyTorch 2.0+
CUDA-capable GPU (recommended)
Required Python packages: torchmd, moleculekit, mdtraj, numpy, pyyaml

Model Inputs

Training Configuration

Key hyperparameters specified in config files (e.g., configs/config.yaml):

embedding_dimension: Dimension of learned atomic embeddings
num_layers: Number of interaction layers in the neural network
num_rbf: Number of radial basis functions
cutoff: Interaction cutoff distance
max_num_neighbors: Maximum neighbors per atom
activation: Activation function (e.g., 'silu')

Training Arguments

python train.py <input_dir> <output_dir> [options]

Required:

input_dir: Path to preprocessed data directory
output_dir: Directory for saving checkpoints and results

Key Options:

--config: Path to model configuration YAML file
--gpus: GPU IDs to use (e.g., "0,1,2,3" or "cpu")
--batch: Batch size for training
--epochs: Number of training epochs
--lr: Learning rate
--wd: Weight decay for regularization
--atoms-per-call: Max atoms per sub-batch (for memory management)
--energy-weight: Weight for energy loss term
--force-weight: Weight for force loss term
--val-ratio: Validation set ratio (default: 0.1)
--early-stopping: Patience for early stopping (default: 1)

Learning Rate Schedulers:

--exp-lr: Exponential decay (parameter: gamma)
--cos-lr: Cosine annealing (parameters: T_max, eta_min)
--plateau-lr: Reduce on plateau (parameters: factor, patience, threshold, min_lr)

Preprocessing

Preprocessing converts all-atom MD trajectories into coarse-grained representations and fits prior force fields.

Basic Usage

python preprocess.py <input_path> -o <output_dir> --prior <prior_type>

Example: Preprocessing with CA Representation

python preprocess.py benchmark_set.yaml \
    -o /path/to/output/preprocessed_CA_Majewski \
    --prior CA_Majewski2022_v1 \
    --temp 300 \
    --num-cores 32

Preprocessing Arguments

Required:

input: Input directory or YAML config file listing trajectories
-o, --output: Output directory for preprocessed data
--prior: Prior force field type (see Available Priors below)

Optional:

--pdbids: Specific PDB IDs to process
--num-frames: Number of frames to process per trajectory
--frame-slice: Python slice notation (e.g., "0:1000:10")
--temp: Temperature in Kelvin (default: 300)
--optimize-forces: Use statistically optimal force aggregation
--prior-file: Use existing prior instead of fitting new one
--no-box: Disable periodic boundary conditions
--num-cores: Number of parallel processes (default: 32)

Available Prior Force Fields

CA: Basic CA representation with bonds only
CA_lj: CA with bonds and repulsion
CA_lj_angle: CA with bonds, angles, and repulsion
CA_lj_angle_dihedral: CA with all classical terms
CA_lj_angleXCX_dihedralX: CA with unified angle/dihedral parameters
CA_Majewski2022_v1: Parameters from Majewski et al. 2022
CACB: CA-CB representation
CACB_lj: CA-CB with repulsion
CACB_lj_angle_dihedral: CA-CB with all terms

Input Data Format

Preprocessing expects HDF5 files with structure:

input_dir/
├── pdb_id_1/
│   └── result/
│       └── output_pdb_id_1.h5  # Contains coordinates, forces, box (optional)
├── pdb_id_2/
│   └── result/
│       └── output_pdb_id_2.h5
...

Output Structure

output_dir/
├── priors.yaml              # Fitted prior force field parameters
├── prior_params.json        # Force field configuration
├── result/
│   ├── info.json           # Preprocessing metadata
│   └── ok_list.txt         # Successfully processed PDB IDs
└── <pdb_id>/
    ├── raw/
    │   ├── coordinates.npy  # CG coordinates
    │   ├── forces.npy       # CG forces from all-atom MD
    │   ├── deltaforces.npy  # Delta forces (MD - prior)
    │   ├── embeddings.npy   # Atom type embeddings
    │   └── box.npy         # Periodic box (if used)
    └── processed/
        └── topology.psf     # CG topology

Training

Train neural network models to predict corrections to the prior force field.

Basic Training Example

python train.py \
    /path/to/preprocessed_data \
    /path/to/results \
    --config configs/config.yaml \
    --gpus 0,1,2,3 \
    --batch 4 \
    --epochs 35 \
    --lr 0.001 \
    --exp-lr 0.85 \
    --atoms-per-call 140000 \
    --energy-weight 0.1 \
    --force-weight 1.0

Training with Energy Matching

python train.py \
    /path/to/preprocessed_with_tica \
    /path/to/results_energy_matching \
    --config configs/config_cutoff2.yaml \
    --gpus 0,1,2,3 \
    --batch 4 \
    --epochs 35 \
    --lr 0.001 \
    --wd 0 \
    --exp-lr 0.85 \
    --atoms-per-call 140000 \
    --energy-weight 0.5 \
    --force-weight 1.0

Resuming Training

To resume from a checkpoint, simply provide the same output directory:

python train.py \
    /path/to/preprocessed_data \
    /path/to/existing_results \
    --config configs/config.yaml \
    --gpus 0,1,2,3

Training Outputs

results_dir/
├── checkpoint.pth           # Latest model checkpoint
├── checkpoint-best.pth      # Best validation loss checkpoint
├── history.npy             # Training/validation loss history
├── epoch_history.json      # Detailed per-epoch metrics
├── training_info.json      # Training configuration
├── priors.yaml            # Copy of prior force field
└── prior_params.json      # Copy of prior configuration

Running Simulations

Use trained models to run coarse-grained MD simulations.

Basic Simulation

python simulate.py \
    /path/to/results/checkpoint.pth \
    /path/to/starting_structure.pdb \
    --output simulation.h5 \
    --steps 1000000 \
    --save-steps 1000 \
    --temperature 300 \
    --timestep 2 \
    --replicas 1

Multiple Replicas

python simulate.py \
    /path/to/results/checkpoint-best.pth \
    /path/to/structure1.pdb /path/to/structure2.pdb /path/to/structure3.pdb \
    --output replicas.pkl \
    --steps 5000000 \
    --save-steps 1000 \
    --temperature 300 \
    --replicas 3

Prior-Only Simulation

To run simulations using only the classical prior (without ML corrections):

python simulate.py \
    /path/to/results/checkpoint.pth \
    /path/to/starting_structure.pdb \
    --output prior_only.h5 \
    --prior-only \
    --steps 1000000 \
    --save-steps 1000

Simulation Arguments

Required:

checkpoint_path: Path to trained model checkpoint
processed_path: One or more starting structures (PDB or preprocessed directories)

Key Options:

--output: Output trajectory file (.pdb, .h5, or .pkl)
--steps: Total simulation steps
--save-steps: Save frequency (frames)
--temperature: Simulation temperature in Kelvin (default: 300)
--timestep: Integration timestep in femtoseconds (default: 1)
--replicas: Number of parallel replicas (default: 1)
--prior-only: Run without ML model (classical prior only)
--no-box: Disable periodic boundary conditions
--max-num-neighbors: Override max neighbors parameter

Output Formats

PDB format (.pdb): Human-readable, compatible with visualization tools
HDF5 format (.h5): Compact binary format with energies and metadata
Pickle format (.pkl): Contains MDTraj trajectory objects for analysis

Utilities

Editing Checkpoints

Reset optimizer or scheduler state in checkpoints:

python edit_checkpoint.py /path/to/checkpoint.pth --reset-optimizer --reset-scheduler

View checkpoint information:

python edit_checkpoint.py /path/to/checkpoint.pth --info

Learning Prior Force Fields

Train classical prior parameters directly from data:

python learn_prior.py \
    /path/to/preprocessed_data \
    /path/to/prior_output \
    --epochs 50 \
    --batch 10 \
    --lr 0.005

Inspired by CGSchNet.

License

License information will be added soon.

Contributing

We welcome contributions! Please use GitHub Issues to:

Request new features
Report bugs
Suggest improvements
Ask questions about the pipeline

Status

⚠️ Note: Code is currently being prepared for public release. Full documentation and tutorials will be available within 1-2 weeks.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
modules @ 78e5ec5		modules @ 78e5ec5
.gitmodules		.gitmodules
README.md		README.md
edit_checkpoint.py		edit_checkpoint.py
learn_prior.py		learn_prior.py
preprocess.py		preprocess.py
preprocess_energy.py		preprocess_energy.py
simulate.py		simulate.py
train.py		train.py

BioMedAI-UCSC/CGML_model_base

Folders and files

Latest commit

History

Repository files navigation

CGML Base Model Repository

Overview

Installation

Prerequisites

Model Inputs

Training Configuration

Training Arguments

Preprocessing

Basic Usage

Example: Preprocessing with CA Representation

Preprocessing Arguments

Available Prior Force Fields

Input Data Format

Output Structure

Training

Basic Training Example

Training with Energy Matching

Resuming Training

Training Outputs

Running Simulations

Basic Simulation

Multiple Replicas

Prior-Only Simulation

Simulation Arguments

Output Formats

Utilities

Editing Checkpoints

Learning Prior Force Fields

License

Contributing

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages