This repository contains the models, data, and scripts used in the GLM-Prior paper, which introduces a transformer-based nucleotide sequence classification approach for inferring transcription factor–target gene interaction priors.
GLM-Prior is the first stage in a dual-stage training pipeline that includes:
- GLM-Prior – a fine-tuned genomic language model that predicts TF–gene interactions from nucleotide sequences.
- PMF-GRN – a probabilistic matrix factorization model that performs GRN inference using prior knowledge from GLM-Prior.
- Environment Setup
- Dataset Preparation
- GLM-Prior Pipeline (Stage 1)
- Hyperparameter Sweep
- GRN Inference with PMF-GRN (Stage 2)
- Datasets and Models
GLM-Prior is designed to run within a Singularity container using a Conda environment. Follow the steps below to create and activate the environment. Installation time is approximately less than 5 minutes.
conda create -p /ext3/pmf-prior-network python=3.10 -y
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install hydra-core pandas datasets scikit-learn transformers wandb
singularity exec --nv --overlay overlay-15GB-500K.ext3:ro --bind local:$HOME/.local cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash
conda env update --prefix /ext3/pmf-prior-network --file environment_prior_network.yaml --prune
source /ext3/env.sh
conda activate /ext3/pmf-prior-network
To prepare training data, use the notebooks provided in the create_sequence_datasets/ directory. There are separate notebooks for yeast, mouse, and human.
Each notebook:
- Lists required downloads (FASTA files, GTF annotations, TF motifs)
- Walks through generating gene/TF sequences
- Saves output sequences and prior matrices for model input
Dataset Locations:
- Yeast:
data/yeast/ - Mouse:
data/mouse/ - Human:
data/human/
Datasets for human and mouse can be downloaded and placed into their respective folders.
For mouse and human reference datasets, download from BEELINE.
To train the GLM-Prior model on DNA sequence input:
1. Edit configuration files with appropriate file paths and optimal hyperparameters for full training:
config/train_prior_network_pipeline.yamlconfig/prior_network/finetune_nt.yaml
prior_network_singularity_overlay- overlay path for environmentprior_network_singularity_img- path to Singularity image
sbatch train_prior_network_pipeline.slurm TPN_pipeline
This will launch dynamic slurm scripts to:
- Tokenize sequences
- Train GLM-Prior
- Perform inference on gene-TF pairs
- Evaluate predictions vs. a gold standard
- Save outputs to:
output/<experiment_name>/prior_network_predictions.tsvoutput/<experiment_name>/auprc_vs_gold.json
To optimize GLM-Prior for a new dataset, we recommend running a pre-training hyperparameter sweep over 1 epoch. To run a hyperparameter sweep:
Edit ./train_prior_network/finetune_nt_hp_sweep.sh for:
- class weights
- learning rates
- downsampling rates
- gradient accumulation steps
Confirm paths and set
num_train_epochs: 1inconfig/prior_network/finetune_nt.yaml
./train_prior_network/finetune_nt_hp_sweep.sh $USER /scratch/$USER/GLM-Prior/envs/overlay-15GB-500K.ext3 /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif ./train_prior_network/hp-sweep/
Each configuration will be submitted as an individual SLURM job. Weights & Biases will automatically track all sweeps. Select the configuration with the best F1 score and update your training config accordingly.
Binarized prior-knowledge can be used as input for the PMF-GRN model to perform full GRN inference.
- PMF-GRN takes this prior-knowledge matrix and single cell gene expression data to infer directed regulatory edges between TFs and their target genes. The output of PMF-GRN includes a gene regulatory network and transcription factor activity for all genes and TFs, as well as metrics such as uncertainty calibration and AUPRC.
Datasets and models associated with the paper can be found on HuggingFace
