- 1. Introduction
- 2. Requirements
- 3. Input formats
- 4. Setting up a config file for the pipeline
- 5. Output files
- 6. Running the pipeline
- 7. Training a PARM model!
- Citation
PARM (Promoter Activity Regulatory Model) is a deep learning model that predicts the promoter activity from the DNA sequence itself. As a convolution neural network trained on MPRA data, PARM is very lightweight and produces predictions in a cell-type-specific manner. See PARM paper. See PARM repository.
This repo contains a Snakemake pipeline to generate the input data for PARM from the raw MPRA counts data.
See bellow the schematic representation of the pre-processing steps:
Each step of this pipeline requires specific packages, but we provide them all. The only requirements that you need before running it are:
- Snakemake v8.25.3
- Conda v23.3.1
The example_data directory contains a small subset of the data used in the PARM paper, where an MPRA library of promoter fragments was transfected into different cells (here, HAP1 and AGS, with two biological replicates each).
This pipeline requires three main inputs: the raw count table of the MPRA fragments (input counts), the coordinates of the features that we want our fragments to overlap with (regulatory features), and a tabular file listing the features that should belong to each of the five folds of the data (features in folds, more details about it here)
Input count files are located in example_data/promoter_library/ and contain MPRA count data for each chromosome. Each file is a tab-separated text file compressed with gzip, containing the coordinates of the fragments and the iPCR and/or pDNA counts, as well as cDNA counts on each replicate.
Preview of chr1.txt.gz:
| chr | start | end | strand | iPCR | pDNA_pHY3_T2 | HAP1_B1 | HAP1_B2 | AGS_B1 | AGS_B2 |
|---|---|---|---|---|---|---|---|---|---|
| chr1 | 15552 | 15772 | + | 4 | 12.0 | 13.0 | 0.0 | 0.0 | 0.0 |
| chr1 | 51502 | 51657 | - | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| chr1 | 101217 | 101437 | + | 3 | 19.0 | 0.0 | 4.0 | 0.0 | 0.0 |
| chr1 | 228479 | 228717 | - | 14 | 57.0 | 2.0 | 18.0 | 0.0 | 0.0 |
Column descriptions:
- chr: chromosome identifier
- start: start position of the MPRA fragment
- end: end position of the MPRA fragment
- strand: strand orientation (+/-)
- iPCR: input PCR counts
- pDNA_pHY3_T2: normalization column (plasmid DNA counts)
- HAP1_B1, HAP1_B2: cDNA counts for HAP1 cells, two biological replicates
- AGS_B1, AGS_B2: cDNA counts for AGS cells, two biological replicates
Regulatory features are BED files located in example_data/regulatory_features/tss_selection_m300_p100/. These define genomic regions of interest (e.g., transcription start sites, enhancers) that will be used to filter and organize the MPRA data. Each BED file contains coordinates for regulatory elements for the corresponding chromosome.
If your BED files contain more than one feature type (e.g.: promoters and enhancers), we recommend adding an index in front of the feature name (separated by underscore), so that's easier to distinguish them later. This pipeline automatically creates the
FEATtypecolumn in the output data with the index value of each fragment. In the example data, we used0for promoters and1for enhancers. If you don't need this, then you can just ignore theFEATtypecolumn in the output.
To train the PARM models, we split our data into six: folds0-4, and test. Then, we train every model five independent times, using a different fold for validation of the training. Therefore, we need to specify which features should go to each fold.
The example_data/regulatory_features/features_in_folds/tss_selection_m300_p100/ directory contains BED-like files that define which regulatory features belong to each cross-validation fold:
features_in_fold0.bedtofeatures_in_fold4.bed: Training/validation foldsfeatures_in_test.bed: Test set features
Preview of features_in_fold0.bed:
| chr | start | end | strand | TSSabspos | group | TSS |
|---|---|---|---|---|---|---|
| chr1 | 568614 | 569015 | + | 568614_569015 | 0_MTATP8P1_ENST00000467115.1_568614_569015 | 0_MTATP8P1_ENST00000467115.1 |
| chr1 | 874354 | 874755 | + | 874354_874755 | 0_SAMD11_ENST00000455979.1_874354_874755 | 0_SAMD11_ENST00000455979.1 |
| chr1 | 940400 | 941000 | . | 940400_941000 | 1_EnhA1_chr1:940400-941000_940400_941000 | 1_EnhA1_chr1:940400-941000 |
| chr1 | 1044600 | 1045000 | . | 1044600_1045000 | 1_EnhA2_chr1:1044600-1045000_1044600_1045000 | 1_EnhA2_chr1:1044600-1045000 |
Column descriptions:
- chr: chromosome identifier
- start: start position of the feature
- end: end position of the feature
- strand: strand orientation
- TSSabspos: absolute position of the feature (start and end coordinates pasted)
- group: unique group identifier combining feature type, name, and coordinates
- TSS: name of the feature
These files ensure that each regulatory feature is assigned to exactly one fold, preventing data leakage during cross-validation.
As in any snakemake workflow, you need first setup a configuration file before running this pipeline. Below we guide you through the fields of the config file, using our example data.
The input parameters should be set as follows:
# Input counts
##############
# Base directory of your input data
INPUT_DIR: example_data
# List of dataset subdirectories within INPUT_DIR
INPUT:
- promoter_library
# Regulatory features
######################
# Base directory containing regulatory feature BED files
REGULATORY_FEATURES_DIR: "example_data/regulatory_features"
# List of regulatory feature sets to process
REGULATORY_FEATURES:
- tss_selection_m300_p100
# Features in folds
######################
# Directory containing fold assignments for regulatory features
FEATURES_IN_FOLDS_BASEDIR: example_data/regulatory_features/features_in_foldsyour_input_dir/
└── your_dataset_name/
├── chr1.txt.gz
├── chr2.txt.gz
├── chr3.txt.gz
├── ...
└── chrX.txt.gz
If you have multiple datasets, this pipeline will merge them then.
REGULATORY_FEATURES_DIR/
└── your_feature_set_name/
├── chr1.bed
├── chr2.bed
├── chr3.bed
├── ...
└── chrX.bed
FEATURES_IN_FOLDS_BASEDIR/
└── your_feature_set_name/ # Same name as above
├── features_in_fold0.bed
├── features_in_fold1.bed
├── features_in_fold2.bed
├── features_in_fold3.bed
├── features_in_fold4.bed
└── features_in_test.bed
This pipeline also supports multiple feature sets. It will produce one output per set.
List of the chromosomes you want to include.
CHR:
- chr1
- chr2
- chr3
# ... add all chromosomes you have
- chrXMap your cell types to the corresponding columns in your input count files:
CELLTYPES_TO_COLUMNS:
HAP1: # Cell type name
- HAP1_B1 # Biological replicate 1
- HAP1_B2 # Biological replicate 2
AGS: # Second cell type
- AGS_B1
- AGS_B2Set up which column of your input should be used for plasmid normalization of the counts.
# Column name used for normalization (typically iPCR or pDNA)
NORMALIZATION_COLUMN:
- pDNA_pHY3_T2
# Minimum raw-count threshold for the column above
NORMALIZATION_THRESHOLD: 10
# Output directory name
OUTDIR: "output_directory"
# Reference genome path
GENOME: /path/to/your/genome.fa
# Size of the onehot matrix. This is the L_max parameter of PARM
MATRIX_SIZE: 600
# Random seed for reproducibility
RANDOM_SEED: 42
# Pseudocount added to prevent log(0) errors during normalization
NORMALIZATION_PSEUDOCOUNT: AUTOoutput_directory/
├── onehot/ # <- One-hot enconded data (input for PARM training)
│ └── tss_selection_m300_p100/
│ ├── fold0.hdf5
│ ├── fold1.hdf5
│ ├── fold2.hdf5
│ ├── fold3.hdf5
│ ├── fold4.hdf5
│ └── test.hdf5
├── metadata/ # <- Tabular file with Log2RPM values and fragment coordinates
│ └── tss_selection_m300_p100/
│ ├── metadata_fold0.bed.gz
│ ├── metadata_fold1.bed.gz
│ ├── metadata_fold2.bed.gz
│ ├── metadata_fold3.bed.gz
│ ├── metadata_fold4.bed.gz
│ └── metadata_test.bed.gz
├── replicate_correlations.pdf # <- QC plot
└── tmp/ # <- Intermediate files
The HDF5 files in the onehot/ directory contain the final input matrices for PARM training, with one-hot encoded DNA sequences and corresponding normalized activity scores.
In both HDF5 files and metadata files, the promoter activity is stored as Log2RPM values (named as Log2RPM_[cell]). To get this value, First, the pipeline converts the normalization count (iPCR or pDNA) counts to RPM, as well as the fragment counts (cDNA), then get the cDNA/iPCR ratio.
Later, sets a replicate-specific pseudocount, defined by the 0.1 quantile of the non-zero values of ratio for that replicate.
Finally, returns the Log2RPM values per replicate, defined as:
After this, the average of the Log2RPM_replicate values across replicates is
calculated, and the final promoter activity (Log2RPM_cell) is obtained.
The QC plot shows a replicate correlation matrix of the Log2RPM_replicate values for every cell, both in fragment level (each dot being a fragment) and feature level (each dot being a feature). Here is an example using the provided data for AGS cells:
-
Setup your configuration file: Copy and modify
example_config.yamlwith your specific parameters. -
Activate Snakemake environment: Ensure you have Snakemake v8.25.3 and Conda v23.3.1 installed.
-
Dry run (recommended): Test your configuration without running the pipeline:
snakemake --configfile your_config.yaml --dry-run
-
Run the pipeline: Execute with your desired number of cores:
snakemake --configfile your_config.yaml --cores 10 --use-conda
For the example data, the pipeline should run in ~10 minutes.
Once the preprocessing pipeline completes successfully, you'll have HDF5 files ready for PARM model training. Go to the PARM repository for more details.
If you make use of PARM and/or this pipeline, please cite:

