1. Introduction

PARM (Promoter Activity Regulatory Model) is a deep learning model that predicts the promoter activity from the DNA sequence itself. As a convolution neural network trained on MPRA data, PARM is very lightweight and produces predictions in a cell-type-specific manner. See PARM paper. See PARM repository.

This repo contains a Snakemake pipeline to generate the input data for PARM from the raw MPRA counts data.

See bellow the schematic representation of the pre-processing steps:

2. Requirements

Each step of this pipeline requires specific packages, but we provide them all. The only requirements that you need before running it are:

Snakemake v8.25.3
Conda v23.3.1

3. Input formats

The example_data directory contains a small subset of the data used in the PARM paper, where an MPRA library of promoter fragments was transfected into different cells (here, HAP1 and AGS, with two biological replicates each).

This pipeline requires three main inputs: the raw count table of the MPRA fragments (input counts), the coordinates of the features that we want our fragments to overlap with (regulatory features), and a tabular file listing the features that should belong to each of the five folds of the data (features in folds, more details about it here)

3.1 Input counts

Input count files are located in example_data/promoter_library/ and contain MPRA count data for each chromosome. Each file is a tab-separated text file compressed with gzip, containing the coordinates of the fragments and the iPCR and/or pDNA counts, as well as cDNA counts on each replicate.

Preview of chr1.txt.gz:

chr	start	end	strand	iPCR	pDNA_pHY3_T2	HAP1_B1	HAP1_B2
chr1	15552	15772	+	4	12.0	13.0	0.0
chr1	51502	51657	-	1	0.0	0.0	0.0
chr1	101217	101437	+	3	19.0	0.0	4.0
chr1	228479	228717	-	14	57.0	2.0	18.0

Column descriptions:

chr: chromosome identifier
start: start position of the MPRA fragment
end: end position of the MPRA fragment
strand: strand orientation (+/-)
iPCR: input PCR counts
pDNA_pHY3_T2: normalization column (plasmid DNA counts)
HAP1_B1, HAP1_B2: cDNA counts for HAP1 cells, two biological replicates
AGS_B1, AGS_B2: cDNA counts for AGS cells, two biological replicates

3.2 Regulatory features

Regulatory features are BED files located in example_data/regulatory_features/tss_selection_m300_p100/. These define genomic regions of interest (e.g., transcription start sites, enhancers) that will be used to filter and organize the MPRA data. Each BED file contains coordinates for regulatory elements for the corresponding chromosome.

If your BED files contain more than one feature type (e.g.: promoters and enhancers), we recommend adding an index in front of the feature name (separated by underscore), so that's easier to distinguish them later. This pipeline automatically creates the FEATtype column in the output data with the index value of each fragment. In the example data, we used 0 for promoters and 1 for enhancers. If you don't need this, then you can just ignore the FEATtype column in the output.

3.3 Features in folds

To train the PARM models, we split our data into six: folds0-4, and test. Then, we train every model five independent times, using a different fold for validation of the training. Therefore, we need to specify which features should go to each fold.

The example_data/regulatory_features/features_in_folds/tss_selection_m300_p100/ directory contains BED-like files that define which regulatory features belong to each cross-validation fold:

features_in_fold0.bed to features_in_fold4.bed: Training/validation folds
features_in_test.bed: Test set features

Preview of features_in_fold0.bed:

chr	start	end	strand	TSSabspos	group	TSS
chr1	568614	569015	+	568614_569015	0_MTATP8P1_ENST00000467115.1_568614_569015	0_MTATP8P1_ENST00000467115.1
chr1	874354	874755	+	874354_874755	0_SAMD11_ENST00000455979.1_874354_874755	0_SAMD11_ENST00000455979.1
chr1	940400	941000	.	940400_941000	1_EnhA1_chr1:940400-941000_940400_941000	1_EnhA1_chr1:940400-941000
chr1	1044600	1045000	.	1044600_1045000	1_EnhA2_chr1:1044600-1045000_1044600_1045000	1_EnhA2_chr1:1044600-1045000

Column descriptions:

chr: chromosome identifier
start: start position of the feature
end: end position of the feature
strand: strand orientation
TSSabspos: absolute position of the feature (start and end coordinates pasted)
group: unique group identifier combining feature type, name, and coordinates
TSS: name of the feature

These files ensure that each regulatory feature is assigned to exactly one fold, preventing data leakage during cross-validation.

4. Seting up a config file for the pipeline

As in any snakemake workflow, you need first setup a configuration file before running this pipeline. Below we guide you through the fields of the config file, using our example data.

4.1 Configuring input parameters

The input parameters should be set as follows:

# Input counts
##############
# Base directory of your input data
INPUT_DIR: example_data
# List of dataset subdirectories within INPUT_DIR
INPUT:
  - promoter_library

# Regulatory features
######################
# Base directory containing regulatory feature BED files
REGULATORY_FEATURES_DIR: "example_data/regulatory_features"
# List of regulatory feature sets to process
REGULATORY_FEATURES:
  - tss_selection_m300_p100

# Features in folds
######################
# Directory containing fold assignments for regulatory features
FEATURES_IN_FOLDS_BASEDIR: example_data/regulatory_features/features_in_folds

Expected structure for input data

your_input_dir/
└── your_dataset_name/
    ├── chr1.txt.gz
    ├── chr2.txt.gz
    ├── chr3.txt.gz
    ├── ...
    └── chrX.txt.gz

If you have multiple datasets, this pipeline will merge them then.

Expected structure for regulatory features

REGULATORY_FEATURES_DIR/
└── your_feature_set_name/
    ├── chr1.bed
    ├── chr2.bed
    ├── chr3.bed
    ├── ...
    └── chrX.bed

Expected structure for features in folds

FEATURES_IN_FOLDS_BASEDIR/
└── your_feature_set_name/  # Same name as above
    ├── features_in_fold0.bed
    ├── features_in_fold1.bed
    ├── features_in_fold2.bed
    ├── features_in_fold3.bed
    ├── features_in_fold4.bed
    └── features_in_test.bed

This pipeline also supports multiple feature sets. It will produce one output per set.

4.2 Other required parameters

Chromosome list

List of the chromosomes you want to include.

CHR:
  - chr1
  - chr2
  - chr3
  # ... add all chromosomes you have
  - chrX

Celltype columns

Map your cell types to the corresponding columns in your input count files:

CELLTYPES_TO_COLUMNS:
  HAP1:                    # Cell type name
    - HAP1_B1              # Biological replicate 1
    - HAP1_B2              # Biological replicate 2
  AGS:                     # Second cell type
    - AGS_B1
    - AGS_B2

Normalization

Set up which column of your input should be used for plasmid normalization of the counts.

# Column name used for normalization (typically iPCR or pDNA)
NORMALIZATION_COLUMN: 
  - pDNA_pHY3_T2

# Minimum raw-count threshold for the column above
NORMALIZATION_THRESHOLD: 10

4.3 Additional Configuration

# Output directory name
OUTDIR: "output_directory"

# Reference genome path
GENOME: /path/to/your/genome.fa

# Size of the onehot matrix. This is the L_max parameter of PARM
MATRIX_SIZE: 600

# Random seed for reproducibility
RANDOM_SEED: 42

# Pseudocount added to prevent log(0) errors during normalization
NORMALIZATION_PSEUDOCOUNT: AUTO

5. Output files

output_directory/
├── onehot/    # <- One-hot enconded data (input for PARM training)
│   └── tss_selection_m300_p100/
│       ├── fold0.hdf5
│       ├── fold1.hdf5
│       ├── fold2.hdf5
│       ├── fold3.hdf5
│       ├── fold4.hdf5
│       └── test.hdf5
├── metadata/    # <- Tabular file with Log2RPM values and fragment coordinates
│   └── tss_selection_m300_p100/
│       ├── metadata_fold0.bed.gz
│       ├── metadata_fold1.bed.gz
│       ├── metadata_fold2.bed.gz
│       ├── metadata_fold3.bed.gz
│       ├── metadata_fold4.bed.gz
│       └── metadata_test.bed.gz
├── replicate_correlations.pdf    # <- QC plot
└── tmp/    # <- Intermediate files

The HDF5 files in the onehot/ directory contain the final input matrices for PARM training, with one-hot encoded DNA sequences and corresponding normalized activity scores.

In both HDF5 files and metadata files, the promoter activity is stored as Log2RPM values (named as Log2RPM_[cell]). To get this value, First, the pipeline converts the normalization count (iPCR or pDNA) counts to RPM, as well as the fragment counts (cDNA), then get the cDNA/iPCR ratio. Later, sets a replicate-specific pseudocount, defined by the 0.1 quantile of the non-zero values of ratio for that replicate. Finally, returns the Log2RPM values per replicate, defined as:

After this, the average of the Log2RPM_replicate values across replicates is calculated, and the final promoter activity (Log2RPM_cell) is obtained.

The QC plot shows a replicate correlation matrix of the Log2RPM_replicate values for every cell, both in fragment level (each dot being a fragment) and feature level (each dot being a feature). Here is an example using the provided data for AGS cells:

Correlation matrix of `Log2RPM_replicate` values on fragment level

Correlation matrix of `Log2RPM_replicate` values on feature level

6. Running the pipeline

Setup your configuration file: Copy and modify example_config.yaml with your specific parameters.
Activate Snakemake environment: Ensure you have Snakemake v8.25.3 and Conda v23.3.1 installed.
Dry run (recommended): Test your configuration without running the pipeline:
```
snakemake --configfile your_config.yaml --dry-run
```

Run the pipeline: Execute with your desired number of cores:

snakemake --configfile your_config.yaml --cores 10 --use-conda

For the example data, the pipeline should run in ~10 minutes.

7. Training a PARM model!

Once the preprocessing pipeline completes successfully, you'll have HDF5 files ready for PARM model training. Go to the PARM repository for more details.

Citation

If you make use of PARM and/or this pipeline, please cite:

Barbadilla-Martínez, L.; Klaassen, N.; Franceschini-Santos, V. H.; Breda, J.; Hernandez-Quiles, M.; van Lieshout, T.; Urzua Traslaviña, C.; Yücel, H.; Boi, M.; Hermana-Garcia-Agullo, C.; Gregoricchio, S.; Zwart, W.; Voest, E.; Franke, L.; Vermeulen, M.; de Ridder, J., van Steensel, B. (2024). The regulatory grammar of human promoters uncovered by MPRA-trained deep learning. BioRxiv.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
conda_envs		conda_envs
example_data		example_data
misc		misc
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
example_config.yaml		example_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

1. Introduction

2. Requirements

3. Input formats

3.1 Input counts

3.2 Regulatory features

3.3 Features in folds

4. Seting up a config file for the pipeline

4.1 Configuring input parameters

Expected structure for input data

Expected structure for regulatory features

Expected structure for features in folds

4.2 Other required parameters

Chromosome list

Celltype columns

Normalization

4.3 Additional Configuration

5. Output files

Correlation matrix of `Log2RPM_replicate` values on fragment level

Correlation matrix of `Log2RPM_replicate` values on feature level

6. Running the pipeline

7. Training a PARM model!

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

vansteensellab/PARM_preprocessing_pipeline

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

1. Introduction

2. Requirements

3. Input formats

3.1 Input counts

3.2 Regulatory features

3.3 Features in folds

4. Seting up a config file for the pipeline

4.1 Configuring input parameters

Expected structure for input data

Expected structure for regulatory features

Expected structure for features in folds

4.2 Other required parameters

Chromosome list

Celltype columns

Normalization

4.3 Additional Configuration

5. Output files

Correlation matrix of Log2RPM_replicate values on fragment level

Correlation matrix of Log2RPM_replicate values on feature level

6. Running the pipeline

7. Training a PARM model!

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Correlation matrix of `Log2RPM_replicate` values on fragment level

Correlation matrix of `Log2RPM_replicate` values on feature level

Packages