Skip to content

vansteensellab/PARM_preprocessing_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

1. Introduction

PARM (Promoter Activity Regulatory Model) is a deep learning model that predicts the promoter activity from the DNA sequence itself. As a convolution neural network trained on MPRA data, PARM is very lightweight and produces predictions in a cell-type-specific manner. See PARM paper. See PARM repository.

This repo contains a Snakemake pipeline to generate the input data for PARM from the raw MPRA counts data.

See bellow the schematic representation of the pre-processing steps:

overview

2. Requirements

Each step of this pipeline requires specific packages, but we provide them all. The only requirements that you need before running it are:

  • Snakemake v8.25.3
  • Conda v23.3.1

3. Input formats

The example_data directory contains a small subset of the data used in the PARM paper, where an MPRA library of promoter fragments was transfected into different cells (here, HAP1 and AGS, with two biological replicates each).

This pipeline requires three main inputs: the raw count table of the MPRA fragments (input counts), the coordinates of the features that we want our fragments to overlap with (regulatory features), and a tabular file listing the features that should belong to each of the five folds of the data (features in folds, more details about it here)

3.1 Input counts

Input count files are located in example_data/promoter_library/ and contain MPRA count data for each chromosome. Each file is a tab-separated text file compressed with gzip, containing the coordinates of the fragments and the iPCR and/or pDNA counts, as well as cDNA counts on each replicate.

Preview of chr1.txt.gz:

chr start end strand iPCR pDNA_pHY3_T2 HAP1_B1 HAP1_B2 AGS_B1 AGS_B2
chr1 15552 15772 + 4 12.0 13.0 0.0 0.0 0.0
chr1 51502 51657 - 1 0.0 0.0 0.0 0.0 0.0
chr1 101217 101437 + 3 19.0 0.0 4.0 0.0 0.0
chr1 228479 228717 - 14 57.0 2.0 18.0 0.0 0.0

Column descriptions:

  • chr: chromosome identifier
  • start: start position of the MPRA fragment
  • end: end position of the MPRA fragment
  • strand: strand orientation (+/-)
  • iPCR: input PCR counts
  • pDNA_pHY3_T2: normalization column (plasmid DNA counts)
  • HAP1_B1, HAP1_B2: cDNA counts for HAP1 cells, two biological replicates
  • AGS_B1, AGS_B2: cDNA counts for AGS cells, two biological replicates

3.2 Regulatory features

Regulatory features are BED files located in example_data/regulatory_features/tss_selection_m300_p100/. These define genomic regions of interest (e.g., transcription start sites, enhancers) that will be used to filter and organize the MPRA data. Each BED file contains coordinates for regulatory elements for the corresponding chromosome.

If your BED files contain more than one feature type (e.g.: promoters and enhancers), we recommend adding an index in front of the feature name (separated by underscore), so that's easier to distinguish them later. This pipeline automatically creates the FEATtype column in the output data with the index value of each fragment. In the example data, we used 0 for promoters and 1 for enhancers. If you don't need this, then you can just ignore the FEATtype column in the output.

3.3 Features in folds

To train the PARM models, we split our data into six: folds0-4, and test. Then, we train every model five independent times, using a different fold for validation of the training. Therefore, we need to specify which features should go to each fold.

The example_data/regulatory_features/features_in_folds/tss_selection_m300_p100/ directory contains BED-like files that define which regulatory features belong to each cross-validation fold:

  • features_in_fold0.bed to features_in_fold4.bed: Training/validation folds
  • features_in_test.bed: Test set features

Preview of features_in_fold0.bed:

chr start end strand TSSabspos group TSS
chr1 568614 569015 + 568614_569015 0_MTATP8P1_ENST00000467115.1_568614_569015 0_MTATP8P1_ENST00000467115.1
chr1 874354 874755 + 874354_874755 0_SAMD11_ENST00000455979.1_874354_874755 0_SAMD11_ENST00000455979.1
chr1 940400 941000 . 940400_941000 1_EnhA1_chr1:940400-941000_940400_941000 1_EnhA1_chr1:940400-941000
chr1 1044600 1045000 . 1044600_1045000 1_EnhA2_chr1:1044600-1045000_1044600_1045000 1_EnhA2_chr1:1044600-1045000

Column descriptions:

  • chr: chromosome identifier
  • start: start position of the feature
  • end: end position of the feature
  • strand: strand orientation
  • TSSabspos: absolute position of the feature (start and end coordinates pasted)
  • group: unique group identifier combining feature type, name, and coordinates
  • TSS: name of the feature

These files ensure that each regulatory feature is assigned to exactly one fold, preventing data leakage during cross-validation.

4. Seting up a config file for the pipeline

As in any snakemake workflow, you need first setup a configuration file before running this pipeline. Below we guide you through the fields of the config file, using our example data.

4.1 Configuring input parameters

The input parameters should be set as follows:

# Input counts
##############
# Base directory of your input data
INPUT_DIR: example_data
# List of dataset subdirectories within INPUT_DIR
INPUT:
  - promoter_library

# Regulatory features
######################
# Base directory containing regulatory feature BED files
REGULATORY_FEATURES_DIR: "example_data/regulatory_features"
# List of regulatory feature sets to process
REGULATORY_FEATURES:
  - tss_selection_m300_p100

# Features in folds
######################
# Directory containing fold assignments for regulatory features
FEATURES_IN_FOLDS_BASEDIR: example_data/regulatory_features/features_in_folds

Expected structure for input data

your_input_dir/
└── your_dataset_name/
    ├── chr1.txt.gz
    ├── chr2.txt.gz
    ├── chr3.txt.gz
    ├── ...
    └── chrX.txt.gz

If you have multiple datasets, this pipeline will merge them then.

Expected structure for regulatory features

REGULATORY_FEATURES_DIR/
└── your_feature_set_name/
    ├── chr1.bed
    ├── chr2.bed
    ├── chr3.bed
    ├── ...
    └── chrX.bed

Expected structure for features in folds

FEATURES_IN_FOLDS_BASEDIR/
└── your_feature_set_name/  # Same name as above
    ├── features_in_fold0.bed
    ├── features_in_fold1.bed
    ├── features_in_fold2.bed
    ├── features_in_fold3.bed
    ├── features_in_fold4.bed
    └── features_in_test.bed

This pipeline also supports multiple feature sets. It will produce one output per set.

4.2 Other required parameters

Chromosome list

List of the chromosomes you want to include.

CHR:
  - chr1
  - chr2
  - chr3
  # ... add all chromosomes you have
  - chrX

Celltype columns

Map your cell types to the corresponding columns in your input count files:

CELLTYPES_TO_COLUMNS:
  HAP1:                    # Cell type name
    - HAP1_B1              # Biological replicate 1
    - HAP1_B2              # Biological replicate 2
  AGS:                     # Second cell type
    - AGS_B1
    - AGS_B2

Normalization

Set up which column of your input should be used for plasmid normalization of the counts.

# Column name used for normalization (typically iPCR or pDNA)
NORMALIZATION_COLUMN: 
  - pDNA_pHY3_T2

# Minimum raw-count threshold for the column above
NORMALIZATION_THRESHOLD: 10

4.3 Additional Configuration

# Output directory name
OUTDIR: "output_directory"

# Reference genome path
GENOME: /path/to/your/genome.fa

# Size of the onehot matrix. This is the L_max parameter of PARM
MATRIX_SIZE: 600

# Random seed for reproducibility
RANDOM_SEED: 42

# Pseudocount added to prevent log(0) errors during normalization
NORMALIZATION_PSEUDOCOUNT: AUTO

5. Output files

output_directory/
├── onehot/    # <- One-hot enconded data (input for PARM training)
│   └── tss_selection_m300_p100/
│       ├── fold0.hdf5
│       ├── fold1.hdf5
│       ├── fold2.hdf5
│       ├── fold3.hdf5
│       ├── fold4.hdf5
│       └── test.hdf5
├── metadata/    # <- Tabular file with Log2RPM values and fragment coordinates
│   └── tss_selection_m300_p100/
│       ├── metadata_fold0.bed.gz
│       ├── metadata_fold1.bed.gz
│       ├── metadata_fold2.bed.gz
│       ├── metadata_fold3.bed.gz
│       ├── metadata_fold4.bed.gz
│       └── metadata_test.bed.gz
├── replicate_correlations.pdf    # <- QC plot
└── tmp/    # <- Intermediate files

The HDF5 files in the onehot/ directory contain the final input matrices for PARM training, with one-hot encoded DNA sequences and corresponding normalized activity scores.

In both HDF5 files and metadata files, the promoter activity is stored as Log2RPM values (named as Log2RPM_[cell]). To get this value, First, the pipeline converts the normalization count (iPCR or pDNA) counts to RPM, as well as the fragment counts (cDNA), then get the cDNA/iPCR ratio. Later, sets a replicate-specific pseudocount, defined by the 0.1 quantile of the non-zero values of ratio for that replicate. Finally, returns the Log2RPM values per replicate, defined as:

Log2RPM equation

After this, the average of the Log2RPM_replicate values across replicates is calculated, and the final promoter activity (Log2RPM_cell) is obtained.

The QC plot shows a replicate correlation matrix of the Log2RPM_replicate values for every cell, both in fragment level (each dot being a fragment) and feature level (each dot being a feature). Here is an example using the provided data for AGS cells:

Correlation matrix of Log2RPM_replicate values on fragment level

Replicate correlation for AGS cells


Correlation matrix of Log2RPM_replicate values on feature level

Replicate correlation for AGS cells

6. Running the pipeline

  1. Setup your configuration file: Copy and modify example_config.yaml with your specific parameters.

  2. Activate Snakemake environment: Ensure you have Snakemake v8.25.3 and Conda v23.3.1 installed.

  3. Dry run (recommended): Test your configuration without running the pipeline:

    snakemake --configfile your_config.yaml --dry-run
  4. Run the pipeline: Execute with your desired number of cores:

    snakemake --configfile your_config.yaml --cores 10 --use-conda

For the example data, the pipeline should run in ~10 minutes.

7. Training a PARM model!

Once the preprocessing pipeline completes successfully, you'll have HDF5 files ready for PARM model training. Go to the PARM repository for more details.

Citation

If you make use of PARM and/or this pipeline, please cite:

Barbadilla-Martínez, L.; Klaassen, N.; Franceschini-Santos, V. H.; Breda, J.; Hernandez-Quiles, M.; van Lieshout, T.; Urzua Traslaviña, C.; Yücel, H.; Boi, M.; Hermana-Garcia-Agullo, C.; Gregoricchio, S.; Zwart, W.; Voest, E.; Franke, L.; Vermeulen, M.; de Ridder, J., van Steensel, B. (2024). The regulatory grammar of human promoters uncovered by MPRA-trained deep learning. BioRxiv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors