Welcome to the GitHub repository for the following publication: Genome-scale quantification and prediction of drug-induced readthrough of pathogenic premature termination codons (Toledano I, Supek F & Lehner B, 2023)
Here you'll find source code for computational analyses and to reproduce the figures in the paper.
- 1. Required Software
- 2. Required Data
- 3. Installation Instructions
- 4. Usage
To run the Stop_codon_readthrough pipeline you will need the following software and associated packages:
- R (dplyr, stringr, stringi, GGally, ggpubr, ggplot2, viridis, tidyverse, seqinr, matrixStats, data.table, rtracklayer, openxlsx, reshape2, caret, hexbin, png, grid, gridExtra, MuMIn, tidyr, rstatix, ggridges, hrbrthemes, glmnet, spgs, ggtext, devtools, ggdendroplot, UpSetR)
Read counts (DiMSum output), readthrough efficiencies, and required miscellaneous files should be downloaded from Main dataset and other files, Fig.1, Fig.2, Fig.3, Fig.4, Fig.5 and Fig.6 (files are organised based on the Figure they are used in) to your project directory (named 'base_dir') i.e. where output files should be written.
Make sure you have git and conda installed and then run (expected install time <10min):
# Install dependencies (preferably in a fresh conda environment)
conda install -c conda-forge r-dplyr, r-stringr, r-stringi, r-ggally r-ggpubr r-ggplot2 r-viridis r-tidyverse r-seqinr r-matrixstats r-data.table r-openxlsx r-reshape2 r-caret r-hexbin r-png r-gridextra r-mumin r-tidyr r-rstatix r-ggridges r-hrbrthemes r-glmnet r-spgs r-ggtext r-devtools r-UpSetR r-biocmanager
conda install -c bioconda bioconductor-biomart
conda install -c bioconda bioconductor-rtracklayer
conda install conda-forge::r-gridgraphics
Alternatively load the 'RT_diseasePTCs.yml' (set up in Linux Operating System (Scientific Linux 7.2)) which contains the conda environment already generated.
The 7 R Markdown files contain the code to reproduce the figures and results from the computational analyses described in the following publication: Genome-scale quantification and prediction of drug-induced readthrough of pathogenic premature termination codons (Toledano I, Supek F & Lehner B, 2023). See Required Data for instructions on how to obtain all required data and miscellaneous files before running the pipeline. If using/downloading the files from Required Data and only plotting the figures, the expected run time is <10min. However, if generating all the files (i.e. the in silico PTC saturation dataset of the human genome) and models needed for all main and supplementary figures, the expected run time is ~2days (without data parallelisation). All steps in which the user can decide whether to generate the file/model or to download it from Required Data are indicated. R Markdown files are meant to be run in the following order:
- 1. Generate_treated_samples.Rmd
- 2. Fig1_EDFig1.Rmd
- 3. Fig2_EDFig2.Rmd
- 4. Fig3_EDFig3.Rmd
- 5. Fig4_EDFig4.Rmd
- 6. Fig5_EDFig5.Rmd
- 7. Fig6.Rmd
The following software package is required for pre-processing of raw FASTQ files:
- DiMSum v1.2.9 (pipeline for pre-processing deep mutational scanning data i.e. FASTQ to fitness). Download the FastQ files from Sequence Read Archive (SRA) with accession number PRJNA996618: http://www.ncbi.nlm.nih.gov/bioproject/996618 to your base directory (base_dir). Store the Clitocine, DAP and SRI FastQ files in a separate folder (named 'round_A_fastq') than CC90009, FUr, Gentamicin, G418, SJ6986 and untreated conditions (folder named 'round_B_fastq'). That is because they were assayed in two different rounds (named 'A' and 'B') and we did a separate Dimsum run for each. Shell scripts to run both Dimsum rounds can be found in Required Data.
Configuration files and additional scripts for running DiMSum are available in the "DiMSum" folder here.