Modular Snakemake workflows for the comprehensive analyses of mature tRNAs and other small RNAs (smRNAs) from high-throughput sequencing data.
- Introduction
- Installation
- Database Build Module
- Database Build Implementation
- Preprocessing Module
- Preprocessing Implementation
- Preprocessing Outputs
- Differential Expression Implementation
- Contact
- Citation and Licensing
This pipeline supports the analysis of mature-tRNAs and other small RNAs (smRNAs) for human (hg38), mouse (mm10), and fly(dm6) genomes through 3 main modules. While typical RNA-Seq preprocessing strategies are emplyed in this pipeline, special considerations to handle tRNA biology are included. Custom reference databases encompass putative mature tRNAs as well as unique tRNA isodecoders as well as tRNA loci in the native host genome along with other annotated smRNAs including miRNA, sRNA, siRNA, snRNA, snoRNA, ribozymes and others. Mature tRNA transcripts are modified to include the addition of a 3' CCA tail which are not encoded genomically. Results are reported at two levels: the gene level which included mature tRNAs plus all other native tRNA-loci and smRNAs, and the isodecoder level, which is tRNA-specific.
The pipeline generates a number of helpful figures for analysis including PCA plots, coverage plots, and normalized abundances of mature tRNAs and other smRNAs, viewable as a pdf report generated at the end of the differential expression module.
To install this code, clone the github repository
#----- Clone repository
git clone https://github.com/Dartmouth-Data-Analytics-Core/GDSC-clover-Seq/
Several conda environments are required to run this code successfully. For your convenience, these conda environments have been prebuilt and are hosted publically at the following path:
/dartfs/rc/nosnapshots/G/GMBSR_refs/envs/GDSC-Clover-Seq
If you wish to build these environments yourself, the associated yaml file in the env_config folder can be built using the following command:
#----- Build conda environment
conda env create -f env_config/<name.yaml>NOTE: THIS IS OPTIONAL, BUT IT IS RECOMMENDED TO USE A PRE-BUILT DATABASE
Reference databases are created through files obtained from gtRNAdb through tRNA-scan experiments. These custom databases encompass not only mature-tRNA sequences, but the full native host genome as well, allowing access to all tNRA-loci. Mature tRNA sequences are created with the addition of a 3' CCA tail, removal of introns, and addition of the histidine post-transcriptional 5'G base. Sequences are padded on both ends by 20 bp to allow for extra bases when mapping. Within this module, multiple-sequence alignments (MSAs) are generated for both the mature tRNA sequences and genomic tRNA loci using cmalign and Infernal with eukaryotic tRNA covariance models from tRNAscan_SE. These MSAs allow the downstream calculation of Sprinzl-positions for each tRNA, facilitating per-base read coverage. Additionally, this module generates a Bowtie2 index for the tRNA-genome (mature tRNA sequences + the full reference genome).
These references have been pre-downloaded, pre-built, and pre-indexed and are hosted by Genomics and Molecular Biology Shared Resources on Discovery for ease of use and efficiency. However, if you wish to build from scratch, or customize the reference, the first module of this pipeline can be used. Pre-built references can be accessed at the following path on Discovery:
/dartfs-hpc/rc/lab/G/GMBSR_bioinfo/genomic_references/tRAX_databases
| Snakemake Rule | Purpose | Conda Environent |
|---|---|---|
generate_gtRNA_db |
Download Ensembl GTF, gtRNA-db data tarball, genome and tRNA fasta file and build resulting tRNA database for downstream analysis in the preprocessing module | clover-seq |
concat_tRNAs |
Concatenate mature tRNA fasta and genome fasta to get tRNA-genome fasta file for use in profiling all smRNAs | clover-seq |
tRNA_bt2_index |
Create Bowtie2 index of tRNA-genome for use in alignment step of preprocessing module | clover-bowtie2 |
To create a tRNA database from scratch, following the steps below:
- Update your
module1_job.script.shscript to point Snakemake to one of the prebuilt configs inprebuilt_configsfolder. These configs are organism-specific and point to specific URLs on gtRNAdb.
#----- Snakemake call
snakemake -s workflows/module-1-build-db.smk \
--configfile prebuilt_configs/hg38_config.yaml \ # EDIT THIS LINE
--use-conda \
--conda-frontend conda \
--conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/GDSC-Clover-Seq \
--profile cluster_profile \
--rerun-incomplete \
--keep-going
- Submit your
module1_job.scriptscript
#----- Submit snakemake job
sbatch module1_job.script.sh
To check the status of your Snakemake job and all child jobs it spawns, run the following (replacing NETID with your Dartmouth NetID)
TIP This job checking line can be applied for all 3 modules!
#----- Check user job status
squeue | grep "NETID"TIP You will notice your directory will populate with logs following the convention: log_X_RuleID_JobID.out. For rules that run on a per-sample basis, there should be one log for each sample. For rules that run once for all samples together, there should be only a single log.
Clover-Seq is adapted from the tRAX Pipeline to be implemented as a Snakemake workflow, allowing sample parallelization and improved modularity and efficiency. As input, this pipeline takes raw fastq.gz files, a config.yaml, and a sample metadata sheet.
This module starts with basic "best-practices" for RNA-Sequencing by trimming adapters. Currently, the Snakemake workflow only supports single-end, UMI-free reads, but support for paired-end and UMIs is under development. A minimum required read length of 15 bp is required for a read to be retained. Following adapter trimming, reads are mapped using Bowtie2 in very-sensitive mode, ignoring quality scores and allowing a maximum of 100 alignments per read. Non-tRNA smRNA annotations are obtained through an Ensembl gtf file (contained in the tRNA database). Read count metrics are calculated for both tRNA-specific alignments (mapping to the mature tRNA sequences) as well as tRNA-isotype mappings (mapping to tRNA-loci).
This module supports the calculation of tRNA fragment types as well. Reads where both the 5' and 3' end lie within 5 nt of their respective ends on the mature tRNAs are categorized as whole-tRNAs. Reads overlapping or aligning closely to the 5' end are categorized as 5' fragments. Likewise, reads aligning closely to the 3' end are categorized as 3' fragments. Fragments not meeting any of these criteria are categorized as "other" (Holmes, et al., 2022)
A gene-level dataset and tRNA-isotype dataset are generated. The former contains mature-tRNA mappings + all other smRNA featrures, and the latter contains tRNA-isotypes from genomic loci. These datasets are used to calculate principal components (separately), normalization size factors through DESeq2, and Rds files for differental expression in module 3.
| Snakemake Rule | Purpose | Conda Environent |
|---|---|---|
trimming |
Trim fastq files to remove adapters and reads not meeting size threshold | clover-seq |
tRNA_align |
Align sequencing reads to combined tRNA sequences and host genome, perform length and quality filtering | clover-bowtie2 |
tRNA_mark_duplicates |
Flag sequencing duplicates | Picard |
tRNA_map_stats |
Collate idxstats and flagstats metrics | clover-seq |
tRNA_count |
Count tRNA isotype reads as well as gene-level tRNAs/smRNAs in the genome | clover-seq |
read_length_distribution |
Collate read-length statistics for all reads (tRNA + non-tRNA) | clover-seq |
count_smRNAs |
Collate smRNA biotype counts across groups and samples | clover-seq |
normalize_and_PCA |
Calculate size factors and normalized-counts for tRNA-isotypes and gene-level counts, run PCA | clover-seq |
get_tRNA_coverage |
Calculare normalized coverage at each Sprinzl position for mature tRNAs | clover-seq |
get_mismatches |
Calculate normalized mismatches for each position in mature tRNAs | clover-seq |
plot_counts |
Generation of various plots | clover-seq |
To run the preprocessing pipeline, modify the Sample_list_SE.txt file. This is a comma-delimited three column file (Sample_ID, fastq_1, Group).
Note that modifying header names will cause the pipeline to fail! Simply modify your sample names, path to the raw fastq.gz file, and Group. Below is an example of how this file should look.
Sample_ID,fastq_1,Group
IB1,data/IB_1_S1_R1_001.fastq.gz,WT
IB2,data/IB_2_S2_R1_001.fastq.gz,WT
IB3,data/IB_3_S3_R1_001.fastq.gz,KO
IB4,data/IB_4_S4_R1_001.fastq.gz,KO
The config.yaml file provides important file paths that allow the Snakefile to dynamically run. Utilize the config specific to your organism in the preprocessing prebuilt configs folder. A list of case-sensitive parameters as specified in the config are listed below:
| Parameter | Value |
|---|---|
sample_txt |
Sample_list_SE.txt |
layout |
one of single or paired (currently only supports single) |
genome |
one of hg38, mm10, or dm6 |
refLevel |
A reference level from the group column in Sample_list_SE.txt |
trna_db |
Path to publically hosted, pre-built tRNA databases |
bt2_index |
Path + bowtie2 index prefix for publically hosted, pre-built tRNA genome Bowtie2 index |
adapter_1 |
Adapter sequence 1 (Illumina universal) |
adapter_2 |
Adapter sequence 2 if paired-end (Illumina universal) |
maxMaps |
Number of mutli-mappings to accept per read |
nPenalty |
Penalty score for ambiguous bases (set to 5 to account for tRNA biology) |
log2FC_magnitude_threshold |
Magnitude of effect size (log2 Fold change) at which results will be considered significant |
padj_threshold |
Adjusted p-value threshold at which results will be considered statisitcally significant |
Modify the module2_job.script.sh script accordingly to point to your config file using the --configfile argument. Below is an example using the hg38 tRNA genome and specifying to submit 10 jobs in parallel across different cluster nodes as specified by --profile cluster_profile
#----- Run snakemake workflow
snakemake -s workflows/module-2-preprocess.smk \
--use-conda \
--configfile prebuilt_configs/hg38_config.yaml \ # EDIT THIS LINE
--conda-frontend conda \
--conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/GDSC-Clover-Seq \
--profile cluster_profile \
--rerun-incomplete \
--keep-going To submit your job, run the following code.
#----- Submit snakemake job script
sbatch module2_job.script.shThe preprocessing module of this workflow contains 9 major output folder. Their contents are exaplined in the following sections.
Contains trimmed fastq files and trimming logs in the logs folder (if SE and using CutAdapt)
Contains multiple alignment files. Additionally, a third folder (02_tRNA_unaligned) holds fastq files containing reads that did not align to the tRNA-genome.
| Files | Content | Rule |
|---|---|---|
.srt.bam |
Bam files sorted by coordinate and filtered for reads between 15-90 bp | rule tRNA_alignment |
.alignment.log.txt |
Bowtie2 alignment log | rule tRNA_alignment |
.mkdup.bam |
Filtered and sorted bam file with duplicates flagged | rule tRNA_mark_duplicates |
.mkdup.log.txt |
Picard Markduplicates log file | rule tRNA_mark_duplicates |
stats/.mkdup.bam.idxstats |
Samtools idxstats metrics | rule tRNA_map_stats |
stats/.mkdup.bam.flagstat |
Samtools flagstat metrics | rule tRNA_map_stats |
full_alignment_read_length_distribution.txt |
Per-sample read length distribution information for all reads | read_length_distribution |
Contains detailed per-sample tRNA count information at the gene and isoform level as well as coverage and mismatch information.
| Files | Content | Rule |
|---|---|---|
genetype_counts.txt |
rule tRNA_count |
|
tRNA_isotype_counts.txt |
Per sample counts of tRNA isoforms | rule tRNA_count |
gene_level_counts_detailed.txt |
Per sample counts of tRNAs broken down by read type (5', 3' antisense, other) and genome smRNAs/tRNA-loci | rule tRNA_count |
gene_level_counts_collapsed.txt |
Per sample counts of tRNAs collapsed to isodecoder level and genome smRNAs | rule tRNA_count |
tRNA_ends_counts.txt |
Contains counts for CCA, CC, and C ending tRNAs | rule tRNA_count |
mature_tRNA_coverages.txt |
Contains normalized coverages for tRNAs by Sprinzl position | rule get_tRNA_coverage |
Z_scored_summed_mismatches_relative_to_*.png |
Contains normalized mismatch information at each position, summed across replicates, relative to your reference group, visualized as a Z-scaled heatmap | rule get_mismatches |
mismatch_figures |
Folder containing normalized mismatch information at each position, summed across replicates, relative to your reference group, split by isoacceptor family, vlisualized as a Z-scaled and hierarchically clustered heatmap | rule get_mismatches |
Contains detailed per-sample smRNA count information at the gene biotype level for tRNAs and other smRNAs.
| Files | Content | Rule |
|---|---|---|
raw_amino_counts_by_group.txt |
Per-greoup raw total counts for all amino acids | rule smRNA_count |
read_length_distribution.txt |
Per-sample read length distribution for tRNAs split by pre-tRNAs and mature tRNAs | rule smRNA_count |
smRNA_raw_counts_by_group.txt |
Per-group raw total counts for all smRNA gene biotypes | rule smRNA_count |
smRNA_raw_counts_by_sample.txt |
Per-sample raw total counts for all smRNA gene biotypes | rule smRNA_count |
raw_anticodon_counts_by_sample.txt |
Per-sample raw total counts for all anticodons | rule smRNA_count |
subgroup_counts.txt |
Per-sample raw total counts grouped by tRNA or non-tRNA | rule smRNA_count |
Contains normalization information, and normalized tRNA counts.
| Files | Content | Rule |
|---|---|---|
gene_level_counts_size_factors.csv |
Per-sample size factors calculated by median-of-ratios (DESeq2) for tRNA isodecoders + genomic smRNAs | rule normalize_and_PCA |
normalized_gene_level_counts.csv |
Per-sample normalized counts calculated by rlog transformation (DESeq2) for tRNA isodecoders + genomic smRNAs | rule normalize_and_PCA |
tRNA_isotype_counts_size_factors |
Per-sample size factors calculated by median-of-ratios (DESeq2) for tRNA isotypes only | rule normalize_and_PCA |
normalized_tRNA_isotype_counts.csv |
Per-sample normalized counts calculated by rlog transformation (DESeq2) for tRNA isotypes only | rule normalize_and_PCA |
CCA_ends_normalized.csv |
CCA-end counts normalized using tRNA-isotype dataset size factors | rule plot_counts |
Contains tabular data for principal component analysis and associated PCA plots.
| Files | Content | Rule |
|---|---|---|
gene_level_variance_plot.png |
Shows ranked gene variances for rlog-normalized tRNA isodecoders + genomic smRNAs | rule normalize_and_PCA |
gene_level_loadings.csv |
PCA loadings per sample based on 500 most variable features for rlog-normalized tRNA isodecoders + genomic smRNAs | rule normalize_and_PCA |
gene_level_PCA.png |
PCA plot based on 500 most variable features for rlog-normalized tRNA isodecoders + genomic smRNAs | rule normalize_and_PCA |
tRNA_isotype_variance_plot.png |
Shows ranked gene variance for rlog-normalized tRNA isotypes only | rule normalize_and_PCA |
tRNA_isotype_loadings.csv |
PCA loadings per sample based on 500 most variable features for rlog-normalized tRNA isotypes only | rule normalize_and_PCA |
tRNA_isotype_PCA.png |
PCA plot based on 500 most variable features for rlog-normalized tRNA isotypes only | rule normalize_and_PCA |
PCA_Analysis_Summary.png |
Side by side PCA plots for full gene-level and tRNA only analysis | rule normalize_and_PCA |
PCA_Direct_Comparison.png |
PCA data for both experiments on one plot to highlight shifts | rule normalize_and_PCA |
Contains DESeq2 serialized R-data objects to be used in the differential expression workflow.
| Files | Content | Rule |
|---|---|---|
gene_level_DESeq2_object.Rds |
DESeq dataset object for tRNA isodecoders + genomic smRNAs dataset with (reference level set in config) | rule normalize_and_PCA |
tRNA_isotype_DESeq2_object.Rds |
DESeq dataset object for tRNA isotypes only (reference level set in config) | rule normalize_and_PCA |
Contains various plots to explore counts.
| Files | Content | Rule |
|---|---|---|
CCA_ends_normalized_absolute_abundances.png |
Absolute abundances of CCA end types per sample, normalized using tRNA-isotype size factors with resulting normalized tabular data saved in 05_normalized folder | rule plot_counts |
CCA_ends_relative_abundances.png |
Relative abundances (calculated from non-normalized CCA-end counts) of CCA end types per sample | rule plot_counts |
Grouped_boxplot_norm_tRNA_isotypes_by_Sample_and_Anticodon.png |
High detial plot showing counts of each tRNA isotype for specific codons | rule plot_counts |
Isoacceptor_counts_normalized.png |
Normalized absolute abundance of each isoacceptor type | rule plot_counts |
Isoacceptor_counts_by_sample_normalized.png |
Absolute composition of amino-acids by Sample | rule plot_counts |
mature_tRNA_coverages |
Folder containing multiple pngs (1 per mature tRNA feature detected), showing normalized coverage per Sprinzl position, split by sample | rule plot_counts |
Contains general QC summary in html format with associated data tables
| Files | Content | Rule |
|---|---|---|
tRNA_multi_QC_report_data |
Folder containing multi-qc data | rule all |
tRNA_multi_QC_report.html |
Web viewable html summary of QC metrics for trimming and alignment | rule all |
Differential expression is conducted using DESeq2. Rds files containing differential expression results are generated in the preprocessing module are analyzed in this module.
This code expects two Rds files, gene_level_DESeq2_object.Rds and tRNA_isotype_DESeq2_object.Rds within your 07_rds_files folder.
DESeq2 results undergo adaptive shrinkage (ashr) which is a method for empirical Bayes shrinkage estimation. Ashr estimates effect sizes (e.g. log2 fold changes) and adjusted standard errors by shrinking noisy estimates toward a central value with the goal of reducing noise and avoiding over-estimation of effects when counts are low, or high. This approach is adaptive as shrinkage depends on the standard error of each estimate and the estimated global distirbution of true effects.
Each dataset produces two main plots: an MA plot and a volcano plot. MA plots show the average expression (A) in log10 scale on the X-axis and log2 fold change (M) on the y-axis. Genes on the right side of the plot are highly expressed while those on the left are lowly expressed. Genes are also sized by their significance (-log10 padj). Larger points indicate greater significance.
The differential expression module consists of a single rule which produces the following outputs:
| Files | Content | Rule |
|---|---|---|
gene_level_DESeq2_shrink_Results.csv |
Contains DESeq2 results and normalized counts per sample (Median-of-ratios) for gene-level dataset | rule extract_DESeq2_results |
tRNA_isotype_DESeq2_shrink_Results.csv |
Contains DESeq2 results and normalized counts per sample (Median-of-ratios) for tRNA isotype dataset | rule extract_DESeq2_results |
Figures |
Folder containing MA and volcano plots for both datasets | rule extract_DESeq2_results |
Upon completion of the differential expression module, a pdf report clover-seq-report.pdf is generated.
Please address questions to DataAnalyticsCore@groups.dartmouth.edu or submit an issue in the GitHub repository.
This codebase is adapted from the original tRAX tool, licensed under GPL v3.0., modified by Mike Martinez, Dartmouth Genomic Data Science Core
Citation: Holmes AD, Howard JM, Chan PP, and Lowe TM.
This pipeline was created with funds from the COBRE grant 1P20GM130454. If you use the pipeline in your own work, please acknowledge the pipeline by citing the grant number in your manuscript.
Protocols.io DOI: dx.doi.org/10.17504/protocols.io.n2bvjewz5gk5/v1




