- Longread_denovo_benchmark
Code to generate, process and quality check long read de novo transcriptome assembly
We first obtained a subset of transcripts that are widely expressed in the GTEx v9 long read dataset (92 samples) using Gencode comprehensive annotation (v44). We kept transcripts with more than 5 reads in at least 15 samples after Salmon quantification (18145 genes, 40509 transcripts), and stored their mean count per million (CPM) values as the control group’s baseline expression.
We then generated a perturbed set of CPM values where transcript expression was changed by: (1) randomly selecting 1000 genes and changing all transcripts belonging to that gene concordantly (500 genes 2 fold up and 500 genes 2 fold down), (2) selected another 1000 genes randomly, and then select 2 random transcripts from the gene and swap their expression, (3) selected another 1000 genes randomly, and then select 1 random transcript to change its expression (500 transcripts 2 fold up and 500 transcripts 2 fold down). The updated CPM were stored as the perturbed group baseline expression.
We then generated a count matrix and CPM matrix for 3 control replicates and 3 perturbed replicates with gamma distribution, followed by a Poisson distribution (Baldoni et al., 2024). Both long-read and short-read FASTQ files were simulated using SQANTI-SIM with default settings and ONT R9.4 cDNA error profile (v 0.2.1) (Mestre-Tomás et al., 2023). The long read data contained 6 million reads in total, and an average read length of 1085 bp, and short read data was 100 bp paired-end. We then subsampled the short-read data to match the total number of base pairs in the long read data (6.5 billion bases).
The simulated data was non-stranded, and contains 2000 DE genes, 2000 genes with DTU, 5927 transcripts with DTU and 6933 DE transcripts. It is available for download at https://doi.org/10.5281/zenodo.14263456.
All assemblies are uploaded to https://doi.org/10.5281/zenodo.17538009.
Data downloaded and processed using code.
Hybrid simulation data generated using code.
Hybrid PCR-cDNA data generated using code.
Data downloaded and processed using code.
We now provide the code to run all quality checks using the nextflow pipeline denovo_qc_nextflow. Old scripts for qc can also be found in qc/.
Code for generating quality metrics
We have generated the following measures to compare the quality of assemblies.
- Transrate analysis of de novo transcriptome
- BUSCO analysis using raw de novo transcriptome
- SQANTI3 analysis of de novo transcriptome
- BUSCO analysis using genome corrected de novo transcriptome
- Salmon quantification of pooled samples using de novo transcriptome
- (Optional) Oarfish quantification of pooled samples or single-cell using de novo transcriptome
- Corset clustering of de novo transcriptome
- Salmon quantification of individual samples using de novo transcriptome
Code for summarising and DE analysis
We then performed DE analysis (DGE, DTE, DTU) using the count matrix from step 8 for bulk RNAseq, and from step 6 for single-cell analysis (clustering and pseudobulking first).
Towards accurate, reference-free differential expression: A comprehensive evaluation of long-read de novo transcriptome assembly. Feng Yan, Pedro L. Baldoni, James Lancaster, Matthew E. Ritchie, Mathew G. Lewsey, Quentin Gouil, Nadia M. Davidson. bioRxiv 2025.02.02.635999; doi: https://doi.org/10.1101/2025.02.02.635999v2
This project is licensed under the MIT License. See the LICENSE file for details.