A comprehensive evaluation of long-read de novo transcriptome assembly

Longread_denovo_benchmark

A comprehensive evaluation of long-read de novo transcriptome assembly

Code to generate, process and quality check long read de novo transcriptome assembly

Generate simulation data

We first obtained a subset of transcripts that are widely expressed in the GTEx v9 long read dataset (92 samples) using Gencode comprehensive annotation (v44). We kept transcripts with more than 5 reads in at least 15 samples after Salmon quantification (18145 genes, 40509 transcripts), and stored their mean count per million (CPM) values as the control group’s baseline expression.

We then generated a perturbed set of CPM values where transcript expression was changed by: (1) randomly selecting 1000 genes and changing all transcripts belonging to that gene concordantly (500 genes 2 fold up and 500 genes 2 fold down), (2) selected another 1000 genes randomly, and then select 2 random transcripts from the gene and swap their expression, (3) selected another 1000 genes randomly, and then select 1 random transcript to change its expression (500 transcripts 2 fold up and 500 transcripts 2 fold down). The updated CPM were stored as the perturbed group baseline expression.

We then generated a count matrix and CPM matrix for 3 control replicates and 3 perturbed replicates with gamma distribution, followed by a Poisson distribution (Baldoni et al., 2024). Both long-read and short-read FASTQ files were simulated using SQANTI-SIM with default settings and ONT R9.4 cDNA error profile (v 0.2.1) (Mestre-Tomás et al., 2023). The long read data contained 6 million reads in total, and an average read length of 1085 bp, and short read data was 100 bp paired-end. We then subsampled the short-read data to match the total number of base pairs in the long read data (6.5 billion bases).

The simulated data was non-stranded, and contains 2000 DE genes, 2000 genes with DTU, 5927 transcripts with DTU and 6933 DE transcripts. It is available for download at https://doi.org/10.5281/zenodo.14263456.

Assemble

All assemblies are uploaded to https://doi.org/10.5281/zenodo.17538009.

ONT simulation data (unstranded)

Data downloaded and processed using code.

Hybrid simulation data generated using code.

Code for assembling

ONT PCR-cDNA data from cancer cell lines (stranded)

Hybrid PCR-cDNA data generated using code.

Code for assembling

ONT Direct RNA data from cancer cell lines (stranded)

Code for assembling

PacBio kinnex Human PBMC single-cell data 10x 3' kit (stranded)

Data downloaded and processed using code.

Code for assembling.

ONT Pea PCR-cDNA data (stranded)

Code for assembling

Generate summary and quality metrics

We now provide the code to run all quality checks using the nextflow pipeline denovo_qc_nextflow. Old scripts for qc can also be found in qc/.

Code for generating quality metrics

We have generated the following measures to compare the quality of assemblies.

Transrate analysis of de novo transcriptome
BUSCO analysis using raw de novo transcriptome
SQANTI3 analysis of de novo transcriptome
BUSCO analysis using genome corrected de novo transcriptome
Salmon quantification of pooled samples using de novo transcriptome
(Optional) Oarfish quantification of pooled samples or single-cell using de novo transcriptome
Corset clustering of de novo transcriptome
Salmon quantification of individual samples using de novo transcriptome

Summarise metric and differential analysis in R

Code for summarising and DE analysis

We then performed DE analysis (DGE, DTE, DTU) using the count matrix from step 8 for bulk RNAseq, and from step 6 for single-cell analysis (clustering and pseudobulking first).

Generate plot

Reference

Towards accurate, reference-free differential expression: A comprehensive evaluation of long-read de novo transcriptome assembly. Feng Yan, Pedro L. Baldoni, James Lancaster, Matthew E. Ritchie, Mathew G. Lewsey, Quentin Gouil, Nadia M. Davidson. bioRxiv 2025.02.02.635999; doi: https://doi.org/10.1101/2025.02.02.635999v2

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R		R
assemble		assemble
prepare_data		prepare_data
qc		qc
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A comprehensive evaluation of long-read de novo transcriptome assembly

Generate simulation data

Assemble

ONT simulation data (unstranded)

ONT PCR-cDNA data from cancer cell lines (stranded)

ONT Direct RNA data from cancer cell lines (stranded)

PacBio kinnex Human PBMC single-cell data 10x 3' kit (stranded)

ONT Pea PCR-cDNA data (stranded)

Generate summary and quality metrics

Summarise metric and differential analysis in R

Generate plot

Reference

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A comprehensive evaluation of long-read de novo transcriptome assembly

Generate simulation data

Assemble

ONT simulation data (unstranded)

ONT PCR-cDNA data from cancer cell lines (stranded)

ONT Direct RNA data from cancer cell lines (stranded)

PacBio kinnex Human PBMC single-cell data 10x 3' kit (stranded)

ONT Pea PCR-cDNA data (stranded)

Generate summary and quality metrics

Summarise metric and differential analysis in R

Generate plot

Reference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages