This pipeline prepares reference data and performs long-read simulation using SQANTI-SIM. This directory contains the scripts required to simulate Long-Read transcriptomic data with controlled Differential Gene Expression (DGE), Differential Transcript Expression (DTE), and Differential Transcript Usage (DTU). It is organized into two sequential steps.
Script: 1.run_ref_prep.sh
This steps performs the following:
- Alignment & Quantification: Aligns raw GTEx long-read data to the reference transcriptom using
minimap2and quantifies abundance usingsalmon. - Baseline Estimation: Runs
gtex_info.Rto generatebaselineAbundance.rdsand a list of expressed transcripts (txid.txt). - Subsetting: specific GTF, FASTA, and BED files for the identified transcripts.
Usage:
sbatch 1.run_ref_prep.sh
# Ensure this completes successfully before running Step 2Script: 2.run_simulation.sh
This step performs the actual simulation:
- Design: Runs
sqanti-sim.py designto create the simulation index. - Differential Expression: Calls
get_diff.Rto establish DGE/DTU/DTE ground truth lists. - Dispersion: Calls
simulate_dispersion.Rto simulate biological variation across replicates. - Run Simulation: Executes
sqanti-sim.py simto generate synthetic FASTQ reads for Control and DE conditions.
Usage:
sbatch 2.run_simulation.sh- Input Data: Expects raw data at
../../dataset/GTeX/long-read/sequence_data/ - Reference Files:
- Transcript Reference FASTA:
gencode.v44.transcripts.fa(Required for Step 1) - Genome Reference FASTA:
GRCh38.primary_assembly.genome.fa(Required for Step 2) - Annotation GTF:
gencode.v44.annotation.gtf
- Transcript Reference FASTA:
- Software & Versions:
minimap2(2.26-r1175)salmon(1.10.2)samtools(1.19.2)seqkit(2.5.1)SQANTI3(5.1.2)SQANTI-SIM(0.2.1)