A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.
If this software has been useful for your work, please cite us at:
Stephens, Z., & Kocher, J. P. (2024). Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation. BMC bioinformatics, 25(1), 194.
https://link.springer.com/article/10.1186/s12859-024-05807-5
Telogator2 dependencies can be easily installed via conda:
git clone https://github.com/zstephens/telogator2.git && cd telogator2/
# create & activate conda environment
conda env create -f conda_env_telogator2.yaml
conda activate telogator2
# run test data
python telogator2.py -i test_data/hg002-ont-1p.fa.gz \
-o results/ \
-r ontpython3.12 -m venv venv
source activate venv/bin/activate
pip install git+https://github.com/zstephens/telogator2.git@v2.2.2
# run test data
telogator2 -i test_data/hg002-ont-1p.fa.gz \
-o results/ \
-r ont \
--minimap2 /path/to/minimap2An aligner executable must be specified, via either --minimap2, --winnowmap, or --pbmm2.
-i accepts fa, fa.gz, fq, fq.gz, or bam (multiple can be provided, e.g. -i reads1.fa reads2.fa). For Revio reads sequenced with SMRTLink13 and onward, we advise including both the "hifi" BAM and "fail" BAM as input.
Sequencing platforms have different sequencing error types, as such we recommend running Telogator2 with different options based on which platform was used:
PacBio Revio HiFi (30x) - -r hifi -n 4
PacBio Sequel II (10x) - -r hifi -n 3
Nanopore R10 (30x) - -r ont -n 4
Telogator2 may be unable to analyze older Nanopore data, as reads basecalled with Guppy have prohibitively high sequencing error rates in telomere regions.
For large datasets, such as data from enrichment methods described by Karimian et al. or Schmidt et al., higher thresholds may be needed to reduce false positives: -r ont -n 10.
By default Telogator2 is run with 4 processes. Runtime can be greatly reduced by specifying more, e.g. -p 8 or -p 16, based on your system's available CPU resources.
These are full-sized datasets and may take several hours to run:
HiFi reads (~70x): hg002-telreads_pacbio.fa.gz
ONT reads (~25x): hg002-telreads_ont.fa.gz
The primary output files are:
tlens_by_allele.tsvallele-specific telomere lengthsall_final_alleles.pngplots of all alleles (TVR + telomere regions)violin_atl.pngviolin plot of ATLs at each chromosome arm
The main results are in tlens_by_allele.tsv, which has the following columns:
chranchor chromosome arm- subtelomeres that could not be aligned are labeled
chrUfor 'unmapped'
- subtelomeres that could not be aligned are labeled
positionanchor coordinateref_sampthe specific T2T reference contig to which the subtelomere was alignedallele_idID number for this specific allele- ids ending in
iindicate subtelomeres that were aligned to known interstitial telomere regions. These alleles should likely be excluded from subsequent analyses.
- ids ending in
TL_p75ATL (reports 75th percentile by default)read_TLsATL of each supporting read in the clusterread_lengthslength of each read in the clusterread_mapqmapping quality of each read in the clustertvr_lenlength of the cluster's TVR regiontvr_consensusconsensus TVR region sequencesupporting_readsreadnames of each read in the cluster
The reference sequence used for telomere anchoring currently contains the first and last 500kb of each chromosome from the following T2T assemblies:
T2T-chm13- https://github.com/marbl/CHM13T2T-yao- https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA017932T2T-cn1- https://github.com/T2T-CN1/CN1T2T-hg002- https://github.com/marbl/hg002T2T-ksa001- https://github.com/bio-ontology-research-group/KSA001T2T-i002c- https://github.com/LHG-GG/I002C
More subtelomere contigs may be added as they become available.
Experimental support has been added for some non-human references, e.g. mouse:
python telogator2.py -i input.fa \
-o results/ \
-t source/resources/non-human/telogator-ref-mouse.fa.gz \ Or maize:
python telogator2.py -i test_data/ZMMo17-hifi-7p8p.fa.gz \
-o results/ \
-r hifi \
-t source/resources/non-human/telogator-ref-maize.fa.gz \
-k source/resources/non-human/kmers_maize.tsv \