This pipeline evaluates assemblies by aligning them to a reference, calling variants, quantifying those variants, and, if applicable, comparing them to gold-standard variant lists. The end result is a jupyter R notebook template that can be run to generate plots and a summary statistic table.
- assembly_list - tab-delimited file with the fields
<assembly name> <first fasta> <second fasta> <truth name> <assembly id> <datatype>If an assembly contains only one fasta, the second fasta can be set to an empty file. If no truthset is available,<truth name>should be NONE. Note that the<assembly name>should consist of the sample name followed by a period, followed by arbitrary other text. Sample name should not contain a period. - populations - tab-delimited file connecting sample to population. Fields are:
<sample> <population code> <superpopulation code> <sex> - ref - reference fasta (not gzipped!)
- ref_index - fai index of reference fasta
- truth_list_snp_indel - tab-delimited file with the fields
<truth name> <vcf location> - truth_list_confident_regions - tab-delimited file with the fields
<truth name> <bed location> - truth_list_sv_vcf - tab-delimited file with the fields
<truth name> <vcf location> - gap_file - bed file denoting reference gaps
- str_track - bed file denoting simple repeats
- seg_dup_track - bed file denoting segmental duplications
- jupyter_notebook_template - path to VariantCalling.ipynb
- Contigs from both fasta files (possibly representing two different haplotypes) are combined into a single fasta, with the suffixes
_H1and_H2appended to contig names from the first and second fasta file, respectively. If the second fasta file is empty, the first fasta file is simply taken as-is. - Contigs are aligned to the given reference using
minimap2with theasm5preset. paftools callis used to call variants from thecstags in the minimap alignment.- A custom perl script is used to "genotype" the variant calls:
- If site is covered by 1 contig, call a homozygous variant (1/1)
- If site is covered by 2+ contigs and the variant appears 2+ times, call a homozygous variant (1/1)
- If site is covered by 2+ contigs and different variants appear, call a heterozygous variant (1/2)
- If site is covered by 2+ contigs and only one variant appears, call heterozygous variant (0/1)
- Custom python scripts are used to call SVs based on split reads and classify them by type.
- Indels >= 40bp from step 3 are combined with SVs from step 5 to form a combined set of SVs.
new.ipynbis a jupyter notebook that can be opened and run to produce the final plots/table
happy.sensitivity.png- Gives sensitivity to the snp/indel truthset as determined by hap.pyhet_fates.png- For each variant in the truthset, show whether it was called heterozygous, homozygous alt, or not called (homozygous ref) in the variant callset from the assembly.<assembly name>.no_gaps_coverage_summary.png- Gives the percentage of chr1-22 non-gap reference bases at each coverage 0X-9X, as well as 10X+<assembly name>.no_gaps_coverage.png- Same thing, broken down by chr and including chrX and chrY<assembly name>.ours.nonRep.lengths.txt.sv_lengths.png- Histogram of INS and DEL lengths. Between 0 and 1000 bp the bin size is 10bp. There are also bins for 1001-10000bp, 10001-100000bp, 100001-1000000bp, and 1Mb+.nonrepSvCounts.png- Counts for all SVs that weren't classified as STR or SegDup, broken down by SV typesegDupSvCounts.png- Counts for all SVs that weren't classified as STR but did match the SegDup track with either breakpoint.strSvCounts.png- Counts for all SVs that matched the STR track with either breakpoint.sv.sensitivity.png- Percentage of the SV truthset found in the assembly using a variety of comparison techniques. SVs are stratified by whether either breakpoint matched the STR track, SegDup track, or neither (Non-repetitive)sv.sensitivityCounts.png- Gives the counts of SVs from the truthset detected by each comparison method.truthHetIndelFates.png- For each heterozygous indel in the truthset, show whether it was called heterozygous, homozygous alt, or not called (homozygous ref) using 1bp, 10bp, or 50bp slop.truthHetIndelFatesErrorsOnly.png- Same as above but only shows the errors to give more detailed counts.truthset.repetitive.count.png- Counts the number of unique elements from the STR and SegDup tracks covered by the SV truthset, and determines how many of those were also found in the SV callset.truthset.repetitive.sensitivity.png- Same as above but displayed as sensitivity.stats.tsv- Tab-delimited file with the following summary stats for each assembly:ID: from the input assembly_listName: from the input assembly_listsnv_count: Total SNVs founddeletions: Total small (1-50 bp) deletions foundinsertions: Total small (1-50 bp) insertions foundsv_count: Total SVs foundsnv_detection_sensitivity: SNV sensitivity ignoring genotype errors (TRUTH.TP + FP.gt)/TRUTH.TOTAL in hap.py metricsindel_detection_sensitivity: Indel sensitivity ignoring genotype errors (TRUTH.TP + FP.gt)/TRUTH.TOTAL in hap.py metricssnv_detection_specificity: SNV specificity ignoring genotype errors (QUERY.TP + FP.gt)/QUERY.TOTAL in hap.py metricsindel_detection_specificity: Indel specificity ignoring genotype errors (QUERY.TP + FP.gt)/QUERY.TOTAL in hap.py metricssv_sensitivity: SV sensitivity usingbedtools pairtopairwith 50 bp slop at each breakpointsv_specificity: SV specificity usingbedtools pairtopairwith 50 bp slot at each breakpointhet_snp_sensitivity: Sensitivity for heterozygous snps onlyhet_indel_sensitivity: Sensitivity for heterozygous indels onlyhom_snp_sensitivity: Sensitivity for homozygous snps onlyhom_indel_sensitivity: Sensitivity for homozygous indels onlyDEL: Total deletions detected from split-read alignmentsINS: Total insertions detected from split-read alignmentsDUP: Total duplications detected from split-read alignmentsINV: Total inversions detected from split-read alignmentsTRANS: Total translocations detected from split-read alignmentssv_nonrep_count: Total SVs that don't overlap STR or SEGDUP regionssv_str_count: Total SVs that overlap STR regionssv_segdup_count: Total SVs that overlap SEGDUP but not STR regions