Skip to content

JosephLalli/LiftoverIndel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LiftoverIndel

Tool to liftover variants between references in an indel-aware manner. Importantly, this tool identifies variants that overlap a region of the target reference that has an indel relative to the origin reference and incorporates that indel into the lifted variant. When a variant lands near (but not directly overlapping) a reference difference between builds, the tool performs haplotype realignment -- reconstructing the source haplotype in a local window and using global alignment against the target reference to find the correct variant representation.

Under active development, but should work for most CHM13/GRCh38 reference liftovers.

Please note:

  • Input variants must be biallelic.
  • Output variants should be left-aligned and sorted.
  • BCF or VCF files can both be read, but using bcf files allows for a much quicker liftover run. I encourage the conversion of all files, even the reference liftover vcfs, to bcf format for this purpose.
  • Output format should be autodetected from the provided output file extension.
  • Only genotypes are lifted over; while some of the other info/format fields may be correct, these fields are explicitly not touched during liftover. Some/most non-GT fields will be wrong. If lifting these fields is important to you, please post an issue and I will do my best to add this feature.

Requirements

pip install -r requirements.txt

Dependencies:

  • pyliftover
  • tqdm
  • intervaltree
  • cyvcf2
  • numpy
  • biopython

Usage

Once requirements are installed:

python3 liftover_indels.py \
    --input-vcf vcf_to_lift.bcf \
    --ref-diffs-vcf vcf_of_assembly_differences.bcf \
    --output-vcf lifted_over_output.bcf \
    --chain chainfile.chain \
    --target-fasta target_fasta.fasta \
    [options]

For example, to lift only chr22 with debug logging and 4 reader threads:

python3 liftover_indels.py \
    --input-vcf input.bcf \
    --ref-diffs-vcf chm13v2-grch38.sort.bcf \
    --output-vcf output.bcf \
    --chain chm13v2-grch38.chain \
    --target-fasta GRCh38.fasta \
    --chrom chr22 --debug --threads 4
  • --ref-diffs-vcf must be in target assembly coordinates.
  • --output-vcf can be set to /dev/stdout or - for piping.
  • Run python3 liftover_indels.py --help for full option details.

Options

Flag Default Description
--chrom CHR [CHR ...] all Restrict liftover to specific contigs (e.g. --chrom chr1 chr22)
--no-realign off Disable haplotype realignment near reference differences
--realign-distance 50 Max distance (bp) to search for nearby ref diffs during realignment
--realign-flank 20 Flanking bases added to each side of the realignment window
--realign-max-window 200 Maximum total realignment window size (bp)
--threads 2 Threads for VCF/BCF reading via cyvcf2
--debug off Enable verbose debug logging to stderr
--quiet off Suppress progress bars

Output Files

In addition to the main lifted output, three sidecar files are written (using the output path as a base name):

  • <base>.unliftable.bcf -- variants that could not be lifted because they lacked start and/or end coordinates in the target assembly.
  • <base>.multiple_overlaps.bcf -- variants that overlapped multiple reference differences and could not be unambiguously lifted.
  • <base>.ref_seq_mismatches.bcf -- variants that failed post-liftover reference sequence validation.

INFO Tags

The following INFO tags are added to each successfully lifted variant:

Tag Description
SRC_CHROM Original contig before liftover
SRC_POS Original position before liftover
Original_REF Original REF allele before liftover
Original_ALT Original ALT allele before liftover
SRC_REF_ALT Original REF,ALT combined string
Original_ID Original variant ID before liftover
Flipped_during_liftover Set to Flipped when REF/ALT were swapped (genotypes adjusted accordingly)
Realigned_during_liftover Set to Realigned when haplotype realignment was used to resolve the variant

Assembly Differences VCF

Assembly differences vcf can either be generated by you, or in the CHM13/GRCh38 liftover context it can be obtained from the HPRC AWS bucket.
GRCh38-CHM13 vcf.gz file (GRCh38 coordinates). Use when lifting CHM13 -> GRCh38 coordinates.
CHM13-GRCh38 vcf.gz file (CHM13 coordinates). Use when lifting GRCh38 -> CHM13 coordinates.

About

Tool to liftover variants between references in an indel-aware manner

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages