Tool to liftover variants between references in an indel-aware manner. Importantly, this tool identifies variants that overlap a region of the target reference that has an indel relative to the origin reference and incorporates that indel into the lifted variant. When a variant lands near (but not directly overlapping) a reference difference between builds, the tool performs haplotype realignment -- reconstructing the source haplotype in a local window and using global alignment against the target reference to find the correct variant representation.
Under active development, but should work for most CHM13/GRCh38 reference liftovers.
- Input variants must be biallelic.
- Output variants should be left-aligned and sorted.
- BCF or VCF files can both be read, but using bcf files allows for a much quicker liftover run. I encourage the conversion of all files, even the reference liftover vcfs, to bcf format for this purpose.
- Output format should be autodetected from the provided output file extension.
- Only genotypes are lifted over; while some of the other info/format fields may be correct, these fields are explicitly not touched during liftover. Some/most non-GT fields will be wrong. If lifting these fields is important to you, please post an issue and I will do my best to add this feature.
pip install -r requirements.txt
Dependencies:
- pyliftover
- tqdm
- intervaltree
- cyvcf2
- numpy
- biopython
Once requirements are installed:
python3 liftover_indels.py \
--input-vcf vcf_to_lift.bcf \
--ref-diffs-vcf vcf_of_assembly_differences.bcf \
--output-vcf lifted_over_output.bcf \
--chain chainfile.chain \
--target-fasta target_fasta.fasta \
[options]
For example, to lift only chr22 with debug logging and 4 reader threads:
python3 liftover_indels.py \
--input-vcf input.bcf \
--ref-diffs-vcf chm13v2-grch38.sort.bcf \
--output-vcf output.bcf \
--chain chm13v2-grch38.chain \
--target-fasta GRCh38.fasta \
--chrom chr22 --debug --threads 4
--ref-diffs-vcfmust be in target assembly coordinates.--output-vcfcan be set to/dev/stdoutor-for piping.- Run
python3 liftover_indels.py --helpfor full option details.
| Flag | Default | Description |
|---|---|---|
--chrom CHR [CHR ...] |
all | Restrict liftover to specific contigs (e.g. --chrom chr1 chr22) |
--no-realign |
off | Disable haplotype realignment near reference differences |
--realign-distance |
50 |
Max distance (bp) to search for nearby ref diffs during realignment |
--realign-flank |
20 |
Flanking bases added to each side of the realignment window |
--realign-max-window |
200 |
Maximum total realignment window size (bp) |
--threads |
2 |
Threads for VCF/BCF reading via cyvcf2 |
--debug |
off | Enable verbose debug logging to stderr |
--quiet |
off | Suppress progress bars |
In addition to the main lifted output, three sidecar files are written (using the output path as a base name):
<base>.unliftable.bcf-- variants that could not be lifted because they lacked start and/or end coordinates in the target assembly.<base>.multiple_overlaps.bcf-- variants that overlapped multiple reference differences and could not be unambiguously lifted.<base>.ref_seq_mismatches.bcf-- variants that failed post-liftover reference sequence validation.
The following INFO tags are added to each successfully lifted variant:
| Tag | Description |
|---|---|
SRC_CHROM |
Original contig before liftover |
SRC_POS |
Original position before liftover |
Original_REF |
Original REF allele before liftover |
Original_ALT |
Original ALT allele before liftover |
SRC_REF_ALT |
Original REF,ALT combined string |
Original_ID |
Original variant ID before liftover |
Flipped_during_liftover |
Set to Flipped when REF/ALT were swapped (genotypes adjusted accordingly) |
Realigned_during_liftover |
Set to Realigned when haplotype realignment was used to resolve the variant |
Assembly differences vcf can either be generated by you, or in the CHM13/GRCh38 liftover context it can be obtained from the HPRC AWS bucket.
GRCh38-CHM13 vcf.gz file (GRCh38 coordinates). Use when lifting CHM13 -> GRCh38 coordinates.
CHM13-GRCh38 vcf.gz file (CHM13 coordinates). Use when lifting GRCh38 -> CHM13 coordinates.