Annotation of VCF file for import into VarFish (through Web UI).
- ExAC r1.0
- gnomAD exomes r2.1
- gnomAD genomes r2.1
- clinvar
- thousand genomes
- hgmd_public from ENSEMBL r75
VarFish annotator uses HTSJDK for reading variant call format (VCF) files. HTSJDK supports reading VCF v4.3 so the output of any tool that produces well-formed VCF can be read. VCF itself only specifies relatively few required fields and callers may use fields in a slightly different way. We thus document below what fields are used/interpreted by VarFish Annotator to prepare the files for VarFish.
The following fields are considered:
CHROMWill be written out in a normalized way, depending on the genome build. That thechrprefix will be presenet for GRCh38 and it will be absent for GRCh37.POSThe 1-based chromosomal position is written out.REFThe reference allele will be written out.ALTFor each alternative allele and each gene, a record is written out. Asterisk alleles will be ignored (see Broad's GATK documentation on this).FORMATand perSAMPLEGTThe genotype is written out with phasing information.ADThe allelic depth of the alternative allele of the current record is written out.DPThe total depth is written out to compute the allelic balance of the alternative allele of the current record is written out.GQThe genotype quality is written out.
Supported Callers and Caller Annotation
The following variant callers are explicitely supported.
- Delly 2 (SVs)
- Dragen CNV caller
- Dragen SV caller
- Manta
- GATK gCNV
- XHMM (deprecated)
In the other cases, VarFish annotator will fall back to a "generic" import where only the per-sample fields GT, FT, and GQ are interpreted.
Your caller should also write out INFO/END, INFO/SVTYPE, and INFO/SVLEN as defined by VCF4.2
VarFish Annotator will look at the field INFO/SVMETHOD to annotate calls with the caller where the call originated from.
If this field is empty then you should define --default-sv-method so you get appropriately labeled output.
If you have any problem with your data then please tell us by opening a GitHub issue.
Interpretation of top-level and INFO VCF fields
The following fields are considered:
CHROMWill be written out in a normalized way, depending on the genome build. That thechrprefix will be presenet for GRCh38 and it will be absent for GRCh37.POSThe 1-based chromosomal position is written out.REFThe reference allele will be written out.ALTFor each alternative allele and each gene, a record is written out. Asterisk alleles will be ignored (see Broad's GATK documentation on this).INFO/CHR2Second chromosome of the SV, if not on the same chromosome.INFO/ENDEnd position of the SV.INFO/CTPaired-endsignature induced connection type.INFO/CIPOSConfidence interval around the start point of the SV.INFO/CIENDConfidence interval around the end point of the SV.INFO/SVMETHODThe name of the caller that was used.
Interpretation of FORMAT and per sample fields
- Common
GTGenotype, written asgtFTPer-genotype filter values, written asftGQPhred-scaled genotype quality, written asgq
- Delly2
DRReference pairs, written aspec = DR + DVDVVariant pairs, written aspevRRReference junction count, written assrc = RR + RVRVVariant junction count, written assrvRDCNCopy number estimate, written ascn
- Dragen CNV
SMAverage normalized overage, written asancBCBucket count, written as point countpcPEDiscordante read count at start/end, written aspev = PE[0] + PE[1]
- Dragen SV
PRPaired read of reference and variant, written aspec = PR[0] + PR[1]andpev = PR[1]SRPaired read of reference and variant, written assrc = SR[0] + SR[1]andsrv = SR[1]
- For GATK gCNV
CNInteger copy number, written ascnNPNumber of points in segment, written asnp
- Manta (equivalent to Dragen SV)
- For XHMM
RDAverage normalized coveage, written asan
The following will create varfish-annotator-db-1906.h2.db and fill it.
# DOWNLOAD=path/to/varfish-db-downloader
# ANNOTATOR_VERSION=0.9
# ANNOTATOR_DATA_RELEASE=1907
# java -jar varfish-annotator-cli-$ANNOTATOR_VERSION-SNAPSHOT.jar \
init-db \
--db-release-info "varfish-annotator:v$ANNOTATOR_VERSION" \
--db-release-info "varfish-annotator-db:r$ANNOTATOR_DATA_RELEASE" \
\
--ref-path /fast/projects/cubit/18.12/static_data/reference/GRCh37/hs37d5/hs37d5.fa \
\
--db-release-info "clinvar:2019-02-20" \
--clinvar-path $DOWNLOAD/GRCh37/clinvar/latest/clinvar_tsv_main/output/clinvar_allele_trait_pairs.single.b37.tsv \
--clinvar-path $DOWNLOAD/GRCh37/clinvar/latest/clinvar_tsv_main/output/clinvar_allele_trait_pairs.multi.b37.tsv \
\
--db-path ./varfish-annotator-db-$ANNOTATOR_DATA_RELEASE \
\
--db-release-info "exac:r1.0" \
--exac-path $DOWNLOAD/GRCh37/ExAC/r1/download/ExAC.r1.sites.vep.vcf.gz \
\
--db-release-info "gnomad_exomes:r2.1" \
$(for path in $DOWNLOAD/GRCh37/gnomAD_exomes/r2.1/download/gnomad.exomes.r2.1.sites.chr*.normalized.vcf.bgz; do \
echo --gnomad-exomes-path $path; \
done) \
\
--db-release-info "gnomad_genomes:r2.1" \
$(for path in $DOWNLOAD/GRCh37/gnomAD_genomes/r2.1/download/gnomad.genomes.r2.1.sites.chr*.normalized.vcf.bgz; do \
echo --gnomad-genomes-path $path; \
done) \
\
--db-release-info "thousand_genomes:v3.20101123"
$(for path in $DOWNLOAD/GRCh37/thousand_genomes/phase3/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf.gz; do \
echo --thousand-genomes-path $path; \
done) \
\
--db-release-info "hgmd_public:ensembl_r75" \
--hgmd-public $DOWNLOAD/GRCh37/hgmd_public/ensembl_r75/HgmdPublicLocus.tsv
# mvn com.coveo:fmt-maven-plugin:format -Dverbose=true
The folder /tests contains some data sets that are appropriate for system (aka "end-to-end") tests of the software.
hg19-chr22-- This folder contains examples for annotating GATK HC and Delly2 calls on the first 20MB of chr22. Only the variants overlapping withADA2andGAB4are used.
You can build the data sets with the build.sh script that is available in each folder.
This script also serves for documenting the test data's provenance.
The Jannovar software must be available as jannovar (e.g., through bioconda) on your PATH and you will need samtools.
The tests use junit5-system-exit for detecting System.exit() calls.
In JDK 18 you have to use the -Djava.security.manager=allow flag.
Issue tginsberg/junit5-system-exit#10 tracks this issue.
There is an issue with removing temporary directories on Windows.
Apparently, HTSJDK does not properly close files.
Set -Djunit.jupiter.tempdir.cleanup.mode.default=NEVER to work around this issue.