ClinVar-first small-variant triage and reporting tool for research-oriented review workflows.
The workbench accepts a VCF, matches variants against a local ClinVar snapshot, highlights conflicting interpretations, optionally enriches findings with PharmGKB, ranks the review queue, and emits analyst-friendly HTML and machine-readable outputs.
VCFs are compact and machine-friendly, but they are not ideal review artifacts. Analysts often need to answer a narrower question first:
- Which variants matched a known ClinVar record?
- Which findings have conflicting interpretations?
- Which records should be reviewed first?
- Which variants may have pharmacogenomics context worth surfacing?
This repository focuses on that gap. It is not a full annotation platform or a clinical interpretation engine. It is a reproducible, inspectable triage workbench for small-variant review.
- reads
.vcfand.vcf.gzinputs - normalizes one record per alternate allele
- matches variants to a local ClinVar snapshot using assembly-aware coordinate and allele keys
- attaches conflict and submission context when available
- optionally enriches variants with PharmGKB gene, variant, clinical annotation, and guideline data
- ranks findings with transparent heuristics
- writes HTML, Markdown, JSON, CSV, and run metadata outputs
This project is:
- a research triage workbench
- a reproducible annotation and reporting tool
This project is not:
- a clinical decision support system
- an ACMG classifier
- a star-allele caller
- a treatment recommendation engine
VCF / VCF.GZ
|
v
VCF parser
|
v
Normalized InputVariant records
|
v
ClinVar exact-match index
|
+--> conflict attachment
|
+--> submission evidence attachment
|
v
AnnotatedVariant records
|
+--> optional PharmGKB enrichment
|
v
RankedVariant records
|
+--> HTML report
+--> prioritized_variants.json
+--> annotated_variants.csv
+--> summary.json
`--> run_metadata.json
variant-review-workbench/
|-- src/
| |-- annotator.py
| |-- cli.py
| |-- clinvar_index.py
| |-- models.py
| |-- pgx_enrichment.py
| |-- ranker.py
| |-- report_builder.py
| `-- vcf_parser.py
|-- templates/
| `-- report.html.j2
|-- data/
| |-- clinvar/
| |-- pharmgkb/
| |-- references/
| `-- demo.vcf
|-- tests/
|-- README.md
`-- pyproject.toml
- input VCF or VCF.GZ
- ClinVar
variant_summary.txt.gz - reference assembly:
GRCh37orGRCh38
summary_of_conflicting_interpretations.txtsubmission_summary.txt.gz- PharmGKB enrichment via
--enable-pharmgkb
Download the required ClinVar files from the official NCBI FTP tab_delimited directory:
- directory listing:
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ variant_summary.txt.gz:https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gzsummary_of_conflicting_interpretations.txt:https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/summary_of_conflicting_interpretations.txtsubmission_summary.txt.gz:https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/submission_summary.txt.gz
Recommended local placement for the examples in this README:
data/clinvar/raw/variant_summary.txt.gzdata/clinvar/raw/summary_of_conflicting_interpretations.txtdata/clinvar/raw/submission_summary.txt.gz
Each run writes:
annotated_variants.csv- stable CSV export of ranked variant records
- list-valued fields are serialized as JSON arrays inside each cell
prioritized_variants.json- machine-readable prioritized variant artifact with
schema_version,artifact_type, andrecords
- machine-readable prioritized variant artifact with
summary.json- machine-readable summary artifact with stable count fields and priority-tier counts
run_metadata.json- reproducibility metadata, source provenance, and counts
report.html- analyst-facing HTML report with top findings, conflicts, methods, and limitations
report.md- Markdown report generated from the same shared report context as the HTML view
report_export.json- JSON-safe export of the rendered report context for downstream consumers
This tool now uses a persistent processed ClinVar cache.
- by default the CLI builds and reuses a SQLite cache at
data/clinvar/processed/clinvar_lookup_cache.sqlite3 - the first local run against a new raw ClinVar snapshot is a preprocessing run and can take a long time
- after that cache exists, repeated runs against the same snapshot should be much faster
Observed timing on the staged repository data:
- first run with cache build: about
1402seconds, about23 minutes 22 seconds - warm run reusing the cache: about
2.83seconds
What this means in practice:
- if a fresh local environment appears slow on the first real run, that is expected
- that first local run is building a reusable queryable index from the raw ClinVar files
- the warm-run path is the intended day-to-day workflow
- hosted deployments should not rely on the first public request to perform this build
Cache controls:
- default cache location:
data/clinvar/processed/clinvar_lookup_cache.sqlite3 - override location:
--clinvar-cache-db <path> - disable cache and force raw-file reads:
--disable-clinvar-cache - prebuild or refresh the cache without running a report:
python -m src.cache_bootstrap ...
The machine-readable outputs are intentionally separate from the human-oriented HTML report.
prioritized_variants.jsonis the canonical structured export for downstream scriptsannotated_variants.csvuses the same field set as the JSON records where practical- list-valued fields such as
condition_names,flags, andranking_rationaleremain lists in JSON and are encoded as JSON arrays in CSV cells summary.jsonuses stable count names:input_variant_countclinvar_matched_countclinvar_unmatched_countconflict_flagged_countpharmgkb_enriched_countgene_symbol_mismatch_countreview_priority_tier_counts
python -m pip install -r requirements.txtPick one of these entry points:
- CLI: run the existing pipeline directly and inspect the generated artifacts in
outputs/. - Web: start the Flask app locally and submit a VCF from the browser.
To inspect the generated outputs first, start with the web interface or open outputs/demo_run/report.html.
The repository also includes a thin Flask web layer that reuses the same backend pipeline as the CLI.
Local startup:
flask --app src.web.app run --host 127.0.0.1 --port 5000Then open http://127.0.0.1:5000.
Web flow:
- Upload a
.vcfor.vcf.gz. - Select
GRCh37orGRCh38. - Optionally enable PharmGKB enrichment.
- Either open the browser report or go straight to
html,json, ormdexport.
For deployment-specific details, see DEPLOYMENT.md.
Hosted process summary:
- the web app reuses a shared processed ClinVar cache rather than building one per uploaded VCF
VRW_RUN_RETENTION_HOURSapplies to temporary run workspaces, not to cache warm-up or user cooldown- the recommended hosted workflow is to build
clinvar_lookup_cache.sqlite3offline and upload it to the mounted disk before public use
Hosted/runtime environment variables:
VRW_JOB_EXECUTION_MODEVRW_MAX_UPLOAD_MBVRW_UPLOAD_ROOTVRW_RUN_OUTPUT_ROOTVRW_RUN_RETENTION_HOURSVRW_CLINVAR_RAW_DIRVRW_CLINVAR_PROCESSED_DIRVRW_CLINVAR_VARIANT_SUMMARYVRW_CLINVAR_CONFLICT_SUMMARYVRW_CLINVAR_SUBMISSION_SUMMARYVRW_CLINVAR_CACHE_DBVRW_DISABLE_CLINVAR_CACHE- optional
VRW_DATA_ROOT
Recommended hosted disk layout:
- uploads:
/var/data/uploads - run outputs:
/var/data/runs - ClinVar raw files:
/var/data/clinvar/raw - ClinVar cache:
/var/data/clinvar/processed/clinvar_lookup_cache.sqlite3
Hosted recommendation:
- do not rely on the first public web submission to build the shared ClinVar cache
- prebuild the processed cache offline, then place the finished SQLite file on the mounted disk before treating the site as ready
- the web app is a convenience interface over the same research pipeline, not a clinical product
- upload acceptance is limited to
.vcfand.vcf.gzfiles and the request size cap defaults to25MB - the hosted web interface stores submitted files in per-run workspaces and removes stale workspaces opportunistically after the configured retention window
- the hosted web interface is not a PHI-grade privacy boundary, so protected health information should not be uploaded
/healthzis intended to return healthy only after the configured ClinVar source files and cache parent directories exist on disk
CLI reviewer path:
python -m src.cli `
--input data\demo.vcf `
--assembly GRCh38 `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--out-dir outputs\demo_run `
--report-title "Variant Review Report" `
--top-findings-limit 5Web reviewer path:
flask --app src.web.app run --host 127.0.0.1 --port 5000Then submit data\demo.vcf through the homepage.
This command will build the processed ClinVar cache on first use if it does not already exist.
python -m src.cli `
--input data\demo.vcf `
--assembly GRCh38 `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--out-dir outputs\demo_runUse a specific cache path:
python -m src.cli `
--input data\demo.vcf `
--assembly GRCh38 `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--clinvar-cache-db data\clinvar\processed\clinvar_lookup_cache.sqlite3 `
--out-dir outputs\demo_runDisable the processed cache entirely:
python -m src.cli `
--input data\demo.vcf `
--assembly GRCh38 `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--disable-clinvar-cache `
--out-dir outputs\demo_run_no_cachePrebuild or refresh the processed ClinVar cache without running a report:
python -m src.cache_bootstrap `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--clinvar-cache-db data\clinvar\processed\clinvar_lookup_cache.sqlite3 `
--force-rebuildpython -m src.cli `
--input data\demo.vcf `
--assembly GRCh38 `
--variant-summary data\clinvar\raw\variant_summary.txt.gz `
--conflict-summary data\clinvar\raw\summary_of_conflicting_interpretations.txt `
--submission-summary data\clinvar\raw\submission_summary.txt.gz `
--out-dir outputs\demo_run_pgx `
--enable-pharmgkbThe repository demo input is intentionally small but not arbitrary. It is designed to show different review situations in one run:
BRCA1- strong pathogenic ClinVar match with conflict surfaced
APC- strong pathogenic ClinVar match without conflict
DPYD- drug-response-oriented ClinVar match that becomes especially useful in the PharmGKB-enabled run
TP53- conflict-heavy variant that stays important but scores below the strongest pathogenic findings
The demo file lives at:
data/demo.vcf
On the current staged ClinVar snapshot, the base demo run produces:
4input variants4ClinVar matches3conflict-flagged findings3high_review_priorityfindings1reviewfinding
The PharmGKB-enabled demo run adds cached public PGx context and, on the current demo set, enriches all 4 variants.
Inspect these generated artifacts in order:
outputs/demo_run/report.htmloutputs/demo_run/summary.jsonoutputs/demo_run/prioritized_variants.jsonoutputs/demo_run_pgx/report.html
What the base demo should immediately show:
- the hero summary confirms all four demo variants matched ClinVar
- the top findings section shows both conflict-flagged and non-conflict findings
- the conflict review queue is not empty
- the rationale text explains why
BRCA1,APC, andDPYDoutrankTP53
What the PharmGKB demo should immediately show:
- the same ClinVar-first queue remains intact
- optional PGx context adds signal without replacing the core ClinVar interpretation layer
- the output provenance records both ClinVar and PharmGKB sources
Base report hero:
This view gives the fastest high-level read on the run:
4input variants4ClinVar matches3conflict-flagged findings3high-priority findings
Base report top findings:
This view shows the ranked findings near the top of the report:
BRCA1as a high-priority conflict-flagged findingAPCas a strong non-conflict pathogenic findingDPYDas a drug-response-oriented finding that becomes more interesting in the PharmGKB-enabled run
Base report conflict review queue:
This view shows the conflict review queue section:
- the conflict review queue is populated
- the highest-friction findings are isolated into a reviewable section
Base report variant table:
This view shows the denser analyst-facing table:
- tier, score, ClinVar significance, review status, conflict flag, and workflow flags appear together
- the row-level output mirrors the machine-readable exports
PharmGKB-enabled source provenance:
This view shows the provenance section after a PharmGKB-enabled run:
- ClinVar sources remain explicit
- PharmGKB API and local PharmGKB cache are both recorded
- the optional enrichment layer is visible without obscuring the ClinVar-first workflow
Ranking is heuristic and intentionally transparent.
The current score uses:
- ClinVar clinical significance
- ClinVar review strength
- conflict surfacing
- input impact severity
- optional PharmGKB context
- gene-symbol mismatch penalty
- visible truncated-input notices when
--max-input-variantsis used
The system emits a numeric score, a priority tier, and a rationale list for every ranked variant.
Priority tiers:
high_review_priorityreviewcontext_only
- Which variants have strong ClinVar support and should be reviewed first?
- Which findings are conflict-flagged and require closer inspection?
- Which unmatched records remain context-only?
- Which variants have optional PGx context worth surfacing for downstream review?
Used as the core local reference layer for variant matching and conflict attachment.
- FTP root:
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/ - maintenance and release notes:
https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/ - FTP primer:
https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/
Primary files used by this workbench:
Used only as optional enrichment.
- API base:
https://api.pharmgkb.org/v1 - docs:
https://api.pharmgkb.org/
Current integration uses stable public queries for:
- gene lookup by symbol
- variant lookup by symbol
- clinical annotations by gene symbol
- guideline annotations by gene symbol
Human-readable source links are also maintained in:
This repository contains code written for the workbench itself. Upstream datasets and APIs remain governed by their respective providers.
- Repository code license: Apache License 2.0
- ClinVar data usage and redistribution expectations should be reviewed through NCBI documentation.
- PharmGKB API usage should follow PharmGKB terms and public API guidance.
Users of this repository should verify current upstream licensing and attribution requirements before redistributing derived datasets or packaging source snapshots.
Run the current unit suite:
python -m unittest discover -s tests -vGitHub Actions also runs syntax checks and the unit suite on push and pull request.
The implemented system currently has coverage for:
- VCF parsing
- ClinVar index loading
- conflict and submission attachment
- annotation behavior
- ranking behavior
- report generation
- CLI orchestration
- PharmGKB caching, failure handling, and integration
- web route behavior
- web job execution and failure handling
Implemented and tested:
- ClinVar-first local annotation pipeline
- HTML report generation
- CSV and JSON exports
- optional PharmGKB enrichment
- end-to-end CLI orchestration
- thin Flask web interface with homepage, docs page, per-run uploads, report embedding, and exports
Current automated test count:
96passing unit tests
- matching is exact and assembly-aware, but does not yet perform deeper variant normalization beyond the current key strategy
- unmatched variants are intentionally left as context-only rather than force-interpreted
- PharmGKB enrichment is optional and network-dependent when enabled
- this is a focused small-variant review tool, not a full-scale annotation framework
This tool is for research triage and educational review only. It is not intended for diagnosis, treatment selection, or other clinical decision-making.
The hosted web demo is also not a production privacy boundary. Do not submit protected health information or rely on uploaded-run retention as a formal data-governance control.




