VarForge is a fast, single-binary Rust tool for generating synthetic cancer sequencing test data with controlled ground truth. It produces realistic FASTQ and BAM files with known mutations, tumour parameters, UMI barcodes, structural variants, and cfDNA fragment profiles for benchmarking bioinformatics pipelines.
| Feature | VarForge | BAMSurgeon | ART | NEAT |
|---|---|---|---|---|
| Single binary, no runtime deps | Yes | No | No | No |
| Somatic mutations (SNV/indel/MNV) | Yes | Yes | No | Partial |
| Structural variants (DEL/INS/INV/DUP/TRA) | Yes | No | No | No |
| SV signatures (HRD, TDP, chromothripsis) | Yes | No | No | No |
| COSMIC SBS signature weighting | Yes | No | No | No |
| Tumour purity / clonal architecture | Yes | No | No | No |
| Paired tumour-normal simulation | Yes | Partial | No | No |
| Germline variant simulation | Yes | No | No | Yes |
| cfDNA fragment model | Yes | No | No | No |
| Long-read fragment model | Yes | No | No | No |
| Duplex UMI barcodes | Yes | No | No | No |
| FFPE / oxoG artefacts | Yes | No | No | No |
| Longitudinal / multi-sample series | Yes | No | No | No |
| Copy number alterations | Yes | No | No | No |
| GC bias model | Yes | No | Partial | No |
| Hybrid-capture / amplicon model | Yes | No | No | No |
| Microsatellite instability (MSI) | Yes | No | No | No |
| Truth VCF output | Yes | Partial | No | Yes |
| YAML configuration | Yes | No | No | No |
cargo install varforge
git clone https://github.com/varforge/varforge
cd varforge
cargo build --release
./target/release/varforge --help
Requires Rust 1.74 or later. No C libraries are needed. The entire dependency stack is pure Rust.
The following example generates a 30x WGS tumour sample with 5000 random somatic mutations.
1. Create a config file (quickstart.yaml):
reference: /data/ref/hg38.fa
output:
directory: out/quickstart
fastq: true
truth_vcf: true
sample:
name: TUMOUR
coverage: 30.0
tumour:
purity: 0.70
mutations:
random:
count: 5000
vaf_min: 0.05
vaf_max: 0.60
snv_fraction: 0.80
indel_fraction: 0.15
mnv_fraction: 0.05
seed: 422. Validate the config:
varforge validate --config quickstart.yaml
3. Run the simulation:
varforge simulate --config quickstart.yaml
4. Inspect the output:
out/quickstart/
TUMOUR_R1.fastq.gz # Read 1 FASTQ
TUMOUR_R2.fastq.gz # Read 2 FASTQ
truth.vcf.gz # Ground-truth VCF with all injected variants
manifest.tsv # Sample metadata (name, coverage, purity, paths)
CLI overrides allow any config value to be changed at the command line:
varforge simulate --config quickstart.yaml \
--coverage 60 --purity 0.5 --seed 99
Variable substitution lets configs use placeholders resolved at runtime:
reference: ${reference}varforge simulate --config quickstart.yaml --set reference=/data/ref/hg38.fa
Presets skip the config entirely for common scenarios:
varforge simulate --config quickstart.yaml --preset wgs
varforge simulate --config quickstart.yaml --preset cancer:melanoma
varforge [OPTIONS] <COMMAND>
Commands:
simulate Run a simulation from a YAML config
validate Validate a YAML config without running
edit Spike variants into an existing BAM file
learn-profile Learn an error/quality profile from a real BAM file
benchmark-suite Run a VAF x coverage benchmark grid
Global options:
-t, --threads <N> Number of threads (default: all available cores)
--log-level <LEVEL> error | warn | info | debug | trace (default: info)
varforge simulate --config <FILE> [OPTIONS]
Options:
-c, --config <FILE> Path to YAML configuration file (required)
-o, --output-dir <DIR> Override output directory
--seed <N> Override random seed
--coverage <F> Override coverage depth (x)
--read-length <N> Override read length (bp)
--purity <F> Override tumour purity (0.0-1.0)
--fragment-mean <F> Override fragment mean length (bp)
--fragment-sd <F> Override fragment length standard deviation (bp)
--random-mutations <N> Generate N random mutations (no VCF needed)
--vaf-range <MIN-MAX> VAF range for random mutations (e.g. 0.001-0.05)
--preset <NAME> Apply a named preset (see Presets section)
--set <KEY=VALUE> Set config variables; replaces ${key} in YAML (repeatable)
--list-presets List all available presets and exit
--dry-run Validate config and estimate output size only
varforge validate --config <FILE>
Parses the YAML config and checks all fields for consistency. Exits with status 0 if valid, non-zero otherwise with a descriptive error message.
varforge edit --bam <IN.bam> --vcf <VARIANTS.vcf> --output <OUT.bam>
Spikes variants from a VCF directly into an existing BAM file without re-simulating reads. Useful for adding a handful of known mutations to a real or previously simulated dataset.
varforge learn-profile --bam <BAM> --output <PROFILE.json>
Learns an empirical base-quality and error profile from a real BAM file. The resulting JSON can be referenced from the quality.profile_path config field to produce reads with a realistic, data-driven quality model instead of the parametric default.
varforge benchmark-suite --config <FILE> --vafs 0.01,0.05,0.1 --coverages 100,500,1000
Runs a grid of simulations across specified VAF and coverage values. Each combination produces its own output directory. Useful for generating sensitivity curves and limit-of-detection analyses.
All simulation parameters are specified in a YAML file. Only reference and output.directory are required. Everything else has a sensible default.
| Field | Type | Default | Description |
|---|---|---|---|
reference |
path | (required) | Path to FASTA reference genome |
output |
OutputConfig | (required) | Output format and directory |
sample |
SampleConfig | see below | Read generation parameters |
fragment |
FragmentConfig | see below | Insert size distribution |
quality |
QualityConfig | see below | Base quality model |
tumour |
TumourConfig | null | Tumour purity and clonal architecture |
mutations |
MutationConfig | null | Somatic mutation injection |
umi |
UmiConfig | null | UMI barcode configuration |
artifacts |
ArtifactConfig | null | Sequencing artefact simulation |
copy_number |
list of CopyNumberConfig | null | Copy number alterations |
gc_bias |
GcBiasConfig | null | GC content coverage bias |
capture |
CaptureConfig | null | Hybrid-capture or amplicon enrichment model |
germline |
GermlineConfig | null | Germline SNP/indel simulation |
paired |
PairedConfig | null | Matched tumour-normal pair mode |
samples |
list of SampleEntry | null | Multi-sample / longitudinal series |
chromosomes |
list of strings | null (all) | Restrict simulation to named chromosomes |
regions_bed |
path | null | Restrict simulation to BED file regions |
vafs |
list of floats | null | Batch mode: one run per VAF value |
preset |
string | null | Chemistry preset name (applied before YAML values) |
performance |
PerformanceConfig | see below | Streaming pipeline tuning |
seed |
integer | null (random) | Random seed for reproducibility |
threads |
integer | null (all cores) | Worker thread count |
output:
directory: out/my_run # required
fastq: true # write gzip-compressed FASTQ files (default: true)
bam: false # write coordinate-sorted BAM (default: false)
truth_vcf: true # write ground-truth VCF (default: true)
germline_vcf: true # write germline truth VCF when germline is enabled (default: true)
manifest: true # write manifest.tsv (default: true)
single_read_bam: false # single-read BAM for long-read platforms (default: false)
mapq: 60 # mapping quality for BAM records (default: 60)sample:
name: TUMOUR_01 # sample name used in file names and read headers (default: SAMPLE)
read_length: 150 # read length in bp (default: 150)
coverage: 30.0 # mean target coverage depth (default: 30.0)
platform: illumina # sequencing platform tag; written to BAM @RG header (optional)Controls the insert size (fragment length) distribution.
fragment:
model: normal # normal | cfda (default: normal)
mean: 300.0 # mean fragment length in bp (default: 300.0)
sd: 50.0 # standard deviation in bp (default: 50.0)Fragment models:
normal: Gaussian distribution. Suitable for standard library prep from fresh-frozen tissue or cell lines.cfda: Short, nucleosome-phased distribution reflecting cell-free DNA in plasma. Typical mean ~167 bp with mononucleosomal and dinucleosomal peaks and 10 bp periodicity from nucleosome positioning.
cfDNA-specific options (only apply when model: cfda):
fragment:
model: cfda
ctdna_fraction: 0.05 # fraction of tumour-derived shorter fragments (default: derived from purity)
mono_sd: 20.0 # SD of mononucleosomal peak in bp (default: 20.0)
di_sd: 30.0 # SD of dinucleosomal peak in bp (default: 30.0)
end_motif_model: plasma # plasma cfDNA 4-mer end motif rejection sampling (optional)Long-read fragment model (PacBio, Nanopore):
fragment:
long_read:
mean: 15000 # mean length in bp (default: 15000)
sd: 5000 # standard deviation in bp (default: 5000)
min_len: 1000 # minimum fragment length (default: 1000)
max_len: 100000 # maximum fragment length (default: 100000)When long_read is set, the log-normal sampler is used instead of the normal or cfDNA sampler. Combine with output.single_read_bam: true for realistic long-read BAM output.
quality:
mean_quality: 36 # mean Phred quality score for the first cycle (default: 36)
tail_decay: 0.003 # per-cycle quality decay rate (default: 0.003)
profile_path: null # optional path to empirical profile JSON from learn-profileIf profile_path is set, the empirical profile overrides mean_quality and tail_decay.
tumour:
purity: 0.70 # fraction of cells that are tumour (0.0-1.0; default: 1.0)
ploidy: 2 # tumour ploidy (default: 2)
msi: false # microsatellite instability mode (default: false)
clones: # optional list of clones for subclonal architecture
- id: trunk
ccf: 1.0 # cancer cell fraction (0.0-1.0)
- id: subclone_a
ccf: 0.40
parent: trunk # parent clone ID (optional; omit for founding clone)When clones is empty, all mutations are assigned to a single clonal population at the specified purity. When clones are defined, mutations are distributed across the clone tree and their effective VAF is:
VAF = purity x CCF / ploidy
When msi: true, indel rates at homopolymer and dinucleotide repeat loci are elevated to simulate MSI-high tumours.
mutations:
vcf: /path/to/variants.vcf.gz # optional: inject specific variants from VCF
random: # optional: add random somatic mutations
count: 5000 # number of mutations to generate
vaf_min: 0.001 # minimum VAF (default: 0.001)
vaf_max: 0.50 # maximum VAF (default: 0.5)
snv_fraction: 0.80 # fraction that are SNVs (default: 0.80)
indel_fraction: 0.15 # fraction that are indels (default: 0.15)
mnv_fraction: 0.05 # fraction that are MNVs (default: 0.05)
signature: SBS7a # COSMIC SBS signature for weighted base selection (optional)
sv_signature: HRD # SV signature: HRD, TDP, or CHROMOTHRIPSIS (optional)
sv_count: 10 # number of SVs to generate for the signature (default: 10)
include_driver_mutations: false # inject driver mutations from cancer preset (default: false)snv_fraction + indel_fraction + mnv_fraction must sum to exactly 1.0.
Both vcf and random may be specified simultaneously. VCF variants are injected first, then random mutations are added at non-overlapping positions.
SV signatures generate structural variants with biologically realistic size and type distributions:
HRD: large deletions (100 kbp to 10 Mbp) characteristic of homologous recombination deficiency.TDP: short tandem duplications (1 kbp to 10 kbp) characteristic of the tandem duplicator phenotype.CHROMOTHRIPSIS: clustered rearrangements (deletions, inversions, duplications) on a single chromosome.
COSMIC SBS signatures weight the alternate base selection by trinucleotide context probabilities from the COSMIC catalogue. For example, signature: SBS7a produces UV-type C>T mutations in dipyrimidine contexts typical of melanoma.
umi:
length: 8 # UMI barcode length in bases (default: 8)
duplex: false # enable duplex (double-stranded) UMI mode (default: false)
pcr_cycles: 10 # number of PCR amplification cycles (default: 10)
family_size_mean: 3.0 # mean read family size (default: 3.0)
family_size_sd: 1.5 # standard deviation of family size (default: 1.5)
inline: true # prepend UMI to read sequence (default: false)When inline: true, the UMI is prepended to the read sequence (e.g. for fgbio ExtractUmisFromBam). When inline: false, the UMI is written into the read name (e.g. @READ:ACGTACGT).
When duplex: true, each molecule is tagged with a strand-specific UMI pair supporting duplex consensus calling tools such as fgbio CallDuplexConsensusReads.
artifacts:
ffpe_damage_rate: 0.02 # C>T deamination rate (0.0-1.0; null = disabled)
oxog_rate: 0.01 # 8-oxoG C>A transversion rate (0.0-1.0; null = disabled)
duplicate_rate: 0.15 # PCR duplicate fraction (0.0-1.0; null = disabled)
pcr_error_rate: 0.001 # PCR substitution error rate per base (null = disabled)All fields are optional. Omit the entire artifacts block (or set individual rates to null) to disable artefact simulation.
copy_number:
- region: "chr7:55000000-55200000" # chrom:start-end (1-based, inclusive)
tumor_cn: 4 # tumour copy number (default: 2)
normal_cn: 2 # normal copy number (default: 2)
major_cn: 3 # major allele CN for LOH modelling (optional)
minor_cn: 1 # minor allele CN for LOH modelling (optional)Multiple entries may be listed. Overlapping regions are applied in order (last wins). Read depth in each region is scaled proportionally to tumor_cn / normal_cn.
gc_bias:
enabled: true # apply GC bias model (default: true when block is present)
model: default # default | flat | custom (default: "default")
severity: 1.0 # bias multiplier: 0 = none, 1 = realistic, 2 = extreme (default: 1.0)The default model applies an empirical coverage reduction at GC extremes (< 30 % or > 70 % GC). Setting severity: 0 disables the effect while keeping the block present.
capture:
enabled: true # activate capture model (default: true)
mode: panel # panel | amplicon (default: panel)
targets_bed: /data/panels/panel.bed # path to capture target BED (optional)
off_target_fraction: 0.20 # fraction of reads mapping off-target (default: 0.2)
coverage_uniformity: 0.30 # per-target LogNormal sigma (0 = uniform; default: 0.3)
edge_dropoff_bases: 50 # exponential dropoff at target edges in bp (default: 50)
primer_trim: 0 # bases to trim from read ends in amplicon mode (default: 0)
coverage_cv_target: 0.25 # warn if achieved CV exceeds this (optional)
on_target_fraction_target: 0.95 # warn if on-target fraction falls below this (optional)In panel mode, reads are distributed across targets with off-target spillover. In amplicon mode, fragments exactly span each target region with no off-target reads. When targets_bed is omitted, the capture model distributes reads uniformly across whichever chromosomes or regions are active.
germline:
het_snp_density: 0.6 # heterozygous SNPs per kbp (default: 0.6)
hom_snp_density: 0.3 # homozygous SNPs per kbp (default: 0.3)
het_indel_density: 0.05 # heterozygous indels per kbp (default: 0.05)
vcf: /path/to/germline.vcf # use specific germline variants instead of random (optional)Germline variants are assigned VAF 0.5 (heterozygous) or 1.0 (homozygous). They appear in the separate germline_truth.vcf output.
paired:
normal_coverage: 30.0 # coverage for the normal sample (default: 30.0)
normal_sample_name: NORMAL # sample name for normal output (default: NORMAL)
tumour_contamination_in_normal: 0.0 # tumour contamination in normal (0.0-1.0; default: 0.0)When present, VarForge runs two simulations: one tumour sample (with all somatic and germline variants) and one normal sample (germline only). Outputs are written to tumour/ and normal/ sub-directories under output.directory.
When samples is present, VarForge generates one output sub-directory per entry and a combined manifest.tsv. Each entry shares the top-level reference, mutations, tumour, and fragment settings but can override coverage, tumour fraction, and fragment model independently.
samples:
- name: timepoint_1
coverage: 1000.0
tumour_fraction: 0.05 # ctDNA fraction for this sample (default: 1.0)
fragment_model: cfda # override fragment model (optional)
clonal_shift: # per-clone CCF adjustments at this timepoint (optional)
subclone_a: 0.10
- name: timepoint_2
coverage: 1000.0
tumour_fraction: 0.002
fragment_model: cfdaperformance:
output_buffer_regions: 64 # max region batches buffered in streaming channel (default: 64)Higher values use more memory but provide better overlap between compute and I/O. Lower values reduce peak memory.
Presets are named configuration bundles that set sensible defaults for common scenarios. A preset is applied before the YAML config values, so explicit YAML fields always win.
varforge simulate --config base.yaml --preset <NAME>
varforge simulate --list-presets
| Preset | Coverage | Fragment | Mutations | UMI | Notes |
|---|---|---|---|---|---|
small |
1x | normal | 100 random (chr22 only) | no | Smoke test; completes in ~30 s |
panel |
500x | normal | 50 random | inline 8-mer | Targeted panel benchmarking |
wgs |
30x | normal | 5 000 random | no | Whole-genome variant calling |
cfdna |
200x | cfda (167 bp) | 200 random, VAF 0.1-5% | duplex | Liquid biopsy simulation |
ffpe |
30x | normal | 500 random | no | FFPE artefacts enabled |
umi |
1 000x | normal | 50 random | duplex 9-mer | High-depth duplex consensus |
Cancer presets are accessed with the cancer: namespace prefix. Each preset sets biologically realistic mutation counts, VAF ranges, tumour purity, and mutation-type fractions based on published COSMIC mutational signatures.
varforge simulate --config base.yaml --preset cancer:melanoma
| Preset | Cancer | Dominant Signature | Typical TMB | Purity |
|---|---|---|---|---|
cancer:lung_adeno |
Lung adenocarcinoma | SBS4 (smoking, C>A) | ~8 mut/Mb | 60% |
cancer:colorectal |
Colorectal (MSS) | SBS1/SBS5 (aging, C>T) | ~5 mut/Mb | 65% |
cancer:breast_tnbc |
Triple-negative breast | SBS3 (HRD, flat) | ~5 mut/Mb | 55% |
cancer:melanoma |
Cutaneous melanoma | SBS7a/b (UV, C>T) | ~30 mut/Mb | 70% |
cancer:aml |
Acute myeloid leukaemia | SBS1/SBS5 (aging) | ~1 mut/Mb | 80% |
cancer:prostate |
Prostate adenocarcinoma | SBS1/SBS5 (aging) | ~2 mut/Mb | 50% |
cancer:pancreatic |
Pancreatic ductal | SBS1/SBS5 (aging) | ~3 mut/Mb | 25% |
cancer:glioblastoma |
Glioblastoma (IDH-wt) | SBS1/SBS5 (aging) | ~4 mut/Mb | 65% |
Ready-to-use example configs are in the examples/ directory.
| File | Use case |
|---|---|
examples/minimal.yaml |
Simplest possible simulation (defaults only) |
examples/wgs_30x.yaml |
Standard 30x WGS tumour with random mutations |
examples/panel_umi.yaml |
Targeted panel with inline 8-mer UMI |
examples/cfdna_monitoring.yaml |
cfDNA longitudinal series (4 timepoints) |
examples/ffpe_artifacts.yaml |
FFPE-damaged tumour sample |
examples/tumor_normal.yaml |
Matched tumour/normal pair |
examples/subclonal.yaml |
Four-clone tumour with copy number alterations |
examples/high_depth.yaml |
1000x duplex UMI for low-VAF detection |
examples/custom_mutations.yaml |
Inject specific variants from a VCF file |
examples/twist_duplex_benchmark.yaml |
Twist duplex capture panel with HRD SVs |
Read names follow the format:
@{SAMPLE}:{CHROM}:{POS}:{READ_NUM}[:UMI={BARCODE}]
Example: @TUMOUR:chr7:55191822:1:UMI=ACGTACGT
The UMI suffix is only present when a umi block is configured and inline: false.
When BAM output is enabled (output.bam: true), the following non-standard tags are written:
| Tag | Type | Description |
|---|---|---|
RX |
Z | Raw UMI sequence |
MI |
Z | Molecule ID (read family identifier) |
tp |
i | 1 if the read carries a simulated somatic variant, 0 otherwise |
cl |
Z | Clone ID the variant was assigned to |
The truth VCF written to truth.vcf.gz uses the following INFO fields:
| Field | Description |
|---|---|
VAF |
Target variant allele frequency |
CLONE |
Clone ID the variant was assigned to |
CCF |
Cancer cell fraction of the assigned clone |
TYPE |
Variant type: SNV, INDEL, MNV, or SV |
# Generate matched tumour/normal with known variants.
reference: /data/ref/hg38.fa
output:
directory: out/caller_bench
bam: true
truth_vcf: true
samples:
- name: TUMOUR
coverage: 60.0
tumour_fraction: 1.0
- name: NORMAL
coverage: 30.0
tumour_fraction: 0.0
tumour:
purity: 0.65
mutations:
random:
count: 1000
vaf_min: 0.05
vaf_max: 0.60
snv_fraction: 0.80
indel_fraction: 0.15
mnv_fraction: 0.05
seed: 42Run your caller against TUMOUR/ with NORMAL/ as the matched normal. Evaluate with:
bcftools stats --apply-filters PASS \
caller_output.vcf.gz truth.vcf.gz
Use examples/panel_umi.yaml with inline: true and then pipe through fgbio:
fgbio ExtractUmisFromBam \
--input out/panel_umi/PANEL_UMI.bam \
--output extracted.bam \
--read-structure 8M+T 8M+T
fgbio GroupReadsByUmi \
--input extracted.bam \
--output grouped.bam \
--strategy paired
fgbio CallMolecularConsensusReads \
--input grouped.bam \
--output consensus.bam \
--min-reads 1
Generate cfDNA samples at multiple tumour fractions and measure detection rate at each:
for TF in 0.10 0.05 0.01 0.005 0.001; do
varforge simulate --config examples/cfdna_monitoring.yaml \
--output-dir out/tf_${TF} \
--purity ${TF} \
--seed 42
done
Generate data with HRD-type structural variants for SV caller evaluation:
mutations:
random:
count: 500
snv_fraction: 0.80
indel_fraction: 0.15
mnv_fraction: 0.05
sv_signature: HRD
sv_count: 20Use examples/ffpe_artifacts.yaml to generate data with realistic FFPE damage, then evaluate your artefact filter:
- True positives: variants present in
truth.vcf.gz - Artefacts: C>T calls not in the truth VCF with strand bias (use the
tpBAM tag to distinguish)
VarForge uses a streaming architecture: rayon workers generate reads in parallel and send them through a bounded crossbeam channel to a dedicated writer thread. Memory scales with channel depth, not dataset size.
varforge simulate --config cfg.yaml --threads 16
Or set in the config:
threads: 16For development and testing, restrict to one or a few chromosomes:
chromosomes:
- chr22Or to a BED file of target regions:
regions_bed: /data/panels/hotspot_panel.bed| Scenario | Coverage | Mutations | Time |
|---|---|---|---|
small preset |
1x, chr22 | 100 | ~30 s |
| Panel (chr7, chr12, chr17) | 500x | 50 | ~2 min |
| WGS 30x | 30x, all | 5 000 | ~25 min |
| WGS 60x | 60x, all | 5 000 | ~50 min |
| Ultra-deep panel 1000x | 1000x, 3 chroms | 50 | ~8 min |
Peak memory is approximately:
2 x read_length x threads x (coverage / 30) MB
For 30x WGS with 150 bp reads on 8 threads: ~600 MB. For 1000x panel on 8 threads: ~200 MB (limited region).
| Scenario | Recommended tool |
|---|---|
| Realistic Illumina base-quality profiles | VarForge (parametric or learn-profile) |
| Controlled somatic variant spike-in | VarForge or BAMSurgeon |
| Spike into a real patient BAM | BAMSurgeon (preserves real read background) |
| Simple read generation, no mutations | ART or NanoSim |
| Whole-genome de novo simulation | NEAT or VarForge |
| cfDNA / liquid biopsy data | VarForge (only tool with native cfDNA model) |
| UMI-tagged duplex sequencing | VarForge (only tool with native duplex model) |
| FFPE artefact simulation | VarForge (only tool with FFPE + oxoG model) |
| Multi-sample longitudinal series | VarForge (only tool with native time-series) |
| Structural variant signatures | VarForge (only tool with HRD/TDP/chromothripsis models) |
| Long-read data | VarForge (log-normal fragment model) or NanoSim |
VarForge is the right choice when you need a complete, reproducible, ground-truth dataset for pipeline benchmarking and do not have access to real patient sequencing data. BAMSurgeon is the better choice when you need to insert a small number of variants into an existing real-data BAM while preserving the authentic read background.