VarForge

VarForge is a fast, single-binary Rust tool for generating synthetic cancer sequencing test data with controlled ground truth. It produces realistic FASTQ and BAM files with known mutations, tumour parameters, UMI barcodes, structural variants, and cfDNA fragment profiles for benchmarking bioinformatics pipelines.

Features

Feature	VarForge	BAMSurgeon	ART	NEAT
Single binary, no runtime deps	Yes	No	No	No
Somatic mutations (SNV/indel/MNV)	Yes	Yes	No	Partial
Structural variants (DEL/INS/INV/DUP/TRA)	Yes	No	No	No
SV signatures (HRD, TDP, chromothripsis)	Yes	No	No	No
COSMIC SBS signature weighting	Yes	No	No	No
Tumour purity / clonal architecture	Yes	No	No	No
Paired tumour-normal simulation	Yes	Partial	No	No
Germline variant simulation	Yes	No	No	Yes
cfDNA fragment model	Yes	No	No	No
Long-read fragment model	Yes	No	No	No
Duplex UMI barcodes	Yes	No	No	No
FFPE / oxoG artefacts	Yes	No	No	No
Longitudinal / multi-sample series	Yes	No	No	No
Copy number alterations	Yes	No	No	No
GC bias model	Yes	No	Partial	No
Hybrid-capture / amplicon model	Yes	No	No	No
Microsatellite instability (MSI)	Yes	No	No	No
Truth VCF output	Yes	Partial	No	Yes
YAML configuration	Yes	No	No	No

Installation

From crates.io

cargo install varforge

From source

git clone https://github.com/varforge/varforge
cd varforge
cargo build --release
./target/release/varforge --help

Requires Rust 1.74 or later. No C libraries are needed. The entire dependency stack is pure Rust.

Quickstart

The following example generates a 30x WGS tumour sample with 5000 random somatic mutations.

1. Create a config file (quickstart.yaml):

reference: /data/ref/hg38.fa

output:
  directory: out/quickstart
  fastq: true
  truth_vcf: true

sample:
  name: TUMOUR
  coverage: 30.0

tumour:
  purity: 0.70

mutations:
  random:
    count: 5000
    vaf_min: 0.05
    vaf_max: 0.60
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05

seed: 42

2. Validate the config:

varforge validate --config quickstart.yaml

3. Run the simulation:

varforge simulate --config quickstart.yaml

4. Inspect the output:

out/quickstart/
  TUMOUR_R1.fastq.gz       # Read 1 FASTQ
  TUMOUR_R2.fastq.gz       # Read 2 FASTQ
  truth.vcf.gz             # Ground-truth VCF with all injected variants
  manifest.tsv             # Sample metadata (name, coverage, purity, paths)

CLI overrides allow any config value to be changed at the command line:

varforge simulate --config quickstart.yaml \
    --coverage 60 --purity 0.5 --seed 99

Variable substitution lets configs use placeholders resolved at runtime:

reference: ${reference}

varforge simulate --config quickstart.yaml --set reference=/data/ref/hg38.fa

Presets skip the config entirely for common scenarios:

varforge simulate --config quickstart.yaml --preset wgs
varforge simulate --config quickstart.yaml --preset cancer:melanoma

CLI Reference

varforge [OPTIONS] <COMMAND>

Commands:
  simulate         Run a simulation from a YAML config
  validate         Validate a YAML config without running
  edit             Spike variants into an existing BAM file
  learn-profile    Learn an error/quality profile from a real BAM file
  benchmark-suite  Run a VAF x coverage benchmark grid

Global options:
  -t, --threads <N>          Number of threads (default: all available cores)
      --log-level <LEVEL>    error | warn | info | debug | trace (default: info)

`simulate`

varforge simulate --config <FILE> [OPTIONS]

Options:
  -c, --config <FILE>           Path to YAML configuration file (required)
  -o, --output-dir <DIR>        Override output directory
      --seed <N>                Override random seed
      --coverage <F>            Override coverage depth (x)
      --read-length <N>         Override read length (bp)
      --purity <F>              Override tumour purity (0.0-1.0)
      --fragment-mean <F>       Override fragment mean length (bp)
      --fragment-sd <F>         Override fragment length standard deviation (bp)
      --random-mutations <N>    Generate N random mutations (no VCF needed)
      --vaf-range <MIN-MAX>     VAF range for random mutations (e.g. 0.001-0.05)
      --preset <NAME>           Apply a named preset (see Presets section)
      --set <KEY=VALUE>         Set config variables; replaces ${key} in YAML (repeatable)
      --list-presets            List all available presets and exit
      --dry-run                 Validate config and estimate output size only

`validate`

varforge validate --config <FILE>

Parses the YAML config and checks all fields for consistency. Exits with status 0 if valid, non-zero otherwise with a descriptive error message.

`edit`

varforge edit --bam <IN.bam> --vcf <VARIANTS.vcf> --output <OUT.bam>

Spikes variants from a VCF directly into an existing BAM file without re-simulating reads. Useful for adding a handful of known mutations to a real or previously simulated dataset.

`learn-profile`

varforge learn-profile --bam <BAM> --output <PROFILE.json>

Learns an empirical base-quality and error profile from a real BAM file. The resulting JSON can be referenced from the quality.profile_path config field to produce reads with a realistic, data-driven quality model instead of the parametric default.

`benchmark-suite`

varforge benchmark-suite --config <FILE> --vafs 0.01,0.05,0.1 --coverages 100,500,1000

Runs a grid of simulations across specified VAF and coverage values. Each combination produces its own output directory. Useful for generating sensitivity curves and limit-of-detection analyses.

Configuration Reference

All simulation parameters are specified in a YAML file. Only reference and output.directory are required. Everything else has a sensible default.

Top-level fields

Field	Type	Default	Description
`reference`	path	(required)	Path to FASTA reference genome
`output`	OutputConfig	(required)	Output format and directory
`sample`	SampleConfig	see below	Read generation parameters
`fragment`	FragmentConfig	see below	Insert size distribution
`quality`	QualityConfig	see below	Base quality model
`tumour`	TumourConfig	null	Tumour purity and clonal architecture
`mutations`	MutationConfig	null	Somatic mutation injection
`umi`	UmiConfig	null	UMI barcode configuration
`artifacts`	ArtifactConfig	null	Sequencing artefact simulation
`copy_number`	list of CopyNumberConfig	null	Copy number alterations
`gc_bias`	GcBiasConfig	null	GC content coverage bias
`capture`	CaptureConfig	null	Hybrid-capture or amplicon enrichment model
`germline`	GermlineConfig	null	Germline SNP/indel simulation
`paired`	PairedConfig	null	Matched tumour-normal pair mode
`samples`	list of SampleEntry	null	Multi-sample / longitudinal series
`chromosomes`	list of strings	null (all)	Restrict simulation to named chromosomes
`regions_bed`	path	null	Restrict simulation to BED file regions
`vafs`	list of floats	null	Batch mode: one run per VAF value
`preset`	string	null	Chemistry preset name (applied before YAML values)
`performance`	PerformanceConfig	see below	Streaming pipeline tuning
`seed`	integer	null (random)	Random seed for reproducibility
`threads`	integer	null (all cores)	Worker thread count

`output`

output:
  directory: out/my_run   # required
  fastq: true             # write gzip-compressed FASTQ files (default: true)
  bam: false              # write coordinate-sorted BAM (default: false)
  truth_vcf: true         # write ground-truth VCF (default: true)
  germline_vcf: true      # write germline truth VCF when germline is enabled (default: true)
  manifest: true          # write manifest.tsv (default: true)
  single_read_bam: false  # single-read BAM for long-read platforms (default: false)
  mapq: 60                # mapping quality for BAM records (default: 60)

`sample`

sample:
  name: TUMOUR_01       # sample name used in file names and read headers (default: SAMPLE)
  read_length: 150      # read length in bp (default: 150)
  coverage: 30.0        # mean target coverage depth (default: 30.0)
  platform: illumina    # sequencing platform tag; written to BAM @RG header (optional)

`fragment`

Controls the insert size (fragment length) distribution.

fragment:
  model: normal    # normal | cfda (default: normal)
  mean: 300.0      # mean fragment length in bp (default: 300.0)
  sd: 50.0         # standard deviation in bp (default: 50.0)

Fragment models:

normal: Gaussian distribution. Suitable for standard library prep from fresh-frozen tissue or cell lines.
cfda: Short, nucleosome-phased distribution reflecting cell-free DNA in plasma. Typical mean ~167 bp with mononucleosomal and dinucleosomal peaks and 10 bp periodicity from nucleosome positioning.

cfDNA-specific options (only apply when model: cfda):

fragment:
  model: cfda
  ctdna_fraction: 0.05     # fraction of tumour-derived shorter fragments (default: derived from purity)
  mono_sd: 20.0            # SD of mononucleosomal peak in bp (default: 20.0)
  di_sd: 30.0              # SD of dinucleosomal peak in bp (default: 30.0)
  end_motif_model: plasma  # plasma cfDNA 4-mer end motif rejection sampling (optional)

Long-read fragment model (PacBio, Nanopore):

fragment:
  long_read:
    mean: 15000    # mean length in bp (default: 15000)
    sd: 5000       # standard deviation in bp (default: 5000)
    min_len: 1000  # minimum fragment length (default: 1000)
    max_len: 100000  # maximum fragment length (default: 100000)

When long_read is set, the log-normal sampler is used instead of the normal or cfDNA sampler. Combine with output.single_read_bam: true for realistic long-read BAM output.

`quality`

quality:
  mean_quality: 36        # mean Phred quality score for the first cycle (default: 36)
  tail_decay: 0.003       # per-cycle quality decay rate (default: 0.003)
  profile_path: null      # optional path to empirical profile JSON from learn-profile

If profile_path is set, the empirical profile overrides mean_quality and tail_decay.

`tumour`

tumour:
  purity: 0.70    # fraction of cells that are tumour (0.0-1.0; default: 1.0)
  ploidy: 2       # tumour ploidy (default: 2)
  msi: false      # microsatellite instability mode (default: false)
  clones:         # optional list of clones for subclonal architecture
    - id: trunk
      ccf: 1.0           # cancer cell fraction (0.0-1.0)
    - id: subclone_a
      ccf: 0.40
      parent: trunk      # parent clone ID (optional; omit for founding clone)

When clones is empty, all mutations are assigned to a single clonal population at the specified purity. When clones are defined, mutations are distributed across the clone tree and their effective VAF is:

VAF = purity x CCF / ploidy

When msi: true, indel rates at homopolymer and dinucleotide repeat loci are elevated to simulate MSI-high tumours.

`mutations`

mutations:
  vcf: /path/to/variants.vcf.gz   # optional: inject specific variants from VCF
  random:                          # optional: add random somatic mutations
    count: 5000                    # number of mutations to generate
    vaf_min: 0.001                 # minimum VAF (default: 0.001)
    vaf_max: 0.50                  # maximum VAF (default: 0.5)
    snv_fraction: 0.80             # fraction that are SNVs (default: 0.80)
    indel_fraction: 0.15           # fraction that are indels (default: 0.15)
    mnv_fraction: 0.05             # fraction that are MNVs (default: 0.05)
    signature: SBS7a               # COSMIC SBS signature for weighted base selection (optional)
  sv_signature: HRD                # SV signature: HRD, TDP, or CHROMOTHRIPSIS (optional)
  sv_count: 10                     # number of SVs to generate for the signature (default: 10)
  include_driver_mutations: false  # inject driver mutations from cancer preset (default: false)

snv_fraction + indel_fraction + mnv_fraction must sum to exactly 1.0.

Both vcf and random may be specified simultaneously. VCF variants are injected first, then random mutations are added at non-overlapping positions.

SV signatures generate structural variants with biologically realistic size and type distributions:

HRD: large deletions (100 kbp to 10 Mbp) characteristic of homologous recombination deficiency.
TDP: short tandem duplications (1 kbp to 10 kbp) characteristic of the tandem duplicator phenotype.
CHROMOTHRIPSIS: clustered rearrangements (deletions, inversions, duplications) on a single chromosome.

COSMIC SBS signatures weight the alternate base selection by trinucleotide context probabilities from the COSMIC catalogue. For example, signature: SBS7a produces UV-type C>T mutations in dipyrimidine contexts typical of melanoma.

`umi`

umi:
  length: 8              # UMI barcode length in bases (default: 8)
  duplex: false          # enable duplex (double-stranded) UMI mode (default: false)
  pcr_cycles: 10         # number of PCR amplification cycles (default: 10)
  family_size_mean: 3.0  # mean read family size (default: 3.0)
  family_size_sd: 1.5    # standard deviation of family size (default: 1.5)
  inline: true           # prepend UMI to read sequence (default: false)

When inline: true, the UMI is prepended to the read sequence (e.g. for fgbio ExtractUmisFromBam). When inline: false, the UMI is written into the read name (e.g. @READ:ACGTACGT).

When duplex: true, each molecule is tagged with a strand-specific UMI pair supporting duplex consensus calling tools such as fgbio CallDuplexConsensusReads.

`artifacts`

artifacts:
  ffpe_damage_rate: 0.02    # C>T deamination rate (0.0-1.0; null = disabled)
  oxog_rate: 0.01           # 8-oxoG C>A transversion rate (0.0-1.0; null = disabled)
  duplicate_rate: 0.15      # PCR duplicate fraction (0.0-1.0; null = disabled)
  pcr_error_rate: 0.001     # PCR substitution error rate per base (null = disabled)

All fields are optional. Omit the entire artifacts block (or set individual rates to null) to disable artefact simulation.

`copy_number`

copy_number:
  - region: "chr7:55000000-55200000"   # chrom:start-end (1-based, inclusive)
    tumor_cn: 4                        # tumour copy number (default: 2)
    normal_cn: 2                       # normal copy number (default: 2)
    major_cn: 3                        # major allele CN for LOH modelling (optional)
    minor_cn: 1                        # minor allele CN for LOH modelling (optional)

Multiple entries may be listed. Overlapping regions are applied in order (last wins). Read depth in each region is scaled proportionally to tumor_cn / normal_cn.

`gc_bias`

gc_bias:
  enabled: true      # apply GC bias model (default: true when block is present)
  model: default     # default | flat | custom (default: "default")
  severity: 1.0      # bias multiplier: 0 = none, 1 = realistic, 2 = extreme (default: 1.0)

The default model applies an empirical coverage reduction at GC extremes (< 30 % or > 70 % GC). Setting severity: 0 disables the effect while keeping the block present.

`capture`

capture:
  enabled: true                             # activate capture model (default: true)
  mode: panel                               # panel | amplicon (default: panel)
  targets_bed: /data/panels/panel.bed       # path to capture target BED (optional)
  off_target_fraction: 0.20                 # fraction of reads mapping off-target (default: 0.2)
  coverage_uniformity: 0.30                 # per-target LogNormal sigma (0 = uniform; default: 0.3)
  edge_dropoff_bases: 50                    # exponential dropoff at target edges in bp (default: 50)
  primer_trim: 0                            # bases to trim from read ends in amplicon mode (default: 0)
  coverage_cv_target: 0.25                  # warn if achieved CV exceeds this (optional)
  on_target_fraction_target: 0.95           # warn if on-target fraction falls below this (optional)

In panel mode, reads are distributed across targets with off-target spillover. In amplicon mode, fragments exactly span each target region with no off-target reads. When targets_bed is omitted, the capture model distributes reads uniformly across whichever chromosomes or regions are active.

`germline`

germline:
  het_snp_density: 0.6     # heterozygous SNPs per kbp (default: 0.6)
  hom_snp_density: 0.3     # homozygous SNPs per kbp (default: 0.3)
  het_indel_density: 0.05   # heterozygous indels per kbp (default: 0.05)
  vcf: /path/to/germline.vcf  # use specific germline variants instead of random (optional)

Germline variants are assigned VAF 0.5 (heterozygous) or 1.0 (homozygous). They appear in the separate germline_truth.vcf output.

`paired`

paired:
  normal_coverage: 30.0                   # coverage for the normal sample (default: 30.0)
  normal_sample_name: NORMAL              # sample name for normal output (default: NORMAL)
  tumour_contamination_in_normal: 0.0     # tumour contamination in normal (0.0-1.0; default: 0.0)

When present, VarForge runs two simulations: one tumour sample (with all somatic and germline variants) and one normal sample (germline only). Outputs are written to tumour/ and normal/ sub-directories under output.directory.

`samples` (multi-sample / longitudinal)

When samples is present, VarForge generates one output sub-directory per entry and a combined manifest.tsv. Each entry shares the top-level reference, mutations, tumour, and fragment settings but can override coverage, tumour fraction, and fragment model independently.

samples:
  - name: timepoint_1
    coverage: 1000.0
    tumour_fraction: 0.05        # ctDNA fraction for this sample (default: 1.0)
    fragment_model: cfda         # override fragment model (optional)
    clonal_shift:                # per-clone CCF adjustments at this timepoint (optional)
      subclone_a: 0.10
  - name: timepoint_2
    coverage: 1000.0
    tumour_fraction: 0.002
    fragment_model: cfda

`performance`

performance:
  output_buffer_regions: 64    # max region batches buffered in streaming channel (default: 64)

Higher values use more memory but provide better overlap between compute and I/O. Lower values reduce peak memory.

Presets Reference

Presets are named configuration bundles that set sensible defaults for common scenarios. A preset is applied before the YAML config values, so explicit YAML fields always win.

varforge simulate --config base.yaml --preset <NAME>
varforge simulate --list-presets

Built-in presets

Preset	Coverage	Fragment	Mutations	UMI	Notes
`small`	1x	normal	100 random (chr22 only)	no	Smoke test; completes in ~30 s
`panel`	500x	normal	50 random	inline 8-mer	Targeted panel benchmarking
`wgs`	30x	normal	5 000 random	no	Whole-genome variant calling
`cfdna`	200x	cfda (167 bp)	200 random, VAF 0.1-5%	duplex	Liquid biopsy simulation
`ffpe`	30x	normal	500 random	no	FFPE artefacts enabled
`umi`	1 000x	normal	50 random	duplex 9-mer	High-depth duplex consensus

Cancer-type presets

Cancer presets are accessed with the cancer: namespace prefix. Each preset sets biologically realistic mutation counts, VAF ranges, tumour purity, and mutation-type fractions based on published COSMIC mutational signatures.

varforge simulate --config base.yaml --preset cancer:melanoma

Preset	Cancer	Dominant Signature	Typical TMB	Purity
`cancer:lung_adeno`	Lung adenocarcinoma	SBS4 (smoking, C>A)	~8 mut/Mb	60%
`cancer:colorectal`	Colorectal (MSS)	SBS1/SBS5 (aging, C>T)	~5 mut/Mb	65%
`cancer:breast_tnbc`	Triple-negative breast	SBS3 (HRD, flat)	~5 mut/Mb	55%
`cancer:melanoma`	Cutaneous melanoma	SBS7a/b (UV, C>T)	~30 mut/Mb	70%
`cancer:aml`	Acute myeloid leukaemia	SBS1/SBS5 (aging)	~1 mut/Mb	80%
`cancer:prostate`	Prostate adenocarcinoma	SBS1/SBS5 (aging)	~2 mut/Mb	50%
`cancer:pancreatic`	Pancreatic ductal	SBS1/SBS5 (aging)	~3 mut/Mb	25%
`cancer:glioblastoma`	Glioblastoma (IDH-wt)	SBS1/SBS5 (aging)	~4 mut/Mb	65%

Example Configs

Ready-to-use example configs are in the examples/ directory.

File	Use case
`examples/minimal.yaml`	Simplest possible simulation (defaults only)
`examples/wgs_30x.yaml`	Standard 30x WGS tumour with random mutations
`examples/panel_umi.yaml`	Targeted panel with inline 8-mer UMI
`examples/cfdna_monitoring.yaml`	cfDNA longitudinal series (4 timepoints)
`examples/ffpe_artifacts.yaml`	FFPE-damaged tumour sample
`examples/tumor_normal.yaml`	Matched tumour/normal pair
`examples/subclonal.yaml`	Four-clone tumour with copy number alterations
`examples/high_depth.yaml`	1000x duplex UMI for low-VAF detection
`examples/custom_mutations.yaml`	Inject specific variants from a VCF file
`examples/twist_duplex_benchmark.yaml`	Twist duplex capture panel with HRD SVs

Output Formats

FASTQ headers

Read names follow the format:

@{SAMPLE}:{CHROM}:{POS}:{READ_NUM}[:UMI={BARCODE}]

Example: @TUMOUR:chr7:55191822:1:UMI=ACGTACGT

The UMI suffix is only present when a umi block is configured and inline: false.

BAM tags

When BAM output is enabled (output.bam: true), the following non-standard tags are written:

Tag	Type	Description
`RX`	Z	Raw UMI sequence
`MI`	Z	Molecule ID (read family identifier)
`tp`	i	1 if the read carries a simulated somatic variant, 0 otherwise
`cl`	Z	Clone ID the variant was assigned to

Truth VCF fields

The truth VCF written to truth.vcf.gz uses the following INFO fields:

Field	Description
`VAF`	Target variant allele frequency
`CLONE`	Clone ID the variant was assigned to
`CCF`	Cancer cell fraction of the assigned clone
`TYPE`	Variant type: `SNV`, `INDEL`, `MNV`, or `SV`

Use Case Recipes

Benchmarking a somatic variant caller

# Generate matched tumour/normal with known variants.
reference: /data/ref/hg38.fa
output:
  directory: out/caller_bench
  bam: true
  truth_vcf: true
samples:
  - name: TUMOUR
    coverage: 60.0
    tumour_fraction: 1.0
  - name: NORMAL
    coverage: 30.0
    tumour_fraction: 0.0
tumour:
  purity: 0.65
mutations:
  random:
    count: 1000
    vaf_min: 0.05
    vaf_max: 0.60
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05
seed: 42

Run your caller against TUMOUR/ with NORMAL/ as the matched normal. Evaluate with:

bcftools stats --apply-filters PASS \
    caller_output.vcf.gz truth.vcf.gz

Benchmarking UMI deduplication (fgbio)

Use examples/panel_umi.yaml with inline: true and then pipe through fgbio:

fgbio ExtractUmisFromBam \
    --input out/panel_umi/PANEL_UMI.bam \
    --output extracted.bam \
    --read-structure 8M+T 8M+T

fgbio GroupReadsByUmi \
    --input extracted.bam \
    --output grouped.bam \
    --strategy paired

fgbio CallMolecularConsensusReads \
    --input grouped.bam \
    --output consensus.bam \
    --min-reads 1

Liquid biopsy sensitivity curve

Generate cfDNA samples at multiple tumour fractions and measure detection rate at each:

for TF in 0.10 0.05 0.01 0.005 0.001; do
    varforge simulate --config examples/cfdna_monitoring.yaml \
        --output-dir out/tf_${TF} \
        --purity ${TF} \
        --seed 42
done

SV signature benchmarking

Generate data with HRD-type structural variants for SV caller evaluation:

mutations:
  random:
    count: 500
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05
  sv_signature: HRD
  sv_count: 20

FFPE artefact filter development

Use examples/ffpe_artifacts.yaml to generate data with realistic FFPE damage, then evaluate your artefact filter:

True positives: variants present in truth.vcf.gz
Artefacts: C>T calls not in the truth VCF with strand bias (use the tp BAM tag to distinguish)

Performance

VarForge uses a streaming architecture: rayon workers generate reads in parallel and send them through a bounded crossbeam channel to a dedicated writer thread. Memory scales with channel depth, not dataset size.

Thread count

varforge simulate --config cfg.yaml --threads 16

Or set in the config:

threads: 16

Restricting scope

For development and testing, restrict to one or a few chromosomes:

chromosomes:
  - chr22

Or to a BED file of target regions:

regions_bed: /data/panels/hotspot_panel.bed

Approximate runtimes (8-core laptop, hg38)

Scenario	Coverage	Mutations	Time
`small` preset	1x, chr22	100	~30 s
Panel (chr7, chr12, chr17)	500x	50	~2 min
WGS 30x	30x, all	5 000	~25 min
WGS 60x	60x, all	5 000	~50 min
Ultra-deep panel 1000x	1000x, 3 chroms	50	~8 min

Memory usage

Peak memory is approximately:

2 x read_length x threads x (coverage / 30) MB

For 30x WGS with 150 bp reads on 8 threads: ~600 MB. For 1000x panel on 8 threads: ~200 MB (limited region).

Comparison with Other Tools

Scenario	Recommended tool
Realistic Illumina base-quality profiles	VarForge (parametric or `learn-profile`)
Controlled somatic variant spike-in	VarForge or BAMSurgeon
Spike into a real patient BAM	BAMSurgeon (preserves real read background)
Simple read generation, no mutations	ART or NanoSim
Whole-genome de novo simulation	NEAT or VarForge
cfDNA / liquid biopsy data	VarForge (only tool with native cfDNA model)
UMI-tagged duplex sequencing	VarForge (only tool with native duplex model)
FFPE artefact simulation	VarForge (only tool with FFPE + oxoG model)
Multi-sample longitudinal series	VarForge (only tool with native time-series)
Structural variant signatures	VarForge (only tool with HRD/TDP/chromothripsis models)
Long-read data	VarForge (log-normal fragment model) or NanoSim

VarForge is the right choice when you need a complete, reproducible, ground-truth dataset for pipeline benchmarking and do not have access to real patient sequencing data. BAMSurgeon is the better choice when you need to insert a small number of variants into an existing real-data BAM while preserving the authentic read background.

Licence

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
benchmarking		benchmarking
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VarForge

Features

Installation

From crates.io

From source

Quickstart

CLI Reference

simulate

validate

edit

learn-profile

benchmark-suite

Configuration Reference

Top-level fields

output

sample

fragment

quality

tumour

mutations

umi

artifacts

copy_number

gc_bias

capture

germline

paired

samples (multi-sample / longitudinal)

performance

Presets Reference

Built-in presets

Cancer-type presets

Example Configs

Output Formats

FASTQ headers

BAM tags

Truth VCF fields

Use Case Recipes

Benchmarking a somatic variant caller

Benchmarking UMI deduplication (fgbio)

Liquid biopsy sensitivity curve

SV signature benchmarking

FFPE artefact filter development

Performance

Thread count

Restricting scope

Approximate runtimes (8-core laptop, hg38)

Memory usage

Comparison with Other Tools

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`simulate`

`validate`

`edit`

`learn-profile`

`benchmark-suite`

`output`

`sample`

`fragment`

`quality`

`tumour`

`mutations`

`umi`

`artifacts`

`copy_number`

`gc_bias`

`capture`

`germline`

`paired`

`samples` (multi-sample / longitudinal)

`performance`

Packages