A comprehensive guide to parallelized variant calling using the scatter-gather approach for high-throughput whole genome sequencing analysis.
- Overview
- What is Scatter-Gather?
- Why Scatter-Gather for WGS?
- Architecture
- Advantages
- Implementation
- Performance Comparison
- Best Practices
- Tools & Compatibility
- Common Pitfalls
- Example Workflows
- Resources
Whole Genome Sequencing (WGS) generates massive datasets requiring intensive computational resources for variant calling:
- Human genome: ~3 billion base pairs
- Typical WGS: 30-100x coverage
- Data size: 50-200 GB per sample
- Processing time: 12-48 hours single-threaded
Scatter-gather parallelization divides the genome into independent intervals, processes them in parallel, and merges the results - dramatically reducing runtime while maintaining accuracy.
Scatter-gather is a parallel processing pattern that:
SCATTER
↓
┌──────────────────┼──────────────────┐
↓ ↓ ↓
[Interval 1] [Interval 2] ... [Interval N]
↓ ↓ ↓
[Process 1] [Process 2] ... [Process N]
↓ ↓ ↓
[Output 1] [Output 2] ... [Output N]
└──────────────────┼──────────────────┘
↓
GATHER
↓
[Final Output]
- SCATTER: Split genome into independent intervals
- PROCESS: Call variants on each interval in parallel
- GATHER: Merge interval results into final output
Without Scatter-Gather:
Single Process: [========================================] 24 hours
One CPU core
With Scatter-Gather (50 intervals):
Process 1: [====] 30 minutes
Process 2: [====] 30 minutes
Process 3: [====] 30 minutes
...
Process 50: [====] 30 minutes
↓
Total Time: 30-45 minutes (with 50 cores)
Speedup: 30-50x faster with proportional compute resources
| Approach | Memory Required | Duration |
|---|---|---|
| Whole genome | 60-100 GB | 24 hours |
| Per interval (50 intervals) | 2-4 GB | 30 mins |
Benefit: Fits within typical HPC node memory limits
- Efficient HPC utilization: Matches cluster job scheduler paradigms
- Fault tolerance: Failed intervals can be re-run independently
- Flexibility: Scale parallelization to available resources
- Reduced memory overhead: Smaller data chunks in RAM
- Better caching: Interval-sized data fits in CPU cache
- Lower peak disk I/O: Distributed read/write operations
chr1 [=================]
chr2 [==============]
chr3 [=============]
...
chrX [======]
chrY [==]
Pros: Simple, ~25 intervals
Cons: Uneven workload (chr1 is 8x larger than chr21)
chr1 [===][===][===][===][===]...
chr2 [===][===][===][===]...
chr3 [===][===][===]...
Pros: Even workload distribution
Cons: More intervals to manage (~50-200)
Genome: [===] [======] [===] [========] [==]
| | | | |
Intervals: [===] [======] [===] [========] [==]
(skip N regions, centromeres, gaps)
Pros:
- Skip non-callable regions (N-regions, centromeres)
- Optimal computational efficiency
- Even workload with ~50-100 intervals
Cons: Requires interval list generation
project/
├── intervals/
│ ├── interval_001.bed
│ ├── interval_002.bed
│ └── ...
│
├── scatter/
│ ├── sample_interval_001.vcf
│ ├── sample_interval_002.vcf
│ └── ...
│
└── gather/
└── sample_final.vcf
| Sample Type | Serial | Scatter-Gather (50 intervals) | Speedup |
|---|---|---|---|
| WGS 30x | 24 hours | 30-45 minutes | 32-48x |
| WGS 60x | 48 hours | 60-90 minutes | 32-48x |
| WGS 100x | 72 hours | 90-120 minutes | 36-48x |
Assumes 50 parallel cores available
Traditional Approach:
Node 1: [====================] 100% utilized
Node 2: [ ] 0% utilized
Node 3: [ ] 0% utilized
Total Efficiency: 33%
Scatter-Gather Approach:
Node 1: [========] intervals 1-20
Node 2: [========] intervals 21-40
Node 3: [========] intervals 41-50
Total Efficiency: 90%+
Serial Processing:
Progress: [================X] CRASH
↓
Start over from beginning
Lost: 20 hours of work
Scatter-Gather:
Interval 15: [===X] FAILED
↓
Re-run only interval 15
Lost: 30 minutes of work
10 cores → 50 intervals → 2 hours
25 cores → 50 intervals → 1 hour
50 cores → 50 intervals → 30 minutes
100 cores → 50 intervals → 30 minutes (I/O bound)
Scale to match available infrastructure!
Variant calling memory usage:
| Genome Coverage | Serial Memory | Scatter Memory (per job) |
|---|---|---|
| 30x WGS | 64 GB | 2-4 GB |
| 60x WGS | 96 GB | 3-6 GB |
| 100x WGS | 128 GB | 4-8 GB |
Result: Can process high-coverage WGS on standard compute nodes
HPC schedulers prefer many small jobs over few large jobs:
- Shorter queue times: Small jobs schedule faster
- Better backfilling: Fills scheduler gaps efficiently
- Higher priority: Many short jobs vs few long jobs
gatk SplitIntervals \
-R reference.fasta \
-L regions.bed \
--scatter-count 50 \
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION \
-O intervals/Parameters:
--scatter-count: Number of intervals (typically 50-100)--subdivision-mode: How to splitBALANCING_WITHOUT_INTERVAL_SUBDIVISION: Even distribution (recommended)INTERVAL_SUBDIVISION: Can split single regionsBALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW: Handles remainder
# Create intervals by chromosome
for chr in {1..22} X Y; do
echo "chr${chr}" > intervals/chr${chr}.bed
done#!/bin/bash
#SBATCH --job-name=variant_calling
#SBATCH --array=1-50 # 50 intervals
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=2:00:00
# Get interval file for this array task
INTERVAL="intervals/interval_$(printf "%03d" $SLURM_ARRAY_TASK_ID).bed"
# Call variants on this interval
gatk HaplotypeCaller \
-R reference.fasta \
-I sample.bam \
-L ${INTERVAL} \
-O scatter/sample_interval_${SLURM_ARRAY_TASK_ID}.vcfJob submission:
sbatch scatter_variant_calling.sh
# Launches 50 parallel jobs automatically# Generate commands file
for i in {1..50}; do
echo "gatk HaplotypeCaller -R ref.fa -I sample.bam \
-L intervals/interval_${i}.bed \
-O scatter/sample_interval_${i}.vcf"
done > commands.txt
# Execute in parallel
parallel -j 50 < commands.txt# Create input list
ls scatter/sample_interval_*.vcf > vcf_list.txt
# Merge VCFs
gatk GatherVcfs \
-I vcf_list.txt \
-O final/sample_complete.vcf.gz# Index all interval VCFs
for vcf in scatter/*.vcf; do
bgzip -c $vcf > ${vcf}.gz
tabix -p vcf ${vcf}.gz
done
# Concatenate in order
bcftools concat \
scatter/sample_interval_*.vcf.gz \
-O z \
-o final/sample_complete.vcf.gz
# Index final VCF
tabix -p vcf final/sample_complete.vcf.gz| Method | Intervals | Cores | Time | Cost* |
|---|---|---|---|---|
| Serial | 1 | 1 | 24 hours | $24 |
| Scatter-Gather | 25 (chr) | 25 | 1.5 hours | $37.50 |
| Scatter-Gather | 50 | 50 | 45 minutes | $37.50 |
| Scatter-Gather | 100 | 100 | 30 minutes | $50 |
*Assuming $1/core-hour
Optimal: 50 intervals with 50 cores
- Cost: ~50% increase
- Time: 96% reduction
- Result: Worth it for production workflows!
| Method | Time | Peak Memory |
|---|---|---|
| Serial | 48 hours | 96 GB |
| Scatter-Gather (50) | 90 minutes | 6 GB/job |
Key Advantage: Fits on standard compute nodes!
Speedup vs Number of Intervals (30x WGS)
50x | ┌────────
| ┌───┘
40x | ┌───┘
| ┌──┘
30x | ┌──┘
| ┌──┘
20x | ┌─┘
|─┘
10x |
└─────┴─────┴─────┴─────┴─────┴────
10 25 50 100 200 500
Number of Intervals
Optimal: 50-100 intervals
Diminishing returns: >100 intervals (I/O bound)
Guidelines:
- WGS: 50-100 intervals
- WES: 10-25 intervals (less data)
- Targeted panels: 1-10 intervals
Calculate optimal intervals:
Optimal = Available Cores × 0.8
(Leave 20% overhead for I/O and gather)
Example: 64-core HPC → 50 intervals
✅ DO:
- Use GATK SplitIntervals for even distribution
- Respect gene boundaries for functional analyses
- Include padding around intervals (100-200 bp)
❌ DON'T:
- Split in the middle of genes
- Create overlapping intervals
- Use uneven interval sizes
Per interval job:
#SBATCH --mem=4G # Start with 4GB
#SBATCH --cpus-per-task=2 # 2 CPUs per task
#SBATCH --time=2:00:00 # 2 hours (buffer)Adjust based on:
- Coverage depth (↑ coverage → ↑ memory)
- Variant density (high variation → more memory)
- Tool used (HaplotypeCaller > Mutect2 > FreeBayes)
Problem: Variants at interval boundaries
Solution 1: Interval padding
gatk HaplotypeCaller \
-L interval.bed \
--interval-padding 100 \ # 100bp padding
-O output.vcfSolution 2: Overlapping intervals
Interval 1: [========>]
Interval 2: [<========>]
Interval 3: [<========]
^^^
Overlap region
Check after gathering:
# 1. Verify all intervals present
expected=50
found=$(ls scatter/*.vcf | wc -l)
if [ $found -ne $expected ]; then
echo "ERROR: Missing intervals!"
exit 1
fi
# 2. Check final VCF integrity
bcftools stats final/sample.vcf.gz > stats.txt
# 3. Compare variant counts
total_scattered=$(grep "number of records" scatter/*stats | \
awk '{sum+=$NF} END {print sum}')
total_gathered=$(bcftools view -H final/sample.vcf.gz | wc -l)
if [ $total_scattered -ne $total_gathered ]; then
echo "WARNING: Variant count mismatch!"
fiImplement retry logic:
#!/bin/bash
MAX_RETRIES=3
RETRY_COUNT=0
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
gatk HaplotypeCaller ... && break
RETRY_COUNT=$((RETRY_COUNT + 1))
echo "Retry $RETRY_COUNT of $MAX_RETRIES"
sleep 60
done
if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
echo "FAILED after $MAX_RETRIES attempts"
exit 1
fi| Tool | Scatter Support | Gather Method | Notes |
|---|---|---|---|
| GATK HaplotypeCaller | ✅ Native | GatherVcfs | Recommended |
| GATK Mutect2 | ✅ Native | GatherVcfs | Somatic calling |
| FreeBayes | ✅ Manual | bcftools concat | Requires -L flag |
| DeepVariant | ✅ Native | concat/merge | Built-in support |
| Strelka2 | ✅ Manual | Requires merging | Region-based |
| DRAGEN | ✅ Native | Built-in | Hardware accelerated |
WDL (Workflow Description Language):
workflow ScatterGatherVC {
scatter (interval in intervals) {
call HaplotypeCaller {
input:
bam = input_bam,
interval = interval
}
}
call GatherVcfs {
input: vcfs = HaplotypeCaller.output_vcf
}
}Nextflow:
process haplotypeCaller {
input:
each interval from intervals_ch
output:
file "*.vcf" into vcf_ch
script:
"""
gatk HaplotypeCaller -L ${interval} ...
"""
}
process gatherVcfs {
input:
file vcfs from vcf_ch.collect()
output:
file "final.vcf"
script:
"""
gatk GatherVcfs -I ${vcfs} -O final.vcf
"""
}Snakemake:
rule scatter:
input:
bam="sample.bam",
interval="intervals/interval_{i}.bed"
output:
"scatter/interval_{i}.vcf"
shell:
"gatk HaplotypeCaller -L {input.interval} ..."
rule gather:
input:
expand("scatter/interval_{i}.vcf", i=range(1, 51))
output:
"final.vcf"
shell:
"gatk GatherVcfs -I {input} -O {output}"Problem:
1000 intervals × 2 minutes = 2000 minutes overhead
Result: Slower than serial!
Solution: Aim for 50-100 intervals (20-40 minutes each)
Problem:
chr1: [====================] 4 hours
chr21: [===] 20 minutes
Total time: 4 hours (limited by longest)
Solution: Use BALANCING_WITHOUT_INTERVAL_SUBDIVISION mode
Problem:
Process killed: Out of memory
Allocated: 4GB
Required: 8GB (high coverage region)
Solution:
- Profile memory usage on test intervals
- Add 50% buffer
- Monitor peak memory in production
Problem:
Interval 1: [...===X| ] Variant at boundary
Interval 2: [ |===... ] Same variant
Result: Variant called twice or missed!
Solution: Use interval padding (100-200 bp)
Problem:
Expected: 4,500,000 variants
Got: 4,499,850 variants
Missing: 150 variants (from failed interval)
Solution: Always validate:
- Interval completion
- Variant counts
- VCF integrity
- File checksums
#!/bin/bash
# Complete scatter-gather WGS variant calling pipeline
set -euo pipefail
# Configuration
SAMPLE="SampleX"
BAM="input/${SAMPLE}.bam"
REF="reference/hg38.fa"
INTERVALS_DIR="intervals"
SCATTER_DIR="scatter"
FINAL_DIR="final"
N_INTERVALS=50
# Step 0: Setup
mkdir -p ${INTERVALS_DIR} ${SCATTER_DIR} ${FINAL_DIR}
# Step 1: Generate intervals
echo "Generating ${N_INTERVALS} intervals..."
gatk SplitIntervals \
-R ${REF} \
--scatter-count ${N_INTERVALS} \
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION \
-O ${INTERVALS_DIR}
# Step 2: Scatter - Launch parallel jobs
echo "Launching ${N_INTERVALS} parallel variant calling jobs..."
for i in $(seq -f "%04g" 1 ${N_INTERVALS}); do
INTERVAL="${INTERVALS_DIR}/${i}-scattered.interval_list"
OUTPUT="${SCATTER_DIR}/${SAMPLE}_${i}.vcf"
sbatch --job-name=vc_${i} \
--mem=8G \
--cpus-per-task=2 \
--time=2:00:00 \
--output=logs/vc_${i}.log \
--wrap="gatk HaplotypeCaller \
-R ${REF} \
-I ${BAM} \
-L ${INTERVAL} \
-O ${OUTPUT}"
done
# Step 3: Wait for all jobs to complete
echo "Waiting for variant calling jobs to complete..."
while [ $(squeue -u $USER -n vc_ -h | wc -l) -gt 0 ]; do
sleep 60
echo "Jobs remaining: $(squeue -u $USER -n vc_ -h | wc -l)"
done
# Step 4: Validate all intervals completed
echo "Validating interval completion..."
EXPECTED=${N_INTERVALS}
COMPLETED=$(ls ${SCATTER_DIR}/*.vcf | wc -l)
if [ ${COMPLETED} -ne ${EXPECTED} ]; then
echo "ERROR: Expected ${EXPECTED} VCFs, found ${COMPLETED}"
exit 1
fi
# Step 5: Gather - Merge VCFs
echo "Gathering VCFs..."
ls ${SCATTER_DIR}/*.vcf > vcf_list.txt
gatk GatherVcfs \
-I vcf_list.txt \
-O ${FINAL_DIR}/${SAMPLE}.vcf.gz
# Step 6: Index final VCF
echo "Indexing final VCF..."
tabix -p vcf ${FINAL_DIR}/${SAMPLE}.vcf.gz
# Step 7: Generate statistics
echo "Generating statistics..."
bcftools stats ${FINAL_DIR}/${SAMPLE}.vcf.gz > ${FINAL_DIR}/${SAMPLE}.stats.txt
# Step 8: Quality control
echo "Running QC checks..."
TOTAL_VARIANTS=$(bcftools view -H ${FINAL_DIR}/${SAMPLE}.vcf.gz | wc -l)
echo "Total variants called: ${TOTAL_VARIANTS}"
if [ ${TOTAL_VARIANTS} -lt 3000000 ]; then
echo "WARNING: Low variant count (expected ~3-5M for WGS)"
fi
echo "Pipeline complete!"
echo "Final VCF: ${FINAL_DIR}/${SAMPLE}.vcf.gz"Adjust intervals based on genomic features:
# Increase intervals in high-variation regions
HIGH_VAR_REGIONS="MHC, HLA, KIR" # Immune regions
LOW_VAR_REGIONS="deserts" # Gene deserts
# 2x density in high-variation regions
gatk SplitIntervals \
-R reference.fa \
-L ${HIGH_VAR_REGIONS} \
--scatter-count 100 \
-O intervals/high_var/
gatk SplitIntervals \
-R reference.fa \
-XL ${HIGH_VAR_REGIONS} \
--scatter-count 50 \
-O intervals/standard/Distribute intervals by complexity:
Simple intervals → Low-memory nodes (2 GB)
Complex intervals → High-memory nodes (16 GB)
# Classify intervals by expected memory
for interval in intervals/*.bed; do
coverage=$(calculate_coverage ${interval})
if [ ${coverage} -gt 100 ]; then
# High coverage → high memory
sbatch --mem=16G variant_call.sh ${interval}
else
# Standard coverage → standard memory
sbatch --mem=4G variant_call.sh ${interval}
fi
doneSymptoms:
Job exceeded time limit
Killed at: 1:59:59 / 2:00:00
Solutions:
- Increase time limit
- Reduce interval size (more intervals)
- Profile long-running intervals
Symptoms:
java.lang.OutOfMemoryError: Java heap space
Solutions:
# Increase Java heap
--java-options "-Xmx6G"
# Increase job memory
#SBATCH --mem=8G
# Monitor actual usage:
sacct -j JOBID --format=MaxRSSSymptoms:
Scattered: 4,500,000 variants
Gathered: 4,499,500 variants
Missing: 500 variants
Debug:
# Check for failed intervals
for i in {1..50}; do
if [ ! -f scatter/interval_${i}.vcf ]; then
echo "Missing: interval ${i}"
fi
done
# Validate each interval VCF
for vcf in scatter/*.vcf; do
gatk ValidateVariants -V ${vcf} -R reference.fa || echo "Invalid: ${vcf}"
doneSpeedup Factor:
Speedup = Serial Time / Parallel Time
Example:
Serial: 24 hours
Parallel: 45 minutes
Speedup = 24 * 60 / 45 = 32x
Efficiency:
Efficiency = Speedup / Number of Cores
Example:
Speedup: 32x
Cores: 50
Efficiency = 32 / 50 = 0.64 (64%)
Target: >60% efficiency is good for WGS
# Track job progress
watch -n 10 'squeue -u $USER | grep vc_'
# Monitor resource usage
sstat -j JOBID --format=AveCPU,MaxRSS,AveVMSize
# Post-job analysis
sacct -j JOBID --format=JobID,MaxRSS,Elapsed,StateSetup:
- Tumor/normal pairs (60x coverage)
- Need same-day results for treatment decisions
Challenge:
60x tumor: 48 hours serial
60x normal: 48 hours serial
Total: 96 hours (4 days) - UNACCEPTABLE
Solution:
Scatter-gather (100 intervals, 100 cores):
Tumor: 60 minutes
Normal: 60 minutes
Total: 120 minutes (2 hours) ✓
Plus analysis time: 4 hours
Total turnaround: 6 hours (same day)
Outcome: Enabled precision medicine in clinical workflow
Traditional CPU: 50 intervals × 45 minutes = 37.5 hours compute
GPU-accelerated: Single GPU = 15 minutes
Example: NVIDIA Parabricks
- ~80x faster than CPU
- Higher cost per hour, but total cost lower
- Best for ultra-high throughput
ML Model: Predict optimal interval size based on:
- Coverage distribution
- Variant density
- Reference complexity
Result: Dynamic interval generation
→ Better load balancing
→ 20-30% additional speedup
✅ Scatter-gather is essential for WGS in production environments
✅ 30-50x speedup with proper implementation
✅ Better resource utilization on HPC clusters
✅ Enables clinical turnaround times (hours instead of days)
✅ Cost-effective for high-throughput projects
| Use Case | Recommendation |
|---|---|
| Single WGS sample | Optional (convenience) |
| 10+ WGS samples | Recommended |
| Clinical workflow | Essential |
| Population studies (1000+) | Essential |
| Time-sensitive | Essential |
- Start small: Test with 10-25 intervals
- Profile your data: Measure time/memory per interval
- Scale up: Optimize interval count for your infrastructure
- Automate: Use workflow managers (Nextflow, Snakemake, WDL)
- Monitor: Track efficiency and optimize
- GATK Best Practices: https://gatk.broadinstitute.org/
- Cromwell (WDL): https://cromwell.readthedocs.io/
- Nextflow: https://www.nextflow.io/
- Snakemake: https://snakemake.readthedocs.io/
- GATK: Variant calling suite
- Picard: VCF manipulation
- BCFtools: VCF processing
- Hail: Large-scale genomics (Python/Spark)
- GATK Forum: https://gatk.broadinstitute.org/hc/en-us/community/topics
- Biostars: https://www.biostars.org/
- SEQanswers: http://seqanswers.com/
- McKenna et al. (2010). "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data." Genome Research
- Poplin et al. (2018). "Scaling accurate genetic variant discovery to tens of thousands of samples." bioRxiv
- Van der Auwera & O'Connor (2020). "Genomics in the Cloud." O'Reilly Media
If you use scatter-gather approaches in your research, please cite:
@article{mckenna2010genome,
title={The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data},
author={McKenna, Aaron and Hanna, Matthew and Banks, Eric and others},
journal={Genome research},
volume={20},
number={9},
pages={1297--1303},
year={2010}
}Contributions are welcome! Please submit issues or pull requests.
- Additional workflow examples
- Cloud platform integrations
- Performance benchmarks
- Tool-specific implementations
This documentation is provided under the MIT License. See LICENSE file for details.
- GATK team at Broad Institute
- HPC communities worldwide
- Bioinformatics researchers advancing parallel genomics
Last Updated: November 2025
Version: 1.0
Maintainers: Bioinformatics Community
┌─────────────────────────────────────────────────────────┐
│ SCATTER-GATHER QUICK REFERENCE │
├─────────────────────────────────────────────────────────┤
│ │
│ Optimal Intervals: 50-100 for WGS │
│ Memory per job: 4-8 GB │
│ Time per interval: 20-40 minutes │
│ Expected speedup: 30-50x │
│ │
│ SCATTER: gatk SplitIntervals --scatter-count 50 │
│ PROCESS: parallel variant calling (array jobs) │
│ GATHER: gatk GatherVcfs or bcftools concat │
│ │
│ Always: - Validate completion │
│ - Check variant counts │
│ - Use interval padding │
│ - Monitor resource usage │
│ │
└─────────────────────────────────────────────────────────┘
END OF DOCUMENT