An automated toolkit for running Apache Spark TPC-DS performance benchmarks on Amazon EMR with different JVM runtimes. This project allows you to compare performance between Azul Zing, Azul Zulu, and Amazon Corretto JVMs using standardized TPC-DS workloads.
- Infrastructure as Code: Complete Terraform automation for EMR cluster provisioning
- Multi-JVM Support: Compare Azul Zing, Azul Zulu, and Amazon Corretto performance
- Flexible Networking: Works with existing VPCs or creates new infrastructure
- Configurable Profiles: Pre-defined configurations for different test scenarios
- Monitoring Tools: Built-in scripts for job status and cluster monitoring
- Automated Bootstrap: Custom JDK installation and performance tuning
-
AWS CLI (v2.0+)
# Install AWS CLI curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg" sudo installer -pkg AWSCLIV2.pkg -target / # Configure with your credentials aws configure
-
Terraform CLI (v1.12.0+)
# macOS brew install terraform # Linux wget https://releases.hashicorp.com/terraform/1.12.0/terraform_1.12.0_linux_amd64.zip unzip terraform_1.12.0_linux_amd64.zip sudo mv terraform /usr/local/bin/
-
jq (for JSON processing in monitoring scripts)
# macOS brew install jq # Linux sudo apt-get install jq # Ubuntu/Debian sudo yum install jq # CentOS/RHEL
-
uv (for Python dependency management and script execution)
# macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Or with pip pip install uv # Or with Homebrew (macOS) brew install uv
- AWS Account: With appropriate permissions for EMR, EC2, VPC, and S3
- EC2 Key Pair: For SSH access to cluster nodes
- S3 Bucket: For storing benchmark data, logs, and bootstrap scripts
- IAM Permissions: Ensure your AWS credentials have permissions for:
- EMR cluster creation and management
- EC2 instance management
- VPC and networking (if creating new VPC)
- S3 bucket access
- IAM role creation for EMR service roles
git clone <repository-url>
cd emr-tpcdsCreate an S3 bucket for your benchmarks (replace your-benchmark-bucket with your actual bucket name):
# Create S3 bucket
aws s3 mb s3://your-benchmark-bucket --region us-west-2
# Set environment variable for convenience
export YOUR_S3_BUCKET=your-benchmark-bucketImportant: This repository does not include TPC-DS data generation. You need to:
- Generate TPC-DS data using official TPC-DS tools or Spark's data generation utilities
- Upload the data to
s3://your-bucket/data/sf15000-parquet/(for 15TB dataset) - Ensure data is partitioned appropriately for optimal Spark performance
We did 15TB dataset with this tool: single-node data generator But it is possible to do it in more Spark-native way with this Benchmark JAR was built from instructions here
If using custom JDKs (Azul Zing or Zulu), you need to prepare download URLs:
- Visit Azul Downloads
- Select: Java 17, Linux, ARM 64-bit, .tar.gz format
- Copy the download URL
- Contact Azul Systems for access to Zing JDK
- Obtain download URL for Linux ARM64 .tar.gz format
Edit the JDK URLs in variables.tf:
locals {
default_tar_urls = {
zulu = "https://cdn.azul.com/zulu/bin/zulu17.50.19-ca-jdk17.0.11-linux_aarch64.tar.gz"
zing = "https://your-repository.com/path/to/zing-jdk17-linux_aarch64.tar.gz"
default = ""
}
}Create or edit terraform.tfvars:
# Required variables
log_bucket = "your-benchmark-bucket"
scripts_bucket = "your-benchmark-bucket"
ssh_key_name = "your-ec2-key-name"
# JVM runtime selection
runtime_variant = "zing" # Options: "zing", "zulu", "default"
# Instance configuration
master_instance_type = "c8gd.2xlarge"
core_instance_type = "c8gd.8xlarge"
core_instance_count = 3
# Optional: Use existing VPC (both required if using existing)
# existing_vpc_id = "vpc-1234567890abcdef0"
# existing_subnet_id = "subnet-1234567890abcdef0"
# Optional: Custom regions and networking
# aws_region = "us-west-2"
# vpc_cidr = "10.20.0.0/16"
# subnet_cidr = "10.20.0.0/24"# Initialize Terraform
terraform init
# Plan deployment (optional)
terraform plan
# Deploy cluster
terraform apply
# Save cluster ID
export CURRENT_SPARK_CLUSTER_ID=$(terraform output -raw cluster_id)
echo "Cluster ID: $CURRENT_SPARK_CLUSTER_ID"# Run with default configuration
./run_test.sh $CURRENT_SPARK_CLUSTER_ID your-benchmark-bucket
# Run with custom profile
./run_test.sh $CURRENT_SPARK_CLUSTER_ID your-benchmark-bucket --profile configs/15tb-full-run-zing.profile
# Run with environment variable for bucket (set YOUR_S3_BUCKET)
./run_test.sh $CURRENT_SPARK_CLUSTER_ID --profile configs/15tb-full-run-zing.profile# Check job status
./status.sh
# Monitor cluster resources
./hosts.sh $CURRENT_SPARK_CLUSTER_ID
# Or with environment variable
CURRENT_SPARK_CLUSTER_ID=$CURRENT_SPARK_CLUSTER_ID ./hosts.sh# Destroy cluster when done
terraform destroy| Variable | Description | Default | Required |
|---|---|---|---|
log_bucket |
S3 bucket for EMR logs | "your-s3-bucket-name" |
Yes |
scripts_bucket |
S3 bucket for bootstrap scripts | "your-s3-bucket-name" |
Yes |
ssh_key_name |
EC2 key pair name | "your-ssh-key-name" |
Yes |
runtime_variant |
JVM runtime ("zing", "zulu", "default") | "default" |
No |
existing_vpc_id |
Use existing VPC | null |
No |
existing_subnet_id |
Use existing subnet | null |
No |
master_instance_type |
Master node instance type | "c7gd.2xlarge" |
No |
core_instance_type |
Core node instance type | "c7gd.2xlarge" |
No |
core_instance_count |
Number of core instances | 1 |
No |
aws_region |
AWS region | "us-west-2" |
No |
If you have an existing VPC and subnet, specify both:
existing_vpc_id = "vpc-1234567890abcdef0"
existing_subnet_id = "subnet-1234567890abcdef0"Important:
- Both variables must be provided together
- The subnet must have internet access (associated with route table that has internet gateway route)
- Subnet must have sufficient IP addresses for your cluster size
- VPC must have DNS hostnames and DNS resolution enabled
Profiles are bash scripts in the configs/ directory that override default benchmark settings:
| Variable | Description | Example |
|---|---|---|
SPARK_EXECUTOR_INSTANCES |
Number of Spark executors | "11" |
SPARK_EXECUTOR_MEMORY |
Memory per executor | "10g" |
SPARK_EXECUTOR_CORES |
CPU cores per executor | "8" |
INPUT_PATH |
S3 path to TPC-DS data | "s3://bucket/data/sf15000-parquet" |
OUTPUT_PATH |
S3 path for results | "s3://bucket/logs/TEST-15TB-RESULT" |
ITERATIONS |
Number of benchmark iterations | "3" |
TPCDS_QUERIES |
Comma-separated query list | "q1-v2.4,q2-v2.4,..." |
CURR_OPT_CONF |
JVM-specific options | "-XX:ActiveProcessorCount=8" |
# Copy existing profile
cp configs/15tb-full-run-default.profile configs/my-custom.profile
# Edit configuration
vi configs/my-custom.profile
# Use custom profile
./run_test.sh $CLUSTER_ID $BUCKET --profile configs/my-custom.profile-
Status Monitoring (
status.sh):- Shows EMR step status and timing
- Requires
CURRENT_SPARK_CLUSTER_IDenvironment variable
-
Host Monitoring (
hosts.sh):- Shows CPU, memory, and disk usage per node
- Supports parallel SSH connections for faster execution
# Configure SSH settings
export SSH_USER=ec2-user
export SSH_KEY=~/.ssh/your-key.pem
export PARALLEL=1 # Enable concurrent SSH
export JOBS=8 # Max concurrent connections
# Run monitoring
./hosts.sh $CLUSTER_ID- EMR Service Logs:
s3://your-bucket/logs/emr/cluster-id/ - Application Logs:
s3://your-bucket/logs/emr/cluster-id/containers/ - Bootstrap Logs:
/tmp/bootstrap.logon cluster nodes - Benchmark Results:
s3://your-bucket/logs/TEST-*-RESULT/
-
Cluster Creation Fails:
- Check AWS service limits (EC2 instances, VPC limits)
- Verify IAM permissions
- Ensure availability zone has requested instance types
-
Bootstrap Failures:
- Check JDK download URL accessibility
- Verify S3 bucket permissions for bootstrap script
- Review
/tmp/bootstrap.logon cluster nodes
-
Job Execution Failures:
- Verify TPC-DS data exists in specified S3 path
- Check Spark configuration parameters match cluster capacity
- Review EMR step logs in AWS Console
-
SSH Access Issues:
- Ensure security group allows SSH (port 22) from your IP
- Verify EC2 key pair is correct
- Check VPC/subnet internet gateway configuration
- c8gd.2xlarge: ~$0.77/hour
- c8gd.8xlarge: ~$3.07/hour
- c7gd.2xlarge: ~$0.69/hour
- Hourly: ~$10/hour
- Full benchmark (3-4 hours): ~$30-40
Key Features:
- Concurrent garbage collection (C4) with sub-millisecond pause times
- ReadyNow warm-up elimination technology
- Automatic heap management
Essential JVM flags for TPC-DS:
-XX:ActiveProcessorCount=$SPARK_EXECUTOR_CORES
-XX:TopTierCompileThresholdTriggerMillis=60000
-XX:ProfileLogIn=/mnt/vmoutput/tpcds-15tb-full-gen1.profileReadyNow Profile Training:
For production benchmarks, you must train ReadyNow profiles to eliminate JVM warm-up variance. Pre-configured profiles are available:
configs/15tb-zing-gen0.profile- Initial profile collectionconfigs/15tb-zing-gen1.profile- Refined profile trainingconfigs/15tb-zing-production.profile- Production measurements
→ See Training Zing ReadyNow Profiles for TPC-DS section below for complete step-by-step training instructions.
- OpenJDK-based with performance improvements
- Standard HotSpot optimizations
- AWS-optimized OpenJDK
- Pre-configured for EMR environment
Key parameters to adjust based on your cluster:
# Memory configuration
SPARK_EXECUTOR_MEMORY="10g" # Per executor memory
SPARK_EXECUTOR_MEMORY_OVERHEAD="3g" # Additional memory for overhead
# CPU configuration
SPARK_EXECUTOR_CORES="8" # Cores per executor
SPARK_EXECUTOR_INSTANCES="11" # Total executors
# Network and reliability
SPARK_NETWORK_TIMEOUT="300s" # Network timeout
SPARK_EXECUTOR_HEARTBEAT_INTERVAL="10s" # Heartbeat frequencyFollowing config included by default for all JVMs in main.tf:
# With IDs in names -> new classes per batch -> less profile reuse, extra code-cache churn. Without IDs -> same generated class reused -> profiles and compiled methods persist lower steady-state latency
spark.sql.codegen.useIdInClassName: false
# The global cache of generated classes defaults to 100 entries once you exceed that, Spark evicts and recompiles classes, wasting CPU and lengthening query latency
spark.sql.codegen.cache.maxEntries: 9999ReadyNow profiles improve through iterative training. Each "generation" builds on the previous one:
- Generation 0 (Gen0): Initial cold-start profile collection
- Generation 1 (Gen1): Refined profile using Gen0 as input
- Production: Final measurements using Gen1 for optimal performance
Minimum Training Requirements:
- Each generation requires running the full TPC-DS query suite (all 99+ queries)
- Multiple iterations (3+ recommended) ensure comprehensive method coverage
In EMR Spark clusters, each executor JVM generates its own ReadyNow profile. With 11 executors, you'll get 11 profile files per run. For the next generation, you need to:
- Collect profiles from all executors
- Select one representative profile (typically the largest)
- Distribute that profile to all executors for the next run
- EMR cluster deployed with
runtime_variant = "zing" - TPC-DS data available in S3 (
s3://your-bucket/data/sf15000-parquet/) - SSH access to cluster nodes configured
- Cluster ID saved:
export CURRENT_SPARK_CLUSTER_ID=j-XXXXXXXXXXXXX
1. Run benchmark with Gen0 configuration:
# Deploy Zing cluster
terraform apply
# Run Gen0 training
./run_test.sh $CURRENT_SPARK_CLUSTER_ID your-bucket --profile configs/15tb-zing-gen0.profile2. Wait for completion (3-4 hours):
# Monitor progress
./status.sh3. Collect Gen0 profiles from cluster nodes:
ssh -i ~/.ssh/your-key.pem ec2-user@$ONE_OF_THE_EXECUTORS_IP
# collect all Gen0 profiles
cd /mnt/vmoutput
ls -lh tpcds-15tb-full-gen0-*.profile
# Expected output: Multiple files, one per executor
# tpcds-15tb-full-gen0-12345.profile (3.2M)
# tpcds-15tb-full-gen0-12346.profile (3.1M)
# tpcds-15tb-full-gen0-12347.profile (3.3M)
# ...
# Select the largest profile (typically has best coverage)
# Rename for Gen1 input (remove process ID)
cp tpcds-15tb-full-gen0-12347.profile tpcds-15tb-full-gen0.profile4. Upload Gen0 profile to S3 (also you can use any HTTP seever to hold it for you, or even manually upload to EVERY executor host before each new run):
aws s3 cp tpcds-15tb-full-gen0.profile s3://your-bucket/readynow-profiles/
exit5. Verify profile size:
# Download and check locally
aws s3 cp s3://your-bucket/readynow-profiles/tpcds-15tb-full-gen0.profile .
ls -lh tpcds-15tb-full-gen0.profile1. Update bootstrap script to download Gen0 profile:
Edit bootstrap.sh.tpl and uncomment/add the download section:
sudo mkdir -p /mnt/vmoutput
# Download Gen0 profile for Gen1 training
sudo curl -L -f --connect-timeout 30 --max-time 300 \
-o /mnt/vmoutput/tpcds-15tb-full-gen0.profile \
https://s3.amazonaws.com/your-bucket/readynow-profiles/tpcds-15tb-full-gen0.profile
sudo chmod -R 1777 /mnt/vmoutput2. Redeploy cluster with updated bootstrap:
# Destroy old cluster
terraform destroy
# Deploy new cluster (will download Gen0 profile during bootstrap)
terraform apply3. Run Gen1 training:
./run_test.sh $CURRENT_SPARK_CLUSTER_ID your-bucket --profile configs/15tb-zing-gen1.profile4. Collect and upload Gen1 profile (same process as Gen0):
ssh -i ~/.ssh/your-key.pem ec2-user@$ONE_OF_THE_EXECUTORS_IP
# Select best Gen1 profile
cd /mnt/vmoutput
cp tpcds-15tb-full-gen1-12347.profile tpcds-15tb-full-gen1.profile
# Upload to S3
aws s3 cp tpcds-15tb-full-gen1.profile s3://your-bucket/readynow-profiles/
exit1. Update bootstrap script for Gen1 profile:
Edit bootstrap.sh.tpl to download Gen1 instead of Gen0:
sudo mkdir -p /mnt/vmoutput
# Download Gen1 profile for production runs
sudo curl -L -f --connect-timeout 30 --max-time 300 \
-o /mnt/vmoutput/tpcds-15tb-full-gen1.profile \
https://s3.amazonaws.com/your-bucket/readynow-profiles/tpcds-15tb-full-gen1.profile
sudo chmod -R 1777 /mnt/vmoutput2. Deploy production cluster:
terraform destroy
terraform apply3. Run production benchmarks:
./run_test.sh $CURRENT_SPARK_CLUSTER_ID your-bucket --profile configs/15tb-zing-production.profileThis configuration uses Gen1 profile for optimal warm-start, producing consistent, production-ready benchmark results.
Verify profiles are loaded successfully:
ssh -i ~/.ssh/your-key.pem ec2-user@$ONE_OF_THE_EXECUTORS_IP
# Check bootstrap log
grep -i "readynow\|profile" /tmp/bootstrap.log
# Check profile exists and is readable
ls -lh /mnt/vmoutput/*.profile
file /mnt/vmoutput/*.profile # Should show: dataUse consistent naming for tracking generations:
s3://your-bucket/readynow-profiles/
├── tpcds-15tb-full-gen0.profile # First generation
├── tpcds-15tb-full-gen1.profile # Second generation
├── tpcds-15tb-full-gen0-20250130.profile # Archived with date
└── tpcds-15tb-full-gen1-20250130.profile # Archived with date
Regenerate profiles when:
- TPC-DS query list changes
- Spark configuration changes significantly (executor memory, cores)
- Upgrading Spark versions
- Upgrading Zing JDK versions
A single trained profile can be reused across multiple clusters if:
- Same Spark version and configuration
- Same TPC-DS query workload
- Same Zing JDK version
- Similar cluster size (executor count can vary slightly)
Symptoms: No .profile files in /mnt/vmoutput/ after benchmark run
Solutions:
-
Check JVM flags are passed correctly:
# On cluster node during run ps aux | grep java | grep ProfileLogOut
-
Verify directory permissions:
ls -ld /mnt/vmoutput/ # Should show: drwxrwxrwt (1777 permissions) -
Check for disk space:
df -h /mnt/
Symptoms: Bootstrap errors or missing profile during run
Solutions:
-
Verify S3 URL is accessible:
# Test from master node curl -I https://s3.amazonaws.com/your-bucket/readynow-profiles/tpcds-15tb-full-gen0.profile -
Check IAM permissions for EMR EC2 role:
- Must have
s3:GetObjectpermission for profile bucket
- Must have
-
Check bootstrap logs:
tail -100 /tmp/bootstrap.log
- Azul ReadyNow Documentation
- Optimizer Hub ReadyNow Orchestrator - Automated profile management (commercial tool)
- See
configs/15tb-full-run-zing.profilefor alternative inline configuration approach
For evaluating results, we will use the data from Spark Event Logs. If spark.eventLog.dir is not set, to get AWS EMR Spark Event Logs:
- In AWS EMR Web UI go to the cluster where the benchmark ran.
- Start the Spark History UI by clicking on
Spark History Server. - Wait for the Spark History UI to open, and wait until the Spark Event Logs from the benchmark runs are available (check by refreshing the page), and download them as
.zipfiles.
This repository includes two Python analysis scripts:
spark_eventlog_analyze.py: Converts Spark event logs to aggregated CSV format with per-query metricstpcds_eventlog_compare.py: Compares two configurations and generates comparison CSV and S-curve plots
The analysis scripts require Python 3.13+ with the following packages:
polars- for efficient data processingmatplotlib- for visualization
Using uv (recommended):
All Python scripts in this repository can be run using uv, which automatically manages dependencies without needing a virtual environment:
# Run directly with uv - dependencies are automatically handled
uv run spark_eventlog_analyze.py -o zing-1.csv /path/to/eventLogs-application_1758748016442_0001-1.zip
uv run tpcds_eventlog_compare.py -o results corretto-*.csv zing-*.csvAlternative: Traditional Python environment:
# Create virtual environment and install dependencies
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install polars matplotlib
# Run scripts
python spark_eventlog_analyze.py -o zing-1.csv /path/to/eventlog.zip
python tpcds_eventlog_compare.py -o results corretto-*.csv zing-*.csvFor each event log (JSON, optionally in .zip or .gz), convert it to a more lightweight aggregated CSV:
uv run spark_eventlog_analyze.py -o zing-1.csv /path/to/eventLogs-application_1758748016442_0001-1.zip
uv run spark_eventlog_analyze.py -o zing-2.csv /path/to/eventLogs-application_1758748016442_0002-1.zip
# ... repeat for additional runs
uv run spark_eventlog_analyze.py -o corretto-1.csv /path/to/eventLogs-application_1756681602934_0001-1.zip
uv run spark_eventlog_analyze.py -o corretto-2.csv /path/to/eventLogs-application_1756681602934_0002-1.zip
# ... repeat for additional runsOutput CSV columns: executionId, description, num_jobs, num_tasks, makespan_ms, task_slot_ms, executor_run_ms, executor_cpu_ms, cpu_vs_wall_pct, deserialize_ms, result_serialize_ms, gc_ms, shuffle_fetch_wait_ms, shuffle_write_time_ms, input_bytes, output_bytes, shuffle_read_bytes, shuffle_write_bytes
Compare the converted event logs of two configurations:
uv run tpcds_eventlog_compare.py -o results corretto-*.csv zing-*.csvImportant:
- Configuration names (e.g.
corretto,zing) and run sequence numbers are implicitly derived from CSV file names in the format{config}-{run}.csv - The first named configuration is taken as baseline
- Multiple runs of the same configuration are automatically aggregated by taking the mean
Optional filters:
# Only consider queries where target runs longer than 60 seconds
uv run tpcds_eventlog_compare.py -o results --longer-than 60 corretto-*.csv zing-*.csvThe result folder will contain:
results/
├── zing-vs-corretto.csv
└── zing-vs-corretto-total_time.png
- PNG file: S-curve comparison of the
total_timemetric (wall clock time to complete each query), sorted by speedup ratio - CSV schema:
query,total_time_corretto,total_time_zing,executor_time_corretto,executor_time_zing,executor_cpu_time_corretto,executor_cpu_time_zing
The CSV contains aggregated per-query metrics for both configurations and can be used for further analysis or custom plotting.