Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
378 changes: 378 additions & 0 deletions docs/articles/test-workflows-matrix-and-sharding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,240 @@ either to distribute the load or to verify it on different setup.

Test Workflows have a built-in mechanism for all these cases - both static and dynamic.

## Configuration File Setup

Test Workflow sharding is configured through YAML files that define TestWorkflow custom resources in your Kubernetes cluster.

### Where to Define Configuration

You can create and apply Test Workflow configurations in several ways:

1. **Create a YAML file** (e.g., `my-workflow.yaml`) with your TestWorkflow definition
2. **Apply it using kubectl**:
```bash
kubectl apply -f my-workflow.yaml
```
3. **Or use the Testkube CLI**:
```bash
testkube create testworkflow -f my-workflow.yaml
```
4. **Or use the Testkube Dashboard** - navigate to Test Workflows and create/edit workflows through the UI

All Test Workflows are stored as custom resources in your Kubernetes cluster under the `testworkflows.testkube.io/v1` API version.

### Basic Configuration Structure

A minimal sharded workflow configuration looks like this:

```yaml
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: my-sharded-workflow
spec:
# Your container and content configuration
container:
image: your-test-image:latest

steps:
- name: Run tests
parallel:
count: 3 # Number of shards to create
shell: 'run-your-tests.sh'
```

## Choosing the Right Shard Number

The number of shards you configure has a direct impact on performance and resource utilization:

### Performance Impact

| Shard Count | Execution Time | Resource Usage | Best For |
|-------------|----------------|----------------|----------|
| 1 (no sharding) | Baseline | Low | Small test suites, limited resources |
| 2-5 | ~50-80% reduction | Medium | Medium test suites (10-50 tests) |
| 5-10 | ~60-90% reduction | High | Large test suites (50-200 tests) |
| 10+ | ~70-95% reduction | Very High | Very large test suites (200+ tests) |

:::tip
**General Guidelines:**
- **Small test suites (fewer than 10 tests)**: Use 1-2 shards. More shards add overhead without benefit.
- **Medium test suites (10-50 tests)**: Use 3-5 shards for optimal balance.
- **Large test suites (50-200 tests)**: Use 5-10 shards based on available cluster resources.
- **Very large test suites (200+ tests)**: Use 10-20 shards, but monitor resource consumption.

The optimal number depends on:
- **Test duration**: Longer tests benefit more from sharding
- **Cluster capacity**: Each shard requires a pod with allocated resources
- **Test distribution**: Shards work best when tests can be evenly distributed
:::

### Resource Considerations

Each shard runs in its own pod, so consider:
- **CPU and memory**: Each shard consumes the resources defined in `container.resources`
- **Cluster capacity**: Ensure your cluster can handle `count` x `resources` simultaneously
- **Cost**: More shards = more parallel pods = higher infrastructure costs during execution

## Step-by-Step Configuration Guide

### Step 1: Determine Your Sharding Strategy

Choose between static and dynamic sharding:

**Static Sharding (`count` only)**: Fixed number of shards
```yaml
parallel:
count: 5 # Always creates exactly 5 shards
```

**Dynamic Sharding (`maxCount` + `shards`)**: Adaptive based on test data
```yaml
parallel:
maxCount: 5 # Creates up to 5 shards based on available tests
shards:
testFiles: 'glob("tests/**/*.spec.js")'
```

### Step 2: Define Resource Limits

Specify resources for each shard to prevent resource contention:

```yaml
parallel:
count: 3
container:
resources:
requests:
cpu: 1 # Each shard gets 1 CPU
memory: 1Gi # Each shard gets 1GB RAM
limits:
cpu: 2
memory: 2Gi
```

### Step 3: Configure Data Distribution

For dynamic sharding, define how to split your test data:

```yaml
parallel:
maxCount: 5
shards:
testFiles: 'glob("cypress/e2e/**/*.cy.js")' # Discover test files
shell: |
# Access distributed test files via shard.testFiles
npx cypress run --spec '{{ join(shard.testFiles, ",") }}'
```

### Step 4: Apply and Verify

```bash
# Apply your workflow
kubectl apply -f my-sharded-workflow.yaml

# Run the workflow
testkube run testworkflow my-sharded-workflow -f

# Monitor execution
kubectl get pods -l testworkflow=my-sharded-workflow
```

## Common Use Cases

### Use Case 1: Sharding Cypress Tests

Distribute Cypress E2E tests across multiple shards:

```yaml
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: cypress-sharded
spec:
content:
git:
uri: https://github.com/your-org/your-repo
paths: [cypress]
container:
image: cypress/included:13.6.4
workingDir: /data/repo/cypress

steps:
- name: Install dependencies
shell: npm ci

- name: Run tests in parallel
parallel:
maxCount: 5 # Up to 5 shards for optimal distribution
shards:
testFiles: 'glob("cypress/e2e/**/*.cy.js")'
description: 'Shard {{ index + 1 }}/{{ count }}: {{ join(shard.testFiles, ", ") }}'
transfer:
- from: /data/repo
container:
resources:
requests:
cpu: 1
memory: 1Gi
run:
args: [--spec, '{{ join(shard.testFiles, ",") }}']
```

### Use Case 2: Load Testing with K6

Generate load from multiple nodes:

```yaml
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: k6-load-test
spec:
container:
image: grafana/k6:latest

steps:
- name: Run distributed load test
parallel:
count: 10 # 10 shards generating concurrent load
description: 'Load generator {{ index + 1 }}/{{ count }}'
container:
resources:
requests:
cpu: 2
memory: 2Gi
shell: |
k6 run --vus 50 --duration 5m \
--tag shard={{ index }} script.js
```

### Use Case 3: Multi-Browser Testing with Playwright

Test across different browsers with sharding:

```yaml
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: playwright-multi-browser
spec:
container:
image: mcr.microsoft.com/playwright:latest

steps:
- name: Run tests
parallel:
matrix:
browser: [chromium, firefox, webkit] # Test on each browser
count: 3 # Shard each browser's tests into 3 parts
description: '{{ matrix.browser }} - shard {{ shardIndex + 1 }}/{{ shardCount }}'
shell: |
npx playwright test \
--project={{ matrix.browser }} \
--shard={{ shardIndex + 1 }}/{{ shardCount }}
```

## Usage

Matrix and sharding features are supported in [**Services (`services`)**](./test-workflows-services), and both [**Test Suite (`execute`)**](./test-workflows-test-suites) and [**Parallel Steps (`parallel`)**](./test-workflows-parallel) operations.
Expand Down Expand Up @@ -240,3 +474,147 @@ Will start 8 instances:
| `5` | `2` | `"firefox"` | `"1Gi"` | `1` | `["https://app.testkube.io"]` |
| `6` | `3` | `"firefox"` | `"2Gi"` | `0` | `["https://testkube.io", "https://docs.testkube.io"]` |
| `7` | `3` | `"firefox"` | `"2Gi"` | `1` | `["https://app.testkube.io"]` |

## Troubleshooting and Best Practices

### Common Issues

#### Issue: Shards Not Starting

**Symptoms**: Some or all shards remain in pending state

**Solutions**:
1. **Check cluster resources**: Ensure your cluster has enough capacity for all shards
```bash
kubectl describe nodes # Check available resources
kubectl get pods -n testkube # Check pod status
```
2. **Review resource requests**: Each shard needs allocated resources
```yaml
container:
resources:
requests:
cpu: 500m # Reduce if resources are limited
memory: 512Mi
```
3. **Reduce shard count**: If resources are constrained, use fewer shards
```yaml
parallel:
count: 3 # Reduced from 10
```

#### Issue: Uneven Test Distribution

**Symptoms**: Some shards finish much faster than others

**Solutions**:
1. **Use dynamic sharding** with `maxCount` instead of `count`:
```yaml
parallel:
maxCount: 5 # Adapts to available tests
shards:
testFiles: 'glob("tests/**/*.test.js")'
```
2. **Ensure test files are similar in size/duration**: Group fast and slow tests evenly
3. **Monitor execution times**:
```bash
testkube get twe EXECUTION_ID # Check individual shard durations
```

#### Issue: Out of Memory Errors

**Symptoms**: Pods crash with OOM (Out of Memory) errors

**Solutions**:
1. **Increase memory limits**:
```yaml
container:
resources:
limits:
memory: 4Gi # Increased from 2Gi
```
2. **Reduce tests per shard**: Increase shard count to distribute load
```yaml
parallel:
count: 10 # More shards = fewer tests per shard
```

### Best Practices

#### 1. Start Conservative and Scale Up

Begin with a small shard count and increase based on results:
```yaml
# Week 1: Baseline
parallel:
count: 2

# Week 2: If successful, increase
parallel:
count: 5

# Week 3: Optimize based on metrics
parallel:
count: 8 # Sweet spot for your test suite
```

#### 2. Monitor Resource Usage

Track resource consumption to optimize shard configuration:
```bash
# Watch resource usage during execution
kubectl top pods -n testkube -l testworkflow=my-workflow

# Review completed execution metrics
testkube get twe EXECUTION_ID
```

#### 3. Use Descriptive Names

Make debugging easier with clear descriptions:
```yaml
parallel:
count: 5
description: 'Shard {{ index + 1 }}/{{ count }} - {{ len(shard.testFiles) }} tests'
```

#### 4. Implement Retry Logic

Account for transient failures in sharded tests:
```yaml
steps:
- name: Run tests with retry
parallel:
count: 3
retry:
count: 2 # Retry failed shards up to 2 times
shell: 'run-tests.sh'
```

#### 5. Consider Cost vs. Speed Tradeoffs

More shards = faster execution but higher cost:
- **Development**: Use fewer shards (2-3) to save resources
- **CI/CD**: Use optimal shards (5-8) for speed
- **Production validation**: Use maximum shards (10+) for critical releases

#### 6. Balance Matrix and Sharding

When combining matrix and sharding, avoid excessive parallelism:
```yaml
# This creates 3 browsers × 5 shards = 15 pods
parallel:
matrix:
browser: [chrome, firefox, safari] # 3 combinations
count: 5 # 5 shards per combination
# Total: 15 concurrent pods - ensure cluster can handle this!
```

## Additional Resources

- [Test Workflows Overview](./test-workflows.md)
- [Parallel Execution](./test-workflows-parallel.mdx)
- [Test Workflow Expressions](./test-workflows-expressions.md)
- [Job and Pod Configuration](./test-workflows-job-and-pod.md)
- [Sharded Cypress Example](./examples/cypress-sharded.mdx)
- [Sharded Playwright Example](./examples/playwright-sharded.mdx)