Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions k8s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Lading in k8s Demonstration

Testing setup to demonstrate memory limits for Datadog Agent under lading load.

Experiment is rigged up through `experiment.sh`. That script takes multiple
memory parameters for each configured Agent pod container, setting them as
limits in `manifests/datadog-agent.yaml`. Experiment runs for a given duration
-- suggested, 300 seconds at a minimum -- and does two things:

* watches for container restarts during the experiment, signaling failure if one
is detected or
* executes to experiment duration and queries Prometheus to calculate the peak
memory consumed by each Agent container, relative to configured limits.

Experiments are **isolated from the internet** to avoid sending metrics et al to
actual Datadog intake. See `manifests/deny-egress.yaml` for details.

## Prerequisites

- kind: `brew install kind`
- kubectl: `brew install kubectl`
- helm: `brew install helm`
- jq: `brew install jq`
- python3: System Python 3
- Docker running

## Usage

### Test a specific memory limit

```bash
# Test 2000 MB total for 5 minutes with explicit per-container limits
./k8s/experiment.sh --total-limit 2000 --agent-memory 1200 --trace-memory 400 --sysprobe-memory 300 --process-memory 100 --tags "purpose:test,limit:2000mb"
```

All memory flags are mandatory and must sum to `--total-limit`, which acts as a check flag.

### To find a minimum memory limit

Run the script multiple times with different limits. Results are:

- **OOMKilled** (FAILURE): Agent needs more memory, script exits
- **Stable** (SUCCESS): Agent survived test duration, cluster kept running for examination

## Manifests

All manifests are in `manifests/` directory. The script uses template
substitution for:

- **manifests/datadog-agent.yaml**: DatadogAgent CRD for Datadog Operator
- Uses `{{ AGENT_MEMORY_MB }}`, `{{ TRACE_MEMORY_MB }}`, `{{
SYSPROBE_MEMORY_MB }}`, `{{ PROCESS_MEMORY_MB }}`, and `{{ DD_TAGS }}`
placeholders
- Configured for DogStatsD via Unix domain socket at `/var/run/datadog/dsd.socket`
- Shares `/var/run/datadog` via hostPath with lading pod

- **manifests/lading.yaml**: Lading load generator (lading 0.29.2)
- ConfigMap with exact config from `uds_dogstatsd_to_api` test
- Sends 100 MiB/s of DogStatsD metrics
- High cardinality: 1k-10k contexts, many tags
- Service with Prometheus scrape annotations for lading metrics

- **manifests/lading-intake.yaml**: Lading intake (blackhole) mimicking Datadog
API (lading 0.29.2)
- Receives and discards agent output for self-contained testing

- **manifests/datadog-secret.yaml**: Placeholder secret (fake API key, not validated)
- **manifests/deny-egress.yaml**: NetworkPolicy blocking internet egress (security isolation)

## Test configuration

Taken from
[`datadog-agent/test/regression/cases/uds_dogstatsd_to_api`](https://github.com/DataDog/datadog-agent/blob/main/test/regression/cases/uds_dogstatsd_to_api/lading/lading.yaml). This
experiment is **high stress** for metrics intake and high memory use from
`agent` container is expected.

Adjust lading load generation configuration in the ConfigMap called
`lading-config`. Adjust Agent configuration in `manifests/datadog-agent.yaml`.

## Cleanup

Cluster is left online after script exits. Re-run of `experiment.sh` will
destroy the cluster. Manually clean up the cluster like so:

```bash
kind delete cluster --name lading-test
```

## Notes

- **Agent version**: 7.72.1
- **Lading version**: 0.29.2
- **Agent features enabled**: APM (trace-agent), Log Collection, NPM/system-probe, DogStatsD, Prometheus scrape
65 changes: 65 additions & 0 deletions k8s/analyze_memory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python3
import sys
import json
import urllib.request
import urllib.parse

def query_container(prom_url, pod, container, duration):
query = f'max_over_time(container_memory_working_set_bytes{{namespace="default",pod="{pod}",container="{container}"}}[{duration}s])'
params = {'query': query}
url = f"{prom_url}?{urllib.parse.urlencode(params)}"

try:
with urllib.request.urlopen(url, timeout=10) as response:
data = json.loads(response.read().decode())

if data['status'] == 'success' and data['data']['result']:
value_bytes = float(data['data']['result'][0]['value'][1])
return data, value_bytes
return data, None
except Exception as e:
print(f"Error querying {container}: {e}", file=sys.stderr)
return None, None

def main():
if len(sys.argv) != 8:
print("Usage: analyze_memory.py <prom_url> <pod> <duration> <agent_limit_mb> <trace_limit_mb> <sysprobe_limit_mb> <process_limit_mb>", file=sys.stderr)
sys.exit(1)

prom_url = sys.argv[1]
pod = sys.argv[2]
duration = sys.argv[3]
agent_limit = int(sys.argv[4])
trace_limit = int(sys.argv[5])
sysprobe_limit = int(sys.argv[6])
process_limit = int(sys.argv[7])
total_limit = agent_limit + trace_limit + sysprobe_limit + process_limit

containers = {
'agent': agent_limit,
'trace-agent': trace_limit,
'system-probe': sysprobe_limit,
'process-agent': process_limit
}

results = {}

for container, limit_mb in containers.items():
data, value_bytes = query_container(prom_url, pod, container, duration)

if value_bytes is not None:
value_mb = value_bytes / 1024 / 1024
percent = (value_mb / limit_mb) * 100
results[container] = (value_mb, limit_mb, percent)
print(f" {container}: {value_mb:.2f} MB / {limit_mb} MB ({percent:.1f}%)")
else:
print(f" {container}: Could not retrieve metrics")
results[container] = (0, limit_mb, 0)

# Calculate total
total_mb = sum(r[0] for r in results.values())
total_percent = (total_mb / total_limit) * 100
print(f" TOTAL: {total_mb:.2f} MB / {total_limit} MB ({total_percent:.1f}%)")

if __name__ == '__main__':
main()
Loading
Loading