DataDog · blt · Dec 2, 2025
@@ -0,0 +1,93 @@
+# Lading in k8s Demonstration
+
+Testing setup to demonstrate memory limits for Datadog Agent under lading load.
+
+Experiment is rigged up through `experiment.sh`. That script takes multiple
+memory parameters for each configured Agent pod container, setting them as
+limits in `manifests/datadog-agent.yaml`. Experiment runs for a given duration
+-- suggested, 300 seconds at a minimum -- and does two things:
+
+* watches for container restarts during the experiment, signaling failure if one
+  is detected or
+* executes to experiment duration and queries Prometheus to calculate the peak
+  memory consumed by each Agent container, relative to configured limits.
+
+Experiments are **isolated from the internet** to avoid sending metrics et al to
+actual Datadog intake. See `manifests/deny-egress.yaml` for details.
+
+## Prerequisites
+
+- kind: `brew install kind`
+- kubectl: `brew install kubectl`
+- helm: `brew install helm`
+- jq: `brew install jq`
+- python3: System Python 3
+- Docker running
+
+## Usage
+
+### Test a specific memory limit
+
+```bash
+# Test 2000 MB total for 5 minutes with explicit per-container limits
+./k8s/experiment.sh --total-limit 2000 --agent-memory 1200 --trace-memory 400 --sysprobe-memory 300 --process-memory 100 --tags "purpose:test,limit:2000mb"
+```
+
+All memory flags are mandatory and must sum to `--total-limit`, which acts as a check flag.
+
+### To find a minimum memory limit
+
+Run the script multiple times with different limits. Results are:
+
+- **OOMKilled** (FAILURE): Agent needs more memory, script exits
+- **Stable** (SUCCESS): Agent survived test duration, cluster kept running for examination
+
+## Manifests
+
+All manifests are in `manifests/` directory. The script uses template
+substitution for:
+
+- **manifests/datadog-agent.yaml**: DatadogAgent CRD for Datadog Operator
+  - Uses `{{ AGENT_MEMORY_MB }}`, `{{ TRACE_MEMORY_MB }}`, `{{
+    SYSPROBE_MEMORY_MB }}`, `{{ PROCESS_MEMORY_MB }}`, and `{{ DD_TAGS }}`
+    placeholders
+  - Configured for DogStatsD via Unix domain socket at `/var/run/datadog/dsd.socket`
+  - Shares `/var/run/datadog` via hostPath with lading pod
+
+- **manifests/lading.yaml**: Lading load generator (lading 0.29.2)
+  - ConfigMap with exact config from `uds_dogstatsd_to_api` test
+  - Sends 100 MiB/s of DogStatsD metrics
+  - High cardinality: 1k-10k contexts, many tags
+  - Service with Prometheus scrape annotations for lading metrics
+
+- **manifests/lading-intake.yaml**: Lading intake (blackhole) mimicking Datadog
+  API (lading 0.29.2)
+  - Receives and discards agent output for self-contained testing
+
+- **manifests/datadog-secret.yaml**: Placeholder secret (fake API key, not validated)
+- **manifests/deny-egress.yaml**: NetworkPolicy blocking internet egress (security isolation)
+
+## Test configuration
+
+Taken from
+[`datadog-agent/test/regression/cases/uds_dogstatsd_to_api`](https://github.com/DataDog/datadog-agent/blob/main/test/regression/cases/uds_dogstatsd_to_api/lading/lading.yaml). This
+experiment is **high stress** for metrics intake and high memory use from
+`agent` container is expected.
+
+Adjust lading load generation configuration in the ConfigMap called
+`lading-config`. Adjust Agent configuration in `manifests/datadog-agent.yaml`.
+
+## Cleanup
+
+Cluster is left online after script exits. Re-run of `experiment.sh` will
+destroy the cluster. Manually clean up the cluster like so:
+
+```bash
+kind delete cluster --name lading-test
+```
+
+## Notes
+
+- **Agent version**: 7.72.1
+- **Lading version**: 0.29.2
+- **Agent features enabled**: APM (trace-agent), Log Collection, NPM/system-probe, DogStatsD, Prometheus scrape
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+import sys
+import json
+import urllib.request
+import urllib.parse
+
+def query_container(prom_url, pod, container, duration):
+    query = f'max_over_time(container_memory_working_set_bytes{{namespace="default",pod="{pod}",container="{container}"}}[{duration}s])'
+    params = {'query': query}
+    url = f"{prom_url}?{urllib.parse.urlencode(params)}"
+
+    try:
+        with urllib.request.urlopen(url, timeout=10) as response:
+            data = json.loads(response.read().decode())
+
+        if data['status'] == 'success' and data['data']['result']:
+            value_bytes = float(data['data']['result'][0]['value'][1])
+            return data, value_bytes
+        return data, None
+    except Exception as e:
+        print(f"Error querying {container}: {e}", file=sys.stderr)
+        return None, None
+
+def main():
+    if len(sys.argv) != 8:
+        print("Usage: analyze_memory.py <prom_url> <pod> <duration> <agent_limit_mb> <trace_limit_mb> <sysprobe_limit_mb> <process_limit_mb>", file=sys.stderr)
+        sys.exit(1)
+
+    prom_url = sys.argv[1]
+    pod = sys.argv[2]
+    duration = sys.argv[3]
+    agent_limit = int(sys.argv[4])
+    trace_limit = int(sys.argv[5])
+    sysprobe_limit = int(sys.argv[6])
+    process_limit = int(sys.argv[7])
+    total_limit = agent_limit + trace_limit + sysprobe_limit + process_limit
+
+    containers = {
+        'agent': agent_limit,
+        'trace-agent': trace_limit,
+        'system-probe': sysprobe_limit,
+        'process-agent': process_limit
+    }
+
+    results = {}
+
+    for container, limit_mb in containers.items():
+        data, value_bytes = query_container(prom_url, pod, container, duration)
+
+        if value_bytes is not None:
+            value_mb = value_bytes / 1024 / 1024
+            percent = (value_mb / limit_mb) * 100
+            results[container] = (value_mb, limit_mb, percent)
+            print(f"  {container}: {value_mb:.2f} MB / {limit_mb} MB ({percent:.1f}%)")
+        else:
+            print(f"  {container}: Could not retrieve metrics")
+            results[container] = (0, limit_mb, 0)
+
+    # Calculate total
+    total_mb = sum(r[0] for r in results.values())
+    total_percent = (total_mb / total_limit) * 100
+    print(f"  TOTAL: {total_mb:.2f} MB / {total_limit} MB ({total_percent:.1f}%)")
+
+if __name__ == '__main__':
+    main()