|
| 1 | +--- |
| 2 | +title: "AIOps on Linux Configuration" |
| 3 | +format: html |
| 4 | +--- |
| 5 | + |
| 6 | +# Self Monitoring |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Setting Up a Promethues AlertManager Webhook in AIOps |
| 11 | + |
| 12 | +### 1. Define the Webhook in the AIOps UI |
| 13 | +1. Navigate to **Integrations** in the AIOps console and select **Add integration**. |
| 14 | +2. Under the **Events** category, select **Prometheus AlertManager**, click **Get started**. |
| 15 | +3. Provide a **Name** (e.g. *Prometheus*) and optional **description** for the webhook to identify its purpose (e.g., *Prometheus Alerts (Self Monitoring)*). |
| 16 | +4. Select **None** for **Authentication type** and click **Next**. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +### 2. Map Prometheus Alert JSON to AIOps Schema |
| 21 | +1. In the webhook configuration screen, locate the **Mapping** section. |
| 22 | +2. Use the following JSONata mapping: |
| 23 | + |
| 24 | +```json |
| 25 | +( |
| 26 | + /* Set resource based on labels available */ |
| 27 | + $resource := function($labels){( |
| 28 | + $name := $labels.name ? $labels.name |
| 29 | + : $labels.node_name ? $labels.node_name |
| 30 | + : $labels.statefulset ? $labels.statefulset |
| 31 | + : $labels.deployment ? $labels.deployment |
| 32 | + : $labels.daemonset ? $labels.daemonset |
| 33 | + : $labels.pod ? $labels.pod |
| 34 | + : $labels.container ? $labels.container |
| 35 | + : $labels.instance ? $labels.instance |
| 36 | + : $labels.app ? $labels.app |
| 37 | + : $labels.job_name ? $labels.job_name |
| 38 | + : $labels.job ? $labels.job |
| 39 | + : $labels.type ? $labels.type: $labels.prometheus; |
| 40 | + |
| 41 | + /* Conditional Namespace Append */ |
| 42 | + $namespace_appended := $labels.namespace ? ($name & '/' & $labels.namespace) : $name; |
| 43 | + |
| 44 | + /* Check if the determined $name is likely a node/hardware name */ |
| 45 | + $is_node_alert := $labels.node_name or $labels.instance; |
| 46 | + |
| 47 | + $is_node_alert ? $name : $namespace_appended; /* Only append if NOT a node alert */ |
| 48 | + )}; |
| 49 | + /* Map to event schema */ |
| 50 | + alerts.( |
| 51 | + { |
| 52 | + "summary": annotations.summary ? annotations.summary: annotations.description ? annotations.description : annotations.message ? annotations.message, |
| 53 | + "severity": $lowercase(labels.severity) = "critical" ? 6 : $lowercase(labels.severity) = "major" ? 5 : $lowercase(labels.severity) = "minor" ? 4 : $lowercase(labels.severity) = "warning" ? 3 : 1, |
| 54 | + "resource": { |
| 55 | + "name": $resource(labels) |
| 56 | + }, |
| 57 | + "type": { |
| 58 | + "eventType": $lowercase(status) = "firing" ? "problem": "resolution", |
| 59 | + "classification": labels.alertname |
| 60 | + }, |
| 61 | + "links": [ |
| 62 | + { |
| 63 | + "url": generatorURL |
| 64 | + } |
| 65 | + ], |
| 66 | + "sender": { |
| 67 | + "name": "Prometheus", |
| 68 | + "type": "Webhook Connector" |
| 69 | + }, |
| 70 | + "details": labels |
| 71 | + } |
| 72 | + ) |
| 73 | + ) |
| 74 | +``` |
| 75 | +3. Click **Save**. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +### 3. Generate the Webhook and Capture the URL |
| 80 | +1. The webhook will start initializing, wait as it intializes. |
| 81 | +2. A unique **Webhook route** will be displayed (e.g., `https://<aiops-domain>/webhook-connector/<id>`) once the webhook is **Running**. |
| 82 | +3. Copy this URL — it will be used in the **AlertmanagerConfig** in Prometheus to send alerts to AIOps. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## Prometheus Alertmanager: Webhook Receiver Configuration for AIOps |
| 87 | + |
| 88 | +This section outlines the steps required to configure the **Prometheus Operator's Alertmanager** to successfully send alerts to the AIOps webhook endpoint. |
| 89 | + |
| 90 | +The process involves two main phases: |
| 91 | + |
| 92 | +- **Network Configuration**: Ensuring the webhook FQDN is resolvable within the cluster. |
| 93 | +- **Alerting Configuration**: Defining the Alertmanager receiver and routing. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +### 1. Network Configuration (DNS Resolution) |
| 98 | + |
| 99 | +The Alertmanager pod must be able to resolve the AIOps webhook FQDN (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`). Since this FQDN is custom and resolves to a specific HAProxy IP (`192.168.252.9`), the entry must be added to **CoreDNS**. |
| 100 | + |
| 101 | +#### Update the `coredns-custom` ConfigMap |
| 102 | + |
| 103 | +Edit the `coredns-custom` ConfigMap in the `kube-system` namespace to include the webhook domain, mapping it to your HAProxy IP (`192.168.252.9`). This approach is necessary since standard Kubernetes DNS does not resolve external domains. |
| 104 | + |
| 105 | +**Note**: Replace `192.168.252.9` with your actual HAProxy IP if different. Replace `<webhook route>` with the fqdn from the webhook route generated by AIOps (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`) |
| 106 | + |
| 107 | +**Additional Note**: The below ConfigMap also contains additional DNS mapping to the CloudPak console and the AIOPs UI. This may or may not be applicable to your environment. |
| 108 | + |
| 109 | +```bash |
| 110 | +kubectl apply -f - <<EOF |
| 111 | +apiVersion: v1 |
| 112 | +kind: ConfigMap |
| 113 | +metadata: |
| 114 | + name: coredns-custom |
| 115 | + namespace: kube-system |
| 116 | +apiVersion: v1 |
| 117 | +data: |
| 118 | + default.server: | |
| 119 | + cp-console-aiops.aiops-haproxy.gym.lan { |
| 120 | + hosts { |
| 121 | + 192.168.252.9 cp-console-aiops.aiops-haproxy.gym.lan |
| 122 | + fallthrough |
| 123 | + } |
| 124 | + } |
| 125 | + aiops-cpd.aiops-haproxy.gym.lan { |
| 126 | + hosts { |
| 127 | + 192.168.252.9 aiops-cpd.aiops-haproxy.gym.lan |
| 128 | + fallthrough |
| 129 | + } |
| 130 | + } |
| 131 | + <webhook route> { |
| 132 | + hosts { |
| 133 | + 192.168.252.9 <webhook route> |
| 134 | + fallthrough |
| 135 | + } |
| 136 | + } |
| 137 | +EOF |
| 138 | +``` |
| 139 | + |
| 140 | +#### Restart CoreDNS |
| 141 | + |
| 142 | +Force CoreDNS to reload the new ConfigMap by restarting the deployment: |
| 143 | + |
| 144 | +```bash |
| 145 | +kubectl -n kube-system rollout restart deployment coredns |
| 146 | +``` |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +After CoreDNS restarts, the Alertmanager will be able to resolve the hostname, and all firing alerts will successfully flow to your AIOps webhook. |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +### 2. Configure Alertmanager Receiver |
| 155 | + |
| 156 | +Since the Prometheus Operator uses the **AlertmanagerConfig Custom Resource (CRD)**, we define the webhook receiver and routing within this resource. |
| 157 | + |
| 158 | +#### Define the AlertmanagerConfig CR |
| 159 | + |
| 160 | +Create or update the `AlertmanagerConfig` CR (named `aiops-webhook-receiver` in this example) to include the receiver and routing. |
| 161 | + |
| 162 | +Replace the sample webhook route `https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk` with |
| 163 | +your actual webhook route and save to a file named `aiops-alertmanagerconfig.yaml`. |
| 164 | + |
| 165 | +```yaml |
| 166 | +apiVersion: monitoring.coreos.com/v1alpha1 |
| 167 | +kind: AlertmanagerConfig |
| 168 | +metadata: |
| 169 | + name: aiops-webhook-receiver |
| 170 | + namespace: prometheus-operator # Must be in the same namespace as Alertmanager |
| 171 | + labels: |
| 172 | + alertmanagerConfig: main # Must match your Alertmanager CR selector |
| 173 | +spec: |
| 174 | + # 1. Define the Receiver |
| 175 | + receivers: |
| 176 | + - name: 'aiops-webhook-receiver' |
| 177 | + webhookConfigs: |
| 178 | + - url: 'https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk' # REPLACE |
| 179 | + sendResolved: true |
| 180 | + # Required for self-signed certificates |
| 181 | + httpConfig: |
| 182 | + tlsConfig: |
| 183 | + insecureSkipVerify: true |
| 184 | + |
| 185 | + # 2. Define the Route |
| 186 | + route: |
| 187 | + receiver: 'aiops-webhook-receiver' # Route all alerts to the new receiver |
| 188 | + groupBy: ['alertname', 'severity'] |
| 189 | + groupWait: 30s |
| 190 | + groupInterval: 5m |
| 191 | + repeatInterval: 4h |
| 192 | +``` |
| 193 | +
|
| 194 | +#### Apply the Configuration |
| 195 | +
|
| 196 | +Apply the manifest: |
| 197 | +
|
| 198 | +```bash |
| 199 | +kubectl apply -f aiops-alertmanagerconfig.yaml |
| 200 | +``` |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +### 3. Alert Lifecycle |
| 205 | + |
| 206 | +This section assumes that you have created a rule in Prometheus to trigger an alert if an AIOps node root filesystem `/` usage exceeds 90%. |
| 207 | + |
| 208 | +#### Trigger Storage Alert |
| 209 | + |
| 210 | +Use the following script `trigger_disk_alert.sh` to trigger a storage alert on the root fileystem of an AIOps node. |
| 211 | + |
| 212 | +```bash |
| 213 | +#!/bin/bash |
| 214 | + |
| 215 | +# Configuration |
| 216 | +TARGET_PERCENT=90 |
| 217 | +MOUNT_POINT="/" |
| 218 | +SAFETY_BUFFER_MB=10 # Add 10MB buffer to ensure we pass the threshold |
| 219 | +TARGET_FILE="/tmp/ROOT_FILL_FILE.bin" |
| 220 | + |
| 221 | +echo "--- Disk Usage Alert Trigger ---" |
| 222 | + |
| 223 | +# 1. Get disk statistics for the root filesystem in Kilobytes (KB) |
| 224 | +# Uses df -k to get output in KB for precise calculation |
| 225 | +if ! STATS=$(df -k "${MOUNT_POINT}" 2>/dev/null | awk 'NR==2{print $2, $3}'); then |
| 226 | + echo "Error: Failed to get disk statistics for ${MOUNT_POINT}. Exiting." |
| 227 | + exit 1 |
| 228 | +fi |
| 229 | + |
| 230 | +TOTAL_KB=$(echo "$STATS" | awk '{print $1}') |
| 231 | +USED_KB=$(echo "$STATS" | awk '{print $2}') |
| 232 | +# AVAILABLE_KB is not strictly needed for the calculation, but useful for debugging |
| 233 | + |
| 234 | +# Calculate percentages using integer arithmetic (multiplying by 100 first for precision) |
| 235 | +CURRENT_PERCENT=$(( (USED_KB * 100) / TOTAL_KB )) |
| 236 | + |
| 237 | +# Convert KB to MB for display purposes only |
| 238 | +TOTAL_MB=$(( TOTAL_KB / 1024 )) |
| 239 | +USED_MB=$(( USED_KB / 1024 )) |
| 240 | + |
| 241 | +echo "Filesystem: ${MOUNT_POINT}" |
| 242 | +echo "Total Size: ${TOTAL_MB} MB" |
| 243 | +echo "Used Size: ${USED_MB} MB (${CURRENT_PERCENT}%)" |
| 244 | +echo "Target: ${TARGET_PERCENT}% usage" |
| 245 | + |
| 246 | +# 2. Check if the disk is already above the target |
| 247 | +# Integer check: If (Current Used KB * 100) is >= (Total KB * Target Percent) |
| 248 | +if [ $(( USED_KB * 100 )) -ge $(( TOTAL_KB * TARGET_PERCENT )) ]; then |
| 249 | + echo "Current usage (${CURRENT_PERCENT}%) is already above the target (${TARGET_PERCENT}%). No file created." |
| 250 | + exit 0 |
| 251 | +fi |
| 252 | + |
| 253 | +# 3. Calculate the required KB to reach the target percentage |
| 254 | +# T_target_KB = (TOTAL_KB * TARGET_PERCENT) / 100 |
| 255 | +TARGET_USAGE_KB=$(( (TOTAL_KB * TARGET_PERCENT) / 100 )) |
| 256 | + |
| 257 | +# Calculate buffer size in KB |
| 258 | +SAFETY_BUFFER_KB=$(( SAFETY_BUFFER_MB * 1024 )) |
| 259 | + |
| 260 | +# Required KB = (Target KB - Current Used KB) + Safety Buffer KB |
| 261 | +REQUIRED_KB=$(( TARGET_USAGE_KB - USED_KB + SAFETY_BUFFER_KB )) |
| 262 | + |
| 263 | + |
| 264 | +# 4. Convert required KB to MB (dd count uses 1MB blocks) and round up |
| 265 | +# Use shell arithmetic for simple rounding up: (KB + 1023) / 1024 |
| 266 | +REQUIRED_MB_COUNT=$(( (REQUIRED_KB + 1023) / 1024 )) |
| 267 | + |
| 268 | +# 5. Execute dd command |
| 269 | +echo "--------------------------------------" |
| 270 | +echo "Creating file of size: ${REQUIRED_MB_COUNT} MB at ${TARGET_FILE}" |
| 271 | +echo "This will push usage over ${TARGET_PERCENT}%..." |
| 272 | + |
| 273 | +# Execute the dd command using the calculated count |
| 274 | +# Note: Requires sudo access to write to the filesystem |
| 275 | +sudo dd if=/dev/zero of="${TARGET_FILE}" bs=1M count="${REQUIRED_MB_COUNT}" 2>/dev/null |
| 276 | + |
| 277 | +# 6. Final verification (Use awk to extract the percentage from df -h) |
| 278 | +NEW_PERCENT=$(df -h "${MOUNT_POINT}" | awk 'NR==2{print $5}') |
| 279 | +echo "Creation complete." |
| 280 | +echo "New usage: ${NEW_PERCENT}" |
| 281 | +echo "--------------------------------------" |
| 282 | + |
| 283 | +exit 0 |
| 284 | +``` |
| 285 | + |
| 286 | +Run the script. |
| 287 | + |
| 288 | +```bash |
| 289 | +chmod +x trigger_disk_alert.sh && ./trigger_disk_alert.sh |
| 290 | +``` |
| 291 | + |
| 292 | +Sample output. |
| 293 | + |
| 294 | +``` |
| 295 | +--- Disk Usage Alert Trigger --- |
| 296 | +Filesystem: / |
| 297 | +Total Size: 2916 MB |
| 298 | +Used Size: 2041 MB (69%) |
| 299 | +Target: 90% usage |
| 300 | +-------------------------------------- |
| 301 | +Creating file of size: 594 MB at /tmp/ROOT_FILL_FILE.bin |
| 302 | +This will push usage over 90%... |
| 303 | +Creation complete. |
| 304 | +New usage: 91% |
| 305 | +-------------------------------------- |
| 306 | +``` |
| 307 | + |
| 308 | +#### Alert in Prometheus |
| 309 | + |
| 310 | +Log in to Prometheus Explorer Alerts console with your AIOps credentials. The URL is `https://aiops-cpd.<domain>/self-monitoring/explorer/alerts` where `<domain>` is the |
| 311 | +network domain AIOps is installed on (e.g. [https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts]((https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts))). |
| 312 | + |
| 313 | +Within a few minutes you will see a `NodeDiskUsage` alert firing. |
| 314 | + |
| 315 | + |
| 316 | + |
| 317 | +#### Alert in AIOps |
| 318 | + |
| 319 | +In AIOps, navigate to the Alerts list. Here you will see the critical Prometheus alert for High Disk Usage. |
| 320 | + |
| 321 | + |
| 322 | + |
| 323 | +Double click on the alert to open the details. |
| 324 | + |
| 325 | + |
| 326 | + |
| 327 | +#### Resolve Alert |
| 328 | + |
| 329 | +On the same not where you triggered the disk usage script. Resolve the disk consumption by deleting the created file. |
| 330 | + |
| 331 | +```bash |
| 332 | +sudo rm -f /tmp/ROOT_FILL_FILE.bin |
| 333 | +``` |
| 334 | + |
| 335 | +After a few minutes, Prometheus will clear the alert and the clear action will cascade to AIOps. |
| 336 | + |
| 337 | + |
0 commit comments