Skip to content

Commit 3a0aa3e

Browse files
authored
Merge pull request #1 from ibm-client-engineering/feature/update-handbook-docs
Prometheus self monitoring
2 parents 48c7fa9 + 359a0cb commit 3a0aa3e

File tree

6 files changed

+339
-0
lines changed

6 files changed

+339
-0
lines changed

_quarto.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ website:
4242
href: src/solution_overview/prepare.qmd
4343
- text: Deploy
4444
href: src/solution_overview/deploy.qmd
45+
- text: Configure
46+
href: src/solution_overview/configuration.qmd
4547
- text: Administration
4648
href: src/solution_overview/administration.qmd
4749
# - section: Implementation Methodology
Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
---
2+
title: "AIOps on Linux Configuration"
3+
format: html
4+
---
5+
6+
# Self Monitoring
7+
8+
---
9+
10+
## Setting Up a Promethues AlertManager Webhook in AIOps
11+
12+
### 1. Define the Webhook in the AIOps UI
13+
1. Navigate to **Integrations** in the AIOps console and select **Add integration**.
14+
2. Under the **Events** category, select **Prometheus AlertManager**, click **Get started**.
15+
3. Provide a **Name** (e.g. *Prometheus*) and optional **description** for the webhook to identify its purpose (e.g., *Prometheus Alerts (Self Monitoring)*).
16+
4. Select **None** for **Authentication type** and click **Next**.
17+
18+
---
19+
20+
### 2. Map Prometheus Alert JSON to AIOps Schema
21+
1. In the webhook configuration screen, locate the **Mapping** section.
22+
2. Use the following JSONata mapping:
23+
24+
```json
25+
(
26+
/* Set resource based on labels available */
27+
$resource := function($labels){(
28+
$name := $labels.name ? $labels.name
29+
: $labels.node_name ? $labels.node_name
30+
: $labels.statefulset ? $labels.statefulset
31+
: $labels.deployment ? $labels.deployment
32+
: $labels.daemonset ? $labels.daemonset
33+
: $labels.pod ? $labels.pod
34+
: $labels.container ? $labels.container
35+
: $labels.instance ? $labels.instance
36+
: $labels.app ? $labels.app
37+
: $labels.job_name ? $labels.job_name
38+
: $labels.job ? $labels.job
39+
: $labels.type ? $labels.type: $labels.prometheus;
40+
41+
/* Conditional Namespace Append */
42+
$namespace_appended := $labels.namespace ? ($name & '/' & $labels.namespace) : $name;
43+
44+
/* Check if the determined $name is likely a node/hardware name */
45+
$is_node_alert := $labels.node_name or $labels.instance;
46+
47+
$is_node_alert ? $name : $namespace_appended; /* Only append if NOT a node alert */
48+
)};
49+
/* Map to event schema */
50+
alerts.(
51+
{
52+
"summary": annotations.summary ? annotations.summary: annotations.description ? annotations.description : annotations.message ? annotations.message,
53+
"severity": $lowercase(labels.severity) = "critical" ? 6 : $lowercase(labels.severity) = "major" ? 5 : $lowercase(labels.severity) = "minor" ? 4 : $lowercase(labels.severity) = "warning" ? 3 : 1,
54+
"resource": {
55+
"name": $resource(labels)
56+
},
57+
"type": {
58+
"eventType": $lowercase(status) = "firing" ? "problem": "resolution",
59+
"classification": labels.alertname
60+
},
61+
"links": [
62+
{
63+
"url": generatorURL
64+
}
65+
],
66+
"sender": {
67+
"name": "Prometheus",
68+
"type": "Webhook Connector"
69+
},
70+
"details": labels
71+
}
72+
)
73+
)
74+
```
75+
3. Click **Save**.
76+
77+
---
78+
79+
### 3. Generate the Webhook and Capture the URL
80+
1. The webhook will start initializing, wait as it intializes.
81+
2. A unique **Webhook route** will be displayed (e.g., `https://<aiops-domain>/webhook-connector/<id>`) once the webhook is **Running**.
82+
3. Copy this URL — it will be used in the **AlertmanagerConfig** in Prometheus to send alerts to AIOps.
83+
84+
---
85+
86+
## Prometheus Alertmanager: Webhook Receiver Configuration for AIOps
87+
88+
This section outlines the steps required to configure the **Prometheus Operator's Alertmanager** to successfully send alerts to the AIOps webhook endpoint.
89+
90+
The process involves two main phases:
91+
92+
- **Network Configuration**: Ensuring the webhook FQDN is resolvable within the cluster.
93+
- **Alerting Configuration**: Defining the Alertmanager receiver and routing.
94+
95+
---
96+
97+
### 1. Network Configuration (DNS Resolution)
98+
99+
The Alertmanager pod must be able to resolve the AIOps webhook FQDN (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`). Since this FQDN is custom and resolves to a specific HAProxy IP (`192.168.252.9`), the entry must be added to **CoreDNS**.
100+
101+
#### Update the `coredns-custom` ConfigMap
102+
103+
Edit the `coredns-custom` ConfigMap in the `kube-system` namespace to include the webhook domain, mapping it to your HAProxy IP (`192.168.252.9`). This approach is necessary since standard Kubernetes DNS does not resolve external domains.
104+
105+
**Note**: Replace `192.168.252.9` with your actual HAProxy IP if different. Replace `<webhook route>` with the fqdn from the webhook route generated by AIOps (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`)
106+
107+
**Additional Note**: The below ConfigMap also contains additional DNS mapping to the CloudPak console and the AIOPs UI. This may or may not be applicable to your environment.
108+
109+
```bash
110+
kubectl apply -f - <<EOF
111+
apiVersion: v1
112+
kind: ConfigMap
113+
metadata:
114+
name: coredns-custom
115+
namespace: kube-system
116+
apiVersion: v1
117+
data:
118+
default.server: |
119+
cp-console-aiops.aiops-haproxy.gym.lan {
120+
hosts {
121+
192.168.252.9 cp-console-aiops.aiops-haproxy.gym.lan
122+
fallthrough
123+
}
124+
}
125+
aiops-cpd.aiops-haproxy.gym.lan {
126+
hosts {
127+
192.168.252.9 aiops-cpd.aiops-haproxy.gym.lan
128+
fallthrough
129+
}
130+
}
131+
<webhook route> {
132+
hosts {
133+
192.168.252.9 <webhook route>
134+
fallthrough
135+
}
136+
}
137+
EOF
138+
```
139+
140+
#### Restart CoreDNS
141+
142+
Force CoreDNS to reload the new ConfigMap by restarting the deployment:
143+
144+
```bash
145+
kubectl -n kube-system rollout restart deployment coredns
146+
```
147+
148+
---
149+
150+
After CoreDNS restarts, the Alertmanager will be able to resolve the hostname, and all firing alerts will successfully flow to your AIOps webhook.
151+
152+
---
153+
154+
### 2. Configure Alertmanager Receiver
155+
156+
Since the Prometheus Operator uses the **AlertmanagerConfig Custom Resource (CRD)**, we define the webhook receiver and routing within this resource.
157+
158+
#### Define the AlertmanagerConfig CR
159+
160+
Create or update the `AlertmanagerConfig` CR (named `aiops-webhook-receiver` in this example) to include the receiver and routing.
161+
162+
Replace the sample webhook route `https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk` with
163+
your actual webhook route and save to a file named `aiops-alertmanagerconfig.yaml`.
164+
165+
```yaml
166+
apiVersion: monitoring.coreos.com/v1alpha1
167+
kind: AlertmanagerConfig
168+
metadata:
169+
name: aiops-webhook-receiver
170+
namespace: prometheus-operator # Must be in the same namespace as Alertmanager
171+
labels:
172+
alertmanagerConfig: main # Must match your Alertmanager CR selector
173+
spec:
174+
# 1. Define the Receiver
175+
receivers:
176+
- name: 'aiops-webhook-receiver'
177+
webhookConfigs:
178+
- url: 'https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk' # REPLACE
179+
sendResolved: true
180+
# Required for self-signed certificates
181+
httpConfig:
182+
tlsConfig:
183+
insecureSkipVerify: true
184+
185+
# 2. Define the Route
186+
route:
187+
receiver: 'aiops-webhook-receiver' # Route all alerts to the new receiver
188+
groupBy: ['alertname', 'severity']
189+
groupWait: 30s
190+
groupInterval: 5m
191+
repeatInterval: 4h
192+
```
193+
194+
#### Apply the Configuration
195+
196+
Apply the manifest:
197+
198+
```bash
199+
kubectl apply -f aiops-alertmanagerconfig.yaml
200+
```
201+
202+
---
203+
204+
### 3. Alert Lifecycle
205+
206+
This section assumes that you have created a rule in Prometheus to trigger an alert if an AIOps node root filesystem `/` usage exceeds 90%.
207+
208+
#### Trigger Storage Alert
209+
210+
Use the following script `trigger_disk_alert.sh` to trigger a storage alert on the root fileystem of an AIOps node.
211+
212+
```bash
213+
#!/bin/bash
214+
215+
# Configuration
216+
TARGET_PERCENT=90
217+
MOUNT_POINT="/"
218+
SAFETY_BUFFER_MB=10 # Add 10MB buffer to ensure we pass the threshold
219+
TARGET_FILE="/tmp/ROOT_FILL_FILE.bin"
220+
221+
echo "--- Disk Usage Alert Trigger ---"
222+
223+
# 1. Get disk statistics for the root filesystem in Kilobytes (KB)
224+
# Uses df -k to get output in KB for precise calculation
225+
if ! STATS=$(df -k "${MOUNT_POINT}" 2>/dev/null | awk 'NR==2{print $2, $3}'); then
226+
echo "Error: Failed to get disk statistics for ${MOUNT_POINT}. Exiting."
227+
exit 1
228+
fi
229+
230+
TOTAL_KB=$(echo "$STATS" | awk '{print $1}')
231+
USED_KB=$(echo "$STATS" | awk '{print $2}')
232+
# AVAILABLE_KB is not strictly needed for the calculation, but useful for debugging
233+
234+
# Calculate percentages using integer arithmetic (multiplying by 100 first for precision)
235+
CURRENT_PERCENT=$(( (USED_KB * 100) / TOTAL_KB ))
236+
237+
# Convert KB to MB for display purposes only
238+
TOTAL_MB=$(( TOTAL_KB / 1024 ))
239+
USED_MB=$(( USED_KB / 1024 ))
240+
241+
echo "Filesystem: ${MOUNT_POINT}"
242+
echo "Total Size: ${TOTAL_MB} MB"
243+
echo "Used Size: ${USED_MB} MB (${CURRENT_PERCENT}%)"
244+
echo "Target: ${TARGET_PERCENT}% usage"
245+
246+
# 2. Check if the disk is already above the target
247+
# Integer check: If (Current Used KB * 100) is >= (Total KB * Target Percent)
248+
if [ $(( USED_KB * 100 )) -ge $(( TOTAL_KB * TARGET_PERCENT )) ]; then
249+
echo "Current usage (${CURRENT_PERCENT}%) is already above the target (${TARGET_PERCENT}%). No file created."
250+
exit 0
251+
fi
252+
253+
# 3. Calculate the required KB to reach the target percentage
254+
# T_target_KB = (TOTAL_KB * TARGET_PERCENT) / 100
255+
TARGET_USAGE_KB=$(( (TOTAL_KB * TARGET_PERCENT) / 100 ))
256+
257+
# Calculate buffer size in KB
258+
SAFETY_BUFFER_KB=$(( SAFETY_BUFFER_MB * 1024 ))
259+
260+
# Required KB = (Target KB - Current Used KB) + Safety Buffer KB
261+
REQUIRED_KB=$(( TARGET_USAGE_KB - USED_KB + SAFETY_BUFFER_KB ))
262+
263+
264+
# 4. Convert required KB to MB (dd count uses 1MB blocks) and round up
265+
# Use shell arithmetic for simple rounding up: (KB + 1023) / 1024
266+
REQUIRED_MB_COUNT=$(( (REQUIRED_KB + 1023) / 1024 ))
267+
268+
# 5. Execute dd command
269+
echo "--------------------------------------"
270+
echo "Creating file of size: ${REQUIRED_MB_COUNT} MB at ${TARGET_FILE}"
271+
echo "This will push usage over ${TARGET_PERCENT}%..."
272+
273+
# Execute the dd command using the calculated count
274+
# Note: Requires sudo access to write to the filesystem
275+
sudo dd if=/dev/zero of="${TARGET_FILE}" bs=1M count="${REQUIRED_MB_COUNT}" 2>/dev/null
276+
277+
# 6. Final verification (Use awk to extract the percentage from df -h)
278+
NEW_PERCENT=$(df -h "${MOUNT_POINT}" | awk 'NR==2{print $5}')
279+
echo "Creation complete."
280+
echo "New usage: ${NEW_PERCENT}"
281+
echo "--------------------------------------"
282+
283+
exit 0
284+
```
285+
286+
Run the script.
287+
288+
```bash
289+
chmod +x trigger_disk_alert.sh && ./trigger_disk_alert.sh
290+
```
291+
292+
Sample output.
293+
294+
```
295+
--- Disk Usage Alert Trigger ---
296+
Filesystem: /
297+
Total Size: 2916 MB
298+
Used Size: 2041 MB (69%)
299+
Target: 90% usage
300+
--------------------------------------
301+
Creating file of size: 594 MB at /tmp/ROOT_FILL_FILE.bin
302+
This will push usage over 90%...
303+
Creation complete.
304+
New usage: 91%
305+
--------------------------------------
306+
```
307+
308+
#### Alert in Prometheus
309+
310+
Log in to Prometheus Explorer Alerts console with your AIOps credentials. The URL is `https://aiops-cpd.<domain>/self-monitoring/explorer/alerts` where `<domain>` is the
311+
network domain AIOps is installed on (e.g. [https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts]((https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts))).
312+
313+
Within a few minutes you will see a `NodeDiskUsage` alert firing.
314+
315+
![](images/prometheus_disk_firing.png)
316+
317+
#### Alert in AIOps
318+
319+
In AIOps, navigate to the Alerts list. Here you will see the critical Prometheus alert for High Disk Usage.
320+
321+
![](images/aiops_alert_disk_usage.png)
322+
323+
Double click on the alert to open the details.
324+
325+
![](images/aiops_alert_disk_usage_details.png)
326+
327+
#### Resolve Alert
328+
329+
On the same not where you triggered the disk usage script. Resolve the disk consumption by deleting the created file.
330+
331+
```bash
332+
sudo rm -f /tmp/ROOT_FILL_FILE.bin
333+
```
334+
335+
After a few minutes, Prometheus will clear the alert and the clear action will cascade to AIOps.
336+
337+
![](images/aiops_alert_disk_usage_cleared.png)
49.1 KB
Loading
44.5 KB
Loading
48.6 KB
Loading
62 KB
Loading

0 commit comments

Comments
 (0)