ibm-client-engineering
diff --git a/‎_quarto.yml‎
Lines changed: 2 additions & 0 deletions b/‎_quarto.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/solution_overview/configuration.qmd‎
Lines changed: 337 additions & 0 deletions b/‎src/solution_overview/configuration.qmd‎
Lines changed: 337 additions & 0 deletions
diff --git a/‎src/solution_overview/images/aiops_alert_disk_usage.png‎
49.1 KB b/‎src/solution_overview/images/aiops_alert_disk_usage.png‎
49.1 KB
diff --git a/‎src/solution_overview/images/aiops_alert_disk_usage_cleared.png‎
44.5 KB b/‎src/solution_overview/images/aiops_alert_disk_usage_cleared.png‎
44.5 KB
diff --git a/‎src/solution_overview/images/aiops_alert_disk_usage_details.png‎
48.6 KB b/‎src/solution_overview/images/aiops_alert_disk_usage_details.png‎
48.6 KB
diff --git a/‎src/solution_overview/images/prometheus_disk_firing.png‎
62 KB b/‎src/solution_overview/images/prometheus_disk_firing.png‎
62 KB
@@ -42,6 +42,8 @@ website:
             href: src/solution_overview/prepare.qmd
           - text: Deploy 
             href: src/solution_overview/deploy.qmd
+          - text: Configure 
+            href: src/solution_overview/configuration.qmd
           - text: Administration
             href: src/solution_overview/administration.qmd
         # - section: Implementation Methodology
 
@@ -0,0 +1,337 @@
+---
+title: "AIOps on Linux Configuration"
+format: html
+---
+
+# Self Monitoring
+
+---
+
+## Setting Up a Promethues AlertManager Webhook in AIOps
+
+### 1. Define the Webhook in the AIOps UI
+1. Navigate to **Integrations** in the AIOps console and select **Add integration**.
+2. Under the **Events** category, select **Prometheus AlertManager**, click **Get started**.
+3. Provide a **Name** (e.g. *Prometheus*) and optional **description** for the webhook to identify its purpose (e.g., *Prometheus Alerts (Self Monitoring)*).
+4. Select **None** for **Authentication type** and click **Next**.
+
+---
+
+### 2. Map Prometheus Alert JSON to AIOps Schema
+1. In the webhook configuration screen, locate the **Mapping** section.
+2. Use the following JSONata mapping:
+
+```json
+(
+    /* Set resource based on labels available */
+    $resource := function($labels){(
+      $name := $labels.name ? $labels.name
+        : $labels.node_name ? $labels.node_name
+        : $labels.statefulset ? $labels.statefulset
+        : $labels.deployment ? $labels.deployment
+        : $labels.daemonset ? $labels.daemonset
+        : $labels.pod ? $labels.pod
+        : $labels.container ? $labels.container
+        : $labels.instance ? $labels.instance
+        : $labels.app ? $labels.app
+        : $labels.job_name ? $labels.job_name
+        : $labels.job ? $labels.job
+        : $labels.type ? $labels.type: $labels.prometheus;
+
+      /* Conditional Namespace Append */
+      $namespace_appended := $labels.namespace ? ($name & '/' & $labels.namespace) : $name;
+
+      /* Check if the determined $name is likely a node/hardware name */
+      $is_node_alert := $labels.node_name or $labels.instance;
+
+      $is_node_alert ? $name : $namespace_appended; /* Only append if NOT a node alert */
+    )};    
+    /* Map to event schema */
+    alerts.(
+      { 
+        "summary": annotations.summary ? annotations.summary: annotations.description ? annotations.description : annotations.message ? annotations.message,
+        "severity": $lowercase(labels.severity) = "critical" ? 6 : $lowercase(labels.severity) = "major" ? 5 : $lowercase(labels.severity) = "minor" ? 4 : $lowercase(labels.severity) = "warning" ? 3 : 1, 
+        "resource": {
+          "name": $resource(labels)
+        },
+        "type": {
+          "eventType": $lowercase(status) = "firing" ? "problem": "resolution",
+          "classification": labels.alertname
+        },
+        "links": [
+          {
+              "url": generatorURL
+          }
+        ],
+        "sender": {
+          "name": "Prometheus",
+          "type": "Webhook Connector"
+        },
+       "details": labels
+      }
+    )
+  )
+```
+3. Click **Save**.
+
+---
+
+### 3. Generate the Webhook and Capture the URL
+1. The webhook will start initializing, wait as it intializes.
+2. A unique **Webhook route** will be displayed (e.g., `https://<aiops-domain>/webhook-connector/<id>`) once the webhook is **Running**.
+3. Copy this URL — it will be used in the **AlertmanagerConfig** in Prometheus to send alerts to AIOps.
+
+---
+
+## Prometheus Alertmanager: Webhook Receiver Configuration for AIOps
+
+This section outlines the steps required to configure the **Prometheus Operator's Alertmanager** to successfully send alerts to the AIOps webhook endpoint.
+
+The process involves two main phases:
+
+- **Network Configuration**: Ensuring the webhook FQDN is resolvable within the cluster.
+- **Alerting Configuration**: Defining the Alertmanager receiver and routing.
+
+---
+
+### 1. Network Configuration (DNS Resolution)
+
+The Alertmanager pod must be able to resolve the AIOps webhook FQDN (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`). Since this FQDN is custom and resolves to a specific HAProxy IP (`192.168.252.9`), the entry must be added to **CoreDNS**.
+
+#### Update the `coredns-custom` ConfigMap
+
+Edit the `coredns-custom` ConfigMap in the `kube-system` namespace to include the webhook domain, mapping it to your HAProxy IP (`192.168.252.9`). This approach is necessary since standard Kubernetes DNS does not resolve external domains.
+
+**Note**: Replace `192.168.252.9` with your actual HAProxy IP if different. Replace `<webhook route>` with the fqdn from the webhook route generated by AIOps (e.g. `whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan`)
+
+**Additional Note**: The below ConfigMap also contains additional DNS mapping to the CloudPak console and the AIOPs UI. This may or may not be applicable to your environment.
+
+```bash
+kubectl apply -f - <<EOF
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: coredns-custom
+  namespace: kube-system
+apiVersion: v1
+data:
+  default.server: |
+    cp-console-aiops.aiops-haproxy.gym.lan {
+        hosts {
+              192.168.252.9 cp-console-aiops.aiops-haproxy.gym.lan
+              fallthrough
+        }
+    }
+    aiops-cpd.aiops-haproxy.gym.lan {
+        hosts {
+              192.168.252.9 aiops-cpd.aiops-haproxy.gym.lan
+              fallthrough
+        }
+    }
+    <webhook route> {
+        hosts {
+              192.168.252.9 <webhook route>
+              fallthrough
+        }
+    }
+EOF
+```
+
+#### Restart CoreDNS
+
+Force CoreDNS to reload the new ConfigMap by restarting the deployment:
+
+```bash
+kubectl -n kube-system rollout restart deployment coredns
+```
+
+---
+
+After CoreDNS restarts, the Alertmanager will be able to resolve the hostname, and all firing alerts will successfully flow to your AIOps webhook.
+
+---
+
+### 2. Configure Alertmanager Receiver
+
+Since the Prometheus Operator uses the **AlertmanagerConfig Custom Resource (CRD)**, we define the webhook receiver and routing within this resource.
+
+#### Define the AlertmanagerConfig CR
+
+Create or update the `AlertmanagerConfig` CR (named `aiops-webhook-receiver` in this example) to include the receiver and routing.
+
+Replace the sample webhook route `https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk` with 
+your actual webhook route and save to a file named `aiops-alertmanagerconfig.yaml`.
+
+```yaml
+apiVersion: monitoring.coreos.com/v1alpha1
+kind: AlertmanagerConfig
+metadata:
+  name: aiops-webhook-receiver
+  namespace: prometheus-operator # Must be in the same namespace as Alertmanager
+  labels:
+    alertmanagerConfig: main # Must match your Alertmanager CR selector
+spec:
+  # 1. Define the Receiver
+  receivers:
+  - name: 'aiops-webhook-receiver'
+    webhookConfigs:
+      - url: 'https://whconn-d59baea5-a620-4efd-bfdc-bbbce5314530-aiops.aiops-haproxy.gym.lan/webhook-connector/fj3u0bq23tk' # REPLACE
+        sendResolved: true
+        # Required for self-signed certificates
+        httpConfig:
+          tlsConfig:
+            insecureSkipVerify: true
+          
+  # 2. Define the Route
+  route:
+    receiver: 'aiops-webhook-receiver' # Route all alerts to the new receiver
+    groupBy: ['alertname', 'severity'] 
+    groupWait: 30s
+    groupInterval: 5m
+    repeatInterval: 4h
+```
+
+#### Apply the Configuration
+
+Apply the manifest:
+
+```bash
+kubectl apply -f aiops-alertmanagerconfig.yaml
+```
+
+---
+
+### 3. Alert Lifecycle
+
+This section assumes that you have created a rule in Prometheus to trigger an alert if an AIOps node root filesystem `/` usage exceeds 90%.
+
+#### Trigger Storage Alert
+
+Use the following script `trigger_disk_alert.sh` to trigger a storage alert on the root fileystem of an AIOps node.
+
+```bash
+#!/bin/bash
+
+# Configuration
+TARGET_PERCENT=90
+MOUNT_POINT="/"
+SAFETY_BUFFER_MB=10 # Add 10MB buffer to ensure we pass the threshold
+TARGET_FILE="/tmp/ROOT_FILL_FILE.bin"
+
+echo "--- Disk Usage Alert Trigger ---"
+
+# 1. Get disk statistics for the root filesystem in Kilobytes (KB)
+# Uses df -k to get output in KB for precise calculation
+if ! STATS=$(df -k "${MOUNT_POINT}" 2>/dev/null | awk 'NR==2{print $2, $3}'); then
+    echo "Error: Failed to get disk statistics for ${MOUNT_POINT}. Exiting."
+    exit 1
+fi
+
+TOTAL_KB=$(echo "$STATS" | awk '{print $1}')
+USED_KB=$(echo "$STATS" | awk '{print $2}')
+# AVAILABLE_KB is not strictly needed for the calculation, but useful for debugging
+
+# Calculate percentages using integer arithmetic (multiplying by 100 first for precision)
+CURRENT_PERCENT=$(( (USED_KB * 100) / TOTAL_KB ))
+
+# Convert KB to MB for display purposes only
+TOTAL_MB=$(( TOTAL_KB / 1024 ))
+USED_MB=$(( USED_KB / 1024 ))
+
+echo "Filesystem: ${MOUNT_POINT}"
+echo "Total Size: ${TOTAL_MB} MB"
+echo "Used Size:  ${USED_MB} MB (${CURRENT_PERCENT}%)"
+echo "Target:     ${TARGET_PERCENT}% usage"
+
+# 2. Check if the disk is already above the target
+# Integer check: If (Current Used KB * 100) is >= (Total KB * Target Percent)
+if [ $(( USED_KB * 100 )) -ge $(( TOTAL_KB * TARGET_PERCENT )) ]; then
+    echo "Current usage (${CURRENT_PERCENT}%) is already above the target (${TARGET_PERCENT}%). No file created."
+    exit 0
+fi
+
+# 3. Calculate the required KB to reach the target percentage
+# T_target_KB = (TOTAL_KB * TARGET_PERCENT) / 100
+TARGET_USAGE_KB=$(( (TOTAL_KB * TARGET_PERCENT) / 100 ))
+
+# Calculate buffer size in KB
+SAFETY_BUFFER_KB=$(( SAFETY_BUFFER_MB * 1024 ))
+
+# Required KB = (Target KB - Current Used KB) + Safety Buffer KB
+REQUIRED_KB=$(( TARGET_USAGE_KB - USED_KB + SAFETY_BUFFER_KB ))
+
+
+# 4. Convert required KB to MB (dd count uses 1MB blocks) and round up
+# Use shell arithmetic for simple rounding up: (KB + 1023) / 1024
+REQUIRED_MB_COUNT=$(( (REQUIRED_KB + 1023) / 1024 ))
+
+# 5. Execute dd command
+echo "--------------------------------------"
+echo "Creating file of size: ${REQUIRED_MB_COUNT} MB at ${TARGET_FILE}"
+echo "This will push usage over ${TARGET_PERCENT}%..."
+
+# Execute the dd command using the calculated count
+# Note: Requires sudo access to write to the filesystem
+sudo dd if=/dev/zero of="${TARGET_FILE}" bs=1M count="${REQUIRED_MB_COUNT}" 2>/dev/null
+
+# 6. Final verification (Use awk to extract the percentage from df -h)
+NEW_PERCENT=$(df -h "${MOUNT_POINT}" | awk 'NR==2{print $5}')
+echo "Creation complete."
+echo "New usage: ${NEW_PERCENT}"
+echo "--------------------------------------"
+
+exit 0
+```
+
+Run the script.
+
+```bash
+chmod +x trigger_disk_alert.sh && ./trigger_disk_alert.sh
+```
+
+Sample output.
+
+```
+--- Disk Usage Alert Trigger ---
+Filesystem: /
+Total Size: 2916 MB
+Used Size:  2041 MB (69%)
+Target:     90% usage
+--------------------------------------
+Creating file of size: 594 MB at /tmp/ROOT_FILL_FILE.bin
+This will push usage over 90%...
+Creation complete.
+New usage: 91%
+--------------------------------------
+```
+
+#### Alert in Prometheus
+
+Log in to Prometheus Explorer Alerts console with your AIOps credentials. The URL is `https://aiops-cpd.<domain>/self-monitoring/explorer/alerts` where `<domain>` is the
+network domain AIOps is installed on (e.g. [https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts]((https://aiops-cpd.aiops-haproxy.gym.lan/self-monitoring/explorer/alerts))).
+
+Within a few minutes you will see a `NodeDiskUsage` alert firing.
+
+![](images/prometheus_disk_firing.png)
+
+#### Alert in AIOps
+
+In AIOps, navigate to the Alerts list. Here you will see the critical Prometheus alert for High Disk Usage.
+
+![](images/aiops_alert_disk_usage.png)
+
+Double click on the alert to open the details.
+
+![](images/aiops_alert_disk_usage_details.png)
+
+#### Resolve Alert
+
+On the same not where you triggered the disk usage script. Resolve the disk consumption by deleting the created file.
+
+```bash
+sudo rm -f /tmp/ROOT_FILL_FILE.bin
+```
+
+After a few minutes, Prometheus will clear the alert and the clear action will cascade to AIOps.
+
+![](images/aiops_alert_disk_usage_cleared.png)