On my microk8s cluster consisting of 3 units, after relating microk8s charm to grafana-agent using both juju-info and cos-agent relation I ended up with 4 active alerts: KubeAPIDown, KubeControllerManagerDown, KubeletDown, KubeSchedulerDown.
The root cause seems to be that these alerts rely on the presence of juju_charm label which is missing in my env:
One of the alert rules:
absent(up{job="apiserver",juju_application="microk8s",juju_charm="grafana-agent",juju_model="microk8s",juju_model_uuid="57280f89-7c62-4703-8622-02de020641d2"} == 1)
count(up{job="apiserver",juju_application="microk8s",juju_charm="grafana-agent",juju_model="microk8s",juju_model_uuid="57280f89-7c62-4703-8622-02de020641d2"})
Empty query result
versus (without juju_charm label in the query the result is 3 as expected)
count(up{job="apiserver",juju_application="microk8s",juju_model="microk8s",juju_model_uuid="57280f89-7c62-4703-8622-02de020641d2"})
{} 3
The microk8s cluster itself is healthy, and all services are running.
Another alert-related problem I discovered is that client cert expiration alerts fire a bit too close to the actual expiration date. Entering the critical state only 24h before the expiration might be a bit challenging in real-life scenarios for the cluster administrator.
Versions:
- juju 2.9.32
- microk8s charm: latest/edge, rev 115
- microk8s snap: v1.28.0 5788 1.28/stable
- grafana-agent charm: rev 12, latest/candidate
On my microk8s cluster consisting of 3 units, after relating microk8s charm to grafana-agent using both juju-info and cos-agent relation I ended up with 4 active alerts: KubeAPIDown, KubeControllerManagerDown, KubeletDown, KubeSchedulerDown.
The root cause seems to be that these alerts rely on the presence of
juju_charmlabel which is missing in my env:One of the alert rules:
versus (without
juju_charmlabel in the query the result is 3 as expected)The microk8s cluster itself is healthy, and all services are running.
Another alert-related problem I discovered is that client cert expiration alerts fire a bit too close to the actual expiration date. Entering the critical state only 24h before the expiration might be a bit challenging in real-life scenarios for the cluster administrator.
Versions: