A set of scripts to run basic checks on an OpenShift cluster. PRs welcome!
⚠️ This is an unofficial tool, don't blame us if it breaks your cluster
$ ./openshift-checks.sh -h
Usage: openshift-checks.sh [-h]
This script will run a minimum set of checks to an OpenShift cluster
Available options:
-h, --help Print this help and exit
-v, --verbose Print script debug info
-l, --list Lists the available checks
-s <script>, --single <script> Executes only the provided script
--no-info Disable cluster info commands (default: enabled)
--no-checks Disable cluster check commands (default: enabled)
--no-ssh Disable ssh-based check commands (default: enabled)
--prechecks path/to/install-config.yaml Executes only prechecks (default: disabled)
--results-only Only shows pass/fail results from checks (default: disabled)
With no options, it will run all checks and info commands with no debug infoThere is an automated container build configured with the content of this repository main branch available at quay.io/rhsysdeseng/openshift-checks.
You can use it with your own kubeconfig file and with the parameters required
as:
$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest -hYou can even create a handy alias:
$ alias openshift-checks="podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest"Then, simply run it as:
$ openshift-checks -s info/00-clusterversion
Using default/api-foobar-example-com:6443/system:admin context
...Note: If your kubeconfig file doesn't have the proper permissions you may get the error "KUBECONFIG not set". In that case verify that the kubeconfig file has read permissions for the user that is used inside the container or just
chmod o+r kubeconfigin your host.
You can build your own container with the included Containerfile:
$ podman build --tag foobar/openshiftchecks .
STEP 1: FROM registry.access.redhat.com/ubi8/ubi:latest
...
$ podman push foobar/openshiftchecks
...Then, run it by replacing
quay.io/repository/rhsysdeseng/openshift-checks:latest with your own image
such as foobar/openshiftchecks:latest:
$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig foobar/openshiftchecks:latest -h
Usage: openshift-checks.sh [-h]
...The checks can be scheduled to run periodically in an OpenShift cluster by creating a CronJob.
Check the cronjob.yaml example.
The openshift-checks.sh script is just a wrapper around bash scripts located
in the info, checks or ssh directories.
| Script | Description |
|---|---|
| alertmanager | Checks if there are warning or error alerts firing |
| bz1948052 | Checks if the node is using a kernel version affected by BZ1948052 |
| chronyc | Checks if the worker clocks are synced using chronyc |
| clusterversion_errors | Checks if there are clusterversion errors |
| csr | Checks if there are pending csr |
| ctrlnodes | Checks if any controller nodes have had the NoSchedule taint removed |
| entropy | Checks if the workers have enough entropy |
| iptables-22623-22624 | Checks if the nodes iptables rules are blocking 22623/tpc or 22624/tcp |
| mcp | Checks if there are degraded mcp |
| mellanox-firmware-version | Checks if the nodes' Mellanox Connect-4 firmware version is below the recommended version. |
| nodes | Checks if there are not ready or not schedulable nodes |
| notrunningpods | Checks if there are not running pods |
| operators | Checks if there are operators in 'bad' state |
| pdb | Checks if there are PodDisruptionBudgets with 0 disruptions allowed |
| port-thrashing | Checks if there are OVN pods thrashing |
| pvc | Checks if there are persistent volume claims that are not bound |
| restarts | Checks if there are pods restarted > n times (10 by default) |
| sriov | Checks if the SR-IOV network state is synced |
| terminating | Checks if there are pods terminating |
| ovn-pods-memory-usage | Checks if the memory usage of the OVN pods is under the LIMIT threshold |
| zombies | Checks if more than 5 zombie processes exist on the hosts |
| Script | Description |
|---|---|
| bz1941840 | Checks if the authentication-operator is using excessive RAM -> hung kubelet BZ1941840 |
| Script | Description |
|---|---|
| clusterversion | Show the clusterversion |
| clusteroperators | Show the clusteroperators |
| nodes | Show the nodes status |
| pods | Show the pods running in the cluster |
| machineset | Show the machinesets status |
| biosversion | Show the nodes' BIOS version |
| bmh-machine-node | Show the node,machine and bmh relationship |
| container-images-running | Show the images of the containers running in the cluster |
| container-images-stored | Show the container images stored in the cluster hosts |
| ethtool-firmware-version | Show the nodes' NIC firmware version using ethtool |
| mtu | Show the nodes' MTU for some interfaces |
| node-versions | Show node components versions such as kubelet, crio, kernel, etc. |
| ovs-hostnames | Show the ovs database chassis hostnames |
| locks | List all pods with locks on each node |
| Script | Description |
|---|---|
| install-config-valid-yaml | Checks if the install-config.yaml file is a valid yaml file |
| dns-hostnames | Checks if the api and wildcard DNS entries are correct |
| Environment variable | Default value | Description |
|---|---|---|
| INTEL_IDS | 8086:158b | Intel device IDs to check for firmware. Can be overridden for non-supported NICs. |
| OCDEBUGIMAGE | registry.redhat.io/rhel8/support-tools:latest | Used by oc debug. |
| OSETOOLSIMAGE | registry.redhat.io/openshift4/ose-tools-rhel8:latest | Used by oc debug in ethtool-firmware-version |
| RESTART_THRESHOLD | 10 | Used by the restarts script. |
| THRASHING_THRESHOLD | 10 | Used by the port-thrashing script. |
| PARALLELJOBS | 1 | By default, all the oc debug commands run in a serial fashion, unless this variable is set >1 |
| OVN_MEMORY_LIMIT | 5000 | Used by the ovn-pods-memory-usage script to set the maximum memory LIMIT (in Mi) to trigger the warning. |
The current intel-firmware-version and mellanox-firmware-version checks only check the firmware version of the SRIOV operator supported NICs (in 4.6).
You can add your own device ID if needed by modifying the script (hint, the
variable is called IDS and the format is vendorID_A:deviceID_A vendorID_B:deviceID_B)
Add a new script to get some information or to perform some check in the proper folder and create a pull request.
You can pipe the script to mail and if there are any errors, an email will be
sent.
First you can configure postfix (already included in RHEL8) as relay host (see https://access.redhat.com/solutions/217503). As an example:
- Append the following settings in
/etc/postfix/main.cf:
myhostname = kni1-bootstrap.example.com
relayhost = smtp.example.com- Restart the postfix service:
sudo systemctl restart postfix- Test it:
echo "Hola" | mail -s 'Subject' johndoe@example.comThen, run the script as:
/openshift-checks.sh > /tmp/oc-errors 2>&1 || mail -s "Something has failed" johndoe@example.com < /tmp/oc-errorsAs a bonus you can include this in a cronjob for periodic checks.
This requires installation of python requirements in the requirements.txt file, preferable within a virtual environment, once those are installed execute:
./risu.py -lTo automatically execute the tests against the current environment and generate two output files:
osc.jsonosc.html
When loaded over a web server, the html file will pull the json file over AJAX and represent the results of the tests in a graphical way:
