Skip to content

albertcard/openshift-checks

 
 

Repository files navigation

openshift-checks

A set of scripts to run basic checks on an OpenShift cluster. PRs welcome!

⚠️ This is an unofficial tool, don't blame us if it breaks your cluster

Usage

$ ./openshift-checks.sh -h
Usage: openshift-checks.sh [-h]

This script will run a minimum set of checks to an OpenShift cluster

Available options:

-h, --help                               Print this help and exit
-v, --verbose                            Print script debug info
-l, --list                               Lists the available checks
-s <script>, --single <script>           Executes only the provided script
--no-info                                Disable cluster info commands (default: enabled)
--no-checks                              Disable cluster check commands (default: enabled)
--no-ssh                                 Disable ssh-based check commands (default: enabled)
--prechecks path/to/install-config.yaml  Executes only prechecks (default: disabled)
--results-only                           Only shows pass/fail results from checks (default: disabled)

With no options, it will run all checks and info commands with no debug info

Container

There is an automated container build configured with the content of this repository main branch available at quay.io/rhsysdeseng/openshift-checks.

You can use it with your own kubeconfig file and with the parameters required as:

$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest -h

You can even create a handy alias:

$ alias openshift-checks="podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest"

Then, simply run it as:

$ openshift-checks -s info/00-clusterversion
Using default/api-foobar-example-com:6443/system:admin context
...

Note: If your kubeconfig file doesn't have the proper permissions you may get the error "KUBECONFIG not set". In that case verify that the kubeconfig file has read permissions for the user that is used inside the container or just chmod o+r kubeconfig in your host.

Build your own container

You can build your own container with the included Containerfile:

$ podman build --tag foobar/openshiftchecks .
STEP 1: FROM registry.access.redhat.com/ubi8/ubi:latest
...
$ podman push foobar/openshiftchecks
...

Then, run it by replacing quay.io/repository/rhsysdeseng/openshift-checks:latest with your own image such as foobar/openshiftchecks:latest:

$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig foobar/openshiftchecks:latest -h
Usage: openshift-checks.sh [-h]
...

CronJob

The checks can be scheduled to run periodically in an OpenShift cluster by creating a CronJob.

Check the cronjob.yaml example.

How it works

The openshift-checks.sh script is just a wrapper around bash scripts located in the info, checks or ssh directories.

Checks

Script Description
alertmanager Checks if there are warning or error alerts firing
bz1948052 Checks if the node is using a kernel version affected by BZ1948052
chronyc Checks if the worker clocks are synced using chronyc
clusterversion_errors Checks if there are clusterversion errors
csr Checks if there are pending csr
ctrlnodes Checks if any controller nodes have had the NoSchedule taint removed
entropy Checks if the workers have enough entropy
iptables-22623-22624 Checks if the nodes iptables rules are blocking 22623/tpc or 22624/tcp
mcp Checks if there are degraded mcp
mellanox-firmware-version Checks if the nodes' Mellanox Connect-4 firmware version is below the recommended version.
nodes Checks if there are not ready or not schedulable nodes
notrunningpods Checks if there are not running pods
operators Checks if there are operators in 'bad' state
pdb Checks if there are PodDisruptionBudgets with 0 disruptions allowed
port-thrashing Checks if there are OVN pods thrashing
pvc Checks if there are persistent volume claims that are not bound
restarts Checks if there are pods restarted > n times (10 by default)
sriov Checks if the SR-IOV network state is synced
terminating Checks if there are pods terminating
ovn-pods-memory-usage Checks if the memory usage of the OVN pods is under the LIMIT threshold
zombies Checks if more than 5 zombie processes exist on the hosts

SSH Checks

Script Description
bz1941840 Checks if the authentication-operator is using excessive RAM -> hung kubelet BZ1941840

Info

Script Description
clusterversion Show the clusterversion
clusteroperators Show the clusteroperators
nodes Show the nodes status
pods Show the pods running in the cluster
machineset Show the machinesets status
biosversion Show the nodes' BIOS version
bmh-machine-node Show the node,machine and bmh relationship
container-images-running Show the images of the containers running in the cluster
container-images-stored Show the container images stored in the cluster hosts
ethtool-firmware-version Show the nodes' NIC firmware version using ethtool
mtu Show the nodes' MTU for some interfaces
node-versions Show node components versions such as kubelet, crio, kernel, etc.
ovs-hostnames Show the ovs database chassis hostnames
locks List all pods with locks on each node

Prechecks

Script Description
install-config-valid-yaml Checks if the install-config.yaml file is a valid yaml file
dns-hostnames Checks if the api and wildcard DNS entries are correct

Environment variables

Environment variable Default value Description
INTEL_IDS 8086:158b Intel device IDs to check for firmware. Can be overridden for non-supported NICs.
OCDEBUGIMAGE registry.redhat.io/rhel8/support-tools:latest Used by oc debug.
OSETOOLSIMAGE registry.redhat.io/openshift4/ose-tools-rhel8:latest Used by oc debug in ethtool-firmware-version
RESTART_THRESHOLD 10 Used by the restarts script.
THRASHING_THRESHOLD 10 Used by the port-thrashing script.
PARALLELJOBS 1 By default, all the oc debug commands run in a serial fashion, unless this variable is set >1
OVN_MEMORY_LIMIT 5000 Used by the ovn-pods-memory-usage script to set the maximum memory LIMIT (in Mi) to trigger the warning.

About firmware version

The current intel-firmware-version and mellanox-firmware-version checks only check the firmware version of the SRIOV operator supported NICs (in 4.6).

You can add your own device ID if needed by modifying the script (hint, the variable is called IDS and the format is vendorID_A:deviceID_A vendorID_B:deviceID_B)

Collaborate

Add a new script to get some information or to perform some check in the proper folder and create a pull request.

Tips & Tricks

Send an email if some check fails

You can pipe the script to mail and if there are any errors, an email will be sent.

First you can configure postfix (already included in RHEL8) as relay host (see https://access.redhat.com/solutions/217503). As an example:

  • Append the following settings in /etc/postfix/main.cf:
myhostname = kni1-bootstrap.example.com
relayhost = smtp.example.com
  • Restart the postfix service:
sudo systemctl restart postfix
  • Test it:
echo "Hola" | mail -s 'Subject' johndoe@example.com

Then, run the script as:

/openshift-checks.sh > /tmp/oc-errors 2>&1 || mail -s "Something has failed" johndoe@example.com < /tmp/oc-errors

As a bonus you can include this in a cronjob for periodic checks.

Get JSON and HTML output

This requires installation of python requirements in the requirements.txt file, preferable within a virtual environment, once those are installed execute:

./risu.py -l

To automatically execute the tests against the current environment and generate two output files:

  • osc.json
  • osc.html

When loaded over a web server, the html file will pull the json file over AJAX and represent the results of the tests in a graphical way:

About

A collection of scripts to check the health of an OpenShift cluster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 98.6%
  • Dockerfile 1.4%