SDN-C Site Health Determination

Casablanca

This functionality was introduced in the Casablanca release.

In Beijing, Kubernetes dashboard was suggested for monitoring the general health of a site (see 5. Install and Use Kubernetes UI).

Overview

In order for either an operator or PROM to make proper decisions as to whether one site should be made active over another, the ability for a particular site to process messaging needs to be ascertained.

Manually checking site health

In order to manually check the health of a site, the operator can run the sdnc.monitor script from the Kubernetes master in the site they are concerned with.  Release name is a required argument, namespace defaults to onap if not specified.

sdnc.monitor
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.monitor dev healthy

This version of the script is actually a wrapper that utilizes kubectl to remotely access the PROM pod in order to run the sdnc.monitor script that actually performs the health checks on components in the site.

Alternatively, the sdnc.monitor script available in the PROM pod can be run directly:



sdnc.monitor
root@dev-prom-6485f566fb-hdhzs:/app/bin# ./sdnc.monitor healthy

Advanced health reporting

To help troubleshoot an unhealthy site, include the --debug argument which will show which health checks are passing and failing, and for failing checks the health check output to help identify the root cause.


The use of consul in component health checks

The consul health checks that are selected for site health are specified in the prom pod's values.yaml file, e.g. ~/oom/kubernetes/sdnc/prom/values.yaml.

prom values.yaml
config: ... healthChecks: # All top-level checks must pass - "Health Check: SDNC - SDN Host" - "Health Check: SDNC" - "Health Check: SDNC ODL Cluster" - "Health Check: SDNC Portal" # Within nested lists, only one must pass - - "Health Check: SDNC-SDN-CTL-DB-01" - "Health Check: SDNC-SDN-CTL-DB-02"

The above example, the first four health checks (three for OpenDaylight and one for admin portal) must all pass, as well at at least one MySQL port check.  Short-circuit evaluation is used to determine site health in as few consul queries as possible.