SDN-C Site Failover
Casablanca
Updated for Casablanca release.
Overview
With SDN-C deployed in a geo-redundant fashion (see Deployment of Geo-Redundant SDN-C), activity can be switched from one site to the other in one of two ways:
manually by the site operator
automatically via PROM, based on health of active site
In either case, after the failover has been completed and the sites have transitioned from 'standby' to 'active' and vice versa, the DNS entry for the SDN-C deployment is automatically updated (see 2.1 Enable Remote Access to CoreDNS) in order to provide clients with the correct SDN-C target for their messaging.
Manual (forced) failover
The manual option would be utilized by site operators wishing to force activity to a particular site so that they may proceed with performing maintenance or other activities on the other site without impacting service. Prior to carrying out this activity, it is suggested that the current role of the site(s) be determined (see SDN-C Site Role Detection).
From the Kubernetes master node in the site, simply run the sdnc.makeActive script:
sdnc.makeActive
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.makeActive dev (release name)
This script will make use of kubectl to access the PROM pod and execute promoverride.py with the appropriate parameters to force PROM to switch activity to the local site.
Alternatively, the promoverride.py script could be executed directly from the PROM pod if so desired:
promoverride.py
root@dev-prom-6485f566fb-hdhzs:/app/config# ../promoverride.py -i sdnc02
Here, 'sdnc02' is the identifier specified during deployment for the site that is desired to become active. When using the promoverride.py script directly, you may switch activity to either of the sites (without having to be logged into the PROM pod on that site).
Automatic failover
The PROM instance in each SDN-C site is responsible for periodically ascertaining the health of the local site based on the health of each component (see SDN-C Site Health Determination). This information is published to MUSIC in order for the remote site to also be aware of this information.
If the local PROM instance determines that the local site is currently 'standby' (see SDN-C Site Role Detection) and the remote site has become unhealthy, it will proceed to automatically initiate failover procedures, making the local site 'active' while the remote site is reverted to 'standby' (provided it is in a good enough state to do so).
Catastrophic failover
In certain circumstances, a "simple" failover – in which most components in the failed site are still available to be manipulated – cannot be performed. If the ODL cluster in the failing site cannot be contacted from the site that is to become active or if the Kubernetes master node in the remote site cannot be contacted (which will prevent communication to the managed pods), a catastrophic failover will be performed.
In a "catastrophic" failover, the healthy SDN-C site needs to be reconfigured as a standalone site (without geo-redundancy). Part of the process will involve a Helm upgrade intended to reconfigure the site to remove geo-redundancy. This will require that the SDN-C pods in the site to be restarted (for the reconfiguration to take effect).
After suffering a catastrophic failure in a site and the other site being made non-geo-enabled, the geo pair can be reconstituted by following the site recovery procedure.
Detecting reconfguration to non-geo-redundancy
After a catastrophic failover, the operator may wish to confirm that the active site has actually reverted to a non-geo-redundant configuration. This can be done by connecting to an SDN-C pod in the site and looking for 'GEO' in the environment variables in use:
Non-geo deployment
ubuntu@k8s-master:~$ k8s exec dev-sdnc-0 -it sh
Defaulting container name to sdnc.
Use 'kubectl describe pod/dev-sdnc-0 -n onap' to see all of the containers in this pod.
# env | grep GEO
#
Geo deployment
DNS updates
After a successful failover is performed, either manually or automatically, the sdnc.dnsswitch script found on the PROM pod will automatically be invoked. This script communicates with the CoreDNS pod and updates the SDN-C deployment's DNS record so that it points to the local (i.e. 'active') site. In the case where the failover was not completed successfully, this step is not carried out (since the site is very likely unable to process messaging).
The sdnc.dnsswitch script is intended to be utilized by PROM but could be run manually if so desired: