SDN-C Site Failover

Overview

With SDN-C deployed in a geo-redundant fashion (see Deployment of Geo-Redundant SDN-C), activity can be switched from one site to the other in one of two ways:

  • manually by the site operator

  • automatically via PROM, based on health of active site

In either case, after the failover has been completed and the sites have transitioned from 'standby' to 'active' and vice versa, the DNS entry for the SDN-C deployment is automatically updated (see 2.1 Enable Remote Access to CoreDNS) in order to provide clients with the correct SDN-C target for their messaging.

Manual (forced) failover

The manual option would be utilized by site operators wishing to force activity to a particular site so that they may proceed with performing maintenance or other activities on the other site without impacting service. Prior to carrying out this activity, it is suggested that the current role of the site(s) be determined (see SDN-C Site Role Detection).

From the Kubernetes master node in the site, simply run the sdnc.makeActive script:

sdnc.makeActive
ubuntu@k8s-s2-master:~/oom/kubernetes/sdnc/resources/geo/bin$ ./sdnc.makeActive dev (release name)

This script will make use of kubectl to access the PROM pod and execute promoverride.py with the appropriate parameters to force PROM to switch activity to the local site.

Alternatively, the promoverride.py script could be executed directly from the PROM pod if so desired:

promoverride.py
root@dev-prom-6485f566fb-hdhzs:/app/config# ../promoverride.py -i sdnc02

Here, 'sdnc02' is the identifier specified during deployment for the site that is desired to become active. When using the promoverride.py script directly, you may switch activity to either of the sites (without having to be logged into the PROM pod on that site).

Automatic failover

The PROM instance in each SDN-C site is responsible for periodically ascertaining the health of the local site based on the health of each component (see SDN-C Site Health Determination). This information is published to MUSIC in order for the remote site to also be aware of this information.

If the local PROM instance determines that the local site is currently 'standby' (see SDN-C Site Role Detection) and the remote site has become unhealthy, it will proceed to automatically initiate failover procedures, making the local site 'active' while the remote site is reverted to 'standby' (provided it is in a good enough state to do so).

Catastrophic failover

In certain circumstances, a "simple" failover – in which most components in the failed site are still available to be manipulated – cannot be performed. If the ODL cluster in the failing site cannot be contacted from the site that is to become active or if the Kubernetes master node in the remote site cannot be contacted (which will prevent communication to the managed pods), a catastrophic failover will be performed.

In a "catastrophic" failover, the healthy SDN-C site needs to be reconfigured as a standalone site (without geo-redundancy). Part of the process will involve a Helm upgrade intended to reconfigure the site to remove geo-redundancy. This will require that the SDN-C pods in the site to be restarted (for the reconfiguration to take effect).

After suffering a catastrophic failure in a site and the other site being made non-geo-enabled, the geo pair can be reconstituted by following the site recovery procedure.

Detecting reconfguration to non-geo-redundancy

After a catastrophic failover, the operator may wish to confirm that the active site has actually reverted to a non-geo-redundant configuration. This can be done by connecting to an SDN-C pod in the site and looking for 'GEO' in the environment variables in use:

Non-geo deployment
ubuntu@k8s-master:~$ k8s exec dev-sdnc-0 -it sh Defaulting container name to sdnc. Use 'kubectl describe pod/dev-sdnc-0 -n onap' to see all of the containers in this pod. # env | grep GEO #
Geo deployment

DNS updates

After a successful failover is performed, either manually or automatically, the sdnc.dnsswitch script found on the PROM pod will automatically be invoked. This script communicates with the CoreDNS pod and updates the SDN-C deployment's DNS record so that it points to the local (i.e. 'active') site. In the case where the failover was not completed successfully, this step is not carried out (since the site is very likely unable to process messaging).

The sdnc.dnsswitch script is intended to be utilized by PROM but could be run manually if so desired:

sdnc.dnsswitch




Earlier Releases

Beijing



Manual site failover

The promoverride.py script for manual (forced) failover is not available in Beijing, nor is automatic failover orchestrated by PROM.

In order to carry out a site failover in Beijing, the operator would invoke the sdnc.failover script found on the Kubernetes master in the standby site:

sdnc.failover

Note: The sdnc.failover script in Beijing is limited to situations where the failure afflicting the active site is not catastrophic, meaning that most components in the active site are still available to be communicated with.

Manual DNS update

After successfully failing over a site, the operator would then be required to update the CoreDNS configuration so that the SDN-C hostname resolves to the appropriate site. (The sdnc.dnsswitch script method is not available in the Beijing release.)

Follow below steps for manual site failover. All steps need to be run on coredns master node.

Please note the configuration in all examples for reference:

coredns master node IP address: 10.147.101.135

primary site (site1) master node IP address: 10.147.99.140

secondary site (site2) master node IP address: 10.147.101.23

   

  1.   Verify coredns server, to get the existing mapping. (here it is pointing to primary site(site 1))

     

     2. Edit zone file to comment out SDNC mapping to primary site (site1) and uncomment mapping to secondary site (site2)



     3. Edit coredns configmap to comment out SDNC mapping to primary site (site1) and uncomment mapping to secondary site (site2)



     4. Note that there is a cache time configured in configmap. Wait for some time (30 seconds here) and then send signal to refresh the settings for coredns.



     5. Verify the "sdnc.example.com" domain points to secondary site now.

It may take some time to refresh the address for DNS resolver depending on configured cache time. Send the refresh signal again (in step 4) after sometime if you're not able to see the update.