...
...
Introduction
The ONAP Operations Manager provides a set of capabilities that facilitate Carrier Grade deployments of ONAP. ONAP deployments need to be capable of offering service while under adverse conditions typically with overall availability measured at five-nines or 99.999% uptime or about 5 minutes of downtime per year. This requirement might be strict for an orchestration system, but keep in mind that ONAP’s closed loop control system could be providing monitoring a control for one or more critical VNFs that need to meet stringent up-time requirements as found in the TL 9000 Quality Management System Measurements Handbook.
The Road to High Availability
The progression of the ONAP project towards a fully Carrier Grade has started and will continue over the Beijing or possibly even subsequent releases. The steps along this progression are roughly as follows:
...
For each of these steps the following sections describe the requirements in more detail and the technologies used to achieve it.
Highly Available Kubernetes Deployments
There is a high degree of variability possible in the deployment of Kubernetes. In some cases it may be installed and managed by hand, done with 3rd party tools like Rancher or even provided by a cloud provider like Microsoft Azure Container Service - Kubernetes has a description of the options here. Kubernetes provides guidance on creating deployments that may be suitable for carrier grade deployments of ONAP on their Building High-Availability Clusters wiki page.
Reliable and Repeatable Deployment
During the Amsterdam release OOM provided a set of capabilities to deploy some or all the ONAP components rapidly and efficiently as a cloud native application with the Kubernetes container orchestration system (note that DCAE is an exception here as DCAE provides its own orchestration system). Each of the components has a deployment specification that describes not only the containers and the container requirements but the relationships or dependencies between the containers. These dependencies dictate the order in-which the containers are started for the first time such that such dependencies are always met without arbitrary sleep times between container startups. For example, the SDC back-end container requires the Elastic-Search, Cassandra and Kibana containers within SDC to be ready and is also dependent on DMaaP (or the message-router) to be ready before becoming fully operational. Here is the deployment specification that describes these dependencies:
...
As critical state is stored outside of the ONAP containers on a storage media specific to the cloud environment, specific instructions on how to backup and restore such storage is outside of the scope of ONAP.
Health Monitoring
All highly available systems include at least one facility to monitor the health of components within the system. Such health monitors are often used as inputs to distributed coordination systems (such as etcd, zookeeper, or consul) and monitoring systems (such as nagios or zabbix). Within ONAP Consul is the monitoring system of choice and deployed by OOM in two parts. A three-way, centralized Consul server cluster is deployed as a highly available monitor of all of the ONAP components. The Consul server provides a user interface that allows a user to graphically view the current health status of all of the ONAP components for which agents have been created - a sample from the ONAP Integration labs follows. Monitoring of ONAP components is configured in the agents within JSON files and stored in gerrit under the consul-agent-config.
...
Initially the Consul agents are using the same health monitoring facilities as the robot test infrastructure which are typically just validating that the end-point is reachable. Some health checks already support more advanced checking - such as validating that a database is able to create, update and delete an entry. Consul exposes an API that allows external agents to use the results of the health check, such as the Kubernetes "liveness" probes described below.
Component Recoverability
OOM deploys ONAP with Kubernetes defined by deployment specifications as mentioned earlier. These same deployment specifications are also used to implement automatic recoverability of ONAP components when individual components fail. Once ONAP is deployed, a "liveness" probe starts checking the health of the components after a specified startup time. These liveness probes can simply check that a port is available, that a built-in health check is reporting good health, or that the Consul health check is positive. Should a liveness probe indicate a failed container it will be restarted as described in the deployment specification. Should the deployment specification indicate that there are one or more dependencies to this container or component (for example a dependency on a database) the dependency will be satisfied before the container/component is restarted. This mechanism ensures that, after a failure, all of the ONAP components restart successfully. Note that, during the Amsterdam release, deployment specifications were created for all ONAP components but not all of these deployment specifications are restartable (idempotent). Further work is required during the Beijing release to ensure recoverability of all the ONAP components.
Centralized Logging
An important tool in achieving minimal downtime is the ability to rapidly diagnose problems and determine the root cause. The Logging Enhancements Project have been building a centralized log collection system based on the Elastic Stack and a Filebeat collector container that is instantiated alongside the containers for each of the ONAP components. Here is an example from the aai-traversal deployment specification:
...
Filebeat collects logs from within the namespace of each component and ships them to the centralized logging stack that was deployed by OOM with the other ONAP components. Users are able to point their web browsers to the Kibana component and see all of the raw logs as well as predefined dashboards that show the state of ONAP in real-time.
Intra Component Clustering
The OOM project is not responsible for creating highly available versions of all of the ONAP components, but does provide via Kubernetes many built in facilities to build clustered, highly available systems including: Services with load-balancers (including support for External Load Balancers), Ingress Resources, and Replica Sets. Some of the open-source projects that form the basis of ONAP components directly support clustered configurations like ODL with instructions on Setting Up Clustering or MariaDB Getting Started with MariaDB Galera Cluster. .
...
The SDN-C Clustering on Kubernetes page describes a working example of many of these techniques working together.
Pod Placement Rules
OOM will use the rich set of Kubernetes node and pod affinity / anti-affinity rules to minimize the chance of a single failure resulting in a loss of ONAP service. Node affinity / anti-affinity is used to guide the Kubernetes orchestrator in the placement of pods on nodes (physical or virtual machines). For example:
...
Pod affinity / anti-affinity is the concept of creating a spacial relationship between pods when the Kubernetes orchestrator does assignment (both initially an in operation) to nodes as explained in Inter-pod affinity and anti-affinity. For example, one might choose to co-located all of the ONAP SDC containers on a single node as they are not critical runtime components and co-location minimizes overhead. On the other hand, one might choose to ensure that all of the containers in an ODL cluster (SDNC and APPC) are placed on separate nodes such that a node failure has minimal impact to the operation of the cluster. An example of how pod affinity / anti-affinity is shown below::
...
This example contains both podAffinity and podAntiAffinity rules, the first rule is is a must (requiredDuringSchedulingIgnoredDuringExecution) while the second will be met pending other considerations (preferredDuringSchedulingIgnoredDuringExecution).
ONAP S/W Upgrades & Rollbacks
Kubernetes has built-in capabilities to enable the upgrade of pods without causing a loss of the service being provided by that pod or pods (if configured as a cluster). As described in the OOM User Guide, ONAP components provide an abstracted ‘service’ end point with the pods or containers providing this service hidden from other ONAP components by a load balancer. This capability is used during upgrades to allow a pod with a new image to be added to the service before removing the pod with the old image. This ‘make before break’ capability ensures minimal downtime.
...
Note that although OOM uses Kubernetes facilities to minimize the effort required of the ONAP component owners to implement a successful rolling upgrade strategy there are other considerations that must be taken into consideration. For example, external APIs – both internal and external to ONAP – should be designed to gracefully accept transactions from a peer at a different software version to avoid deadlock situations. Embedded version codes in messages may facilitate such capabilities.
Geo-Redundant Deployments
As described in the Pod Placement Rules section, OOM enables the placement of specific pods into specific zones or regions thus providing protection from a single cluster failure. These placement rules can also be used to distribute specific resources, such as a DCAE cluster, close to the VNFs that DCAE is monitoring. To build such a distributed network operators will use Kubernetes Federation to link multiple clusters.
...
The Kubernetes federation control plane enables clusters that are geographically separated to function as a single deployment with DNS servers and load balancers distributing work across the clusters. Federation also enables hybrid clouds say with nodes being provided by a private OpenStack cluster and a Microsoft Azure cluster. Note that clusters can each scale to thousands of nodes so it is unlikely that capacity will be the sole reason for deploying ONAP within a federation of clusters.
List of Epics
The following list of JIRA Epics represent the development activities required to complete the OOM related carrier grade activities (to be confirmed):
...