What's needed to deploy ONAP
ONAP is a set of different applications. Since Casablanca release, the preferred way to deploy ONAP is OOM (ONAP Operations Manager).
OOM is a set of helm charts + an helm plugin (deploy).
Each helm chart will deploy on a Kubernetes cluster a component of ONAP (AAI, SO, VFC, ...).
helm deploy plugin will simplify the deployment of the whole solution (faster deployment, use of standard helm metadata storage).
So in order to deploy ONAP, you'll need a working kubernetes (Ingress, Storage class, CNI, ...) environment + helm installed (see software requirements for the correct version to use according to the ONAP version).
On top of this Kubernetes installation, you may also need third party components for specific use cases:
- Cert Manager if you want to use CMPv2 certificates in DCAE and SDNC
- Prometheus (preferrably installed with this helm chart) if you want to scrape some metrics
- Strimzi (PoC in Jakarta, default in Kohn) for Kafka deployment
Strategy to deploy the stack
OOM and Integration team has decided to take a modular approach in order to fulfill all the operations for deploying and validating an ONAP instance:
- Create the virtual machines
Deploy Kubernetes
- Deploy Platform Services on Kubernetes
- Deploy ONAP
- Test ONAP
Important choices have been made:
- All the deployments are using Gitlab CI
- All deployments are using ansible playbooks
- An "orchestrator" of Gitlab CI deployment is used
Chained CI: the Gitlab CI deployment "orchestrator"
Chained CI is a dynamic gitlab ci pipeline manage which will call an underneath pipeline at each stage, wait for completion, retrieve artifacts and move into the next stage if sucessful
Via a declarative (yaml) file, we can:
- chain any pipeline
- use outputs of previous stages as input of a stage
Chained CI configuration is mainly splitted in 4 types of files:
- an unique file describing the underneath projects with their pipelines: https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_inventory/group_vars/all.yml
- a per chain file describing the steps to perform:
- deployment of Kubernetes for a master deployment: https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_inventory/host_vars/onap_daily_pod4_k8s_master.yml
- deployment and test of master version: https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_inventory/host_vars/onap_daily_pod4_master.yml
- an IDF (Installed Description File) / PDF (Platform Description File) giving the specific configuration for a particular deployment
- PDF file will describe the servers to create. Example for a gating deployment: https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_config/config/az7-gating3.yaml
- IDF file will give all the values needed by the different steps. Example for a gating deployment: https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_config/config/idf-az7-gating3.yaml
More Information :
official documentation: https://docs.onap.org/projects/onap-integration/en/latest/onap-integration-ci.html#integration-ci
Creating virtual machines
input
- the description of the wanted infrastructure (networks, servers, volume, floating IPs, ...)
- credentials to use IaaS API
output
- an inventory file with the created machines and their purpose
Implementations:
OS Infra Manager for OpenStack deployments
AZ Infra Manager for Azure VM deployments
Creating Kubernetes cluster
input
- an inventory file with server with specific groups:
- kube-master for servers hosting API part of K8S
- etcd for servers hosting etcd
- kube-worker for servers hosting "compute" part of K8S
- k8s-cluster with kube-node and kube-worker servers
- k8s-full-cluster with all the servers (master, etcd, worker, jumphost)
output
- a deployed kubernetes cluster
- the admin kube config
implementations
Kubespray Automatic installation (used in dailies / weeklies)
RKE Automatic installation (not used anymore)
RKE2 Automatic installation (used internally)
AKS Automatic installation (used for gating)
Adding Services to the Kubernetes cluster
input
- a valid kube config
- an accessible kubernetes API
output
- installed platform components (helm, prometheus, others)
implementations
This installation is done in the "postconfiguration" part of the Kubernetes cluster project.
There are 2 types of implementations :
- legacy one where a subset is installed (prometheus) and the rest is done via "freeform" (see https://gitlab.com/Orange-OpenSource/lfn/ci_cd/chained-ci/-/blob/master/pod_config/config/idf-az7-gating3.yaml#L62-98)
- "new" one where all must be created before (no more freeform): better control and thus better durability. Not used on OOM today
Deploying ONAP
input
- a valid kube config
- helm
- possibly an override file to choose which components to use / specific configuration for ONAP or some components
output
- a (working hopefully) ONAP deployment
implementation
ONAP OOM Automatic installation
ONAP OOM Automatic installation refresh (not used today on community deployments unfortunately)
Testing ONAP
input
- a valid kueb config
- the name of the namespace where ONAP is installed (onap by default)
- a docker service
output
- a report on the performed tests
implementation
Specificities of ONAP Dailies / Weeklies on Orange premices
Openstack API is not present on Internet and thus all calls must be made via a jumphost (rebond.opnfv.fr)
Specificities of ONAP gating on Azure
As Azure has no OpenStack APIs, a small openstack instance using devstack (using DevStack Automatic Installation) is created near each worker.
Gating
Gating is built on top of "automatic deployment" seen before.
As for daily deployments, two chains in chained ci are created per gating environment (2 gating environment today):
- Infrastructure deployment (Virtual Machines + Kubernetes + Platform services + Dedicated OpenStack)
- ONAP deployment and test
One of the difference is that first one will not trigger the second one.
Infrastructure deployment chain is meant to be performed once in a while (after ~100 days, artifacts are too old in gitlab and it must be reinstalled)
ONAP deployment and test chain is meant to be performed anytime a gate is ready to be launched.
As we have a limited number of platform and potentially a bigger number of gates to be performed, a queue system needs to be put in front.
At the time of creation of this gating system, no "out of the box" queue system was found (or understood, we never understood how to use zuul for example)
So the decision was made to create 4 μservices using a MQTT broker named mosquitto as messenging system:
- Gerrit 2 MQTT : it will create topics / message for every event sent by Gerrit (via SSH)
- MQTT 2 Gerrit : it will send comments (optionally with score) to a specific Gerrit review when a message is sent in a specific topic
- Chained CI MQTT Trigger (master mode) : will listen to message on specific topics and queue them when they belongs to a wanted topic. Will resend them when a worker ask for a job
- Chained CI MQTT Trigger (worker mode) : when free, will listen to message on specific topics and launch a gate (if elected) when receiving one. Will ask for Job every xx seconds when free
Some details are given in the but this is how it's done in the two "main" cases:
Workers are free
- A new patchset is created on a watched repo (OOM for example)
- Gerrit2MQTT create a message on /onap/oom/patchset-created
- Chained CI MQTT Trigger Master reads the message and put it in internal queue
- Worker is free and propose to use
- Master will acknowledge and remove the message from the queue
- Worker will start a chained ci and wait for completion. According to the completion status, it will retrieve failed jobs and abstract messages
- Worker will send them to gerrit notification topic
- MQTT 2 Gerrit will see the message, retrieve Gerrit number and Patchset number and upload the message
Workers are not free
- A new patchset is created on a watched repo (OOM for example)
- Gerrit2MQTT create a message on /onap/oom/patchset-created
- Chained CI MQTT Trigger Master reads the message and put it in internal queue
- Later, a worker is free and send a message to its master to announce it can take a job
- Master dequeues the oldest message and resend it
- Worker proposes to use
- Master acknowledges and removes the message from the queue
- Worker starts a chained ci and wait for completion. According to the completion status, it retrieves failed jobs and abstract messages
- Worker sends abstract and failed job list gerrit notification topic
- MQTT 2 Gerrit will see the message, retrieve Gerrit number and Patchset number and upload the message
- Worker announces it's free
Current deployments
All Gating μservices are deployed on Azure ONAP "gating" kubernetes (alongside with a nexus)
Each gating system has a Chained CI MQTT Trigger worker μs.
One Chained CI MQTT Trigger master is created (we can have several that would monitor different repos / have different workers)
Maintenance work on gating
Access to gating systems
You need to have given your ssh key to one of the admins
then, put in your .ssh/config for access to gating systems:
Host rebond.francecentral.cloudapp.azure.com User cloud Host azure*.onap.eu *.integration.onap.eu User cloud StrictHostKeyChecking no CheckHostIP no UserKnownHostsFile /dev/null ProxyJump rebond.francecentral.cloudapp.azure.com # Networks used in Azure cloud Host 192.168.64.* 192.168.65.* 192.168.53.* 192.168.66.* 192.168.67.* ProxyJump rebond.francecentral.cloudapp.azure.com StrictHostKeyChecking no CheckHostIP no UserKnownHostsFile /dev/null User cloud
Access to weeklies/dailies in Orange can be done by adding this to your .ssh/config (if granted):
Host rebond.opnfv.fr User user Host master*.onap.eu *.daily.onap.eu *.internal.onap.eu *-weekly.onap.eu istanbul.onap.eu User debian StrictHostKeyChecking no CheckHostIP no UserKnownHostsFile /dev/null ProxyJump rebond.opnfv.fr
Full filesystem
filesystem of jumphost is full and therefore no tests can be launched
check (on jumphost):
df -h
remediation: clean docker
docker system prune -a
Full OpenStack
Clean on OpenStack has not been perfomed and we cannot start new tests on gate
check (on jumphost):
openstack --os-cloud admin stack list
it should be blank or at least with very recent creation time
remediation: clean openstack
openstack --os-cloud admin stack delete UUIDs_OF_STACKS
Lost integration images
ONAP images are not anymore on nexus3.onap.org
check: verify that images are present on nexus3
remediation: create a ticket
Too old system
Gate is quite old and cannot work again
check (on jumphost): uptime on jumphost is more than 100 days and gates are behaving weirdly
uptime
remediation : reinstall gate