/
Beijing Release Resiliency Testing Status

Beijing Release Resiliency Testing Status

DONE

We will use vFW use case as the baseline to test this:

Pre-requirement: Instantiate a vFW with closed loop running.

  • Error detection is very fast: less than 1 second

  • Recovery:

    • Kill docker container, it normally takes less than 1 minute to get the system in normal state. (SDNC, APPC will take up 5 minutes)

    • Delete the pod, it normally takes much longer to get back specially for SDNC, APPC (up to 15 minutes). 

  • Note: Helm upgrade sometimes messed up the whole system, which will turn the system into un-useable status. However, we think this may not be a normal use case for production env.

Time (EDT)

Categories

Sub-Categories

(In Error Mode Component)

Time to Detect Failure and Repair

Pass?

Notes

Time (EDT)

Categories

Sub-Categories

(In Error Mode Component)

Time to Detect Failure and Repair

Pass?

Notes



VNF Onboarding and Distribution

SDC

< 5 minutes

Pass

Timing?? 30 minutes.  Using  a script kills those components randomly, and continue onboarding VNFs.

ete-k8s.sh onap healthdist

After kicking off the command; waiting for 1 minutes; killed SDC;

The first one was failed; then we did redistribute, it was success.



SO

< 5 minutes

Pass

After kicking off the command; waiting for 1 minutes; killed SO;

The first one was failed; then we did redistribute, it was success.



A&AI

< 5 minutes

Pass

  1. Killed aai-modelloader; it finished the task in 3:04 minutes

  2. Killed two aai-cassandra pods; it finished the task in ~1 minutes.



SDNC

< 8 minutes

Pass

  1. Run preload using scripts

Delete SDNC pod, it took very very long time to get back, it might because of the network issues. And we got a very "weird" system, SDC gives us the following error:

< 5 minutes

Pass

  1. Deleted one of the SDNC container: eg. sdnc-0.

2. Run health and preload





VNF Instantiation

SDC

< 2 seconds

Pass

Tested with manually kill the docker container



VID

< 1 minute

Pass

  1. kubectl delete pod dev-vid-6d66f9b8c-9vdlt -n onap # back in 1 minute

  2. kubectl delete pod dev-vid-mariadb-fc95657d9-wqn9s -n onap   # back in 1 minute



SO

5 minutes

Pass

so pod restarted as part of hard rebooting 2 k8s VMs out of 9



A&AI

20 minutes

Pass

restarted aai-model-loader, aai-hbase, and aai-sparky-be due to hard rebooting 2 more k8s VMs

probably took extra time due to many other pods restarting at the same time and taking time to converge



SDNC

5 minutes

Pass

sdnc pods restarted as part of hard rebooting 2 k8s VMs out of 9



MultiVIM

< 5 minutes

Pass

deleted multicloud pods and verified that new pods that come up can orchestrate VNFs as usual



Closed Loop

(Pre-installed manually)

DCAE

< 5 minutes

Pass

Deleted dep-dcae-ves-collector-767d745fd4-wk4ht. No discernible interruption to closed loop. Pod restarted in 1 minute.

Deleted dep-dcae-tca-analytics-d7fb6cffb-6ccpm. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-dcae-db-0. Closed loop failed after about 1 minute. Pod restarted in 2 minutes. Closed loop started suffering from intermittent packet gaps and only recovered after rebooting the packet generator. Most likely suspect is intermittent network or issues within the packet generator.

Deleted dev-dcae-redis-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.



DMaaP

10 seconds

Pass

Deleted dev-dmaap-bus-controller-657845b569-q7fr2. No discernible interruption to closed loop. Pod restarted in 10 seconds.



Policy

(Policy documentation: Policy on OOM)

15 minutes

Pass

Deleted dev-pdp-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-drools-0. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop recovered in 15 minutes.

Deleted dev-pap-5c7995667f-wvrgr. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-policydb-5cddbc96cf-hr4jr. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-nexus-7cb59bcfb7-prb5v. No discernible interruption to closed loop. Pod restarted in 2 minutes.



A&AI

Never

Fail

Deleted aai-modelloader. Closed loop failed immediately. Pod restarted in < 5 minutes. Closed loop never recovered.

--- the rest done on a different instance ---

Deleted dev-aai-55b4c4f4d6-c6hcj. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-babel-6f54f4957d-h2ngd. No discernible interruption to closed loop. Pod restarted in < 5 minutes.

Deleted dev-aai-cassandra-0. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-data-router-69b8d8ff64-7qvjl. After two minutes all packets were shut off, recovered in 5 minutes (maybe intermittent network or packet generator issue). Pod restarted in 2 minutes.

Deleted dev-aai-hbase-5d9f9b4595-m72pf. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-aai-resources-5f658d4b64-66p7b. Closed loop failed immediately. Pod restarted in 2 minutes. Closed loop never recovered.



APPC (3-node cluster)

20 minutes

Pass

Deleted dev-appc-0. Closed loop failed immediately. dev-appc-0 pod restarted in 15 minutes. Closed loop recovered in 20 minutes.

Deleted dev-appc-cdt-57548cf886-8z468. No discernible interruption to closed loop. Pod restarted in 2 minutes.

Deleted dev-appc-db-0. No discernible interruption to closed loop. Pod restarted in 3 minutes.

Requirement

Area

Priority

Min. Level

Stretch Goal

Level Descriptions (Abbreviated)

Resiliency

High

Level 2 – run-time projects
Level 1 – remaining projects

Level 3 – run-time projects
Level 2 – remaining projects

•1 – manual failure and recovery (< 30 minutes)
•2 – automated detection and recovery (single site) (<30 minutes)
•3 – automated detection and recovery (geo redundancy)