The intent of the 72 hour stability test is not to exhaustively test all functions but to run a steady load against the system and look for issues like memory leaks that aren't found in the short duration install and functional testing during the development cycle.
This page will collect notes on the 72 hour stability test run for Frankfurt.
See El Alto Stability Run Notes for comparison to previous runs.
Summary of Results
WORK IN PROGRESS
Setup
The integration-longevity tenant in Intel/Windriver environment was used for the 72 hour tests.
The onap-ci job for "Project windriver-longevity-release-manual" was used for the deployment with the OOM set to frankfurt and Integration branches set to master. Integraiton master was used so we could catch the latest updates to integration scripts and vnf heat templates.
The jenkins job needs a couple of updates for each release:
- Set the integration branch to 'origin/master'
- Modify the parameters to deploy.sh to specify "-i master" and "-o frankfurt" to get integration master an oom frankfurt clones onto the nfs server.
The path for robot logs on dockerdata-nfs changed in Frankfurt so the /dev-robot/ becomes /dev/robot
The stability tests used robot container image 1.6.1-STAGING-20200519T201214Z
robot container updates:
API_TYPE was set to GRA_API since we have deprecated VNF_API.
Shakedown consists of creating some temporary tags for stability72hrvLB, stability72hrvVG,stability72hrVFWCL to make sure each sub test ran successfully (including cleanup) in the environment before the jenkins job started with the higher level testsuite tag stability72hr that covers all three test types.
Clean out the old buid jobs using a jenkins console script (manage jenkins)
def jobName = "windriver-longevity-stability72hr"
def job = Jenkins.instance.getItem(jobName)
job.getBuilds().each { it.delete() }
job.nextBuildNumber = 1
job.save()
appc.properties updated to apply the fix for DMaaP message processing to call http://localhost:8181 for the streams update.
VNF Orchestration Tests
This test uses the onap-ci job "Project windriver-longevity-stability72hr" to automatically onboard, distribute and instantiate the ONAP opensource test VNFs vLB, vVG and vFWCL.
The scripts run validation tests after the install.
The scripts then delete the VNFs and cleans up the environment for the next run.
The script tests AAF, DMaaP, SDC, VID, AAI, SO, SDNC, APPC with the open source VNFs.
There was a problem with the robot scripts for vLB where it was not finding the base_lb.yaml file in the artifacts due to a change in the structure. A two line change to the vnf orchestration script to look for the 'heat3' key was made to resolve the issue. A Jira was created to track the changes to the robot scrips. - INT-1598Getting issue details... STATUS
These tests started at jenkins job #1
Each test run generates over 500 MB of data on the test through robot framework.
Each test run also runs the kubectl top nodes command to see cpu and memory utilization across the k8 cluster.
We periodically will run the top pods command as well to check on the top memory and cpu using pods.
http://10.12.6.182:8080/jenkins/job/windriver-longevity-stability72hr/
Test # | Comment | Message |
---|---|---|
k8 utilization Wed May 20 18:45:15 UTC 2020 | Memory: | |
#1 | TOOLING Startup issues - modified customer uuid to shorten the string in the tooling since it looked like robot selenium was having trouble "seeing" the string in the drop down. | vDNS: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_aaaf3926-d765-4c47-93b9-857e674d2d01 vvG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_08f8a099-3e2b-480f-8153-5b4173d9394a vFW: Succeeded |
#4 | ENV ${vnf} = vFWCLvPKG Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate. | Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack' |
#13 | ENV ${vnf} = vFWCLvPKG Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate. | Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack' |
#14 | TOOLING or ENV vDNS and vVG robot script couldnt find elements on the GUI drop downs. Likely transient networking issues. vFW succeeded and all three are in the test run (vDNS, vVG, vFW in that order). | vDNS : Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: Element 'xpath=//tr[td/span/text() = 'vLB 2020-05-20 13-06-03']/td/button[contains(text(),'Deploy')]' not visible after 1 minute. vVG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_9f739343-cbc7-4ee4-8697-ea52f06e7796 vFW Succeeded |
#15 | TOOLING Virtual Volume Group - Failure in robot selenium to find customer in search window. Timing issue. | NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_26e85655-1f44-4e7e-8cd2-e9fab290af01 |
#17 | ENV or TOOLING Failure in robot selenium at second VNF in service package. Likely tuning of robot needed waiting for the module name to appear in the drop down under transient conditions. | Element 'xpath=//div[contains(.,'Ete_vFWCLvPKG_f716b1bd_1')]/div/button[contains(.,'Add VF-Module')]' did not appear in 1 minute. |
#18 | ENV K8 worker node problem . kubectl top nodes listed k8s-04 as unkown. k8s-04 is on 10.12.6.0 which could be contributing factor - .0 and .32 addresses in windriver have suspect behavoir. Worker down caused a set of containers to be restarted which is the right behavoir from a k8 standpoint. Test could not run while robot container was down. | 12:00:25 Instantiate Virtual DNS GRA command terminated with exit code 137 12:22:22 + retval=137 12:22:22 ++ echo 'kubectl exec -n onap dev-robot-56c5b65dd-dkks4 -- ls -1t /share/logs | grep stability72hr | head -1' 12:22:22 ++ ssh -i /var/lib/jenkins/.ssh/onap_key ubuntu@10.12.5.205 sudo su 12:22:25 error: unable to upgrade connection: container not found ("robot") |
#19 #20 | TOOLING k8 restarted robot pod. Manual fixes to vnf_orchestration_test_template to fix heat3 parsing issues were removed. reapplied manual fixes so parsing sdc artifacts to find the base_vlb resource succeeded again. | Unable to find catalog resource for vLB base_vlb' |
#32 | TOOLING Robot script did not find subscriber name in search results Likely timing issue that robot is too fast in looking for json data in the drop down before it is fully loaded. | Create Service Instance → vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name' |
#35 | ENV vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective. Looks like functional bug with SO to Openstack issue triggered by the environment not stability related. | CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.211.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids: ['req-ca6e5f39-7462-47c6-aaa8-9653783828cb'] |
#37 | ENV vG and vFW failed on VID screen errors looking for data items. Investigation shows that aai-traversal pod restarted. Looks like slow networking caused the pod to be redeployed but not conclusive. Initially so, vid failed healtch check until aai traversal was up then both passed healthcheck. | |
Thu May 21 12:33:45 UTC 2020 | Memory: root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -rn -k3 | head -20 | |
#38 | ENV vDNS - Timeout waiting for model to be visible via Deploy button in VID vVG and vFW Succeeded Transient Slowness since the 2nd and 3rd VNF succeeded. | Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: TypeError: object of type 'NoneType' has no len() |
#47 | TOOLING vDNS - Seleinum error seeing the Subscriber Name vVG and vFW worked. Transient | vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name'] StaleElementReferenceException: Message: stale element reference: element is not attached to the page document |
Fri May 22 03:41:11 UTC 2020 | root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20 | |
#53 | ENV vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective. Looks like functional bug with SO to Openstack issue triggered by the environment not stability related. vVG and vFW Succeeded in same test. | STATUS: Received vfModuleException from VnfAdapter: category='INTERNAL' message='Exception during create VF org.onap.so.openstack.utils.StackCreationException: Stack Creation Failed Openstack Status: CREATE_FAILED Status Reason: Resource CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.250.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids |
Fri May 22 09:35:28 UTC 2020 | root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20 root@long-nfs:/home/ubuntu# kubectl -n onap top nodes | |
#58 | ENV ODL cluster communication error on vFW preload. This type of error usually is associated with network latency issues between nodes. Akka configuration should be evaluated to loosen up the timeout settings for public cloud or other slow environments. Discuss with Dan GET to https://{{sdnc_ssl_port}}/restconf/config/VNF-API:preload-vnfs/ Succeeds | O Get Request using : alias=sdnc, uri=/restconf/config/VNF-API:preload-vnfs/vnf-preload-list/Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0/VfwclVfwsnkA143de8bE20f..base_vfw..module-0, headers={'X-FromAppId': 'robot-ete', 'X-TransactionId': '922f999d-2444-4bcd-b5ad-60fbf553735d', 'Content-Type': 'application/json', 'Accept': 'application/json'} json=None 04:36:17.031 INFO Received response from [sdnc]: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Error executeRead ReadData for path /(org:onap:sdnctl:vnf?revision=2015-07-20)preload-vnfs/vnf-preload-list/vnf-preload-list[{(org:onap:sdnctl:vnf?revision=2015-07-20)vnf-type=VfwclVfwsnkA143de8bE20f..base_vfw..module-0, (org:onap:sdnctl:vnf?revision=2015-07-20)vnf-name=Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0}]","error-info":"Shard member-2-shard-default-config currently has no leader. Try again later."}]}} https://{{sdnc_ssl_port}}/jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational |
Interim Status on VNF Orchestration
Notice the improved test duration after the K8 node automated reconfiguration to move loads off k8s-04.
We will run final numbers at the end of the test but most of the problems appear to be environment and tooling issues.
Closed Loop Tests
This test uses the onap-ci job "Project windriver-longevity-vfwclosedloop".
The test uses the robot test script "demo-k8s.sh vfwclosedloop ". The script sets the number of streams on the vPacket Generator to 10 , waits for the change from 10 set sreams to 5 streams by the control loop then sets the stream to 1 and again waits for the 5 streams.
Success tests the loop from VNF through DCAE, DMaaP, Policy, AAI , AAF and APPC.
In the jenkins job:
Modify the NFS_IP and PKG_IP in the jenkins job to point to the current nfs server and packet generator in the tenant
NFS_IP=10.12.5.205
PKG_IP=10.12.5.247
Initially the policy in TCA Key Value store was not in synch with Policy due to the instantiation of the Demo VNF issue.
Since consul-server-ui is not enabled by default , we had to edit the service to expose the consul-server-ui as a NodePort and then go to the ui page to edit the ControlLoop vFW policy to use the same model-invariant-id that was used with the instantiate so A&AI query would succeed.
http://10.12.5.185:32512/ui/#/dc1/kv/dcae-tca-analytics/edit (node the nodeport was epheremal)
closedLoopControlName was edited in two places (for Hi and Low) to specify "ControlLoop-vFirewall-cdf42e53-b49b-4d9f-a621-fa9521111615". "cdf42e53-b49b-4d9f-a621-fa9521111615" was the new , matching model-invariant-id.
The tests start with #1
http://10.12.6.182:8080/jenkins/job/windriver-longevity-vfwclosedloop/
Test # | Comment | Message |
---|---|---|
0 - 20 | No errors | |
21-40 | No errors |
Interim Status on closed loop testing ~30% through stability run