Frankfurt Stability Run Notes
The intent of the 72 hour stability test is not to exhaustively test all functions but to run a steady load against the system and look for issues like memory leaks that aren't found in the short duration install and functional testing during the development cycle.
This page will collect notes on the 72 hour stability test run for Frankfurt.
See El Alto Stability Run Notes for comparison to previous runs.
Summary of Results
The 72 hour stability run result was PASS.
The onboard and instantiate tests ran for over 115 hours before environment issues stopped the test. There were errors due to both tooling and environment errors as indicate in the log.
The overall memory utilization only grew about 2% on the work nodes despite the environment issues. Interestingly the kubernetes ochestration node memory grew more which could mean we are over driving the API's in some fashion.
We did not limit other tenant activities in Windriver during this test run and we saw the impact from things like the re-install of SB00 in the tenant and general network latency impacts that caused openstack to be slower to instantiate.
For future stability runs we should go back to the process of shutting down non-critical tenants in the test environment to free up host resources for the test run (or other ways to prevent other testing from affecting the stabiity run).
The control loop tests were 100% successful and the cycle time for the loop was fairly consistent despite the environment issues. Future control loop stability tests should consider doing more policy edit type activites and running more control loop if host resources are available. The 10 second VES telemetry event is quite aggressive so we are sending more load into the VES collector and TCA engine during onset events than would be typical so adding additional loops should factor that in.
The jenkins jobs ran fairly well although the instantiate Demo vFWCL took longer than usual and should be factored into future test planning.
Setup
The integration-longevity tenant in Intel/Windriver environment was used for the 72 hour tests.
The onap-ci job for "Project windriver-longevity-release-manual" was used for the deployment with the OOM set to frankfurt and Integration branches set to master. Integraiton master was used so we could catch the latest updates to integration scripts and vnf heat templates.
The jenkins job needs a couple of updates for each release:
Set the integration branch to 'origin/master'
Modify the parameters to deploy.sh to specify "-i master" and "-o frankfurt" to get integration master an oom frankfurt clones onto the nfs server.
The path for robot logs on dockerdata-nfs changed in Frankfurt so the /dev-robot/ becomes /dev/robot
The stability tests used robot container image 1.6.1-STAGING-20200519T201214Z
robot container updates:
API_TYPE was set to GRA_API since we have deprecated VNF_API.
Shakedown consists of creating some temporary tags for stability72hrvLB, stability72hrvVG,stability72hrVFWCL to make sure each sub test ran successfully (including cleanup) in the environment before the jenkins job started with the higher level testsuite tag stability72hr that covers all three test types.
Clean out the old buid jobs using a jenkins console script (manage jenkins)
def jobName = "windriver-longevity-stability72hr"
def job = Jenkins.instance.getItem(jobName)
job.getBuilds().each { it.delete() }
job.nextBuildNumber = 1
job.save()
appc.properties updated to apply the fix for DMaaP message processing to call http://localhost:8181 for the streams update.
VNF Orchestration Tests
This test uses the onap-ci job "Project windriver-longevity-stability72hr" to automatically onboard, distribute and instantiate the ONAP opensource test VNFs vLB, vVG and vFWCL.
The scripts run validation tests after the install.
The scripts then delete the VNFs and cleans up the environment for the next run.
The script tests AAF, DMaaP, SDC, VID, AAI, SO, SDNC, APPC with the open source VNFs.
There was a problem with the robot scripts for vLB where it was not finding the base_lb.yaml file in the artifacts due to a change in the structure. A two line change to the vnf orchestration script to look for the 'heat3' key was made to resolve the issue. A Jira was created to track the changes to the robot scrips. INT-1598: robot test script issues during 72 hour stability runs setupClosed
These tests started at jenkins job #1
Each test run generates over 500 MB of data on the test through robot framework.
Each test run also runs the kubectl top nodes command to see cpu and memory utilization across the k8 cluster.
We periodically will run the top pods command as well to check on the top memory and cpu using pods.
http://10.12.6.182:8080/jenkins/job/windriver-longevity-stability72hr/
Test # | Comment | Message |
---|---|---|
k8 utilization Wed May 20 18:45:15 UTC 2020 | Memory: | |
#1 | TOOLING Startup issues - modified customer uuid to shorten the string in the tooling since it looked like robot selenium was having trouble "seeing" the string in the drop down. | vDNS: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_aaaf3926-d765-4c47-93b9-857e674d2d01 vvG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_08f8a099-3e2b-480f-8153-5b4173d9394a vFW: Succeeded |
#4 | ENV ${vnf} = vFWCLvPKG Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate. | Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack' |
#13 | ENV ${vnf} = vFWCLvPKG Robot heat bridge run after the deployment failed trying to find the stack in openstack usually means that openstack was slow in deploying the VNF. Heatbridge had succeeded for the vFWCLvSNK inside the same service instantiate. | Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack' |
#14 | TOOLING or ENV vDNS and vVG robot script couldnt find elements on the GUI drop downs. Likely transient networking issues. vFW succeeded and all three are in the test run (vDNS, vVG, vFW in that order). | vDNS : Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: Element 'xpath=//tr[td/span/text() = 'vLB 2020-05-20 13-06-03']/td/button[contains(text(),'Deploy')]' not visible after 1 minute. vVG: NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_9f739343-cbc7-4ee4-8697-ea52f06e7796 vFW Succeeded |
#15 | TOOLING Virtual Volume Group - Failure in robot selenium to find customer in search window. Timing issue. | NoSuchElementException: Message: Could not locate element with visible text: ETE_Customer_26e85655-1f44-4e7e-8cd2-e9fab290af01 |
#17 | ENV or TOOLING Failure in robot selenium at second VNF in service package. Likely tuning of robot needed waiting for the module name to appear in the drop down under transient conditions. | Element 'xpath=//div[contains(.,'Ete_vFWCLvPKG_f716b1bd_1')]/div/button[contains(.,'Add VF-Module')]' did not appear in 1 minute. |
#18 | ENV K8 worker node problem . kubectl top nodes listed k8s-04 as unkown. k8s-04 is on 10.12.6.0 which could be contributing factor - .0 and .32 addresses in windriver have suspect behavoir. Worker down caused a set of containers to be restarted which is the right behavoir from a k8 standpoint. Test could not run while robot container was down. | 12:00:25 Instantiate Virtual DNS GRA command terminated with exit code 137 |
#19 #20 | TOOLING k8 restarted robot pod. Manual fixes to vnf_orchestration_test_template to fix heat3 parsing issues were removed. reapplied manual fixes so parsing sdc artifacts to find the base_vlb resource succeeded again. | Unable to find catalog resource for vLB base_vlb' |
#32 | TOOLING Robot script did not find subscriber name in search results Likely timing issue that robot is too fast in looking for json data in the drop down before it is fully loaded. | Create Service Instance → vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name' |
#35 | ENV vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective. Looks like functional bug with SO to Openstack issue triggered by the environment not stability related. | CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.211.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids: ['req-ca6e5f39-7462-47c6-aaa8-9653783828cb'] |
#37 | ENV vG and vFW failed on VID screen errors looking for data items. Investigation shows that aai-traversal pod restarted. Looks like slow networking caused the pod to be redeployed but not conclusive. Initially so, vid failed healtch check until aai traversal was up then both passed healthcheck. | |
Thu May 21 12:33:45 UTC 2020 | Memory: root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -rn -k3 | head -20 | |
#38 | ENV vDNS - Timeout waiting for model to be visible via Deploy button in VID vVG and vFW Succeeded Transient Slowness since the 2nd and 3rd VNF succeeded. | Keyword 'Wait For Model' failed after retrying for 3 minutes. The last error was: TypeError: object of type 'NoneType' has no len() |
#47 | TOOLING vDNS - Seleinum error seeing the Subscriber Name vVG and vFW worked. Transient | vid_interface . Click On Element When Visible //select[@prompt='Select Subscriber Name'] StaleElementReferenceException: Message: stale element reference: element is not attached to the page document |
Fri May 22 03:41:11 UTC 2020 | root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20 | |
#53 | ENV vDNS instantiate failed at openstack stage. Potentially slowed openstack caused SO to resubmit a request that subsequently became a duplicate from openstack perspective. Looks like functional bug with SO to Openstack issue triggered by the environment not stability related. vVG and vFW Succeeded in same test. | STATUS: Received vfModuleException from VnfAdapter: category='INTERNAL' message='Exception during create VF org.onap.so.openstack.utils.StackCreationException: Stack Creation Failed Openstack Status: CREATE_FAILED Status Reason: Resource CREATE failed: Conflict: resources.vlb_0_onap_private_port_0: IP address 10.0.250.24 already allocated in subnet be057760-1ffa-4827-a6df-75d355c4d45a\nNeutron server returns request_ids |
Fri May 22 09:35:28 UTC 2020 | root@long-nfs:/home/ubuntu# kubectl -n onap top pod | sort -nr -k 3 | head -20 root@long-nfs:/home/ubuntu# kubectl -n onap top nodes | |
#58 | ENV ODL cluster communication error on vFW preload. This type of error usually is associated with network latency issues between nodes. Akka configuration should be evaluated to loosen up the timeout settings for public cloud or other slow environments. Discuss with Dan GET to https://{{sdnc_ssl_port}}/restconf/config/VNF-API:preload-vnfs/ Succeeds | O Get Request using : alias=sdnc, uri=/restconf/config/VNF-API:preload-vnfs/vnf-preload-list/Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0/VfwclVfwsnkA143de8bE20f..base_vfw..module-0, headers={'X-FromAppId': 'robot-ete', 'X-TransactionId': '922f999d-2444-4bcd-b5ad-60fbf553735d', 'Content-Type': 'application/json', 'Accept': 'application/json'} json=None 04:36:17.031 INFO Received response from [sdnc]: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Error executeRead ReadData for path /(org:onap:sdnctl:vnf?revision=2015-07-20)preload-vnfs/vnf-preload-list/vnf-preload-list[{(org:onap:sdnctl:vnf?revision=2015-07-20)vnf-type=VfwclVfwsnkA143de8bE20f..base_vfw..module-0, (org:onap:sdnctl:vnf?revision=2015-07-20)vnf-name=Vfmodule_Ete_vFWCLvFWSNK_e401f06d_0}]","error-info":"Shard member-2-shard-default-config currently has no leader. Try again later."}]}} https://{{sdnc_ssl_port}}/jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=ShardManager,name=shard-manager-operational cluster health{
"request": {
"mbean": "org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore",
"type": "read"
},
"value": {
"LocalShards": [
"member-3-shard-default-operational",
"member-3-shard-prefix-configuration-shard-operational",
"member-3-shard-topology-operational",
"member-3-shard-entity-ownership-operational",
"member-3-shard-inventory-operational",
"member-3-shard-toaster-operational"
],
"SyncStatus": true,
"MemberName": "member-3"
},
"timestamp": 1590141147,
"status": 200
} |
#59 #60 #61 | ENV Looks like the #58 environment issue affected networking or pod performance for SDC-BE as well.. deleted SDNC-1 pod to fix the shard leader issue. deleted SDC-BE pod to fix the SDC issue (it had failed liveness probes and the automated k8 restart did not work) new containers created by k8 were successful | |
#72 | ENV vFWvPKG Heatbridge failed to see Deployed stack after 10 minutes. Usualy means openstack issue. vFWvSNK, vDNS and vVG had succeeded | Keyword 'Get Deployed Stack' failed after retrying for 10 minutes. The last error was: KeyError: 'stack' |
#73 | TOOLING vFWvSNK Heatbridge AAI Validation failed to find the node in AAI. After the test re-ran the query and the data is there. Most likely tooling was did not wait long enough for replication across the cassandra nodes to occur. Should consider adding a delay in robot between the openstack completion in SO and the AAI query. | AAI Heatbridge Validation post response: {"requestError":{"serviceException":{"messageId":"SVC3001","text":"Resource not found for %1 using id %2 (msg=%3) (ec=%4)","variables":["POST Search","getNamedQueryResponse","Node Not Found:No Node of type vserver found for properties","ERR.5.4.6114"]}}} post request Post Request using : alias=aai, uri=/aai/search/named-query, data={
"query-parameters": {
"named-query": {
"named-query-uuid": "f199cb88-5e69-4b1f-93e0-6f257877d066"
}
},
"instance-filters": {
"instance-filter": [
{
"vserver":
{
"vserver-name": "vofwl01fwle37c"
}
}
]
}
} post test query results{
"inventory-response-item": [
{
"vserver": {
"vserver-id": "175e2a27-436d-423b-9518-21d5c504299f",
"vserver-name": "vofwl01fwle37c",
"vserver-name2": "vofwl01fwle37c",
"prov-status": "ACTIVE",
"vserver-selflink": "http://10.12.25.2:8774/v2.1/28481f6939614cfd83e6767a0e039bcc/servers/175e2a27-436d-423b-9518-21d5c504299f",
"in-maint": false,
"is-closed-loop-disabled": false,
"resource-version": "1590194419699"
},
"extra-properties": {},
"inventory-response-items": {
"inventory-response-item": [
{
"model-name": "vFWCL_vFWSNK cd634f60-3362",
"generic-vnf": {
"vnf-id": "86a93f6d-0540-41da-8e98-5de910ec4088",
"vnf-name": "Ete_vFWCLvFWSNK_5afbe37c_0",
"vnf-type": "vFWCL 2020-05-23 00-25-/vFWCL_vFWSNK cd634f60-3362 0",
"service-id": "a105505b-bb52-4b0d-a2fd-165056e7e6ea",
"prov-status": "ACTIVE",
"orchestration-status": "Active",
"in-maint": false,
"is-closed-loop-disabled": false,
"resource-version": "1590194431230",
"model-invariant-id": "76369d2d-2797-441b-a197-764b581d7a1c",
"model-version-id": "0a064ab0-bfc1-4aee-95b6-13d8161179bd",
"model-customization-id": "da2190dc-721d-4f05-9412-3bcea987d736"
},
"extra-properties": {
"extra-property": [
{
"property-name": "model-ver.model-version-id",
"property-value": "0a064ab0-bfc1-4aee-95b6-13d8161179bd"
},
{
"property-name": "model-ver.model-name",
"property-value": "vFWCL_vFWSNK cd634f60-3362"
},
{
"property-name": "model.model-type",
"property-value": "resource"
},
{
"property-name": "model.model-invariant-id",
"property-value": "76369d2d-2797-441b-a197-764b581d7a1c"
},
{
"property-name": "model-ver.model-version",
"property-value": "1.0"
}
]
},
"inventory-response-items": {
"inventory-response-item": [
{
"model-name": "VfwclVfwsnkCd634f603362..base_vfw..module-0",
"vf-module": {
"vf-module-id": "dd63d577-7b63-4274-b346-6ad1a5f07e31",
"vf-module-name": "Vfmodule_Ete_vFWCLvFWSNK_5afbe37c_0",
"heat-stack-id": "Vfmodule_Ete_vFWCLvFWSNK_5afbe37c_0/00692c06-b498-4a4f-99f8-5d0529e40921",
"orchestration-status": "active",
"is-base-vf-module": true,
"automated-assignment": false,
"resource-version": "1590194420852",
"model-invariant-id": "74d0a469-8791-42d9-ad84-0b7f6720c00f",
"model-version-id": "177de70f-2afd-4617-81b8-01765eec8e53",
"model-customization-id": "c44d47c1-706b-4fa3-960b-57792cba809c",
"module-index": 0
},
"extra-properties": {
"extra-property": [
{
"property-name": "model-ver.model-version-id",
"property-value": "177de70f-2afd-4617-81b8-01765eec8e53"
},
{
"property-name": "model-ver.model-name",
"property-value": "VfwclVfwsnkCd634f603362..base_vfw..module-0"
},
{
"property-name": "model.model-type",
"property-value": "resource"
},
{
"property-name": "model.model-invariant-id",
"property-value": "74d0a469-8791-42d9-ad84-0b7f6720c00f"
},
{
"property-name": "model-ver.model-version",
"property-value": "1"
}
]
}
}
]
}
},
{
"tenant": {
"tenant-id": "28481f6939614cfd83e6767a0e039bcc",
"tenant-name": "Integration-Longevity",
"resource-version": "1590230418024"
},
"extra-properties": {},
"inventory-response-items": {
"inventory-response-item": [
{
"cloud-region": {
"cloud-owner": "CloudOwner",
"cloud-region-id": "RegionOne",
"cloud-type": "SharedNode",
"owner-defined-type": "OwnerType",
"cloud-region-version": "v1",
"cloud-zone": "CloudZone",
"orchestration-disabled": false,
"in-maint": false,
"resource-version": "1589992859784"
},
"extra-properties": {}
}
]
}
}
]
}
}
]
} |
#77 | TOOLING vFWvPKG failed on AAI validation after heat bridge. Same as #73. Query succeeded after the test when run from POSTMAN. | post response: {"requestError":{"serviceException":{"messageId":"SVC3001","text":"Resource not found for %1 using id %2 (msg=%3) (ec=%4)","variables":["POST Search","getNamedQueryResponse","Node Not Found:No Node of type vserver found for properties","ERR.5.4.6114"]}}} |
Sat May 23 11:01:16 UTC 2020 83 hours of testing completed (10 over the 72 hour planned duration) | root@long-nfs:~/oom/kubernetes/robot# kubectl -n onap top pod | sort -rn -k 3 | head -20 dev-appc-0 6m 2849Mi | |
jenkins job collected top nodes percentage growthIts interesting that the memory for K8 control plane nodes (orch-1,orch-2, orch-3) grew more than ONAP work nodes. | #83 top nodes#2 top nodes | |
Interim Status on VNF Orchestration
Notice the improved test duration after the K8 node automated reconfiguration to move loads off k8s-04.
We will run final numbers at the end of the test but most of the problems appear to be environment and tooling issues.
Closed Loop Tests
This test uses the onap-ci job "Project windriver-longevity-vfwclosedloop".
The test uses the robot test script "demo-k8s.sh vfwclosedloop ". The script sets the number of streams on the vPacket Generator to 10 , waits for the change from 10 set sreams to 5 streams by the control loop then sets the stream to 1 and again waits for the 5 streams.
Success tests the loop from VNF through DCAE, DMaaP, Policy, AAI , AAF and APPC.
In the jenkins job:
Modify the NFS_IP and PKG_IP in the jenkins job to point to the current nfs server and packet generator in the tenant
NFS_IP=10.12.5.205
PKG_IP=10.12.5.247
Initially the policy in TCA Key Value store was not in synch with Policy due to the instantiation of the Demo VNF issue.
Since consul-server-ui is not enabled by default , we had to edit the service to expose the consul-server-ui as a NodePort and then go to the ui page to edit the ControlLoop vFW policy to use the same model-invariant-id that was used with the instantiate so A&AI query would succeed.
http://10.12.5.185:32512/ui/#/dc1/kv/dcae-tca-analytics/edit (node the nodeport was epheremal)
closedLoopControlName was edited in two places (for Hi and Low) to specify "ControlLoop-vFirewall-cdf42e53-b49b-4d9f-a621-fa9521111615". "cdf42e53-b49b-4d9f-a621-fa9521111615" was the new , matching model-invariant-id.
The tests start with #1
http://10.12.6.182:8080/jenkins/job/windriver-longevity-vfwclosedloop/
Test # | Comment | Message |
---|---|---|
0 - 20 | No errors | |
21-40 | No errors | |
41-60 | No errors | SB00 re-install during #46/#47 may have caused a background network load in windriver to affect response times |
Interim Status on closed loop testing ~30% through stability run