Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

References

CPS-2146: Analysis of Out of Memory and related Errors in NCMP

Jira Legacy
serverSystem Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId4733707d-2057-3a0f-ae5e-4fd8aff50176
keyCPS-2161

Jira Legacy
serverSystem Jira
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId4733707d-2057-3a0f-ae5e-4fd8aff50176
keyCPS-2146

...

#IssueNotesĀ Decision
1Placeholder for issue

<Note. use green for closed issues, yellow for important ones if needed>



Background

The use of Hazelcast during NCMP's CM-handle Module Sync is leading to:

...

Summary of Hazelcast structures for Module/Data Sync

StructureTypeNotes
moduleSyncWorkQueueBlockingQueue<DataNode>Entire CM handles are stored in work queue for module sync. This creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also.
moduleSyncStartedOnCmHandlesMap<String, Object>One entry is stored in memory per CM handle in ADVISED state.
dataSyncSemaphoresMap<String, Boolean>Note this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.

Consistency problems

Consistency problems are evidenced by log entries showing duplicate CM-handles being created:

...

Drawio
bordertrue
diagramNameProposed LCM State Machine
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth1091
revision23

Aside: For Module Upgrade, the state transition from READY to LOCKED to ADVISED could be simplified to READY to ADVISED.

A side effect of introducing a SYNCING state will be an additional LCM event notification.

Module Set Syncing

Proof of Concept

A PoC was is being constructed: WIP Remove hazelcast map for module sync | https://gerrit.nordix.org/c/onap/cps/+/20724

From the PoC, it was determined that when running multiple instances of NCMP, there was approximately 10% of batches being processed by both instances simultaneously, which led to some handles going to LOCKED state, due to database exceptions. Two solutions proposed:

  1. Add a distributed lock (from Hazelcast) to create a critical section, allowing only 1 instance to to move handles to SYNCING state
  2. Allow collisions, by gracefully handling AlreadyDefinedExceptions in the code

Solution 1 is verified to work, and gives 50% faster registration than now. Solution 2 is not yet tested, so it is yet to be determined which has better performance/reliability.