Page Comparison

Table of Contents

References

<insert Jira Ref, Using confluence menu options +, Jira Issue/Filter>
<optional other relevant references, Jiras, Confluence pages, external links>

Assumptions<optional>

<optional, assumptions are like decision made up front ie. everyone agrees on the answer but they are important to mention>

...

Jira Legacy

server	System Jira
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	4733707d-2057-3a0f-ae5e-4fd8aff50176
key	CPS-2161

Jira Legacy

server	System Jira
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	4733707d-2057-3a0f-ae5e-4fd8aff50176
key	CPS-2146

Issues & Decisions

#	Issue	Notes	Decision
1	Placeholder for issue

Background

The use of Hazelcast during NCMP's CM-handle Module Sync is leading to:

High memory usage during CM-handle registration
Consistency problems
Poor load balancing between NCMP instances for module sync

Summary of Hazelcast structures for Module/Data Sync

Structure	Type	Notes
moduleSyncWorkQueue	BlockingQueue<DataNode>	Entire CM handles are stored in work queue for module sync. This is an open issue
2	Do we need a analysis template?	is convention for (new) developers to guide them	01 Aug 2022 Luke Gleeson (Unlicensed) and Toine Siebelink agreed we do to have consistent study pages
3	This is a very important (blocking issue)

<Note. use green for closed issues, yellow for important ones if needed>

Any Other Header

< we do not want to dictate the remainder of an analysis it will depend on the type of user story at hand>

Any Other Header

...

creates very high memory usage during CM handle registration. The use of this blocking queue likely causes issues with load balancing during module sync also.
moduleSyncStartedOnCmHandles	Map<String, Object>	One entry is stored in memory per CM handle in ADVISED state.
dataSyncSemaphores	Map<String, Boolean>	Note this map is only populated if data sync is enabled for a CM handle. If the feature is used, it will store one entry per CM handle with data sync enabled.

Consistency problems

Consistency problems are evidenced by log entries showing duplicate CM-handles being created:

STATEMENT:  insert into fragment (anchor_id,attributes,parent_id,xpath) values ($1,$2,$3,$4) RETURNING *
DETAIL:  Key (anchor_id, xpath)=(2, /dmi-registry/cm-handles[@id='C9B31349E93B850D52EFD2F632BAE598']) already exists.
ERROR:  duplicate key value violates unique constraint "fragment_anchor_id_xpath_key"

Additionally, in CPS-2146 it was reported that:

moduleSync was quite chaotic between the two NCMP pods, both of them logged that the other one is working on the given cmHandle which reached the READY state minutes ago.

The consistency issues are likely a result of Hazelcast requiring an odd number of cluster members to resolve consistency issues via quorum.

Proposed Changes

It is proposed the LCM (Lifecycle Management) State Machine be changed to include an explicit state for syncing modules (or data).

The previous LCM State Machine is outlined here:

Drawio

border	true

diagramName	Existing LCM State Machine
simpleViewer	false
width
links	auto
tbstyle	top
lbox	true
diagramWidth	841
revision	4

The proposed LCM State Machine is:

Drawio

border	true

diagramName	Proposed LCM State Machine
simpleViewer	false
width
links	auto
tbstyle	top
lbox	true
diagramWidth	1091
revision	3

Aside: For Module Upgrade, the state transition from READY to LOCKED to ADVISED could be simplified to READY to ADVISED.

A side effect of introducing a SYNCING state will be an additional LCM event notification.

Module Set Syncing

Proof of Concept

A PoC is being constructed: WIP Remove hazelcast map for module sync | https://gerrit.nordix.org/c/onap/cps/+/20724

From the PoC, it was determined that when running multiple instances of NCMP, there was approximately 10% of batches being processed by both instances simultaneously, which led to some handles going to LOCKED state, due to database exceptions. Two solutions proposed:

Add a distributed lock (from Hazelcast) to create a critical section, allowing only 1 instance to to move handles to SYNCING state
Allow collisions, by gracefully handling AlreadyDefinedExceptions in the code

Solution 1 is verified to work, and gives 50% faster registration than now. Solution 2 is not yet tested, so it is yet to be determined which has better performance/reliability.

Versions Compared

Old Version 1

New Version Current

Key

References

Assumptions<optional>

Issues & Decisions

Background

Summary of Hazelcast structures for Module/Data Sync

Any Other Header

Any Other Header

Consistency problems

Proposed Changes

Proof of Concept