Related Jira(s)
- CPS-1415Getting issue details... STATUS
Assumptions
Assumption | Notes | |
---|---|---|
1 | When a DMI restarts all cm-handles related to that DMI wil consider to have trust-level Complete | Temporary assumption until the 'Audit' function has been implemented |
Issues and Decisions
# | Issue | Notes | Decision |
---|---|---|---|
1 | how fast should CPS (and DB) be able to process max heart beat failures | is 60K really realistic if ENM goes down we should get a notification for each node do we ?! | PoC has shown 60 seconds is reasonable |
2 | restart of NCMP | should/can this be handled | |
3 | Does DMI Plugin provide NCMP with a health check url during registration? Or just rely on the default one provided with Spring boot actuator? | Document the contract. Its just the interface that matters and not the implementation. | |
4 | Look for the dmi data service (dmiDataPlugin) for the healthcheck. |
Description
- Define scenarios which cause a CM Handle to go stale
- Implement changes to support tracking of CM Handle Freshness/Staleness
What might trigger a cmhandle to go to STALE?
- dmi plugin identifies that the device is no longer contactable
- dmi plugin identifies that an underlying device manager managing the device (node) is out of sync with the device itself.
Requirements
Functional
# | Interface | Requirement | Additional Information | Sign-off |
---|---|---|---|---|
1 | CPS-NCMP-E-05 | The 'trustlevel' can is visible) on the methods as currently the 'cm handle state' | can be new or existing (preferred) endpoint | |
2 | CPS-NCMP-E-05 | CM Handles can be queried (filter condition) on 'trustlevel' | using a new 'trustLevel' condition (cannot use cpsPath condition) | |
3 | CPS-NCMP-E-05 | Once a CM Handle is registered the trust -level for that CM Handle should be reported to be 'COMPLETE' | ||
4 | CPS-NCMP-E-05 | Once DMI (plugin) is detected to be down the trust-level for all affected CM Handles should be reported to be 'NONE' | It might not need to be persisted.... | |
5 | CPS-NCMP-I-01.e | DMI plugin can report the current trustlevel of a single cm handle id | ie. the DMI can tell NCMP the trustlevel is 'NONE' when a node heartbeat failure is detected and 'COMPLETE' once it is restored |
Error Handling
# | Error Scenario | Expected behavior |
---|---|---|
1 | NCMP restart (all instances) | To be discussed, not suer if it can/should be handled Trustlevels should be 'NONE' and need to be restored using an audit-request (not in scope) |
Characteristics
# | Parameter | Expectation | Notes | Sign-off |
---|---|---|---|---|
1 | dmi-down detection speed | 30 seconds | ||
2 | device heartbeat frequency (message emitted by DMI plugin for each device) | 60 seconds | ||
3 | maximum supported devices (by NCMP) | 60,0000 | Given #2 and #3 this means NCMP needs to process 60,000 message / minute! | |
4 | maximum number of cm-handles down report by DMI in one request and/or per minute | 30,000 / minute | a peak can be processed within 60 seconds | |
5 | processing of all trustlevel time for DMI-Down and/or peak load by DMI | 1 second | ||
6 | If we incorporate into searches endpoints the speed should not be impacted |
Out-of-Scope
- This epic will only introduce trustlevels NONE and COMPLETE. PARTIAL and POOR may be added later as below.
- Re-registration ie. resolving trutslevel degradation is not in scope of this epic
- NCMP wil not send notification on trustlevel changes for external consumers
High Level Interactions
Interface | Name | Trigger | Description | Type | Endpoint or Topic | Schema |
---|---|---|---|---|---|---|
1 | HealthCheck | 30 second interval (configurable) | NCMP is to perform a health check against each of the DMI Plugins | REST | http://'$1'/manage/health/readiness This endpoint will be the standard heath check endpoint provided by spring boot actuator. We don't store it anywhere. We just document it for now. | |
2 | CMHandle trust level change | A CMHandle managed by DMI Plugin's trust level has changed | data contains {trustLevel: ENUM} event id is cmhandle id | Kafka | TBD | <cloudEvents-header> id : <cmhandleId> type : org.onap.cm.events.trustlevel-notification data : { trustlevel : "COMPLETE" } |
3 | TrustLevel Request | Client Request | TrustLevel is to be returned based on the values in above Maps | REST | TBD | |
4 | Document the health check endpoint | 30 second interval (configurable) | Document the standard healthcheck endpoint url provided by the dmi plugins. We rely on the standard urls and not store it anywhere. | REST |
Managing TrustLevels
DMI Plugins
- NCMP is checking every DMI Plugin for health at interface 1 every 30 seconds using the DMI Trust Map
- IF a DMI Plugin goes down, that DMI Plugin's trust level is updated to NONE in the DMI Trust Map
- IF a DMI Plugin comes back up, Trust level is set back to COMPLETE.
More details of health check URL can be accessed via:
CPS-1857 Document watchdog job impl. with health check URL
CMHandles HB
- It is the responsibility of the DMI Plugins to update NCMP about the HBs of CMHandles
- Through interface 4, DMI Plugins will provide a kafka event on the changing of trustworthiness state of a CMHandle.
- NCMP receives this event and updates the Untrustworthy CMHandles Set accordingly
- Needs to be able to handle a throughput of 60,000 State changes per minute for 2 instances
Reading Trust Level
- Body of request to be discussed, Will the request provide a DMI or a list of CMHandles?
- Interface 3
- NCMP will first check DMI Trust Map for the CMHandle
- If that DMI which is managing the CMHandle is marked as untrustworthy then we return NONE without checking the Untrustworthy CMHandles Map
- If that DMI is trustworthy, we check the individual CMHandles Map, if the CMHandle is in the Map then return NONE.
- Logically IF (DMITrustMap.getDMIPlugin.getTrustLevel == NONE) Return NONE
- ELSE (IF UntrustworthyCMHandlesMap.getDMIPlugin.contains(CMHandle) RETURN NONE
- ELSE return COMPLETE