SO Carrier-Grade Enhancement Points for R2

The below document proposes how SO should approach non-functional Carrier-Grade requirements and work with other ONAP components.

The slide pages 8-20 proposed SO Monitoring feature (rationale, requirements, proposed UIs, and high-level deign brainstorming).

SO Scalability Requirements

SO scalability will be supported by managing multiple SO instances by utilizing OOM.
- In OOM, the number of SO component instances will be configured to control the number of active SO instances.
  - Target scalability will be supported. The MariaDB instance number could be different from the rest of SO components.
- Each BPMN execution engine will be configured for a shared database, so the engine can be scaled promptly and ready to handle assignments.
SO endpoints will be registered to MSB for communication load-balancing.
SO run-time scalability handling
- SO will have multiple Camunda Execution engine instances which share the centralized data store.
  - The centralized data store will be replicated, and the replication will be transparent to other SO components.
- The individual execution engine instances do not maintain session state across transactions.
  - The complete state is flushed out to the shared database when a process instance is complete or waiting for events (e.g., asynchronous event, message, human task, etc.).
  - Or, asynchronous continuations can be used during the workflow design when it is necessary to control save points actively (by design) and flush out the process instance states to the database.
  - Once a process instance is passivated, another engine instance can pick up and execute the remaining process instance flows.
- Multiple SDC distribution client instances will be instantiated.
  - A SDC notification will be routed to (or picked up by) one of the SDC notification client instances. Then, the assigned client instance will:
    - query for templates/models from SDC.
    - parse the template/models and store in the Catalog DB.
  - Due to less frequent templates/models changes and SDC notification client activities, a small number (2) of SDC distribution client instances can be configured.
- Multiple API handler instances will be instantiated, and all of the instances are active (active-active).
  - The requests from VID, External API and UUI towards the API handler instances will be distributed/routed via load-balancing. MSB is expected to handle their load-balancing.
  - An assigned API handler instance will communicate with the orchestration execution engine and Data store in a scalable manner.
    - Communications (invoking BPMN execution) with the orchestration execution engine will be done through MSB, no direct connection with hard-coded endpoints.
    - For storing requests and select recipes, the API Handler will communicate with the Data store (Request DB, Service Catalog), which is replicated.
- Multiple Resource/Controller Adapters will be instantiated for active-active operations.
  - The communications between the BPMN/TOSCA resource recipes and the adapter instances will be load-balanced through MSB.
- External communications with other ONAP components such as DACE, OOF, A&AI, SDNC, etc. will be done through MSB/DMaaP in a scalable manner like the above communication requirements.

SO Resiliency / Auto Heal Requirements

SO resiliency / auto heal will be supported by utilizing OOM.
- In OOM, the number of SO component instances will be configured.
- SO component instances in different containers will be configured for fail-over.
- Each BPMN execution engine will be configured for a shared data store to make the engine instance stateless.
- OOM will detect failed SO components / containers and create new instances for substitution - cattle-based instantiation.
SO Resiliency / Auto Heal run-time handling
- When a microservice SO component in a container is down, the OOM will bring up another instance of component / container.
- SDC Distribution Client, API Handler, BPMN Execution and Adapters will be stateless for a quick fail-over.
  - Multiple instances of the SDC Distribution Client API Handler, BPMN Execution and Adapters will be instantiated for active-active HA.
- Fail-over components will pick up and finish the interrupted assignments.
  - In BPMN case, the engine will re-execute the interrupted workflows from the save points.
  - When an entire workflow was abandoned, the API Handler will retry the requests.
- The centralized data store components will be replicated to avoid single-point-failure.[question: can OOM handle Data Store replication configuration?]
  - Master-Slave data replication
  - by utilizing MUSIC (OOM and MUSIC integration is coming)
  - Service Catalog, Camunda DB and Request DB data will be replicated
- The individual execution engine instances do not maintain session state across transactions. This is also a key for reliability support.
  - The complete state will be flushed out to the shared database.
  - Asynchronous continuations will be used when it is necessary to control save points actively and flush out the process instance state. So, another process instance can pick up the remaining process instance flows.
- Retry handling
  - TBD

ARIA Orchestrator

ARIA Orchestrator is not addressed here because we are not sure about its location.
TBD

Note:

SO components are being refactored. Additional modules can be identified.