SO Carrier-Grade Enhancement Points for R2

The below document proposes how SO should approach non-functional Carrier-Grade requirements and work with other ONAP components.

The slide pages 8-20 proposed SO Monitoring feature (rationale, requirements, proposed UIs, and high-level deign brainstorming).

SO Monitoring Feature Requirements

A design idea
- Why not Camunda Cockpit as is: current Camunda Cockpit was designed from a BPMN process management perspective (note: need to study for TOSCA cases).
  - It does not meet service-level orchestration monitoring.
  - It is designed for BPMN definition/execution monitoring; require process knowledge for monitoring.
- We need higher-level monitoring abstraction for both BPMN and TOSCA.
  - Associate Service Instance Id (or other keys) to the top-level process instance id. For the association,
  - Could use a process variable holding the Service Instance id (or other keys), or
  - Could use a database holding the association
  - Allow VID, UUI or external apps monitor process workflow process (graphically and text-based) based on extensible search keys.
- We need a platform level runtime and history process activity report capabilties out of the box.
  - Regardless use of Camunda Enterprise Edition or Community Edition.
  - Query to Camunda/ARIA database to extract activities.
- The following diagram depicts the high-level concept.

Rationale
- Search is an Camunda enterprise feature, but we need to provide searching capabilitiy for non-enterprise edition.
  - Finding right process instance(s) for a NS/VNF service request is tedious and hassle.
  - To facilitate monitoring, we need more than what Camunda Community/Enterprise edition supports.
  - provides the process monitoring (instance-search) hyperlink to the SO clients for launching process monitoring.
    - Automates tedious manual steps for finding target process instance(s)
    - Access customized SO Service List / Camunda Cockpit widgets from VID, UUI and external APIs.
- If a service provider uses Camunda Enterprise edition, they can still utilize this SO monitoring on top of Camunda enterprise edition features.
- Many of Camunda enterprise features such as CRUDV and version control of process definitions would be part of SDC.
  - ONAP separated workflow design and runtime.
  - Several enterprise features are not part of SO monitoring features and are not applicable.
- What are current TOSCA orchestrator monitoring capabilities?
  - SO monitoring should cover both imperative and declarative orchestration.
  - Question: can we have a kind of uniform way of monitoring?
High-Level Requirements
- Dashboard views of Service lists
  - Filtering capabilities based on search criteria
  - Configurable search criteria
- Dashboard views of statistics (donuts, pie charts, etc.) for filtered service instances
- Service Instance Rendering and detail panel views
  - with sub-service instance drill-drown and drill-up capabilities
    - A service instance could be realized by multiple process instances
  - with process / task detail
  - Topology (workflow) views during/after orchestration
- Input/output data views for process/task/service task (messages, parameters)
  - Display on service / task detail panel
  - Provide message log views (could be on a pop-up widget)
- Color coding/visual indication of statistic and service type and status
- Troubleshooting capabilities by manipulating the workflow during orchestration for troubleshooting and retry from the current location (stretch goal)
- TOSCA orchestration monitoring (stretch goal)

Widget Requirements
- Service List Widget
  - Provides monitoring capabilities for processed services based on search criteria
    - Configurable Search Criteria filtering: Service ID, Operation Type, Status, User Id, Date/Time range
    - Actual filtering criteria fields could be changed based on configuration

SO Scalability Requirements

SO scalability will be supported by managing multiple SO instances by utilizing OOM.
- In OOM, the number of SO component instances will be configured to control the number of active SO instances.
  - Target scalability will be supported. The MariaDB instance number could be different from the rest of SO components.
- Each BPMN execution engine will be configured for a shared database, so the engine can be scaled promptly and ready to handle assignments.
SO endpoints will be registered to MSB for communication load-balancing.
SO run-time scalability handling
- SO will have multiple Camunda Execution engine instances which share the centralized data store.
  - The centralized data store will be replicated, and the replication will be transparent to other SO components.
- The individual execution engine instances do not maintain session state across transactions.
  - The complete state is flushed out to the shared database when a process instance is complete or waiting for events (e.g., asynchronous event, message, human task, etc.).
  - Or, asynchronous continuations can be used during the workflow design when it is necessary to control save points actively (by design) and flush out the process instance states to the database.
  - Once a process instance is passivated, another engine instance can pick up and execute the remaining process instance flows.
- Multiple SDC distribution client instances will be instantiated.
  - A SDC notification will be routed to (or picked up by) one of the SDC notification client instances. Then, the assigned client instance will:
    - query for templates/models from SDC.
    - parse the template/models and store in the Catalog DB.
  - Due to less frequent templates/models changes and SDC notification client activities, a small number (2) of SDC distribution client instances can be configured.
- Multiple API handler instances will be instantiated, and all of the instances are active (active-active).
  - The requests from VID, External API and UUI towards the API handler instances will be distributed/routed via load-balancing. MSB is expected to handle their load-balancing.
  - An assigned API handler instance will communicate with the orchestration execution engine and Data store in a scalable manner.
    - Communications (invoking BPMN execution) with the orchestration execution engine will be done through MSB, no direct connection with hard-coded endpoints.
    - For storing requests and select recipes, the API Handler will communicate with the Data store (Request DB, Service Catalog), which is replicated.
- Multiple Resource/Controller Adapters will be instantiated for active-active operations.
  - The communications between the BPMN/TOSCA resource recipes and the adapter instances will be load-balanced through MSB.
- External communications with other ONAP components such as DACE, OOF, A&AI, SDNC, etc. will be done through MSB/DMaaP in a scalable manner like the above communication requirements.

SO Resiliency / Auto Heal Requirements

SO resiliency / auto heal will be supported by utilizing OOM.
- In OOM, the number of SO component instances will be configured.
- SO component instances in different containers will be configured for fail-over.
- Each BPMN execution engine will be configured for a shared data store to make the engine instance stateless.
- OOM will detect failed SO components / containers and create new instances for substitution - cattle-based instantiation.
SO Resiliency / Auto Heal run-time handling
- When a microservice SO component in a container is down, the OOM will bring up another instance of component / container.
- SDC Distribution Client, API Handler, BPMN Execution and Adapters will be stateless for a quick fail-over.
  - Multiple instances of the SDC Distribution Client API Handler, BPMN Execution and Adapters will be instantiated for active-active HA.
- Fail-over components will pick up and finish the interrupted assignments.
  - In BPMN case, the engine will re-execute the interrupted workflows from the save points.
  - When an entire workflow was abandoned, the API Handler will retry the requests.
- The centralized data store components will be replicated to avoid single-point-failure.[question: can OOM handle Data Store replication configuration?]
  - Master-Slave data replication
  - by utilizing MUSIC (OOM and MUSIC integration is coming)
  - Service Catalog, Camunda DB and Request DB data will be replicated
- The individual execution engine instances do not maintain session state across transactions. This is also a key for reliability support.
  - The complete state will be flushed out to the shared database.
  - Asynchronous continuations will be used when it is necessary to control save points actively and flush out the process instance state. So, another process instance can pick up the remaining process instance flows.
- Retry handling
  - TBD

ARIA Orchestrator

ARIA Orchestrator is not addressed here because we are not sure about its location.
TBD

Note:

SO components are being refactored. Additional modules can be identified.