SO Resiliency / Auto Heal Requirements

  • Configuration

    • SO resiliency / auto heal will be supported by utilizing OOM along with MUSIC.

      • In OOM, the number of SO component instances will be configured.

      • SO component instances in different containers will be configured for fail-over.

      • Each BPMN execution engine will be configured for a shared data store to make the engine instance stateless.

      • SO will interface with MariaDB through MDBC.

  • SO Resiliency / Auto Heal run-time handling

    • OOM will detect failed SO components / containers and create new instances for substitution - cattle-based instantiation. 

    • When a microservice SO component in a container is down, the OOM will bring up another instance of component / container. 

    • SDC Distribution Client, API Handler, BPMN Execution and Adapters will be stateless for a quick fail-over. 

      • Multiple instances of the SDC Distribution Client API Handler, BPMN Execution and Adapters will be instantiated for active-active HA.

    • Fail-over components will pick up and finish the interrupted assignments.

      • In BPMN case, the engine will re-execute the interrupted workflows from the save points.

      • When an entire workflow was abandoned, the API Handler will retry the requests.

    • The centralized data store components will be replicated to avoid single-point-failure.[question: can OOM handle Data Store replication configuration?]

      • Master-Slave data replication 

      • by utilizing MUSIC (OOM and MUSIC integration is coming)

      • Service Catalog, Camunda DB and Request DB data will be replicated

    • The individual execution engine instances do not maintain session state across transactions. This is also a key for reliability support.

      • The complete state will be flushed out to the shared database.

      • Asynchronous continuations will be used when it is necessary to control save points actively and flush out the process instance state. So, another process instance can pick up the remaining process instance flows.

    • Retry handling

      • TBD

  • Note

    • ARIA Orchestrator

      • ARIA Orchestrator is not addressed here because we are not sure about its location. 

      • TBD

    • SO components are being refactored. Additional modules can be identified.