Resiliency Levels (New)

Level Definitions

  •  

    • Level 0: no redundancy

    • Level 1: support manual failure detection & rerouting or recovery within a single site; tested to complete in 30 minutes

    • Level 2: support automated failure detection & rerouting 

      • within a single geographic site

      • stateless components: establish baseline measure of failed requests for a component failure within a site 

      • stateful components: establish baseline of data loss for a component failure within a site

    • Level 3: support automated failover detection & rerouting 

      • across multiple sites 

      • stateless components 

        • improve on # of failed requests for component failure within a site 

        • establish baseline for failed requests for site failure 

      • stateful components 

        • improve on data loss metrics for component failure within a site 

        • establish baseline for data loss for site failure

Minimum Levels (Dublin)

  • Runtime Projects: Level 2 (stretch goal Level 3)

    • NOTE: For Dublin, the building blocks will be put in place for Level 3 geo-redundancy, and a few projects will pilot it

  • All other Projects: Level 1 (stretch goal Level 2)

Guidance for Implementation

  • Level 2 resiliency within a single site can be easily implemented by project teams using OOM, Kubernetes clusters, and health checks.

  • CNI - OOM is introducing CNI which will allow for multi-site Kubernetes clusters (VxLAN or BGP).  One can deploy pods and label them by geo location, which will be scheduled to corresponding labeled nodes.  Labels would need to be defined in the Helm charts.  See OOM-1506

  • MUSIC - MUSIC supports geo-redundancy, particularly for stateful components where traditional clustering techniques are insufficient (not performant, for example).  Pilot projects: Portal, OOF, and SDC.

  • Integration testing details TBD.

Contacts

OOM, MUSIC, and Integration teams.