Resiliency Levels (New)
Level Definitions
Level 0: no redundancy
Level 1: support manual failure detection & rerouting or recovery within a single site; tested to complete in 30 minutes
Level 2: support automated failure detection & rerouting
within a single geographic site
stateless components: establish baseline measure of failed requests for a component failure within a site
stateful components: establish baseline of data loss for a component failure within a site
Level 3: support automated failover detection & rerouting
across multiple sites
stateless components
improve on # of failed requests for component failure within a site
establish baseline for failed requests for site failure
stateful components
improve on data loss metrics for component failure within a site
establish baseline for data loss for site failure
Minimum Levels (Dublin)
Runtime Projects: Level 2 (stretch goal Level 3)
NOTE: For Dublin, the building blocks will be put in place for Level 3 geo-redundancy, and a few projects will pilot it
All other Projects: Level 1 (stretch goal Level 2)
Guidance for Implementation
Level 2 resiliency within a single site can be easily implemented by project teams using OOM, Kubernetes clusters, and health checks.
CNI - OOM is introducing CNI which will allow for multi-site Kubernetes clusters (VxLAN or BGP). One can deploy pods and label them by geo location, which will be scheduled to corresponding labeled nodes. Labels would need to be defined in the Helm charts. See OOM-1506
MUSIC - MUSIC supports geo-redundancy, particularly for stateful components where traditional clustering techniques are insufficient (not performant, for example). Pilot projects: Portal, OOF, and SDC.
Integration testing details TBD.
Contacts
OOM, MUSIC, and Integration teams.