Tips on tuning akka.conf and the ...datastore.cfg files for clustering

2019-05-13: Original post

jhartley@luminanetworks.com – for any questions

Goals for akka.conf:

Specify THIS cluster member's resolvable FQDN or IP address. (Tip: Use FQDNs, and ensure they're resolvable in your env.)
Name the list of all cluster members in the seed-nodes list.
Tune optional variables, noting that the defaults for many of these are far too low.
Keep this file ~identical on all instances; only the "roles" and "hostname" are unique to this member.

Example of a 3-node configuration, tuned:

odl-cluster-data {
  akka {
    loglevel = ""
    remote {
      netty.tcp {
        hostname = "odl1.region.customer.com"
        port = 2550
      },
      use-passive-connections = off
    }
    actor {
      debug {
        autoreceive = on
        lifecycle = on
        unhandled = on
        fsm = on
        event-stream = on
      }
    }
    cluster {
      seed-nodes = [
        "akka.tcp://opendaylight-cluster-data@odl1.region.customer.com:2550", 
        "akka.tcp://opendaylight-cluster-data@odl2.region.customer.com:2550", 
        "akka.tcp://opendaylight-cluster-data@odl3.region.customer.com:2550"
      ]
      seed-node-timeout = 15s
      roles = ["member-1"]
    }
    persistence {
      journal-plugin-fallback {
        circuit-breaker {
          max-failures = 10
          call-timeout = 90s
          reset-timeout = 30s
        }
        recovery-event-timeout = 90s
      }
      snapshot-store-plugin-fallback {
        circuit-breaker {
          max-failures = 10
          call-timeout = 90s
          reset-timeout = 30s
        }
        recovery-event-timeout = 90s
      }
    }
  }
}

Goals for org.opendaylight.controller.cluster.datastore.cfg:

This is a HOCON-style config file, so subsequent entries replace earlier entries.
The goal here is to significantly reduce the race-condition that is present when starting all members of a cluster, and the race-condition any freshly restarted or "cleaned" member has when rejoining.

### Note: Some sites use batch-size of 1, not reflecting that here###
persistent-actor-restart-min-backoff-in-seconds=10
persistent-actor-restart-max-backoff-in-seconds=40
persistent-actor-restart-reset-backoff-in-seconds=20
shard-transaction-commit-timeout-in-seconds=120
shard-isolated-leader-check-interval-in-millis=30000
operation-timeout-in-seconds=120

Goals for module-shards.conf:

Name which members retain copies of which data shards.
These shard name fields are the 'friendly" names assigned to the explicit namespaces in the modules.conf.
In a K8S/Swarm environment, it's easiest to keep this identical on all members. Unique shard replication (or isolation) strategies are for another document/discussion, and require non-trivial planning.

module-shards = [
    {
        name = "default"
        shards = [
            {
                name="default"
                replicas = [
                    "member-1"
                    "member-2"
                    "member-3"
                ]
            }
        ]
    },
    {
        name = "topology"
        shards = [
            {
                name="topology"
                replicas = [
                    "member-1"
                    "member-2"
                    "member-3"
                    ]
            }
        ]
    },
    {
        name = "inventory"
        shards = [
            {
                name="inventory"
                replicas = [
                    "member-1"
                    "member-2"
                    "member-3"
                    ]
            }
        ]
    },
]

...thus, for example, it would be legitimate to have a single simple entry that ONLY includes "default" if desired.  Thus there would only be default-config and default-operational, plus some of the auto-created shards.