Participant in Replica

Note:

In a scenario where a participant is stuck in deploying, the instance will be in TIMEOUT and the user can take action like deploy again or undeploy. In that scenario the intermediary-participant has to receive the next message, kill the thread that is stuck in deploying and create a new thread.

The requirements are:

Participants can be replicated, each participant can have an arbitrary number of replicas
Composition definitions, instances, element instances and all their data including properties is identical in all participant replicas
When anything is changed in one replica, the change is propagated to all the replicas of a participant
An operation on a composition element can be sent to any replica of a participant, which means that for a given element, the deploy could be on replica 1, the update could be on replica 2 and the delete could be on replica 3, as one would expect in any HA solution
First free replica fetches the operation message from Kafka
The ACM runtime will be made HA (more than one replica of ACM-R will be supported), it will run on a HA postgres.
The REST API that returns participant information will be updated to include replica information
Replicas are "eventually consistent, with consistency typically occurring in 100s of milliseconds one the message of the change is triggered.

This solution uses a similar approach to that used by Kubernetes when using etcd to implement CRDs and CRs. It implements replication in participants by introducing

The concept of participant replicas into ACM-R and the Participant Intermediary
A lightweight mechanism for replica update between replicas and ACM-R. Every time a replica changes its data, the change is reported to ACM-R and ACM-R updates all other replicas for that participant

The diagram above depicts data management as implemented today in ACM-R. ACM-R maintains the "source of truth" for all compositions and their elements in the ACM-R database.

Composition type data and InProperties are pushed from ACM-R to the participants and are read-only in the participants. The participant can update the Composition type OutProperties, and when it happen, those changes are propagated to the ACM-R database.
Composition element data (state information mainly) is built up by interactions between ACM-R and participants for the various ACM operations (Deploy/Update/Undeploy etc) and ACM-R always maintains the current composition element data in the ACM-R database
Composition element data and InProperties are pushed by ACM-R to the participant and are read-only in the participants. The participant can update the Composition element OutProperties, and when it happen, those changes are propagated to the ACM-R database.

Therefore today, for the three types of data above, the ACM-R database has the state of the participant. This means that creation of participant replicas is rather trivial. We leave the current data handling mechanism in place, introduce the concept of participant replicas in ACM-R, and introduce data synchronization across participant replicas.

We introduce a participant replication table in the ACM-R database, which is an index used to record the replicas that exist of each participant. When the replica of a component implementing a participant is created (by Kubernetes or otherwise), the participant intermediary in the component registers with ACM-R as usual and as it does today. The only difference is that the participant intermediary will send the participant ID and the replica number. ACM-R is updated to accept registrations from participants with the same Participant ID and different replica numbers. When the first replica for a certain participant registers, ACM-R will handle this registration exactly as it does today and will add the replica as the single replica that exists for this participant. When the next replica registers, ACM-R will recongnise that this is a second replica for a participant that already exists and will record this as a replica. Rather than priming this replica, ACM-R will copy all the data from the first replica to this replica. The registration of further replicas will continue to follow this pattern.

During normal operation where ACM-R receives and executes requests towards participants, When the operation has been completed, ACM-R will synchronize the data from the replica that executed the operation to all the other replicas using Participant Synchronization.

If ACM-R is informed by a replica that an Implementing Component changed composition element properties, Participant Synchronization synchronizes these changes to all other Participant Intermediary replicas.

In this solution:

We will preserve Participant Design backward compatibility, there is no change to the participant intermediary interface for participant implementations
ACM-R participant endpoint will contain list of replicas
ACM-R is made HA so that it itself can scale
We can use Kafka load balancing on the participants and get the load balancing functionality for nothing
A new Kafka topic is used for synchronization

Restarting

Participant New Delhi version

ACM-R receives message from participant intermediary that is restarted
ACM-R set the restarting flag to true of all compositions/instances (user cannot do actions)
ACM-R sends restarting message to participant intermediary
Participant intermediary handles the restarting message (to save data in memory) and calls participant
Participant saves local data and does actions if the composition/instance was in priming/deploying/...,
Participant sends to ACM-R the final state
ACM-R receives message from participant about the final state and set restarting flag to null

Oslo version:

ACM-R receives message from participant intermediary that is restarted
if Participant is Oslo version
- ACM-R sends restarting message to participant intermediary replica
- Participant intermediary handles the restarting message (to save data in memory) and calls participant
- Participant saves local data
if Participant is New Delhi version
- ACM-R set the restarting flag to true of all compositions/instances (user cannot do actions)
- ACM-R sends restarting message to participant intermediary
- Participant intermediary handles the restarting message (to save data in memory) and calls participant
- Participant saves local data and does actions if the composition/instance was in priming/deploying/...,
- Participant sends to ACM-R the final state
- ACM-R receives message from participant about the final state and set restarting flag to null
- Participant is Oslo version

Old implementation:

AcElementListenerV2

    @Override
    public void handleRestartInstance(CompositionElementDto compositionElement, InstanceElementDto instanceElement,
                                      DeployState deployState, LockState lockState) throws PfModelException {

        if (DeployState.DEPLOYING.equals(deployState)) {
            deploy(compositionElement, instanceElement);
            return;
        }
        if (DeployState.UNDEPLOYING.equals(deployState)) {
            undeploy(compositionElement, instanceElement);
            return;
        }
        if (DeployState.UPDATING.equals(deployState)) {
            update(compositionElement, instanceElement, instanceElement);
            return;
        }
        if (DeployState.DELETING.equals(deployState)) {
            delete(compositionElement, instanceElement);
            return;
        }
        if (LockState.LOCKING.equals(lockState)) {
            lock(compositionElement, instanceElement);
            return;
        }
        if (LockState.UNLOCKING.equals(lockState)) {
            unlock(compositionElement, instanceElement);
            return;
        }
        intermediaryApi.updateAutomationCompositionElementState(instanceElement.instanceId(),
            instanceElement.elementId(), deployState, lockState, StateChangeResult.NO_ERROR, "Restarted");
    }

Refactored:

AcElementListenerV2

    @Override
    public void handleRestartInstance(CompositionElementDto compositionElement, InstanceElementDto instanceElement,
                                      DeployState deployState, LockState lockState) throws PfModelException {
    }

Policy participant:

AutomationCompositionElementHandler

    @Override
    public void handleRestartInstance(UUID automationCompositionId, AcElementDeploy element,
            Map<String, Object> properties, DeployState deployState, LockState lockState) throws PfModelException {
        if (DeployState.DEPLOYING.equals(deployState)) {
            deploy(automationCompositionId, element, properties);
            return;
        }
        if (DeployState.UNDEPLOYING.equals(deployState) || DeployState.DEPLOYED.equals(deployState)
                || DeployState.UPDATING.equals(deployState)) {
            var automationCompositionDefinition = element.getToscaServiceTemplateFragment();
            serviceTemplateMap.put(element.getId(), automationCompositionDefinition);
        }
        if (DeployState.UNDEPLOYING.equals(deployState)) {
            undeploy(automationCompositionId, element.getId());
            return;
        }
        deployState = AcmUtils.deployCompleted(deployState);
        lockState = AcmUtils.lockCompleted(deployState, lockState);
        intermediaryApi.updateAutomationCompositionElementState(automationCompositionId, element.getId(), deployState,
                lockState, StateChangeResult.NO_ERROR, "Restarted");
    }

Policy participant refactored:

AutomationCompositionElementHandler

    @Override
    public void handleRestartInstance(UUID automationCompositionId, AcElementDeploy element,
            Map<String, Object> properties, DeployState deployState, LockState lockState) throws PfModelException {

        if (DeployState.DEPLOYING.equals(deployState) || DeployState.UNDEPLOYING.equals(deployState)
                || DeployState.DEPLOYED.equals(deployState) || DeployState.UPDATING.equals(deployState)) {
            var automationCompositionDefinition = element.getToscaServiceTemplateFragment();
            serviceTemplateMap.put(element.getId(), automationCompositionDefinition);
        }
        if (DeployState.DELETED.equals(deployState)) {
            serviceTemplateMap.remove(element.getId());
        }
    }