Baseline Measurements

Baseline Measure of Failed Requests for a Component Failure within a Site

The rule management module of Holmes is mainly used for user interactions, especially for the lifecycle control of the rules. Since this module is not quite frequently used, and the rule deployment is usually done manually. There are not too many requests sending to this module. Besides, all request could be re-sent once the module is recovered. Hence for this module, the request failures during the component failure could be reasonably ignored.

As for the engine management module, as it is the core module of Holmes and it is mainly responsible for alarm processing, the request loss could be considered the same as the data loss. So this will be evaluated in the next section.

Baseline of Data Loss for a Component Failure within a Site

The average start-up time of Holmes is around 20 seconds. All the estimation of the baseline will be based on such an approximation.

Because the performance of Holmes differs a lot with and without A&AI. Both scenarios should be taken into account when identifying the baseline.

Data Analysis with AAI

From the testing report provided on the Testing Results page, when AAI is involved, correlation analysis for alarms is extremely slow. Even if we take the best performance data as a reference, the data loss during the start-up of Holmes is

1.73 alarms/s * 20s = 34.5 alarms


Data Analysis without AAI

Now, let's move forward to the scenario under which AAI is not involved during the correlation analysis.

According to the Testing Results, the peak rate for alarm processing is around 350 alarms per second. Based on this approximation, the data loss under such a scenario is

350 alarms/s * 20s = 7000 alarms

From the figures above, we can see that the data loss could be far beyond the acceptance of the users. So in the future, the Holmes team has to optimize the component to reduce data loss.

Here are two suggestions:

  1. Cache the AAI data and refresh them periodically so that Holmes won't have to make an HTTP call to AAI every time it tries to correlate one alarm to another.

  2. DCAE and it's inferior systems (e.g. PNFs and VNFs, etc.) provide data synchronization mechanism to ensure that components could fetch data spontaneously after they are deployed or restarted.