Baseline Measure of Failed Requests for a Component Failure within a Site
The rule management module of Holmes is mainly used for user interactions, especially for the lifecycle control of the rules. Since this module is not quite frequently used, and the rule deployment is usually done manually. There are not too many requests sending to this module. Besides, all request could be re-sent once the module is recovered. Hence for this module, the request failures during the component failure could be reasonably ignored.
As for the engine management module, as it is the core module of Holmes and it is mainly responsible for alarm processing, the request loss could be considered the same as the data loss. So this will be evaluated in the next section.
Baseline of Data Loss for a Component Failure within a Site
The average start-up time of Holmes is around 20 seconds. All the estimation of the baseline will be based on such an approximation.
Because the performance of Holmes differs a lot with and without A&AI. Both scenarios should be taken into account when identifying the baseline.
Data Analysis with AAI
From the testing report provided on the Testing Results page, when AAI is involved, correlation analysis for alarms is extremely slow. Even if we take the best performance data as a reference, the data loss during the start-up of Holmes is
1.73 alarms/s * 20s = 34.5 alarms
Data Analysis without AAI
Now, let's move forward to the scenario under which AAI is not involved during the correlation analysis.
According to the Testing Results, the peak rate for alarm process is around 350 alarms per second. Based on this approximation, the data loss under such a scenario is
350 alarms/s * 20s = 7000 alarms
From the figures above, we can see that the data loss could be far beyond the acceptance of the users. So in the future, the Holmes team has to optimize the component to reduce data loss.
Here are two suggestions:
- Cache the AAI data and refresh them periodically so that Holmes won't have to make an HTTP call to AAI every time it tries to correlate one alarm to another.
- DCAE and it's inferior systems (e.g. PNFs and VNFs, etc.) provide data synchronization mechanism to ensure that components could fetch data spontaneously after they are deployed or restarted.