...
The principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Fwk components: API, PAP, Policy-Distribution, Policy-DB, PDPs (APEX, Drools, XACML).
- Request Rate - Number of requests, per second as served by Policy services i.e. by API, PAP. Number of requests/events, per second as processed by the PDPs
- Errors - Number of those requests/events processed that are failing
- Latency/Duration (expressed as time intevalinterval) - Amount of time those requests take, and for PDPs relevant metrics denoting the event processing times
- Saturation - Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.
System Metrics that
...
apply to all Policy components
These metrics are available and exposed via a Prometheus endpoint since Istanbul release.
Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.
Metric | Prometheus Query |
---|---|
Memory usage | rate(jvm_memory_bytes_used[30s])*100 |
CPU Usage | rate(process_cpu_seconds_total[30s])*100 |
JVM threads | jvm_threads_current |
Process uptime | process_start_time_seconds |
Garbage Collectors | GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m]) |
Note: SSL certificate expiry is a key metric to alert on, however this can be dealt with outside the scope of Policy Fwk.
Key metrics for Policy API
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-api service | Yes |
Yes | Exposed by policy-api healthcheck and policy-pap consolidated healthcheck. |
Latency |
Yes |
Yes | To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests. |
Successful API request counter | Yes | Yes | Prometheus query for Number of successful API calls per minute |
Failed API request counter | Yes | Yes | Prometheus query for Number of API calls with non 20* family of status codes per minute |
Key metrics for Policy PAP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of the policy-pap service | Yes |
policy-pap healthcheck API
Yes |
policy-pap |
healthcheck API |
Request rate (API requests per minute)
Successful API request counter | Yes | Yes | To be implemented for all the endpoints exposed by policy-pap. Sample s3p numbers for policy-pap stress tests. |
Failure rate (API errors per minute)
Failed API request counter | Yes | Yes | To be implemented for all the endpoints exposed by policy-pap. Number of API calls with non 200 family of status codes per minute |
Latency |
Yes |
Yes | To be implemented for all the endpoints exposed by policy-pap. |
Policy deployment statistics policyDeployFailureCount | Yes |
Yes | Sample:
|
No
Key metrics for Policy Distribution
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of the policy-distribution service | Yes | Yes | Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck |
Successful API request counter | Yes | Yes | To be implemented for all the endpoints exposed by policy-distribution. Sample s3p numbers for policy-distribution stress tests. |
Failed API request counter | Yes | Yes | To be implemented for all the endpoints exposed by policy-distribution. Number of API calls with non 200 family of status codes per minute |
Latency | Yes | Yes | To be implemented for all the endpoints exposed by policy-distribution. |
Policy distribution statistics distributions | Yes | Yes |
Key metrics for Policy APEX PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-apex-pdp | Yes |
Yes | Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck. |
TOSCA Policy Deployment counter (per apex-pdp instance) policyDeployCount | Yes |
Engine count
Can be inferred from the size of the json array object "engineStats"
Yes | Exposed by policy-pap statistics
| |||||||
TOSCA Policy Execution counter (per apex-pdp instance) # of policies executed |
*Note: the stats currently displays APEX policy counters
Yes | Yes |
Engine stats (by engineID per apex-pdp instance) |
eventCount: number of APEX events processed |
averageExecutionTime: average time taken to process an APEX policy |
Count of events processed (per engine thread, per apex-pdp instance)
# of incoming trigger events processed by policy-apex-pdp
# of incoming trigger events processed successfully by policy-apex-pdp
# of incoming trigger events processed by policy-apex-pdp that resulted in a failure
*Note: the stats currently displays APEX event counters processed by the engine
Latency
time at which the policy engine was last started, uptime is derived from this metric | Yes |
Yes | |||
Latency | Yes | Yes | Time taken for processing an incoming |
APEX event *Note: the stats currently displays execution time for processing APEX policy |
, and is a measure of system saturation and is sufficient | |||
Kafka consumer lag | No | No | Can be implemented outside of the Policy FWK. Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp |
Key metrics for Policy Drools PDP
*Note: Drools PDP counters are exposed on a per controlloop implementation basis.
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-drools-pdp | Yes | No | Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.
| |||||||||
Policy Deployment counter (per drools-pdp instance) policyDeployCount | Yes | No | Sample:
| |||||||||
Policy Execution counter (per drools-pdp instance) policyExecutedCount | Yes | No | ||||||||||
Latency | No | No | Time taken for an incoming event to be processed by drools controller. | |||||||||
Count of Drools facts | No | No | An ever increasing number of drools facts can lead to an Out of memory. | |||||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools |
Key metrics for Policy XACML PDP
TODO: The statistics exposed can be more granular
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-xacml-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API
| |||||||
Policy Deployment counter totalPoliciesCount | Yes | No | XACML PDP statistics API
| |||||||
Policy execution error counter totalErrorCount | Yes | No | ||||||||
Policy execution success counter by type permitDecisionsCount | Yes | No | ||||||||
Latency | No | No | Time taken for an incoming event to be processed via the XACML policies. | |||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML |