...
Metric | Prometheus Query |
---|---|
Memory usage | rate(jvm_memory_bytes_used[30s])*100 |
CPU Usage | rate(process_cpu_seconds_total[30s])*100 |
JVM threads | jvm_threads_current |
Process uptime | process_start_time_seconds |
Garbage Collectors | GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m]) |
Note: SSL certificate expiry is a key metric to alert on, however this can be dealt with outside the scope of Policy Fwk.
Key metrics for Policy API
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | ||||
---|---|---|---|---|---|---|---|
Availability of policy-api service | Yes | NoYes | Exposed by policy-api healthcheck and policy-pap consolidated healthcheck. | ||||
Latency | NoYesNo | Yes | To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests. | ||||
Successful API request counter | NoYesNo | Yes | Prometheus query for Number of successful API calls per minute | ||||
Failed API request counter | NoYesNo | Yes | Prometheus query for Number of API calls with non 20* family of status codes per minute | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy PAP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Availability of the policy-pap service | Yes | No | policy-pap healthcheck APIStatus of PDPs as registered with policy-pap | YesNo | policy-pap | consolidated healthcheck API | ||||||||||||
Successful API request counter | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-pap. Sample s3p numbers for policy-pap stress tests. | |||||||||||||||
Failed API request counter | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-pap. Number of API calls with non 200 family of status codes per minute | |||||||||||||||
Latency | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-pap. | |||||||||||||||
Policy deployment statistics policyDeployFailureCount | Yes | NoYes | Sample:
| SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy Distribution
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of the policy-distribution service | Yes | NoYes | Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck |
Successful API request counter | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-distribution. Sample s3p numbers for policy-distribution stress tests. |
Failed API request counter | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-distribution. Number of API calls with non 200 family of status codes per minute |
Latency | NoYesNo | Yes | To be implemented for all the endpoints exposed by policy-distribution. |
Policy distribution statistics distributions | Yes | NoYes | |
SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy APEX PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-apex-pdp | Yes | NoYes | Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck. | |||||||
TOSCA Policy Deployment counter (per apex-pdp instance) policyDeployCount | Yes | NoYes | Exposed by policy-pap statistics
| |||||||
TOSCA Policy Execution counter (per apex-pdp instance) # of policies executed *Note: the stats currently displays APEX policy counters | No | No | ||||||||
Engine count Can be inferred from the size of the json array object "engineStats" | Yes | No | ||||||||
Engine availability details Yes | Yes | |||||||||
Engine stats (by engineID per apex-pdp instance) | Latency | No | No engineTimestamp: timestamp at which the statistics were recordedeventCount: number of APEX events processed | Yes | No | |||||
Count of events processed (per engine thread, per apex-pdp instance) # of incoming trigger events processed by policy-apex-pdp *Note: the stats currently displays APEX event counters processed by the engine | No | No | ||||||||
, uptime is derived from this metric | Yes | Yes | ||||||||
Latency | Yes | Yes | Time taken for processing an incoming network trigger event by the policyAPEX event *Note: the stats currently displays execution time for processing APEX policy., and is a measure of system saturation and is sufficient | |||||||
Kafka consumer lag | No | No | Can be implemented outside of the Policy FWK. Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy Drools PDP
...
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-drools-pdp | Yes | No | Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.
| |||||||||
Policy Deployment counter (per drools-pdp instance) policyDeployCount | Yes | No | Sample:
| |||||||||
Policy Execution counter (per drools-pdp instance) policyExecutedCount | Yes | No | ||||||||||
Latency | No | No | Time taken for an incoming event to be processed by drools controller. | |||||||||
Count of Drools facts | No | No | An ever increasing number of drools facts can lead to an Out of memory. | |||||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools | |||||||||
SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy XACML PDP
...
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-xacml-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API
| |||||||
Policy Deployment counter totalPoliciesCount | Yes | No | XACML PDP statistics API
| |||||||
Policy execution error counter totalErrorCount | Yes | No | ||||||||
Policy execution success counter by type permitDecisionsCount | Yes | No | ||||||||
Latency | No | No | Time taken for an incoming event to be processed via the XACML policies. | |||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |