...
Metric | Prometheus Query |
---|---|
Memory usage | rate(jvm_memory_bytes_used[30s])*100 |
CPU Usage | rate(process_cpu_seconds_total[30s])*100 |
JVM threads | jvm_threads_current |
Process uptime | process_start_time_seconds |
Garbage Collectors | GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m]) |
Note: SSL certificate expiry is a key metric to alert on, however this can be dealt with outside the scope of Policy Fwk.
Key metrics for Policy API
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | ||||
---|---|---|---|---|---|---|---|
Availability of policy-api service | Yes | No | Exposed by policy-api healthcheck and policy-pap consolidated healthcheck. | ||||
Latency | No | No | To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests. | ||||
Successful API request counter | No | No | Prometheus query for Number of successful API calls per minute | ||||
Failed API request counter | No | No | Prometheus query for Number of API calls with non 20* family of status codes per minute | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy PAP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Availability of the policy-pap service | Yes | No | policy-pap healthcheck API | ||||||||||||||
Status of PDPs as registered with policy-pap | Yes | No | policy-pap consolidated healthcheck API | ||||||||||||||
Successful API request counter | No | No | To be implemented for all the endpoints exposed by policy-pap. Sample s3p numbers for policy-pap stress tests. | ||||||||||||||
Failed API request counter | No | No | To be implemented for all the endpoints exposed by policy-pap. Number of API calls with non 200 family of status codes per minute | ||||||||||||||
Latency | No | No | To be implemented for all the endpoints exposed by policy-pap. | ||||||||||||||
Policy deployment statistics policyDeployFailureCount | Yes | No | Sample:
| SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy Distribution
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | ||||
---|---|---|---|---|---|---|---|
Availability of the policy-distribution service | Yes | No | Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck | ||||
Successful API request counter | No | No | To be implemented for all the endpoints exposed by policy-distribution. Sample s3p numbers for policy-distribution stress tests. | ||||
Failed API request counter | No | No | To be implemented for all the endpoints exposed by policy-distribution. Number of API calls with non 200 family of status codes per minute | ||||
Latency | No | No | To be implemented for all the endpoints exposed by policy-distribution. | ||||
Policy distribution statistics distributions | Yes | No | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy APEX PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-apex-pdp | Yes | No | Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck. | |||||||
Policy Deployment counter (per apex-pdp instance) policyDeployCount | Yes | No | Exposed by policy-pap statistics
| |||||||
Policy Execution counter (per apex-pdp instance) # of policies executed *Note: the stats currently displays APEX policy counters | No | No | ||||||||
Engine count Can be inferred from the size of the json array object "engineStats" | Yes | No | Engine availability details stats (by engineID per apex-pdp instance) engineTimestamp: timestamp at which the statistics were recordedeventCount: number of APEX events processed | Yes | No | |||||
Count of events processed (per engine thread, per apex-pdp instance) # of incoming trigger events processed by policy-apex-pdp *Note: the stats currently displays APEX event counters processed by the engine | No | No | ||||||||
Latency | No | No | Time taken for processing an incoming network trigger event by a TOSCA policy *Note: the stats currently displays execution time for processing APEX policy. | |||||||
Kafka consumer lag | No | No | Can be implemented outside of the Policy FWK. Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp | |||||||
SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy Drools PDP
...
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-drools-pdp | Yes | No | Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.
| |||||||||
Policy Deployment counter (per drools-pdp instance) policyDeployCount | Yes | No | Sample:
| |||||||||
Policy Execution counter (per drools-pdp instance) policyExecutedCount | Yes | No | ||||||||||
Latency | No | No | Time taken for an incoming event to be processed by drools controller. | |||||||||
Count of Drools facts | No | No | An ever increasing number of drools facts can lead to an Out of memory. | |||||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools | |||||||||
SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |
Key metrics for Policy XACML PDP
...
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Availability of policy-xacml-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API
| |||||||
Policy Deployment counter totalPoliciesCount | Yes | No | XACML PDP statistics API
| |||||||
Policy execution error counter totalErrorCount | Yes | No | ||||||||
Policy execution success counter by type permitDecisionsCount | Yes | No | ||||||||
Latency | No | No | Time taken for an incoming event to be processed via the XACML policies. | |||||||
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML | SSL certificate expiry time | No | No | Can be done outside the scope of Policy Fwk |