Policy Framework: Key metrics to monitor

Context

Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability.

The goal of monitoring is to achieve the below operational needs:

Monitoring via dashboards: Provide a visual aid to display health and key metrics for use by OPS.
Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation.
Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues.
Analyzing trends: How fast is it the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands.

Policy Framework Key Metrics

The principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Fwk components: API, PAP, Policy-Distribution, Policy-DB, PDPs (APEX, Drools, XACML).

Request Rate - Number of requests, per second as served by Policy services i.e. by API, PAP. Number of requests/events, per second as processed by the PDPs
Errors - Number of those requests/events processed that are failing
Latency/Duration (expressed as time interval) - Amount of time those requests take, and for PDPs relevant metrics denoting the event processing times
Saturation - Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.

System Metrics that apply to all Policy components

These metrics are available and exposed via a Prometheus endpoint since Istanbul release.

Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.

Metric	Prometheus Query
Memory usage	rate(jvm_memory_bytes_used[30s])*100
CPU Usage	rate(process_cpu_seconds_total[30s])*100
JVM threads	jvm_threads_current jvm_threads_daemon
Process uptime	process_start_time_seconds
Garbage Collectors	GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m])

Key metrics for Policy API

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of policy-api* service**	Yes	No	Exposed by policy-api healthcheck and policy-pap consolidated healthcheck.
Latency	No	No	To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests.
Successful API request counter	No	No	Prometheus query for Number of successful API calls per minute
Failed API request counter	No	No	Prometheus query for Number of API calls with non 20* family of status codes per minute
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk

Key metrics for Policy PAP

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of the policy-pap* service**	Yes	No	policy-pap healthcheck API
Status of PDPs as registered with policy-pap	Yes	No	policy-pap consolidated healthcheck API
Successful API request counter	No	No	To be implemented for all the endpoints exposed by policy-pap. Sample s3p numbers for policy-pap stress tests.
Failed API request counter	No	No	To be implemented for all the endpoints exposed by policy-pap. Number of API calls with non 200 family of status codes per minute
Latency	No	No	To be implemented for all the endpoints exposed by policy-pap.
Policy deployment statistics policyDeployFailureCount policyDeploySuccessCount totalPolicyDeployCount	Yes	No	Sample: GET /policy/pap/v1/statistics Expand source { "code": 200, "policyDeployFailureCount": 0, "policyDeploySuccessCount": 0, "policyDownloadFailureCount": 0, "policyDownloadSuccessCount": 0, "totalPdpCount": 0, "totalPdpGroupCount": 0, "totalPolicyDeployCount": 0, "totalPolicyDownloadCount": 0 }
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk

Key metrics for Policy Distribution

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of the policy-distribution* service**	Yes	No	Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck
Successful API request counter	No	No	To be implemented for all the endpoints exposed by policy-distribution. Sample s3p numbers for policy-distribution stress tests.
Failed API request counter	No	No	To be implemented for all the endpoints exposed by policy-distribution. Number of API calls with non 200 family of status codes per minute
Latency	No	No	To be implemented for all the endpoints exposed by policy-distribution.
Policy distribution statistics distributions distribution_complete_ok distribution_complete_fail downloads downloads_ok downloads_error	Yes	No	policy-distribution Rest Endpoint Samples
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk

Key metrics for Policy APEX PDP

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
Availability of policy-apex-pdp	Yes	No	Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck.
Policy Deployment counter (per apex-pdp instance) policyDeployCount policyDeploySuccessCount policyDeployFailCount	Yes	No	Exposed by policy-pap statistics GET /policy/pap/v1/statistics/defaultGroup/apex Expand source { "defaultGroup": { "apex": [ { "pdpInstanceId": "devdev-policy-apex-pdp-0", "timeStamp": "2021-09-07T20:10:52.242Z", "pdpGroupName": "defaultGroup", "pdpSubGroupName": "apex", "policyDeployCount": 2, "policyDeploySuccessCount": 2, "policyDeployFailCount": 0, "policyExecutedCount": 0, "policyExecutedSuccessCount": 0, "policyExecutedFailCount": 0, "engineStats": [ { "engineId": "NSOApexEngine-0:0.0.1", "engineWorkerState": "READY", "engineTimeStamp": 1630550345549, "eventCount": 0, "lastExecutionTime": 0, "averageExecutionTime": 0, "upTime": 0, "lastEnterTime": 0, "lastStart": 1630550345549 }, ...... ] } ] } }
Policy Execution counter (per apex-pdp instance) # of policies executed # of policies executed with success status # of policies executed with a failure status *Note: the stats currently displays APEX policy counters	No	No
Engine count Can be inferred from the size of the json array object "engineStats"	Yes	No
Engine availability details (by engineID per apex-pdp instance) engineTimestamp: timestamp at which the statistics were recorded engineWorkerState: possible values defined in AxEngineState upTime: time that has elapsed since the policy engine was started lastStart: time at which the policy engine was last started	Yes	No
Count of events processed (per engine thread, per apex-pdp instance) # of incoming trigger events processed by policy-apex-pdp # of incoming trigger events processed successfully by policy-apex-pdp # of incoming trigger events processed by policy-apex-pdp that resulted in a failure *Note: the stats currently displays APEX event counters processed by the engine	No	No
Latency	No	No	Time taken for processing an incoming network trigger event by a TOSCA policy *Note: the stats currently displays execution time for processing APEX policy.
Kafka consumer lag	No	No	Can be implemented outside of the Policy FWK. Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk

Key metrics for Policy Drools PDP

*Note: Drools PDP counters are exposed on a per controlloop implementation basis.

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
Availability of policy-drools-pdp	Yes	No	Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck. Telemetry feature-lifecycle status API Expand source http://localhost:9696/policy/pdp/engine> get /policy/pdp/engine/lifecycle/state HTTP/1.1 200 OK Content-Length: 8 Content-Type: application/json Date: Thu, 11 Nov 2021 16:36:13 GMT Server: Jetty(9.4.33.v20201020) "ACTIVE"
Policy Deployment counter (per drools-pdp instance) policyDeployCount policyDeploySuccessCount policyDeployFailCount	Yes	No	Sample: GET /policy/pap/v1/statistics/defaultGroup/drools Expand source { "defaultGroup":{ "drools":[ { "pdpInstanceId":"dev-policy-drools-pdp-0", "timeStamp":"2021-09-07T20:09:34.160Z", "pdpGroupName":"defaultGroup", "pdpSubGroupName":"drools", "policyDeployCount":54, "policyDeploySuccessCount":54, "policyDeployFailCount":0, "policyExecutedCount":1, "policyExecutedSuccessCount":1, "policyExecutedFailCount":0, "engineStats":[ ] } ] } }
Policy Execution counter (per drools-pdp instance) policyExecutedCount policyExecutedSuccessCount policyExecutedFailCount	Yes	No
Latency	No	No	Time taken for an incoming event to be processed by drools controller.
Count of Drools facts	No	No	An ever increasing number of drools facts can lead to an Out of memory.
Kafka consumer lag	No	No	Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk

Key metrics for Policy XACML PDP

TODO: The statistics exposed can be more granular

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
Availability of policy-xacml-pdp	Yes	No	Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API GET /policy/pdpx/v1/healthcheck Expand source ~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/healthcheck' --header 'Authorization: Basic ****' { "name": "Policy Xacml PDP", "url": "self", "healthy": true, "code": 200, "message": "alive" }
Policy Deployment counter totalPoliciesCount totalPolicyTypesCount	Yes	No	XACML PDP statistics API GET /policy/pdpx/v1/statistics Expand source ~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/statistics' --header 'Authorization: Basic ****' { "code": 200, "totalPolicyTypesCount": 18, "totalPoliciesCount": 1, "totalErrorCount": 0, "permitDecisionsCount": 0, "denyDecisionsCount": 0, "indeterminantDecisionsCount": 0, "notApplicableDecisionsCount": 1 }
Policy execution error counter totalErrorCount	Yes	No
Policy execution success counter by type permitDecisionsCount denyDecisionsCount indeterminantDecisionsCount notApplicableDecisionsCount	Yes	No
Latency	No	No	Time taken for an incoming event to be processed via the XACML policies.
Kafka consumer lag	No	No	Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML
SSL certificate expiry time	No	No	Can be done outside the scope of Policy Fwk