Page Comparison

...

The principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Fwk components: API, PAP, Policy-Distribution, Policy-DB, PDPs (APEX, Drools, XACML).

Request Rate - Number of requests, per second as served by Policy services i.e. by API, PAP. Number of requests/events, per second as processed by the PDPs
Errors - Number of those requests/events processed that are failing
Latency/Duration (expressed as time intevalinterval) - Amount of time those requests take, and for PDPs relevant metrics denoting the event processing times
Saturation - Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.

System Metrics that

...

apply to all Policy components

These metrics are available and exposed via a Prometheus endpoint since Istanbul release.

Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.

Metric	Prometheus Query
Memory usage	rate(jvm_memory_bytes_used[30s])*100
CPU Usage	rate(process_cpu_seconds_total[30s])*100
JVM threads	jvm_threads_current jvm_threads_daemon
Process uptime	process_start_time_seconds
Garbage Collectors	GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m])

Note: SSL certificate expiry is a key metric to alert on, however this can be dealt with outside the scope of Policy Fwk.

Key metrics for Policy API

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of policy-api* service**	Yes

No

Yes	Exposed by policy-api healthcheck and policy-pap consolidated healthcheck.
Latency

No

Yes

NoNumber of

Yes	To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests.

Request rate (API requests per minute)NoNo

Successful API request counter

Yes

Prometheus query for Number of successful API calls per minute

Failure rate (API errors per minute)NoNo

Failed API request counter

Yes

Prometheus query for Number of API calls with non 20* family of status codes per minute

SSL certificate expiry timeNoNo

Key metrics for Policy PAP

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of the policy-pap* service**	Yes

No

policy-pap healthcheck API

Status of PDPs as registered with policy-pap

Yes

No

policy-pap

consolidated No

healthcheck API

Request rate (API requests per minute)

No

Successful API request counter

Yes

To be implemented for all the endpoints exposed by policy-pap.

Sample s3p numbers for policy-pap stress tests.

Failure rate (API errors per minute)

NoNo

Failed API request counter

Yes

To be implemented for all the endpoints exposed by policy-pap.

Number of API calls with non 200 family of status codes per minute

Latency

No

Yes

No

Yes

To be implemented for all the endpoints exposed by policy-pap.

Policy deployment statistics

policyDeployFailureCount
policyDeploySuccessCount
totalPolicyDeployCount

Yes

No

Yes

Sample:

Code Block

language	bash
title	GET /policy/pap/v1/statistics
collapse	true

{
    "code": 200,
    "policyDeployFailureCount": 0,
    "policyDeploySuccessCount": 0,
    "policyDownloadFailureCount": 0,
    "policyDownloadSuccessCount": 0,
    "totalPdpCount": 0,
    "totalPdpGroupCount": 0,
    "totalPolicyDeployCount": 0,
    "totalPolicyDownloadCount": 0
}

SSL certificate expiry time

No

Key metrics for Policy Distribution

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
*Availability of the policy-distribution* service**	Yes	Yes	Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck
Successful API request counter	Yes	Yes	To be implemented for all the endpoints exposed by policy-distribution. Sample s3p numbers for policy-distribution stress tests.
Failed API request counter	Yes	Yes	To be implemented for all the endpoints exposed by policy-distribution. Number of API calls with non 200 family of status codes per minute
Latency	Yes	Yes	To be implemented for all the endpoints exposed by policy-distribution.
Policy distribution statistics distributions distribution_complete_ok distribution_complete_fail downloads downloads_ok downloads_error	Yes	Yes	policy-distribution Rest Endpoint Samples

Key metrics for Policy APEX PDP

Metric	Metric available?	Exposed via Prometheus endpoint?	Comment
Availability of policy-apex-pdp	Yes

No

Yes

Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck.

TOSCA Policy Deployment counter (per apex-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

Yes

NoNoNo

Engine count

Can be inferred from the size of the json array object "engineStats"

YesNoEngine availability details

Yes

Exposed by policy-pap statistics

Code Block

title	GET /policy/pap/v1/statistics/defaultGroup/apex
collapse	true

{
  "defaultGroup": {
    "apex": [
      {
        "pdpInstanceId": "devdev-policy-apex-pdp-0",
        "timeStamp": "2021-09-07T20:10:52.242Z",
        "pdpGroupName": "defaultGroup",
        "pdpSubGroupName": "apex",
        "policyDeployCount": 2,
        "policyDeploySuccessCount": 2,
        "policyDeployFailCount": 0,
        "policyExecutedCount": 0,
        "policyExecutedSuccessCount": 0,
        "policyExecutedFailCount": 0,
        "engineStats": [
          {
            "engineId": "NSOApexEngine-0:0.0.1",
            "engineWorkerState": "READY",
            "engineTimeStamp": 1630550345549,
            "eventCount": 0,
            "lastExecutionTime": 0,
            "averageExecutionTime": 0,
            "upTime": 0,
            "lastEnterTime": 0,
            "lastStart": 1630550345549
          },
          ......
        ]
      }
    ]
  }
}

TOSCA Policy Execution counter (per apex-pdp instance)

# of policies executed
# of policies executed with success status
# of policies executed with a failure status

*Note: the stats currently displays APEX policy counters

Yes	Yes
Engine stats (by engineID per apex-pdp instance)

engineTimestamp: timestamp at which the statistics were recorded

eventCount: number of APEX events processed
engineWorkerState: possible values defined in AxEngineState

upTime: time that has elapsed since the policy engine was started

averageExecutionTime: average time taken to process an APEX policy
lastExecutionTime: time taken to process the last APEX policy
lastStart:

time

Count of events processed (per engine thread, per apex-pdp instance)

# of incoming trigger events processed by policy-apex-pdp
# of incoming trigger events processed successfully by policy-apex-pdp
# of incoming trigger events processed by policy-apex-pdp that resulted in a failure

*Note: the stats currently displays APEX event counters processed by the engine

NoNo

Latency

NoNo

time at which the policy engine was last started, uptime is derived from this metric

Yes

No

Yes
Latency	Yes	Yes	Time taken for processing an incoming

network trigger event by the policy

APEX event

*Note: the stats currently displays execution time for processing APEX policy

.

, and is a measure of system saturation and is sufficient

Kafka consumer lag

No

Can be implemented outside of the Policy FWK.

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp

SSL certificate expiry time (wherever applicable)NoNo

Key metrics for Policy Drools PDP

*Note: Drools PDP counters are exposed on a per controlloop implementation basis.

Metric

Metric available?

Exposed via Prometheus endpoint?

Comment

Availability of policy-drools-pdp

Yes

No

Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.

Code Block

language	bash
title	Telemetry feature-lifecycle status API
collapse	true

http://localhost:9696/policy/pdp/engine> get /policy/pdp/engine/lifecycle/state
HTTP/1.1 200 OK
Content-Length: 8
Content-Type: application/json
Date: Thu, 11 Nov 2021 16:36:13 GMT
Server: Jetty(9.4.33.v20201020)

"ACTIVE"

Policy Deployment counter (per drools-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

Yes

No

Sample:

Code Block

title	GET /policy/pap/v1/statistics/defaultGroup/drools
collapse	true

{
   "defaultGroup":{
      "drools":[
         {
            "pdpInstanceId":"dev-policy-drools-pdp-0",
            "timeStamp":"2021-09-07T20:09:34.160Z",
            "pdpGroupName":"defaultGroup",
            "pdpSubGroupName":"drools",
            "policyDeployCount":54,
            "policyDeploySuccessCount":54,
            "policyDeployFailCount":0,
            "policyExecutedCount":1,
            "policyExecutedSuccessCount":1,
            "policyExecutedFailCount":0,
            "engineStats":[

            ]
         }
      ]
   }
}

Policy Execution counter (per drools-pdp instance)

policyExecutedCount
policyExecutedSuccessCount
policyExecutedFailCount

Yes

No

Latency

No

Time taken for an incoming event to be processed by drools controller.

Count of Drools facts

No

An ever increasing number of drools facts can lead to an Out of memory.

Kafka consumer lag

No

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools

SSL certificate expiry timeNoNo

Key metrics for Policy XACML PDP

TODO: The statistics exposed can be more granular

Metric

Metric available?

Exposed via Prometheus endpoint?

Comment

Availability of policy-xacml-pdp

Yes

No

Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API

Code Block

title	GET /policy/pdpx/v1/healthcheck
collapse	true

~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/healthcheck' --header 'Authorization: Basic ****'
{
  "name": "Policy Xacml PDP",
  "url": "self",
  "healthy": true,
  "code": 200,
  "message": "alive"
}

Policy Deployment counter

totalPoliciesCount
totalPolicyTypesCount

Yes

No

XACML PDP statistics API

Code Block

title	GET /policy/pdpx/v1/statistics
collapse	true

~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/statistics' --header 'Authorization: Basic ****'
{
  "code": 200,
  "totalPolicyTypesCount": 18,
  "totalPoliciesCount": 1,
  "totalErrorCount": 0,
  "permitDecisionsCount": 0,
  "denyDecisionsCount": 0,
  "indeterminantDecisionsCount": 0,
  "notApplicableDecisionsCount": 1
}

Policy execution error counter

totalErrorCount

Yes

No

Policy execution success counter by type

permitDecisionsCount
denyDecisionsCount
indeterminantDecisionsCount
notApplicableDecisionsCount

Yes

No

Latency

No

Time taken for an incoming event to be processed via the XACML policies.

Kafka consumer lag

No

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML

SSL certificate expiry timeNoNo

Version	Old Version 4	New Version Current
Changes made by	Rashmi Pujar	Rashmi Pujar
Saved on	Nov 11, 2021	Apr 04, 2022

Page Comparison

Versions Compared

Key

System Metrics that

apply to all Policy components

Key metrics for Policy API

Key metrics for Policy PAP

Key metrics for Policy Distribution

Key metrics for Policy APEX PDP

Key metrics for Policy Drools PDP

Key metrics for Policy XACML PDP