Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.

MetricPrometheus Query
Memory usagerate(jvm_memory_bytes_used[30s])*100
CPU Usagerate(process_cpu_seconds_total[30s])*100
JVM threads

jvm_threads_current
jvm_threads_daemon

Process uptimeprocess_start_time_seconds
Garbage Collectors

GCs per second: rate(jvm_gc_collection_seconds_sum[1m])

Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m])

Key metrics for Policy API

MetricMetric available?

Exposed via Prometheus endpoint?

Comment
Availability of policy-api serviceYesNo

Exposed by policy-api healthcheck and policy-pap consolidated healthcheck.

Latency


NoNo

To be implemented for all CRUD endpoints exposed by policy-api.

Sample s3p numbers for policy-api stress tests.

Request rate (API requests per minute)
Successful API request counterNoNo

Prometheus query for Number of successful API calls per minute

Failure rate (API errors per minute)
Failed API request counterNoNo

Prometheus query for Number of API calls with non 20* family of status codes per minute

SSL certificate expiry timeNoNoCan be done outside the scope of Policy Fwk

Key metrics for Policy PAP

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of the policy-pap serviceYesNo

policy-pap healthcheck API

Status of PDPs as registered with policy-pap

YesNo

policy-pap consolidated healthcheck API

Request rate (API requests per minute)

Successful API request counter

NoNo

To be implemented for all the endpoints exposed by policy-pap.

Sample s3p numbers for policy-pap stress tests. 

Failure rate (API errors per minute)

Failed API request counter

NoNo

To be implemented for all the endpoints exposed by policy-pap.

Number of API calls with non 200 family of status codes per minute

Latency

NoNo

To be implemented for all the endpoints exposed by policy-pap.

Policy deployment statistics

policyDeployFailureCount
policyDeploySuccessCount
totalPolicyDeployCount

YesNo

Sample:

Code Block
languagebash
titleGET /policy/pap/v1/statistics
collapsetrue
{
    "code": 200,
    "policyDeployFailureCount": 0,
    "policyDeploySuccessCount": 0,
    "policyDownloadFailureCount": 0,
    "policyDownloadSuccessCount": 0,
    "totalPdpCount": 0,
    "totalPdpGroupCount": 0,
    "totalPolicyDeployCount": 0,
    "totalPolicyDownloadCount": 0
}


SSL certificate expiry time

No

NoCan be done outside the scope of Policy Fwk

Key metrics for Policy Distribution

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of the policy-distribution serviceYesNo

Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck

Request rate (API requests per minute)

Successful API request counter

NoNo

To be implemented for all the endpoints exposed by policy-distribution.

Sample s3p numbers for policy-distribution stress tests. 

Failure rate (API errors per minute)

Failed API request counter

NoNo

To be implemented for all the endpoints exposed by policy-distribution.

Number of API calls with non 200 family of status codes per minute

Latency

NoNo

To be implemented for all the endpoints exposed by policy-distribution.

Policy distribution statistics

distributions
distribution_complete_ok
distribution_complete_fail
downloads
downloads_ok
downloads_error

YesNo
SSL certificate expiry time

No

NoCan be done outside the scope of Policy Fwk

Key metrics for Policy APEX PDP

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-apex-pdpYesNo

Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck.

Policy Deployment counter (per apex-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

YesNo

Exposed by policy-pap statistics

Code Block
titleGET /policy/pap/v1/statistics/defaultGroup/apex
collapsetrue
{
  "defaultGroup": {
    "apex": [
      {
        "pdpInstanceId": "devdev-policy-apex-pdp-0",
        "timeStamp": "2021-09-07T20:10:52.242Z",
        "pdpGroupName": "defaultGroup",
        "pdpSubGroupName": "apex",
        "policyDeployCount": 2,
        "policyDeploySuccessCount": 2,
        "policyDeployFailCount": 0,
        "policyExecutedCount": 0,
        "policyExecutedSuccessCount": 0,
        "policyExecutedFailCount": 0,
        "engineStats": [
          {
            "engineId": "NSOApexEngine-0:0.0.1",
            "engineWorkerState": "READY",
            "engineTimeStamp": 1630550345549,
            "eventCount": 0,
            "lastExecutionTime": 0,
            "averageExecutionTime": 0,
            "upTime": 0,
            "lastEnterTime": 0,
            "lastStart": 1630550345549
          },
          ......
        ]
      }
    ]
  }
}






Policy Execution counter (per apex-pdp instance)

# of policies executed
# of policies executed with success status
# of policies executed with a failure status

*Note: the stats currently displays APEX policy counters

NoNo

Engine count

Can be inferred from the size of the json array object "engineStats"

YesNo

Engine availability details (by engineID per apex-pdp instance)

engineTimestamp: timestamp at which the statistics were recorded
engineWorkerState: possible values defined in AxEngineState
upTime: time that has elapsed since the policy engine was started
lastStart: time at which the policy engine was last started

YesNo

Count of events processed (per engine thread, per apex-pdp instance)

#  of incoming trigger events processed by policy-apex-pdp
# of incoming trigger events processed successfully by policy-apex-pdp
# of incoming trigger events processed by policy-apex-pdp that resulted in a failure

*Note: the stats currently displays APEX event counters processed by the engine

NoNo

Latency

NoNo

Time taken for processing an incoming network trigger event by the policy

*Note: the stats currently displays execution time for processing APEX policy.

Kafka consumer lag

NoNo

Can be implemented outside of the Policy FWK.

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp

SSL certificate expiry timeNoNo

Can be done outside the scope of Policy Fwk

Key metrics for Policy Drools PDP

*Note: The Drools PDP counters are exposed per controlloop implementation.

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-drools-pdpYesNo

Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.

Code Block
languagebash
titleTelemetry feature-lifecycle status API
collapsetrue
http://localhost:9696/policy/pdp/engine> get /policy/pdp/engine/lifecycle/state
HTTP/1.1 200 OK
Content-Length: 8
Content-Type: application/json
Date: Thu, 11 Nov 2021 16:36:13 GMT
Server: Jetty(9.4.33.v20201020)

"ACTIVE"


Policy Deployment counter (per drools-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

YesNo

Sample:

Code Block
titleGET /policy/pap/v1/statistics/defaultGroup/drools
collapsetrue
{
   "defaultGroup":{
      "drools":[
         {
            "pdpInstanceId":"dev-policy-drools-pdp-0",
            "timeStamp":"2021-09-07T20:09:34.160Z",
            "pdpGroupName":"defaultGroup",
            "pdpSubGroupName":"drools",
            "policyDeployCount":54,
            "policyDeploySuccessCount":54,
            "policyDeployFailCount":0,
            "policyExecutedCount":1,
            "policyExecutedSuccessCount":1,
            "policyExecutedFailCount":0,
            "engineStats":[

            ]
         }
      ]
   }
}


Policy Execution counter (per drools-pdp instance)

policyExecutedCount
policyExecutedSuccessCount
policyExecutedFailCount

YesNo

Latency

NoNoTime taken for an incoming event to be processed by drools controller.

Count of Drools facts

NoNoAn ever increasing number of drools facts can lead to an Out of memory.
Kafka consumer lagNoNo

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools

SSL certificate expiry timeNoNoCan be done outside the scope of Policy Fwk

Key metrics for Policy XACML PDP

TODO: The statistics exposed can be more granular

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-xacml-pdpYesNo

Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API

Code Block
titleGET /policy/pdpx/v1/healthcheck
collapsetrue
~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/healthcheck' --header 'Authorization: Basic ****'
{
  "name": "Policy Xacml PDP",
  "url": "self",
  "healthy": true,
  "code": 200,
  "message": "alive"
}


Policy Deployment counter

totalPoliciesCount
totalPolicyTypesCount

YesNo

XACML PDP statistics API

Code Block
titleGET /policy/pdpx/v1/statistics
collapsetrue
~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/statistics' --header 'Authorization: Basic ****'
{
  "code": 200,
  "totalPolicyTypesCount": 18,
  "totalPoliciesCount": 1,
  "totalErrorCount": 0,
  "permitDecisionsCount": 0,
  "denyDecisionsCount": 0,
  "indeterminantDecisionsCount": 0,
  "notApplicableDecisionsCount": 1
}


Policy execution error counter

totalErrorCount

YesNo

Policy execution success counter by type

permitDecisionsCount
denyDecisionsCount
indeterminantDecisionsCount
notApplicableDecisionsCount

YesNo
LatencyNoNo

Time taken for an incoming event to be processed via the XACML policies.

Kafka consumer lagNoNo

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML

SSL certificate expiry timeNoNoCan be done outside the scope of Policy Fwk