Context
Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability.
The goal of monitoring is to achieve the below operational needs:
- Monitoring via dashboards: Provide a visual aid to display health and key metrics for use by OPS.
- Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation.
- Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues.
- Analyzing trends: How fast is it the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands.
Policy Framework Key Metrics
The principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Fwk components: API, PAP, Policy-DB, PDPs (APEX, Drools, XACML).
- Request Rate - Number of requests, per second as served by Policy services i.e. by API, PAP. Number of requests/events, per second as processed by the PDPs
- Errors - Number of those requests/events processed that are failing
- Latency/Duration (expressed as time inteval) - Amount of time those requests take, and for PDPs relevant metrics denoting the event processing times
- Saturation - Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.
System Metrics that aply to all Policy components
Metric | Prometheus Query | Metric available? | Exposed via Prometheus? | Comment |
---|---|---|---|---|
Memory usage | jvm_memory_bytes_used | Yes | Yes | Available in Istanbul |
CPU Usage | process_cpu_seconds_total | Yes | Yes | |
JVM threads | jvm_threads_current | Yes | Yes | |
Process uptime | process_start_time_seconds | Yes | Yes | |
Garbage Collectors | GCs per second: rate(jvm_gc_collection_seconds_sum[1m]) Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m]) | Yes | Yes |
Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.
Key metrics for Policy API
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-api service | Yes | No | Exposed by policy-api healthcheck |
Latency | No | No | To be implemented for all CRUD endpoints exposed by policy-api. Sample s3p numbers for policy-api stress tests. |
Request rate (API requests per minute) | No | No | Number of API calls per minute |
Failure rate (API errors per minute) | No | No | Number of API calls with non 20* family of status codes per minute |
SSL certificate expiry time | No | No |
Key metrics for Policy PAP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of the policy-pap service | Yes | No | policy-pap healthcheck API |
Status of PDPs as registered with policy-pap | Yes | No | policy-pap consolidated healthcheck API |
Request rate (API requests per minute) | No | No | To be implemented for all the endpoints exposed by policy-pap. Sample s3p numbers for policy-pap stress tests. |
Failure rate (API errors per minute) | No | No | To be implemented for all the endpoints exposed by policy-pap. Number of API calls with non 200 family of status codes per minute |
Policy deployment statistics policyDeployFailureCount | Yes | No | Sample: |
Latency | No | No | To be implemented for all the endpoints exposed by policy-pap. |
SSL certificate expiry time | No | No | https is disabled for entire Policy framework |
Key metrics for Policy APEX PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-apex-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. |
Policy Deployment counter (per apex-pdp instance) policyDeployCount | Yes | No | Exposed by policy-pap statistics |
Policy Execution counter (per apex-pdp instance) # of policies executed *Note: the stats currently displays APEX policy counters | No | No | |
Engine count Can be inferred from the size of the json array object "engineStats" | Yes | No | |
Engine availability details (by engineID per apex-pdp instance) engineTimestamp: timestamp at which the statistics were recorded | Yes | No | |
Count of events processed (per engine thread, per apex-pdp instance) # of incoming trigger events processed by policy-apex-pdp *Note: the stats currently displays APEX event counters processed by the engine | No | No | |
Latency | No | No | Time taken for processing an incoming network trigger event by the policy *Note: the stats currently displays execution time for processing APEX policy. |
Kafka consumer lag | No | No | Can be implemented outside of the Policy FWK. Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp |
SSL certificate expiry time (wherever applicable) | No | No | https is disabled for all of Policy FWK |
Key metrics for Policy Drools PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-drools-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. |
Policy Deployment counter (per drools-pdp instance) policyDeployCount | Yes | No | Sample: |
Policy Execution counter (per drools-pdp instance) policyExecutedCount | Yes | No | |
Latency | No | No | Time taken for an incoming event to be processed by drools controller. |
Count of Drools facts | No | No | An ever increasing number of drools facts can lead to an Out of memory. |
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools |
SSL certificate expiry time | No | No |
Key metrics for Policy XACML PDP
Metric | Metric available? | Exposed via Prometheus endpoint? | Comment |
---|---|---|---|
Availability of policy-xacml-pdp | Yes | No | Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML API |
Policy Deployment counter totalPoliciesCount | Yes | No | XACML PDP statistics API |
Policy execution error counter totalErrorCount | Yes | No | |
Policy execution success counter by type permitDecisionsCount | Yes | No | |
Latency | No | No | Time taken for an incoming event to be processed via the XACML policies. |
Kafka consumer lag | No | No | Can be implemented external to the policy FWK Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML |
SSL certificate expiry time | No | No |