/
Open Telemetry/Micrometer Tracing Investigation

Open Telemetry/Micrometer Tracing Investigation

This page is for the collation of findings around an investigation of the impact of implementing message tracing in policy clamp. This study will focus on the addition of open telemetry and micrometer libraries to achieve full, configurable tracing.

Integrating tracing into a Spring Boot application offers several advantages that significantly enhance observability, troubleshooting, and performance optimisation within the application and the entire system.

What is Distributed Tracing?

Distributed tracing in Spring Boot refers to the practice of monitoring and tracing the flow of requests as they propagate through a distributed system composed of multiple interconnected services or components. It helps in understanding how requests traverse different parts of the system, allowing for visualisation of their path and identification of performance bottlenecks or issues across various services.

Key Components:

  1. Spans and Traces:

    • Span: Represents a unit of work or an operation within a service. Spans are linked together to form a trace.

    • Trace: A collection of spans representing the journey of a request as it moves through multiple services.

  2. Context Propagation:

    • Mechanisms to pass contextual information (like trace IDs, span IDs, etc.) between services to maintain continuity and correlation of traces.

  3. Instrumentation:

    • Integrating tracing libraries or frameworks (e.g., OpenTelemetry, Zipkin, Jaeger) into Spring Boot applications and services to generate traces and capture relevant metadata.

  4. Trace Collectors and Exporters:

    • Components responsible for collecting trace data generated by services and exporting it to backend systems for storage, analysis, and visualisation.

How it Works in Spring Boot:

  1. Instrumentation: Developers add tracing instrumentation to their Spring Boot applications using libraries or frameworks designed for distributed tracing.

  2. Propagation of Context:

    • During request handling, tracing context (trace IDs, span IDs) is propagated between services, ensuring continuity and correlation of traces.

  3. Span Creation:

    • Spans are created to represent various operations or actions within the Spring Boot application, such as handling HTTP requests, database queries, or other custom operations.

  4. Trace Visualisation and Analysis:

    • Traces collected from different services are aggregated and visualised in tracing systems (e.g., Jaeger UI, Zipkin UI) to provide a complete picture of request flows, latency, and dependencies across the distributed system.

Benefits

  1. End-to-End Visibility:

    • Tracing provides visibility into the entire flow of requests across various microservices or components, allowing you to understand how requests propagate and interact through the system.

  2. Performance Monitoring:

    • Detailed tracing metrics enable performance monitoring, helping identify bottlenecks, latency issues, or areas for optimisation within the application.

  3. Root Cause Analysis:

    • Traces aid in root cause analysis during incidents or errors by providing a detailed sequence of events leading to an issue, making debugging and troubleshooting more efficient.

  4. Dependency Mapping:

    • Tracing helps in visualising dependencies between different services or components, facilitating better understanding of interactions and potential points of failure.

  5. Resource Optimization:

    • Insights from tracing data assist in optimising resource allocation, scaling strategies, and load balancing based on real usage patterns.

  6. Service-Level Monitoring:

    • Monitoring and enforcing service-level agreements (SLAs) become more efficient by tracking performance metrics of individual services within a distributed system.

  7. Fine-Grained Metrics:

    • Tracing provides fine-grained metrics on individual transactions or requests, aiding in granular analysis and optimisation efforts.

  8. Application Insights and Analytics:

    • Trace data serves as a valuable source for analytics and business insights, allowing for a deeper understanding of user behaviours and application usage patterns.

  9. Real-Time Monitoring and Alerting:

    • Tracing systems often integrate with monitoring tools to offer real-time visibility into application performance, enabling proactive issue resolution and alerting mechanisms.

  10. Standardised Observability:

    • Integrating tracing aligns with standardised observability practices, making it easier to collaborate across teams and maintain a consistent approach to monitoring and troubleshooting.

  11. Third-Party Integration:

    • Tracing solutions often support integration with various third-party tools, allowing interoperability and data sharing across different monitoring platforms.



By incorporating tracing into your Spring Boot application, you gain a comprehensive understanding of its behaviour, performance bottlenecks, and interactions within a distributed system. This fosters a proactive approach to monitoring, troubleshooting, and optimising your application's performance and reliability.

Implementation Approaches

There are a number of different approaches that can be taken with micrometer and open telemetry to add tracing to your applications. Some of the kafka-related approaches that were tried during this investigation were early implementations that were note well documented. On the other hand, tracing of http requests were quick straightforward in implementation. This section talks about some of the approaches but also some of the difficulties that arise when using each of the methods.

Automatic Instrumentation with Java Agent

What would seem to be one of the most straightforward approaches to adding tracing to your applications is the use of the java agent supplied by OpenTelemetry. This is a zero code approach - touching the code inside your application is not required. All that is required is for you to download the agent jar and run it alongside your java application like so:

java -javaagent:path/to/opentelemetry-javaagent.jar \ -Dotel.service.name=your-service-name \ -Dotel.traces.exporter=zipkin \ -jar myapp.jar

Including the above approach, there are several other ways to provide parameters such as the exporter type and the name of the service.

One of the considerations we must make here is how much of a footprint does this method have on your application.

The footprint of the OpenTelemetry agent alongside your Java application can vary based on several factors, including the volume of telemetry data collected, the configurations set, and the resources available on the host machine. Here are some key considerations regarding its footprint:

Overhead:

  • Resource Utilisation: The OpenTelemetry agent itself is lightweight, but it may consume additional memory and CPU resources depending on the instrumentation, exporters used, and the volume of data collected.

  • Instrumentation Overhead: The impact largely depends on the instrumented libraries and frameworks. Instrumenting extensive libraries or frequently called methods might impose a slightly higher overhead.

Configuration:

  • Sampling Rate: Configuring a high sampling rate might increase resource consumption as more data will be collected and exported.

  • Exporters: Different exporters (Jaeger, Zipkin, etc.) might have different resource footprints. For instance, exporting to a remote service might incur network overhead.

Monitoring:

  • Monitoring Impact: Constantly monitoring the agent itself might consume additional resources. However, modern observability tools are designed to minimize this impact.

Optimisation:

  • Optimising Configurations: Properly configuring the agent to sample only necessary data and tuning settings can help reduce its footprint.

Mitigation:

  • Scaling and Isolation: Consider running the agent on separate instances, especially in production environments, to isolate and manage its resource consumption independently of your application.

Testing:

  • Load Testing: Conduct load tests or performance tests to evaluate the specific impact of the OpenTelemetry agent on your application in a setup similar to the production environment.

Summary:

The OpenTelemetry agent itself is designed to have minimal overhead, but the actual impact depends on various factors like instrumentation, configurations, and the volume of data collected. It's advisable to start with default settings, monitor resource utilization, and adjust configurations based on the observed impact to strike a balance between observability and resource consumption.

Automatic Instrumentation in SpringBoot

The opentelemetry-spring-boot-starter module auto-configures OpenTelemetry instrumentation for spring-web , spring-webmvc, and spring-webflux. Leverages Spring Aspect Oriented Programming, dependency injection, and bean post-processing to trace spring applications. To include all features listed below use the opentelemetry-spring-boot-starter. This module was not investigated in-detail because the version of Springboot that is currently used in ACM and the policy framework is only compatible with version 1.25.0-alpha of opentelemetry-spring-boot-starter. Alpha versions can be problematic for users due to vulnerabilities and unreliable features. In addition, this - this starter does not seem to support kafka clients

Manual Approaches