In this blog, we discuss Spring’s new and inbuilt support for Observability, and how to make best use of it. In earlier versions of Spring and SpringBoot, one had to use Micrometer and Micrometer tracing, both 3rd party libraries, to implement observability and distributed tracing respectively. Now, starting from Spring 6 and Spring Boot 3.0, observability capabilities are built and seamlessly woven into the framework.
Observability is the ability to monitor a system just by looking at its external features/outputs.
Observability acts like a dashboard in an airplane's cockpit, it gives you a top-level view of key performance indicators, a bird's eye view of how the system is working as a sum of
its parts.
Because it's the first screen you see if you're trying to analyze a working system.Since everybody wants to monitor their system's performance, it would be nice if we could have something that can be instrumented automatically or with minimum effort, which then watches the internal dynamics of our Spring boot system, records the function calls on the call stack and the time they take, and then also records this “tracing” information between the system components.Since these watches will have a causal ordering, i.e. based on the ordered sequence of function calls being made, it would give us a blueprint of how the system is performing along its pathways/rails.Then it should also show us aggregated information, as well as allow us to drill down to details, to pinpoint problems/errors. This helps us ascertain the efficiency of our system as well as detect and solve issues, to achieve optimal performance.
Spring Framework 6 and Spring Boot 3, now have inbuilt support for observability for Spring Applications. Both Spring flavors now contain numerous auto-configurations for improved metrics with Micrometer and a new distributed tracing support with Micrometer Tracing (formerly Spring Cloud Sleuth). The application logs, metrics, and distributed tracing are beautifully interwoven to give you a holistic view of the internal functioning of your applications.
All the tracing information propagated to enable distributed tracing scenarios, is based on the W3C context propagation standard. [https://www.w3.org/TR/trace-context/].Spring now introduces a new API, called the Micrometer Observation API, which lets its users instrument their code once using a single API and have multiple benefits out of it (e.g. metrics, tracing, logging).
The three pillars of Observability are logs, metrics, and traces; these are foundational concepts on which Observability is built.The terminology used by SpringBoot and in this article frequently uses terms like Span/Trace/Service.You can wrap any operation in a "Span", which has a unique span_id, contains timing information, and could contain some user-defined additional metadata ( like key-value pairs).e.g. A span can be the interface/boundary between 2 services, e.g. when serviceA calls serviceB, a span is created between them, and later when serviceB calls serviceC, another span is created between them; and we trace those spans.So a Span is just a basic Operation and a Trace is a tree of Spans with causal ordering. By causal ordering we mean that Span2 is caused by Span1 or Span1 does something that brings Span2 in the picture. This happens because of your application design, where one function/service calls another function/service and so on ... to complete a business process.Each span has its parent_ID/Start and End time/Duration/Metadata.
You can integrate with analytical tools like Grafana, Prometheus, Kibana or any tool of your choice, to get Dashboards that show latency and throughput(with errors and successes).
If your users complain that your system is unresponsive/slow, or if you want to know which services are running slow/creating a bottleneck, you would want to visualize the latency and then drill down to the concrete operation that is the culprit.
You can select a service and then see its traces and check more details about a particular trace. It also depicts any dependencies/parent-child relationship between services, e.g. the services called by a service during its operation, in the form of a node tree.
In the latency graph, there are dots joined with lines, whereas lines show the overall trends, dots show individual requests, and to see more details about the particular request you just need to mouse over the dot to show an exemplar.
Clicking on any dot will show you details about that trace/exemplar.
You can search the tracing data, to know more details about the events that caused a particular behavior.
SpringBoot 3.2's observability will tag error logs with relevant tags, and you can search for specific error outcomes/traces using those tags, in the search-tracing form. From the results, you can see the hierarchy or calling stack of services in the "Service and Operation" section.Clicking on a particular trace will give you all details recorded for the particular trace( e.g. all details of an API call), data like duration of a connection/order of calling of services/the milliseconds spent in each service in the calling hierarchy, etc.
Apart from functions and API calls, database layer calls are also traced.
You also get JVM stats on CPU usage, heap/non-heap memory used, processes, garbage collector etc.
You can also search all of the raw logs that match a specific trace_id, this can come in handy if you want to ascertain the stack trace that culminated into an error, or You can also check the raw logs related to an anomaly/error to check the minute details, and click back to the summarized metrics easily.
So from Metrics --> to Traces --> to Logs; that's the level of granular detail you can watch using SpringBoot3.2's inbuilt Observability. There's more… if you just have the raw logs you can jump to the relevant metrics for that span easily.
Distributed tracing is a technique used to track and observe application requests as they move from front-end to backend, and further through distributed systems or other micro service environments. E.g. Your shopping cast application calling your bank's payment gateway creates an avenue for distributed tracing. Distributed tracing is now built into SpringBoot3 and Spring 6; so there is no need to manually instrument traces for each of the third-party services that your application uses.The micrometer detects dependencies and orchestrates tracing for these other services too.
Nobody wants to sit all day long in front of a dashboard, waiting for something to happen. So, SpringBoot Observability has an inbuilt system of triggering alerts.
You can easily set alerts and if anything goes beyond a specified threshold value, you will get a notification email/message.
Next, we discuss some internals of the Observability system. Each service/module inputs its metrics data into a micrometer and the micrometer aggregates them to show us analytics.
Each service's log is assigned a unique identifier or trace_id, which can help in drilling down to details.
Spring now has an inbuilt Micrometer Observation API, which is used to orchestrate Observability. With Micrometer Observation API you can just instrument/orchestrate once and get multiple benefits out of it.
This simple code will give you all your traces, spans, and logs.Internally micrometer uses many handlers for almost every possible event, ObervationHandler interface has methods like onStart/onError/onStop/supportContext, etc. You can override these methods to introduce your desired custom behavior.
e.g.
Once your tracing instrumentation is in place, Micrometer gives you the flexibility to choose your preferred analytical handler, as well as tracing handlers and log solutions.e.g. You could use the Open Telemetry tracer for your tracing handlers instead of Micrometer's own tracing API.
Unlogged shows statistics on method execution time, such as average execution time, standard deviation, etc. when you replay a method. This indicates the latency that your method could introduce in your scheme of things.
For multiple replays, Unlogged generates statistics like # of calls, avg, and std Dev in microseconds. It is fine-grained till microseconds so that you can detect even minuscule changes in method performance after your code changes.