Distributed traces are a key part of the accurate observability of distributed systems and microservices environments; however, traces need to be sampled to save on storage costs. Head-based and tail-based sampling methods have been used for many years, but they have limitations. This is why Dynatrace is researching a new approach, partial trace sampling, to fill the gaps left by those canonical sampling methods.
What is distributed tracing?
The complexity we have reached with distributed systems and microservice architectures makes observability necessary to maintain healthy software. Next to metrics, events, and logs, distributed traces are an essential type of telemetry data that gives you a complete picture of your software environment and performance for end-to-end observability.
Distributed traces track and observe service requests as they flow through distributed systems and go from one service to another. Thanks to trace data, you can understand microservices environments in a way that is not possible manually. You can understand where failures happen and why.
Every “step” in a trace is referred to as a span. The “root span” is the first span in a trace and a “child span” a subsequent span.
The more information you can collect and store about your system, the more accurate the analysis’s results will be.
So, why do we need trace sampling?
Since storage and processing resources are not unlimited, sampling enables us to collect as much useful information as possible without storing too much.
But how do you choose which traces to sample?
There are two canonical ways to do this: head-based and tail-based.
Head-Based Trace Sampling
In head-based sampling, the decision is made randomly while the root span is processed. It’s fast and simple to get up and running and has little impact on application performance. For some systems, random sampling can give sufficient visibility; however, it could lead to less coverage for more complex systems. For example, if the decision is made at the start not to sample this trace it ends up being a rare path.
Tail-Based Trace Sampling
In tail-based sampling, the decision is made when the request has been completed, and all information about the trace has been collected. This method has the benefit of more intelligent sampling since rarer traces can be collected just as often as more common ones. However, incomplete traces must be buffered on a collector service as the decision can only be made upon completion, causing significant communication and memory overhead.
What is the issue with canonical trace sampling methods?
These canonical ways have been working in observability for a long time. However, they are far from perfect ways of doing the job.
In head-based sampling, you often do not know at the root span whether a trace is rare or common. Since the sampling decision is random, you cannot ensure that you get high coverage. Frequently called requests are sampled very often. Rarer requests have a lower chance of being sampled. This makes it challenging to choose an appropriate sampling rate since you also need to consider the constraints for data collection at the backend.
As mentioned, tail-based sampling has significant additional network and memory costs because of the required preprocessing. In addition, if the collector service needs to be scalable, additional complexity is introduced as the spans of the same trace have to be routed to the same collector instance. Adding or removing instances may even lead to unintended information loss.
What is the alternative?
To ensure that rarer parts of a trace are sampled as often as more common ones, Dynatrace research proposes a flexible sampling method called “partial trace sampling”.
In partial trace sampling, the sampling rate may vary depending on the frequency in which a part of a branch is being called.
For example, the head of the trace is sampled less frequently because it is called very often. The deeper parts of the trace that are called less often will be sampled more frequently. This ensures a more balanced sampling of spans across the board, which is less likely to happen with other approaches.
Are partially sampled traces useful?
In contrast to head- and tail-based sampling, varying sampling rates across the spans of the same trace often result in fragments being collected, making the analysis more challenging.
While consistent span sampling maximizes the probability of capturing all spans of a trace by sharing the same random number for all sampling decisions within a trace, many traces are only partially sampled due to the differing span sampling rates. However, the information in incomplete traces is also valuable because many queries do not need the full trace and only consider specific branches. For example, estimating how often one backend service calls another does not require any information about the frontend.
What is the current state of research?
These ideas have been formulated in our research paper and are being used to define the new OpenTelemetry sampling specification — but it’s still a work in progress. Nonetheless, you can look at the proof of concept published on GitHub.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum