Background Half Wave
Observability

What is telemetry data?

Telemetry data refers to the automatic recording and transmission of data from all hosts to an IT system for monitoring and analysis. Accordingly, it's pivotal in ensuring IT systems meet service-level targets for availability and performance.

The concept of telemetry dates back to the early 19th century when it was used to describe the transmission of meteorological data like temperature and pressure readings over telegraph wires. In the IT world, telemetry data is often used to monitor information about the operational state of equipment and programs.

How telemetry data relates to IT

Telemetry data is a critical element of observability, a discipline that measures a system's internal states by examining its outputs. It's a crucial concept in modern IT and software engineering, particularly when complex, distributed systems and microservices architectures are involved. Most observability systems focus on three types of telemetry: logs, metrics, and traces.

Logs

Logs are detailed, time-stamped records of events within a system or application. These records are essential for understanding the historical sequence of events and are used to monitor, diagnose, and troubleshoot issues. They capture errors, warnings, informational messages, and debugging information to record a system's events chronologically.

Logs are usually the first telemetry data engineers look at when something goes wrong. They help identify error messages and the sequence of function or method calls that led to an error. Additionally, logs provide insights into the system's operational state, revealing patterns such as frequent errors that might not be apparent in other metrics.

Metrics

Metrics measure a system's state at a point in time. Examples include CPU usage, memory consumption, storage access, and network traffic. Metrics are typically collected and stored as time-series data, each with a time stamp. They allow teams to continuously monitor system performance and health and help identify performance bottlenecks, capacity issues, and other anomalies that could affect stability and user experience.

By analyzing metrics over time, teams can identify trends such as increasing resource usage or error rates to aid in proactive maintenance.

Metrics, such as CPU usage levels, are often used to trigger alerts when a certain threshold is reached. They also provide useful business insights, such as response times to user commands. Often used in conjunction with other forms of telemetry, metrics show what's happening, while logs and traces show why it's happening.

Traces

Traces are detailed records of a request as it travels through the components of a distributed system. They show how different services interact to enable a better understanding of system behavior.

Traces capture a request's lifecycle from entering the system until a result is returned, documenting steps, services, and component interactions. A trace is composed of multiple spans, each representing the actions of a service or component. Distributed tracing collects and correlates spans across services to create an end-to-end view of the request's path. This helps administrators and developers identify performance bottlenecks such as slow services, inefficient database queries, or network latency issues.

When an error or performance issue occurs, traces provide a detailed view of the events leading up to the problem. By tracing user requests, administrators can understand how system performance affects user experience. Traces also help detect developing performance and availability problems for proactive maintenance.

How telemetry data can benefit operations

These three key telemetry elements work together to help administrators solve problems. For example, say an operations team receives a preprogrammed notification of slow load times for product pages on the company's website. Metrics reveal the spike started at a specific time and correlates with an increase in the number of users.

Logs show many timeout errors when the system tries to fetch data from the product database. They also indicate slower-than-usual query execution times. Meanwhile, distributed traces reveal the database service consumes significant time responding to queries about product details. Dependency traces also show the product service calls an external recommendation service, which occasionally has delays.

By correlating telemetry data, the team can quickly identify the root cause of the problem: slow database queries exacerbated by delays in the recommendation service. In response, they decide to optimize the database query by adding indexes or rewriting it. They also implement caching to reduce database reads and writes and failback logic to minimize delays in the recommendation service.

The result is a significant reduction in response times for product page requests. The outcome is confirmed by log data showing the number of timeout errors decreased and query execution times improved. New traces show database service performance has improved, and the request path is more efficient.

Potential challenges associated with telemetry data

Despite its many benefits, managing telemetry data also brings some challenges.

Data overload. The high data volumes that complex distributed systems generate can create confusion as administrators try to locate a needle in a haystack of telemetry records.

Security vulnerabilities. Telemetry data sent unencrypted over a network may inadvertently include sensitive information, such as user credentials and personally identifiable information. Logs and traces can also contain sensitive information about system configurations, user activities, and application logic.

Interoperability issues. Integrating telemetry systems with existing infrastructure — and legacy systems in particular — can be challenging due to incompatible data formats, such as JSON, XML, and proprietary alternatives. Inconsistent data schemas, such as differing field names, types, and structures, can complicate correlating telemetry data across systems.

Cost. Setting up and maintaining telemetry systems can be expensive, especially for small and medium-sized organizations.

The importance of a data lakehouse to storing critical telemetry data

Telemetry data has enabled organizations to deploy and manage distributed systems of unprecedented complexity. Accordingly, as technology evolves, telemetry data will become more integral in driving the next wave of IT innovations. This makes securely storing telemetry data critical. That’s where a data lakehouse, such as Dynatrace Grail, can help.

A data lakehouse offers the flexibility and cost-efficiency of a data lake with the contextual and high-speed querying capabilities of a data warehouse. Dynatrace Grail is a database designed for observability and security data. It is a single unified storage solution for telemetry data, such as logs, metrics, traces, events, and more. All data stored in the Grail data lakehouse is interconnected within a real-time model that reflects the topology and dependencies within a monitored environment.