DevOps

What are golden signals?

Golden signals are four key metrics — latency, traffic, errors, and saturation — for monitoring IT systems' health and performance. They offer a vital framework for efficient ITOps monitoring and management.

These signals are crucial for delivering high-performing software solutions. They combine to swiftly identify IT issues, maintain system reliability, and ensure a consistent, positive user experience.

The four golden signals and how they work

Golden metrics contribute to effective IT observability because they provide a clear and actionable view of system performance and health. Golden signals continuously monitor ITOps to help teams gain valuable insights into IT systems' holistic health and take proactive measures to ensure their reliability and performance.

Here's a closer look at how each of the four golden signals works.

1. Latency

Latency measures the time taken for a system to process a request. This includes the time from when a request is received until the response is sent back.

How latency works

Monitoring latency involves tracking the response time of requests. This can be done at various levels, such as application-level response time, database query time, or network latency. Tools and monitoring systems record these timings and often calculate percentiles (e.g., 95th percentile latency) to understand typical performance.

How latency affects performance

Monitoring high latency - also called lag - can reveal performance bottlenecks, resource contention, or inefficient processing. Low latency helps applications run faster and more smoothly.

2. Traffic

Traffic measures the volume of requests or transactions that the system is handling. It can be quantified in terms of requests per second, transactions per second, or throughput.

How traffic works

Traffic monitoring involves counting the number of incoming requests or operations over time. This data can be visualized through graphs, or histograms to detect patterns, spikes, or drops.

How traffic affects performance

Monitoring traffic is crucial for preventing high traffic levels that can strain the system and degrade performance. Conversely, a sudden decrease in traffic may signal a system issue or a problem affecting users.

3. Errors

Errors track the number or rate of failed requests in a system, such as HTTP 500 errors, timeouts, and other application-specific failures.

How errors work

Error tracking involves logging and monitoring error events. Metrics are collected on the number and types of errors and can be aggregated over time or filtered by error type.

How errors affect performance

Monitoring errors helps identify and address problems that could degrade service reliability or user satisfaction. A high error rate often signifies bugs, misconfigurations, or system faults.

4. Saturation

Saturation assesses how much of the system's resources (such as CPU, memory, disk, or network) are in use. It indicates how close the system is to reaching its capacity limits.

How saturation works

System performance counters or monitoring tools that track resource usage collect saturation metrics. These metrics are often visualized in dashboards to provide insights into resource utilization.

How saturation affects performance

Monitoring saturation helps in scaling resources or optimizing usage to prevent overload. High saturation levels suggest resources are heavily used, which can lead to performance degradation or system failure if limits are reached.

Why are the four golden signals important?

The four golden signals provide a streamlined approach to monitoring IT systems' health, performance, and reliability. A spike in latency or errors can indicate a problem, and efficient troubleshooting enables issues to be diagnosed and resolved quickly. Proactive management allows ITOps to anticipate and address issues before they impact users.

Golden signals are also critical for ITOps and other teams for the following reasons:

Performance and reliability

These signals offer a high-level view of a system's health, which helps ITOps teams quickly assess the overall state of IT infrastructure and services. Continuously monitoring latency, traffic, errors, and saturation allows ITOps to maintain a balance between performance and resource utilization.

Informed decision making

The four golden signals offer actionable insights that can inform scaling, capacity planning, and resource allocation decisions. For example, high traffic combined with high latency might suggest the need for additional resources or optimizations.

User experience

Golden metrics such as high latency or errors affect how users interact with your service. Keeping these metrics in check ensures a smoother and more reliable user experience.

Service-level objectives (SLOs) and agreements

The four golden signals provide the necessary metrics to ensure a service meets its performance and reliability targets. Meeting SLOs and ensuring SLAs is crucial for maintaining customer trust and satisfaction.

Limitations of the four golden signals

While they are invaluable for monitoring IT systems' health and performance, relying solely on the four golden signals can have several drawbacks.

The four golden signals offer a limited scope that may exclude application-specific metrics and contextual blind spots that can lead to misinterpretation of metrics. Threshold-setting challenges can also result in alert fatigue in highly variable environments where what constitutes as "normal" can change frequently.

Additional limitations of the four golden signals include the following:

Complexity in distributed systems

It can be difficult to correlate these signals across multiple distributed services and components. This complicates the troubleshooting process and can lead to loss of granularity or important details when aggregating these signals.

Deep focus on operational metrics

The four golden signals emphasize operational aspects but may not address nonfunctional requirements - including security, compliance, or key performance indicators - related to business outcomes.

Reactive in nature

Golden metrics often highlight issues after they've started to impact the system (i.e., detection lag). They may not provide early warnings or predictive insights to prevent issues before they affect users.

Potential for misleading indicators

The four golden signals can lead to false positives or negatives. For example, a traffic increase might not be an issue if the system is designed to handle high loads. Alternatively, a single metric like latency might mask underlying issues such as high error rates or resource saturation.

Resource intensive

Golden metrics require instrumentation for accurate data collection and monitoring, which can introduce overhead and complexity. These signals can also generate a large volume of data, requiring efficient storage and analysis mechanisms.

Neglecting low-level details

The four golden signals might not capture detailed performance metrics - such as database query performance or memory usage - that require added implementation of logs and traces for detailed insights.

Interdependencies and integration

Golden metrics may not capture service dependencies - such as issues arising from dependencies between services or third-party integrations - or provide end-to-end visibility.

End-to-end observability and monitoring with Dynatrace

As cloud-native and hybrid cloud environments have become increasingly complex, root causes of performance issues are becoming more difficult to pinpoint and resolve quickly. The four golden signals' limitations underscore the importance of using a broader set of metrics for end-to-end observability and monitoring. This includes additional metrics, logs, traces, and contextual information to achieve a more comprehensive and accurate picture of IT system health and performance.

The Dynatrace platform enables organizations to take a unified approach to observability. With end-to-end observability, security, and business analytics in one platform, organizations can resolve issues quickly and proactively at scale. The platform combines predictive, causal, and generative AI capabilities with automation to deliver root-cause analysis, context-aware insights, and predictive operations to streamline productivity.

Explore how your organization can benefit from enhancing its operations and user experience with the Dynatrace Observability content hub.