Securing production environment resilience is top of mind for many organizations in the wake of widespread outages caused by a routine software update in July, 2024. Here's how Dynatrace ensures our OneAgent releases protect our customer IT environments at every stage of the software development lifecycle.
Modern observability and security require comprehensive access to your hosts, processes, services, and applications to monitor system performance, conduct live debugging, and ensure application security protection. This level of access enables advanced capabilities such as runtime instrumentation and detailed diagnostics. While these techniques are powerful, they can pose risks to production environment resilience if not managed properly, as demonstrated by the widespread software outages in July.
At Dynatrace, we’ve implemented a thorough and industry-proven approach to developing OneAgent® that minimizes such risks. Our approach encompasses all stages of the software development lifecycle, focusing on safeguarding OneAgent integration with your systems. Through rigorous testing, dependency management, continuous monitoring, and phased rollouts, we’ve prioritized the development of OneAgent with the highest possible reliability and security standards.
By adhering to these stringent processes, OneAgent is designed to operate smoothly and securely, minimizing the likelihood of disruptions and providing you with greater confidence in your system’s security.
Dynatrace OneAgent: Quick overview
Dynatrace OneAgent is a unified monitoring solution deployed across customers’ IT environments. It automatically discovers and monitors each host’s applications, services, processes, and infrastructure components. Injecting monitoring code into your applications without manual configuration ensures continuous and comprehensive monitoring and security. OneAgent provides end-to-end visibility, capturing real-time performance data and detailed metrics on CPU, memory, disk, network, and processes.
Safety measures across the entire software lifecycle
From development to rollout and production, we’ve developed safeguards to ensure production environment resilience at each phase of the software development lifecycle, preventing problems in your systems during updates. Let’s dive into the details.
End-to-end safety measures in the software development process
- Separation of concerns: Each OneAgent component is designed to perform only one specific function. Critical and impactful components are minimized and undergo regular detailed reviews. Changes are introduced on a controlled schedule, typically once a week, to reduce the risk of affecting customer systems.
- Dependency reduction: OneAgent development teams minimize the use of third-party or open source dependencies. Any required dependencies are thoroughly tested and fixed to specific versions, minimizing the risk of introducing untested code.
- Rigorous testing: Engineers conduct extensive unit and integration tests on all code changes, covering individual functions and OneAgent performance with real-world applications. These tests are run on all supported operating systems and versions to enhance reliability.
- Hardening phase: Before release, OneAgent undergoes a month-long hardening phase, during which repeated tests are conducted to uncover hidden issues. Our developers double-check these tests daily. The software is also deployed to internal test applications for real-world validation.
- Artifact signing: OneAgent binaries are signed to prevent unauthorized changes. The deployment steps guide you through verifying the signature during installation, and from this point on, any OneAgent updates are verified automatically to ensure the integrity of the OneAgent.
Reduced likelihood of failures in the OneAgent rollout process
- Pre-rollout check: After the hardening phase, we thoroughly review each new OneAgent version with all teams to identify any known issues or concerns. We only proceed with a phased rollout when confident in a release.
- Phased and controlled rollout: Each rollout is carefully staged, starting with internal environments, then moving to proofs-of-concept (POCs), trials, new customer environments, and finally to the broader customer base. We monitor correct OneAgent behavior and functionality at each stage to ensure it behaves as expected.
- Customer control: You can manually update OneAgent versions, prioritizing updates for critical applications or host groups while enabling auto-updates for less critical areas. You can also schedule updates during maintenance windows to minimize disruption.
Stay ahead of potential failures with production self-monitoring
- 24/7 fully automated monitoring: We continuously monitor critical statistics, such as the number of connected OneAgents per technology, to swiftly address any significant issues. We also collect and analyze warning and severe log events to proactively address potential problems before they escalate into incidents.
- Real-time health insights: Dynatrace provides immediate insights into your environment, helping you quickly identify the root cause of any issues. This enables you to clearly understand the health of your OneAgent deployment across your entire environment.
- Automated issue collection: Details of potential issues are automatically collected and can be sent to Dynatrace for further analysis. Additionally, manual diagnostics can be performed to gather more specific information if needed.
- Proactive analysis and auto-remediation: When a problem is reported, regardless of its severity, we assess its potential impact on our customers’ environments and take appropriate action as needed. This might include adding safety checks or rolling out an updated version to address the issue.
Ensure system stability with our zero-impact policy
Dynatrace OneAgent is built with a focus on reliability and security, aiming to keep your systems stable and protected. We proactively approach potential risks, implementing rigorous standards across the entire software development lifecycle and maintaining continuous, real-time monitoring. This comprehensive strategy is designed to reduce the likelihood of disruptions, offering you greater confidence in the safety and resilience of your IT environment.
While no system is entirely free from potential risks, our approach minimizes such risks and provides robust protection for your systems, helping to ensure smoother and more reliable operation.
For complete details about Dynatrace OneAgent, go to Dynatrace Documentation.
Interested in innovating along with us? Explore the OneAgent careers page.
Gain insights into how Dynatrace developers develop observability code, continuously test across all supported environments, and avoid situations like the CrowdStrike software outages in July.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum