Header background

A Kubernetes platform engineering strategy tames Kubernetes complexity

With Kubernetes, it’s easy for organizations to miss the forest for the trees. Despite its becoming an industry standard, Kubernetes complexity can cause major headaches as organizations scale. This complexity prompted digital bank PicPay, Latin America’s largest digital wallet and now bank, to adopt a Kubernetes platform engineering approach with unified observability and security at its center.

After a decade of helping companies manage container orchestration, Kubernetes, the open source container platform, has established itself as a mature enterprise technology. According to the Cloud Native Computing Foundation (CNCF), 84% of organizations are using or evaluating Kubernetes, up from 81% in 2022.

But challenges remain when it comes to Kubernetes complexity. The average deployment now spans 20 clusters running 10 or more software elements across clouds and data centers. In fact, 76% of technology leaders say the dynamic nature of Kubernetes makes it more difficult to maintain visibility of their infrastructure compared with traditional technology stacks.

With Kubernetes, it’s easy for many organizations to miss the forest for the trees. While its open source attributes have helped establish it as the industry standard for deploying, managing, and scaling container-based applications, its complexity rapidly increases as Kubernetes deployments begin to scale.

This was the case for PicPay, a financial services app in Brazil. I spoke with Martin Spier, PicPay’s VP of Engineering, about the challenges PicPay experienced and the Kubernetes platform engineering strategy his team adopted in response.

Overcoming Kubernetes complexity

With more than 35 million active users, PicPay is experiencing substantial growth: In 2023, the Total Payment Volume jumped 40% to R$271 billion, while revenue reached a record of R$3.5 billion.

The company receives tens of thousands of requests per second on its edge layer and sees hundreds of millions of events per hour on its analytics layer. To manage this data, PicPay has more than a hundred Kubernetes clusters running tens of thousands of pods and over a thousand microservices. “We ended up with allegedly the largest cluster in Latin America” Spier said, “which isn’t a great thing”.

Moreover, observability of their increasingly complex Kubernetes environment was lagging. “Our development teams relied heavily on logs to understand what was going on with our systems,” he said. This created problems with both visibility and scalability. Different teams were using different solutions to achieve the same results, creating a fragmented IT stack. In addition, their logs-heavy approach to analysis made scaling processes complex and costly. By over-rotating on log analysis, Spier and his team were missing the value, cost savings, and productivity that come from having metrics, traces and logs all in one place and in context.

To address these problems, Spier and his team needed to simplify. “We decided to break up the big cluster into smaller ones and create a standardization to provide that managed infrastructure for everyone,” Spier said. The company’s goal was to standardize observability and prevent common problems, such as Java or pods running out of memory, or users requesting resources and barely using any, or using 100% of it.

Taking a strategic Kubernetes platform engineering approach

Spier noted that keeping Kubernetes simple requires a strategic approach. He points to the shift from DevOps to platform engineering, or as he calls it, Foundation Engineering.
“Software is built in layers,” Spier explained. “And these layers tend to be similar. But if every team is left to define their own tools and stacks, you end up duplicating a lot of the work. Platform engineering looks to bring in a unified toolset.”

Spier also noted that trying to support all possible use cases is another common pitfall in navigating Kubernetes complexity. “For example, if most teams run Java, it might not make sense trying to support an outlier. Instead, you’re better off creating the best solution for the common use case.”

Ultimately, the sooner companies start controlling complexity using a standardized Kubernetes platform engineering approach, the better. “Start before you have multiple, competing platforms and you have to go through a really unpleasant migration,” Spier suggested.

Unified observability is a team sport for taming Kubernetes complexity

To implement their Kubernetes platform engineering strategy, PicPay needed observability of the big picture. They also needed to integrate the value and context of metrics and traces into their log monitoring scheme in a single place. To achieve it, Spier and his team turned to Dynatrace.

PicPay’s partnership with Dynatrace has enabled Spier and his team to accelerate their platform engineering efforts, which has resulted in several key benefits, including the following:

  • Metrics, traces, and logs in one place. The Dynatrace platform integrates all data in one place and in context.
  • Immediate entry. Dynatrace supports most tools and languages out of the box.
    Automated notifications. The solution offers automatic alerts and the ability to create alert baselines.
  • Ease of use. All relevant data is in the same place under a single control plane with a unified view.
  • Complete visibility for Kubernetes. Since all PicPay workloads run on Kubernetes, the company can detect issues before they happen.
  • Cost efficiency. By consolidating many tools into a single platform solution, PicPay saves on costs and maintenance, preserving developer time for innovation.
To hear my full conversation with Martin Spier from PicPay, watch the customer story: Taming K8s Complexity.
[cf]skyword_tracking_tag[/cf]