$ cat post/chmod-seven-seven-seven-/-i-diff-the-past-against-now-/-the-pipeline-knows.md

28SEP20

chmod seven seven seven / I diff the past against now / the pipeline knows

Title: September 28, 2020 - A Platform Engineer’s Perspective

September 28, 2020. I woke up to the usual stream of emails from my dev ops and SRE teams, each with their own unique flavor of complexity. Today felt like a good day to reflect on what it means to be a platform engineer in this era.

The tech world was buzzing around me. Instagram’s bizarre passport number incident had made headlines, much like the Nvidia acquisition of Arm and the passing of Supreme Court Justice Ruth Bader Ginsburg. But for me, today was about digging into a thorny problem that had been bugging our development teams for weeks—performance issues in our internal developer portal.

The Problem

Our platform engineering team had recently launched Backstage, an internal developer portal that aimed to unify and streamline the way developers interact with our services. It was like a central hub where they could find documentation, track bugs, view release notes, and even spin up environments—all in one place. However, as usage increased, we started to notice some unexpected hiccups.

One of my team members, Alex, walked into my office looking concerned. “Hey Brandon,” he said, “we’ve got an issue with the performance metrics. Some users are reporting slowdowns, and we can’t figure out where the bottleneck is.”

I took a look at the dashboard and saw some red flags—CPU usage was spiking, and there were a few services showing high latency. I knew it was time to roll up my sleeves.

The Investigation

The first step was to dig into the logs. Alex had already done some basic filtering, but we needed more context. We decided to use kubectl to get an overview of the containerized applications running on our Kubernetes cluster. With a few commands, I quickly saw that one service, backstage-backend, was hogging resources.

I started by looking at the CPU and memory usage with kubectl top pod. But those numbers didn’t tell us everything we needed to know. We needed deeper insights into what these services were actually doing. That’s where eBPF came in handy—a powerful tool that allows you to instrument the kernel directly without changing any code.

I dove into the trace commands provided by eBPF, focusing on the backstage-backend service. The output was overwhelming at first, but after a bit of filtering, we identified some suspicious function traces. It turned out that there were frequent database queries causing delays.

The Solution

We decided to optimize the backend logic to reduce the number of unnecessary database calls. This involved refactoring some of our services and improving their caching mechanisms. We also looked at ways to distribute the load more evenly across our cluster by tweaking Kubernetes pod specifications.

One of the most rewarding parts was seeing the immediate impact of these changes. The performance metrics started to normalize, and the user reports of slowdowns dwindled. It felt great to have made a tangible difference in such a short time.

Reflection

As I sat back after resolving this issue, I couldn’t help but think about how far platform engineering had come. Just a few years ago, dealing with these kinds of issues would have been more fragmented and less cohesive. With tools like eBPF, ArgoCD, and Flux GitOps, we now have the means to diagnose and fix problems faster and more effectively.

But as the era of platform engineering continues to evolve, so do our challenges. The complexity around managing a Kubernetes cluster is real, and there’s always room for improvement. The recent popularity of eBPF, while exciting, also comes with its own set of learning curves.

Looking Ahead

As I look ahead, I’m excited about the future of platform engineering. Tools like ArgoCD and Flux GitOps are maturing, making our infrastructure more resilient and easier to manage. However, we still face the challenge of keeping up with the rapid pace of change in technology.

One thing is certain—my job won’t get any less interesting. The tech landscape continues to evolve, and so do the problems we need to solve. But that’s what keeps it fun!

This day was just one piece of a much larger puzzle, but it felt like a good moment to take stock. As platform engineers, our work is never done, but that’s part of what makes it so fulfilling.

Until next time, keep pushing the boundaries and solving those tricky problems.