$ cat post/a-ticket-unopened-/-i-wrote-it-and-forgot-why-/-uptime-was-the-proof.md

21FEB22

a ticket unopened / I wrote it and forgot why / uptime was the proof

Title: February 21, 2022: A Day in the Life of a Platform Engineer

It was another chilly February morning when I woke up and logged into my terminal. The world outside was already buzzing with news—Google’s search dominance supposedly faltering, and Russia’s invasion of Ukraine casting an ominous shadow over geopolitical stability. But for me, as a platform engineer, today began much the same way every other day: a few cups of coffee and a deep dive into the technical weeds.

Today, I was working on improving our service mesh capabilities. We were moving towards using Istio as the backbone for our microservices architecture, which meant a lot of time spent configuring sidecars, managing traffic splits, and optimizing latency. Our current setup had to handle petabytes of data each day, so every tweak could have significant implications.

One of my tasks was to debug a strange issue where some services were experiencing higher latencies in certain regions. I started by reviewing the logs, but they didn’t give me any clear clues. It wasn’t until I used strace and perf that I finally saw it: there was an excessive number of system calls being made due to an incorrect configuration of a sidecar proxy.

I quickly fixed the issue and deployed the changes, hoping for a smooth ride. But fate had other plans. The change caused some unexpected behavior in our load balancers. After a few rounds of debugging, I found that the health checks weren’t properly configured. They were timing out too frequently, leading to unnecessary rerouting of traffic.

This led to an interesting argument with my team lead about how we handle these types of issues. Should we make breaking changes during peak hours or wait for a maintenance window? In the end, we decided to schedule the rollback and reconfiguration for the next night, ensuring minimal impact on our users.

The day went on, and I found myself spending more time than usual on meeting calls about FinOps and cloud cost pressure. We’re moving towards a more automated approach to manage costs, using tools like AWS Cost Explorer and Google Cloud’s billing API to track and optimize spend. It was fascinating to see how much our infrastructure had grown over the years, and how much we needed these kinds of metrics to stay lean.

In the evening, I attended a DORA (DevOps Research and Assessment) metrics workshop. We discussed various practices like Lead Time, Deployment Frequency, and Change Failure Rate. These metrics are now normalized in our industry, but it still felt good to have them validated by data. The discussion brought up some heated debates about whether continuous integration alone was enough or if we needed to focus more on automated testing and monitoring.

As I wrapped up my day, I realized that despite all the chaos happening outside, this day had been productive. I’d learned a few things, fixed an issue that could have been quite disruptive, and continued our journey towards better platform engineering practices.

And yet, as I lay down to sleep, the headlines of the day kept replaying in my mind. Google’s search troubles, the invasion of Ukraine, and all the other breaking news events. It was a stark reminder that while we’re busy building our tech stacks, there are real-world issues affecting countless people.

In the midst of it all, I’m grateful for the quiet moments where I can focus on solving problems and improving our platform. Because even in the face of global turmoil, the work we do every day has tangible impacts—both good and bad—that ripple out into the world.

This blog post reflects a typical day in my life as a platform engineer at that time, touching on specific technologies like Istio and tools like strace/perf, while also tying it to broader industry trends and real-world events.