$ cat post/strace-on-the-wire-/-the-service-mesh-confused-us-all-/-the-repo-holds-it-all.md
strace on the wire / the service mesh confused us all / the repo holds it all
Debugging a Nightmare with DORA Metrics
August 26, 2024. Another typical day in the life of an engineer, but today feels like one of those days when everything that can go wrong does. I’m sitting at my desk, staring at a DORA dashboard that’s showing some alarming trends. The Deployment Frequency has dropped, the Lead Time for Changes is through the roof, and our Change Failure Rate is spiking again. It’s a perfect storm of developer frustration and operational chaos.
The last few weeks have been a whirlwind. We’ve seen the explosion of AI/LLM infrastructure, with everyone jumping on board after ChatGPT. Platform engineering has become mainstream, and the CNCF landscape feels overwhelming as I try to keep up with the latest in Kubernetes, OpenTelemetry, and Jaeger. WebAssembly is finally making its way into server-side apps, but the learning curve is steep. FinOps and cloud cost pressure are real, and our team is constantly under scrutiny for spending.
One of the projects that caught my eye recently was a side project from AnandTech, where they announced they were winding down their operations. It’s always humbling to see what others are doing and how they approach problems. The idea of running a successful tech publication seems like a daunting task compared to our day-to-day struggles.
Anyway, back to the issue at hand. We’ve got a critical app that’s been throwing errors left and right, and it’s causing outages. I decide to dig into the logs, hoping to find some clues. The server-side logging is set up with WebOps, which is great for visibility but adds another layer of complexity.
I spend hours pouring over logs, trying different tools like Prometheus and Grafana to get a better understanding of what’s going on. It turns out that one of our microservices was hitting an API rate limit imposed by a third-party provider. We hadn’t anticipated the load, and now it’s causing major disruptions. I’ve got to find a way to either increase the rate limit or implement retries in a smarter way.
The frustration builds as I argue with a colleague about whether we should invest time in rewriting parts of our service using WebAssembly or stick with what we have. The consensus is that the existing codebase works, but it’s not ideal for this use case. I feel like I’m stuck between a rock and a hard place.
Meanwhile, the team is complaining about their lack of developer experience tools. We’re still in the phase where our CI/CD pipeline is more of an afterthought than a strategic asset. The idea of automating more aspects of our workflow feels daunting but necessary. I spend some time researching DevOps tools like Spinnaker and GitLab CI, trying to find something that fits our needs without breaking the bank.
FinOps is always on my mind too. We’re constantly being asked about cloud costs, and it’s a reminder that every line of code we write has real financial implications. I start looking into cost optimization tools like Turbot and Cost Management APIs provided by cloud providers to better manage our expenses.
DORA metrics are part of our regular review process now. It’s great to have data-driven insights on how we’re performing, but it can also be demotivating when the numbers aren’t where they should be. I try not to let that get to me too much and focus instead on what we can do to improve.
By the end of the day, I’ve made some progress—implemented a fallback mechanism for handling rate limits, streamlined our CI/CD pipeline, and started a cost optimization project. But there’s still so much more to tackle. It feels like every time I think we’re making headway, something else crops up that needs attention.
As I close my laptop and start the commute home, I’m reminded of how much this job can be both challenging and rewarding. Debugging these systems, optimizing our processes, and improving developer experience—these are the day-to-day battles that make it all worthwhile. And even on days like today, when everything seems to be going wrong, there’s always a glimmer of hope that tomorrow will bring new opportunities for growth and improvement.
This is my reality check for August 26, 2024. A mix of struggles, successes, and the ever-present drive to keep pushing forward.