$ cat post/grep-through-the-dark-log-/-the-socket-never-closed-right-/-i-kept-the-bash-script.md

06FEB23

grep through the dark log / the socket never closed right / I kept the bash script

Debugging the Daylight

February 6, 2023

Last week was a rollercoaster of tech updates and discussions. The AI/LLM explosion post-ChatGPT is just one part of the larger tech ecosystem that continues to evolve rapidly. Platform engineering and FinOps are becoming more mainstream, and I found myself deep in both as we pushed our infrastructure further.

The AI/LLM Infrastructure Dive

Ever since ChatGPT hit the scene, everyone from startups to Fortune 500s is scrambling to figure out how to integrate large language models (LLMs) into their operations. At my current company, we’re no exception. We’ve been evaluating various frameworks and libraries for integrating these models into our applications, but the infrastructure costs are staggering.

We’ve had some interesting discussions around whether it makes sense to build a custom LLM layer or just use a managed service like Anthropic’s Claude or Anthos AI Service. While the managed services offer a lot of convenience and cost savings in terms of ops overhead, they come with their own set of limitations, especially when you need fine-grained control over the models.

Platform Engineering and FinOps

Platform engineering is definitely mainstream now. We’re focusing on building self-service infrastructure that developers can use without needing to spin up entire teams just for infrastructure work. This means we’re constantly evaluating tools like Kubedash for monitoring, Flux for GitOps, and Turbonomic for cost optimization.

One of the big challenges with platform engineering is ensuring that everyone uses these tools effectively while still providing enough flexibility. It’s a fine line between making things too rigid (which stifles creativity) and leaving things so open that nothing gets done.

FinOps is another area where we’re seeing a lot of traction. With cloud providers like AWS and Azure, the cost can quickly spiral out of control if you’re not careful. We’ve implemented tools to track costs in real-time and set up alerts when spending hits certain thresholds. This has helped us identify areas where we could optimize our spend without compromising on performance.

WebAssembly on the Server

WebAssembly (Wasm) is gaining traction as a way to run server-side code more efficiently. I was playing around with Wasm last week, specifically looking at how it might be used for small microservices or even parts of larger applications where performance is critical but security concerns are paramount.

One thing that stood out to me was the potential for Wasm to reduce our attack surface by running certain components in a sandboxed environment. We’re still evaluating the trade-offs between running everything natively and using Wasm, especially given the current state of tooling and support.

Developer Experience and DORA Metrics

On the developer experience side, we’ve been looking at ways to streamline workflows and reduce bottlenecks. I spent some time setting up a CI/CD pipeline for our latest project, and it was eye-opening how many different tools can integrate together seamlessly. However, finding the right balance between automation and human oversight is still an art.

DORA metrics (Deployment Lead Time, Change Failure Rate, Deployment Frequency, Mean Time to Restore) are becoming more normalized in teams like ours. We’ve adopted them as a way to measure our DevOps maturity and identify areas for improvement. It’s been interesting seeing how these metrics can drive behavior changes within the team.

A Real-World Debugging Session

One of the most challenging issues I encountered this week was debugging an issue with one of our microservices. The service started failing intermittently, and logs weren’t providing much insight. After a couple of hours of tracing requests and analyzing traces using Jaeger, we finally identified the root cause: a memory leak in the Python code.

The fix involved adding more garbage collection to ensure that unused objects were being cleaned up properly. It was a good reminder that even after years of working with these systems, there’s always something new to learn.

Wrapping Up

Tech is moving at breakneck speed right now, and it’s easy to get caught up in the hype cycle. But for me, the real satisfaction comes from solving complex problems and seeing tangible improvements in our operations. Whether it’s optimizing cloud spend with FinOps tools or debugging memory leaks in a microservice, there’s always something new to tackle.

So here’s to another week of coding, learning, and pushing the boundaries of what we can achieve with technology!

That’s my take for today. Hope this gives you some flavor of what goes on behind the scenes!