$ cat post/ps-aux-at-midnight-/-the-deploy-went-sideways-fast-/-disk-full-on-impact.md

ps aux at midnight / the deploy went sideways fast / disk full on impact


Title: Living in the Era of Overwhelming Cloud Costs


October 2, 2023 has been a day filled with reminders that we live in interesting times. The tech world is awash with changes—AI/LLM infrastructure seems to be everywhere you look, platform engineering is mainstreaming like never before, and the CNCF landscape feels as overwhelming as ever. On top of all this, WebAssembly on the server side continues to gather momentum, developer experience has become a first-class discipline, and FinOps and cloud cost pressure are real conversations in every corner.

Today, I woke up to yet another high-cost alert from our cloud provider, reminding me that we’re dealing with some serious financial implications. This isn’t just about saving money; it’s about ensuring the platform remains sustainable for everyone involved. The DORA metrics are everywhere now, and they’re not just for startups—they apply to all of us.

The Cost Game

Let’s talk specifics: We’ve been working hard to manage our costs, but every month brings a new challenge. Earlier this week, we had an urgent meeting with the finance team to dissect our spend in AWS. It turns out that a few misconfigured auto-scaling groups were gobbling up resources at an alarming rate. Each group was running instances 24/7, even when traffic was down, just because they hadn’t been properly tuned.

To address this, we’re implementing more robust monitoring and alerting on our cost management dashboard. This involves setting up detailed metrics in CloudWatch to track the utilization of all our services. We’ve also introduced a new CI/CD pipeline that automatically scales resources based on actual demand, rather than just maxing out capacity for safety.

Developer Experience

On the developer experience front, I’ve been wrestling with the fact that our platform engineers are increasingly being asked to balance the needs of development velocity against cost efficiency. We’re using tools like Tilt and Skaffold to help developers work more efficiently locally, but we need to ensure they don’t end up overprovisioning resources on their laptops either.

One interesting idea I’ve been playing with is to integrate a tool that profiles local environments and suggests the most efficient setups for different projects. This way, we can guide developers toward better practices without micromanaging them. It’s all about empowering them while keeping an eye on the financials.

Platform Engineering

Speaking of platform engineering, our team has been diving deep into WebAssembly (Wasm) as a potential solution to some of our server-side challenges. We’re exploring how Wasm can help us reduce resource usage and improve performance in certain microservices. For example, one of my engineers is working on porting parts of our image processing pipeline to run in the browser using Wasm.

This shift isn’t without its downsides—there are still quite a few edge cases we need to handle, but the potential benefits are significant. We’re also looking at how we can better document these trade-offs and provide clear guidance for developers considering using Wasm.

FinOps and Sustainability

FinOps is really coming into its own as a discipline here. It’s not just about cutting costs; it’s about making sure every dollar spent aligns with our business goals. We’re working on implementing more sophisticated cost optimization strategies, such as reserved instances, spot pricing, and autoscaling.

One particularly fun challenge was when we accidentally left an expensive Elastic Kubernetes Service (EKS) cluster running without any workloads for days. We quickly realized the mistake and fixed it, but it serves as a stark reminder that human error can be costly. Since then, I’ve been advocating for more automation in our setup and teardown processes.

Conclusion

Living through this era of tech is both exhilarating and challenging. The balance between delivering value to customers and managing costs is a constant dance. With AI/LLM infrastructure expanding, platform engineering becoming more mainstream, and FinOps gaining traction, there are endless opportunities—and challenges—to navigate.

For now, I’ll be focusing on refining our cost management tools, pushing Wasm into production, and ensuring that every dollar spent benefits us in the long run. Stay tuned for more updates!


This is a snapshot of my journey through these exciting times. The tech landscape continues to evolve rapidly, but by staying grounded and focused on real problems, we can continue to deliver value while keeping an eye on our financials.