$ cat post/packet-loss-at-dawn-/-the-thread-pool-was-too-shallow-/-it-ran-in-the-dark.md

16MAY22

packet loss at dawn / the thread pool was too shallow / it ran in the dark

Title: When Serverless Meets FinOps: A Costly Lesson in Platform Engineering

May 16, 2022. Another Tuesday with the usual grind of code reviews and technical debt. Today though, I woke up with a bit more of a hangover than usual—both metaphorical and literal. The metaphorical one comes from the latest sprint’s worth of unmerged PRs that still needed addressing after another all-nighter.

The literal one is just my usual morning cold from the office AC, but it felt like a lot today. A quick look at my calendar reveals an extra meeting with our FinOps team—a group I’ve been increasingly coming to respect for their keen insight into what our platform does and doesn’t deliver. They’re not just accountants in suits anymore; they’re the stewards of our cloud budget, and boy did we need their help.

The Setup

Over the past few months, we’ve been riding the wave of serverless hype. Our platform has expanded to include a diverse set of AWS Lambda functions, each with its own unique purpose. We were starting to see the benefits—a more scalable architecture, lower operational overhead—but as always, cost was a concern.

One of our key metrics is the DORA (DevOps Research and Assessment) score, which measures deployment frequency, lead time, mean time to recover from failures, and change failure rate. We’ve hit some impressive numbers, but FinOps had their eye on something more granular: cloud spending per function. The costs were rising faster than we anticipated.

The Revelation

During our meeting with the FinOps team, they pointed out a critical flaw in how we were tracking costs. Our billing was fragmented across multiple services, making it difficult to pinpoint where the expenses went. They suggested implementing a centralized cost management solution using AWS Budgets and Cost Explorer, which would help us visualize and manage spending better.

But that’s not all. They also raised concerns about certain Lambda functions that seemed particularly costly. One in particular stood out: our daily data processing function. It was running more often than necessary and consuming unnecessary resources during off-peak hours. This wasn’t a bug, but rather an unintended consequence of our auto-scaling settings.

The Fix

Armed with this information, I dove into the code and began debugging. First step: audit the Lambda functions to ensure they were properly configured for cost optimization. It involved tweaking the memory allocation and timeout settings, adjusting the auto-scaling policies to reduce unnecessary invocations during off-peak hours.

The process was painstaking but rewarding. Each change brought us closer to achieving a more efficient platform. I even took it a step further by introducing a monitoring solution using CloudWatch Metrics and Alarms, which would alert me if any function was consuming more than the allocated budget.

The Learning

This experience taught me two valuable lessons:

Platform Engineering is More Than Just Code: As we shift towards cloud-native architectures, platform engineering becomes increasingly important. It’s not just about writing code; it’s about ensuring that your infrastructure scales effectively and efficiently while adhering to cost constraints.
Collaboration is Key: FinOps has become a crucial partner in our development process. They bring a unique perspective on the financial aspects of our platform, which helps us make informed decisions.

Looking Forward

As we continue to scale our platform, I’m excited about integrating more tools like AWS Budgets and Cost Explorer. It’s comforting knowing that we’re not just chasing DORA metrics or performance benchmarks; we’re also ensuring that our costs stay in check. The serverless architecture has opened up new possibilities for us, but with great power comes great responsibility.

Today’s meeting was a reminder that while technology can be complex, the solutions often lie in simple, collaborative efforts. As I sit here typing this, I’m already thinking about how we can further optimize our platform to meet both our operational and financial goals.