$ cat post/the-rollback-succeeded-/-the-incident-taught-us-the-most-/-i-left-a-comment.md

the rollback succeeded / the incident taught us the most / I left a comment


A Day in the Life of a Platform Engineer: October 31, 2016


Alright, let’s get into it. I woke up to another busy day as an engineering manager with responsibilities that span from coding to platform ops. Today started off like any other morning: coffee, emails, and a quick check-in with the team on Slack.

Today was shaping up to be a good one when I got a notification about a server issue in our Kubernetes cluster. Our infrastructure had been growing steadily over the last few months, and today it looked like one of the nodes had gone down unexpectedly. We’ve been using Kubernetes for several months now, so we knew how to handle this kind of incident, but that didn’t make it any less stressful.

I quickly checked the logs and realized it was a resource issue—specifically, memory pressure on the node. The container running our application was eating up all the available RAM, causing the node to crash. We had set up monitoring with Prometheus and Grafana, so we knew exactly what was going on when it happened.

Using our Jenkins pipeline, I pushed out a quick fix to bump up the resource limits for that specific container, which stabilized things pretty quickly. The team appreciated having these tools in place; they made the recovery process much smoother. This also led me to reflect on how far we’ve come since those early days of hand-rolling monitoring and alerting.

After addressing the immediate issue, I spent some time reviewing our recent spike in new user sign-ups from an advertising campaign. We had set up a few A/B tests using Google Optimize, but we needed to better understand the infrastructure impact before scaling too much further. The team was discussing how to scale out our services and whether it made sense to adopt Istio for service mesh.

As we talked about potential solutions, I couldn’t help but think back to the hiring test at Google that had been making waves in Hacker News just a few days ago. It’s crazy how tech evolves so quickly, and yet these foundational challenges still exist. Ensuring our platform can handle sudden spikes while maintaining performance and availability is something we face every day.

Speaking of infrastructure, I spent some time with the team discussing the benefits of adopting Terraform for our provisioning needs. We’ve been using Ansible for a while now, but as our infrastructure grew more complex, it became harder to manage. Terraform seemed like a natural progression. One thing that kept coming up was whether we should also consider CloudFormation, especially given AWS’s growing dominance.

While the team was in favor of Terraform, I argued for staying agnostic and using it only where it made sense, rather than making a platform-wide switch. This led to an interesting debate about vendor lock-in versus flexibility, which is something that always comes up when we make such significant changes.

Later on, someone brought up the recent DDoS attack against Dyn. It was a stark reminder of how vulnerable our services can be if not properly secured. We decided to take some time today to review and update our security practices, including implementing rate limiting and ensuring all external endpoints were configured securely.

In between meetings, I caught wind that a colleague had been working on setting up a serverless architecture using AWS Lambda and API Gateway. It was fascinating to see how much interest this space is generating. While we didn’t have any immediate plans to switch to serverless, it’s always good to keep an eye on emerging technologies.

As the day wound down, I reflected on the constant evolution of our tech stack. We’re constantly learning and adapting, but more importantly, we’re building a platform that can serve our growing user base while being resilient against future challenges. It’s a rewarding yet challenging role, but it keeps things interesting.


That’s how a typical day for me goes these days. The tech landscape is always changing, and there’s always something new to learn or adapt to. But at the end of the day, I love seeing the impact our work has on making the services we build more robust and reliable.