Debugging DevOps with DORA Metrics and LLMs: A Personal Journey

April 15, 2024. I’m sitting in my office, looking at a wall full of monitors showing various dashboards and metrics from our platforms. It’s been an interesting few months since the AI/LLM infrastructure explosion post-ChatGPT really took off. Now, platform engineering is not just a buzzword; it’s a real thing that demands attention.

Today, I want to reflect on something we’ve been grappling with at work: using DORA metrics to optimize our DevOps processes and how this has intersected with the rise of large language models (LLMs).

The Context

We’ve all heard about Meta’s Llama 3, a massive AI model that’s pushing the boundaries of what’s possible. Meanwhile, Google is in hot water for its search engine allegedly losing ground to new competitors. Equinox.space and Mario meets Pareto have become household names due to their innovative approaches. And every day, I find myself thinking about these stories as they reflect the current state of tech.

The Debugging Session

Last week, we had a crisis in one of our critical services. Our main service experienced an unexpected downtime, and it took us several hours to identify and fix the issue. This got me thinking—how can we prevent such issues from happening again? How do we leverage new technologies like LLMs to improve our DevOps practices?

DORA Metrics

One of the things that stood out during this incident was how slow our deployment process had become. According to DORA metrics, which have been widely adopted in our industry, high-performing teams deploy code more frequently and with higher quality. We were not meeting those benchmarks.

We decided to implement some changes based on these metrics. Our goal: reduce our Deployment Lead Time (DLT) by 30%, increase our Deployment Frequency, and decrease our Mean Time To Recovery (MTTR). This was going to be a challenge, but we had a new tool in our arsenal—large language models.

LLMs in DevOps

I started experimenting with an LLM called ChatGPT4. We’ve been using it for various tasks, like code review and issue triage. It’s impressive how quickly the model can understand complex issues and provide insights that we might miss. However, we were skeptical about fully automating our release processes with it.

To test its capabilities, I ran some prompts on potential deployment automation scripts. The model generated a basic script, which was surprisingly well-structured. But when I tried to use it in production, it didn’t work as expected due to context switching and edge cases that the LLM hadn’t accounted for. This taught me a valuable lesson: while LLMs are fantastic at generating code snippets, they still need human oversight.

Debugging the Issue

During the incident, we spent hours debugging the issue. We had logs, metrics, and monitoring tools, but it was still difficult to pinpoint the exact root cause. That’s when I remembered a recent talk on using LLMs for anomaly detection in DevOps. Could an LLM help us identify issues faster?

I decided to run some queries on our logs with ChatGPT4. The model quickly identified patterns that we had missed and suggested potential causes. It helped us narrow down the issue, which ultimately led us to a misconfigured network rule. This was a significant win for both human and machine collaboration.

Lessons Learned

This experience reinforced several things:

Human Oversight is Crucial: While LLMs are powerful tools, they can’t replace experienced engineers who understand the context of their work.
Continuous Improvement: We need to continuously monitor our DevOps processes using metrics like DORA and adjust accordingly.
Tooling Diversity: A mix of human intelligence and automation tools is necessary for effective problem-solving.

Conclusion

As I sit here reflecting, I’m glad we managed to mitigate the incident. The journey to improving our DevOps practices has been challenging but rewarding. We’re now more focused on optimizing our processes and leveraging new technologies like LLMs in a balanced way. It’s a work in progress, but it feels good to be part of this evolution.

This reflection is just one small part of the broader landscape of tech in 2024. The stories from Hacker News inspire us to keep pushing boundaries and learning from each other. Whether we’re talking about meta Llamas or credit card rewards programs, there’s always something new to explore.