$ cat post/apt-get-from-the-past-/-the-abstraction-leaked-everywhere-/-a-ghost-in-the-pipe.md
apt-get from the past / the abstraction leaked everywhere / a ghost in the pipe
Title: The Great Kubernetes Cleanup: A Case Study
April 5, 2021 - The other day, our team faced a classic problem: our cluster was a mess. It had been running for three years and had grown like a wild jungle of pods, services, deployments, and statefulsets. The infrastructure looked like someone had randomly thrown together a thousand toy blocks without any rhyme or reason.
I’ve been on the fence about this for a while now—should we just leave it as is? After all, everything is working, right? But then again, managing a cluster that way isn’t scalable and makes life hard when something goes wrong. I decided to take matters into my own hands.
The Setup
Our Kubernetes cluster was running on AKS (Azure Kubernetes Service). We had a bunch of microservices built in Go and Python, along with some legacy applications written in Java and .NET. Every now and then, we’d get requests from developers who wanted new pods or services spun up for their latest project. So, naturally, over the years, the cluster became a sprawling labyrinth.
The Plan
I decided to tackle this issue head-on by cleaning things up systematically. Here’s what I did:
- Identify and Tag: I went through each namespace and tagged everything according to its purpose or last known user.
- Gather Metrics: I used tools like Prometheus and Grafana to monitor the cluster’s health and performance.
- Define a Cleanup Strategy: We needed a plan that would help us keep the cluster clean going forward.
The Cleanup
Step 1: Identify Orphaned Resources
I wrote a script to identify pods, services, deployments, statefulsets, and other resources that had not been used for over six months. This was done by cross-referencing Kubernetes metadata with our internal git repositories and release notes.
# Example bash snippet for identifying orphaned pods
kubectl get pods -o json | jq -r '.items[] | select(.metadata.labels.environment != "production") | .metadata.name' > current_pods.txt
git log --pretty=format:'%h %s' --since='1 year ago' --reverse --no-merges > git_log.txt
comm -23 <(sort current_pods.txt) <(sort git_log.txt)
This script helped us identify resources that could be safely removed. After reviewing the output, we found a few outdated services and deployments.
Step 2: Consolidate Namespaces
We had namespaces galore, most of which were empty or only used for specific short-lived projects. I decided to consolidate them into more manageable chunks. This not only made it easier to manage but also helped reduce namespace sprawl.
# Example bash snippet for deleting unused namespaces
kubectl get namespaces -o json | jq -r '.items[].metadata.name' > all_namespaces.txt
unused_namespaces=$(comm -23 <(sort current_pods.txt) <(sort git_log.txt))
for ns in $unused_namespaces; do
echo "Deleting namespace: $ns"
kubectl delete namespace "$ns" --grace-period=0 --force
done
Step 3: Set Up Governance Rules
To prevent future sprawl, we implemented a set of best practices and governance rules. We decided to use Argo CD for continuous delivery and FluxCD for GitOps, which helped ensure that any changes made to our cluster followed a defined process.
The Results
After about two weeks of meticulous cleanup, the cluster looked much cleaner. Our monitoring metrics showed improved performance, and developers found it easier to deploy new services without running into naming conflicts or resource issues.
Lessons Learned
- Automation is Key: Scripting the cleanup process saved me a lot of time and ensured consistency.
- Documentation Helps: Keeping track of what each namespace or service was for helped in making informed decisions about which ones to keep, modify, or delete.
- Continuous Improvement: Even after the initial cleanup, we need to stay vigilant and regularly review our cluster’s health.
The Takeaway
Cleaning up a Kubernetes cluster is like decluttering your living room—necessary but not always fun. But doing it right can make managing infrastructure much easier in the long run. I learned that while it’s easy to let things slide, taking the time to clean up now saves headaches down the line.
As we move forward with more SRE and platform engineering roles, this kind of thorough cleanup is crucial for maintaining a healthy, scalable environment. And who knows? Maybe next year, we can do another round of improvements and make it even better!