Debugging Kubernetes Complexity: A Real-Life Example

November 2, 2020. Another quiet morning in the ops trenches, with a fresh cup of coffee and my laptop set to debug mode. Today, I face the complexity beast again—this time, it’s a tangled web of services and deployments on our Kubernetes cluster.

The Setup

We’re running a microservices architecture using Istio for service mesh, ArgoCD for CI/CD, and Flux for GitOps. Everything was humming along smoothly until one fine morning when I received an alert: “Pod ‘user-service’ is crashing.” Great. Let’s dive in.

Step 1: The Alert

The first step was to check the pod logs. The logs pointed to a NullPointerException in our user service, which served as the front door for all user-related API calls. But it felt too generic; I needed more context. So, I did what any good engineer does—added some logging and redeployed.

Step 2: Logging and Redeployment

I added logs to catch the exact point where the NullPointerException occurred. After redeploying, I waited for another crash. Sure enough, it happened again! This time, I had something more useful in my logs:

2020-11-02 10:34:56.123 [user-service] ERROR - Exception caught at UserServiceImpl.getUserDetails
java.lang.NullPointerException: null
        at com.example.service.UserServiceImpl.getUserDetails(UserServiceImpl.java:123)

Step 3: Code Inspection

With the log pointing to line 123, I inspected the code. It was a simple method call that should never have thrown an NPE under normal circumstances. I ran through all possible conditions and found nothing out of the ordinary.

Step 4: Istio Tracing

Since this service is part of our Istio mesh, I decided to enable tracing for it. After enabling the Tracing CRD in Kubernetes and restarting the pod, I waited for a call that would trigger a trace.

Finally, I had my trace! It showed a request flow from an external API hitting our user-service, all the way down to the NullPointerException. The trace clearly indicated that the external service was returning null data.

Step 5: External Service Issue

So, it wasn’t just our code—our upstream service was the culprit. I reached out to the team responsible for the external API and reported the issue. They were quick to respond and assured me they would investigate.

Step 6: Temporary Fix

In the meantime, since our user-service was critical, I decided to add a null check around the method call and deployed a temporary fix:

public User getUserDetails(String userId) {
    User userDetails = externalApi.getUserDetails(userId);
    return Optional.ofNullable(userDetails).orElse(new User());
}

This ensured that we wouldn’t get an NPE even if the external service failed to provide data.

Reflections

Debugging in Kubernetes can be frustrating, especially when you have a complex setup like ours. It’s moments like these that make me appreciate tools like Istio and Jaeger for tracing, and Flux for managing our deployments reliably. But at the end of the day, it’s always back to basics—logging, debugging, and understanding the flow.

As I sit here, I’m reminded of a quote by Linus Torvalds: “Debugging is twice as hard as writing the code in the first place.” And that’s true even with all the modern tools at our disposal. But hey, that’s what makes it interesting!

End of blog post.