$ cat post/the-firewall-dropped-it-/-a-grep-through-ten-years-of-logs-/-the-pod-restarted.md

the firewall dropped it / a grep through ten years of logs / the pod restarted


Debugging a Production Wasm Bug: A Lesson in Real-Time Debugging

July 29, 2024

Today’s entry is about an unexpected challenge that popped up while working on our serverless WebAssembly (Wasm) platform. It’s the kind of problem that reminds you why debugging at scale can be such a headache.

The Setup

At work, we’ve been diving deep into Wasm for backend services due to its flexibility and performance benefits. We’ve been using Cloudflare Workers as one of our main platforms, which supports running Wasm modules directly in the browser. For some reason, I thought it would just be a fun side project to extend this into full server-side execution—something that could handle real, heavy lifting.

The Problem

A few days ago, we started seeing strange behavior from one of our Wasm services. It was supposed to process large payloads and return a response almost instantly, but instead, it would just hang indefinitely. This wasn’t happening for every request; only certain ones—specifically, those with larger data sizes.

Initial Hunches

At first, I suspected some kind of memory leak or timeout issue, given the size of the payloads involved. The Wasm code was running in a restricted environment, so it couldn’t just use more resources willy-nilly. I spent hours tracing through the Rust code, trying to figure out where the bottleneck might be.

Debugging Tools

To get some visibility into what was going on, I turned to various debugging tools. The wasm-gc tool helped me inspect memory usage, but it wasn’t giving me a clear picture of why some requests were hanging while others just worked fine.

I also tried using the wasi-sdk with wabt, which provided better insight into Wasm module execution and memory allocation. But even with these tools, I couldn’t pin down exactly what was causing the issue.

The Breakthrough

Then it hit me—I needed a different approach. Rather than just looking at the Rust code or the Wasm bytecode directly, I decided to go back to first principles: how was the data being passed into and out of the Wasm module? Was there something in the payload itself that could be causing issues?

Using some basic logging, I started tracing the flow of data. The logs showed that for larger payloads, the input buffer seemed to get stuck somewhere—specifically at a certain byte size. But why would this only happen occasionally?

The Real Cause

After more digging, it turned out to be an encoding issue. Our service was expecting UTF-8 encoded strings, but sometimes clients were sending in payloads with a different encoding (like ISO-8859-1). This caused the Wasm module to choke because it wasn’t handling the extra byte values correctly.

Once I identified this, fixing it was straightforward: adding proper encoding checks and conversions. But the lesson here is how important it is to think through all possible edge cases, especially when dealing with untrusted data.

The Aftermath

Updating the service to handle different encodings was a small but crucial fix. It taught me that no matter how much you plan or test in dev environments, real-world production issues often reveal blind spots you didn’t anticipate.

This incident also highlighted the importance of robust logging and debugging tools when working with serverless Wasm services. While we now have more powerful tools available (like the wasi-sdk and its associated utilities), it’s still critical to think through potential edge cases and validate assumptions.

Conclusion

Debugging at scale can be a real nightmare, but it’s also incredibly rewarding. This experience reminded me that even with the latest technologies, there are still fundamental principles of software engineering that stand the test of time—like thoroughly testing your code in all scenarios.

In the world of AI/LLM infrastructure and FinOps pressures, these lessons become more critical than ever. Happy coding!