$ cat post/nmap-on-the-lan-/-a-port-scan-echoes-back-now-/-the-cron-still-fires.md

23MAY05

nmap on the lan / a port scan echoes back now / the cron still fires

Debugging a Dilemma with Python and Xen

May 23, 2005 was just another day in the life of an ops guy, right? Well, not exactly. At that time, I was working on a web infrastructure project where we were using Xen as our hypervisor and Python for automation scripts. It felt like every month brought new changes to how we approached devops, and May 2005 was no different.

The Setup

We were running a series of e-commerce websites, each hosted in separate Xen virtual machines (VMs). Our goal was to streamline the setup process by automating everything with Python scripts. We had a solid infrastructure up and running, but every now and then, we’d hit some nasty bugs that made me want to pull out my hair.

One day, I woke up to an email from our support team: “VM 12345 is down. Can you look into it?” Of course, the usual suspects were not to blame—no sudden power outage or hardware failure this time. It was something more subtle and elusive.

The Symptoms

The VM appeared to be in a paused state with no errors in the logs. SSH wouldn’t connect; all I could do was watch as the CPU usage dropped to zero. After some quick googling, it seemed like someone had already run into this issue before: Xen’s xenstored service was likely dead or misbehaving.

I dove into the logs and noticed a few suspicious entries. xenstore-write was failing intermittently, which could cause xenstored to become unresponsive. I decided to set up some traps in our Python script to catch these errors and restart xenstored if necessary. Simple enough, right?

The Script

Here’s a snippet of the Python code that caught my eye:

import subprocess
import os

def restart_xenstored():
    try:
        subprocess.check_call(['service', 'xenstored', 'restart'])
        print("Restarted xenstored")
    except Exception as e:
        print(f"Failed to restart xenstored: {e}")

while True:
    # Do some checks and logging
    if check_something_fails():
        restart_xenstored()

I thought this was a neat little script. If xenstored died, it would be restarted automatically. Easy peasy lemon squeezy.

The Problem

But as I continued to monitor the VMs, another issue cropped up: not all of them were behaving the same way. Some machines were fine for days on end; others seemed to have a one-in-ten shot at crashing. This inconsistency was driving me crazy.

I started digging through our configuration files and noticed something odd—some VMs had different versions of Python installed. Could this be causing the issue? I decided to run some tests:

import platform

def get_python_version():
    return platform.python_version()

print(get_python_version())

Sure enough, there was a difference in Python version between the VMs that worked flawlessly and those that had issues. A bit of research led me to realize that older versions of Python could cause compatibility issues with newer versions of Xen.

The Fix

With this information, I decided to standardize all our Python environments across the board. It wasn’t easy—some scripts were hardcoded to specific Python versions. But after a few long nights of refactoring and testing, we finally got everything working as expected.

In the end, I wrote up a post about what I learned from this experience:

Title: Standardizing Python Environments in Xen VMs

In our efforts to automate infrastructure with Python scripts on Xen VMs, we encountered an interesting issue where some VMs would periodically fail due to xenstored service issues. Through careful debugging and testing, we discovered that the problem lay in differences between Python versions. By standardizing all Python environments across our Xen setup, we were able to resolve these recurring issues.

Lessons Learned

This experience taught me a few valuable lessons:

Consistency is key: Standardize your environment as much as possible.
Debugging is a process: Don’t rush to a solution without thorough investigation.
Keep learning and adapting: Technology evolves, and so should your approach.

In the grand scheme of things, this might seem like a small issue compared to today’s infrastructure challenges, but it solidified my understanding that every problem is an opportunity to learn something new.

That’s what May 23, 2005 was all about for me. A day spent wrestling with bugs and emerging technologies, just like any other day in ops.