Debugging a Nightmarish Nginx Configuration

January 27, 2003. It’s been quite the month so far, with Firefox hitting the streets and Google hiring like mad to fuel their growth. The Sysadmin role is evolving too—more scripting, more automation. I find myself knee-deep in Perl scripts and Python glue code rather than spending all my time on hardware and low-level systems work.

Today, I’ve got a particularly gnarly problem to tackle: an Nginx configuration gone bad. Let’s dive into the details of this mess.

The Setup

We have a few dozen servers running our web application using Nginx as our frontend reverse proxy. Each server is configured identically, but tonight, one particular server is throwing errors left and right. It’s hitting an infinite loop when trying to handle requests for static files, causing the entire service to grind to a halt.

The Symptoms

When I first log in via SSH, I can see that the CPU usage spikes up to 100% on the affected server. No other processes are consuming much memory, and the disk is idle. When I tail the Nginx error logs (/var/log/nginx/error.log), it’s filled with these cryptic messages:

2003-01-27 23:45:08 [crit] 698#0: *15045 connect() to upstream server failed (111: Connection refused) while connecting to upstream, client: X.X.X.X, server: example.com, request: "GET /static/css/style.css HTTP/1.1", upstream: "http://127.0.0.1:8080/static/css/style.css"

It seems like Nginx is trying to connect to itself on port 8080 for some reason.

The Investigation

I start by checking the Nginx configuration file (/etc/nginx/nginx.conf):

http {
    upstream backend {
        server 127.0.0.1:8080;
    }

    server {
        listen 80 default_server;
        root /var/www/html;

        location /static {
            proxy_pass http://backend;
        }
    }
}

Everything looks correct at first glance, but the error message is clearly indicating a problem with http://127.0.0.1:8080. Could it be that something else on the server is listening on port 8080? I use netstat to check:

$ sudo netstat -tlnp | grep 8080
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      698/nginx

No, Nginx is the only thing listening on that port. So why would it be trying to connect back to itself?

The Eureka Moment

It hits me—this server also hosts a development environment for our web application. I recall that we recently added a new local development proxy to handle requests from IDEs and other dev tools. This proxy was configured with the same backend upstream definition, but it wasn’t properly cleaned up when we were testing.

I navigate to /etc/nginx/conf.d/ and find another configuration file for the development proxy:

upstream backend {
    server 127.0.0.1:8080;
}

server {
    listen 8080;
    location /static {
        root /var/www/dev/html;
    }
}

Sure enough, this is causing Nginx to think that requests for static files should go through the local development proxy instead of reaching the main backend. The infinite loop was a result of Nginx repeatedly trying to connect to itself on 8080.

Fixing the Problem

With this understanding, I make a simple change in the production configuration:

upstream backend {
    server 127.0.0.1:8081; # Note: Corrected port number for development proxy
}

server {
    listen 80 default_server;
    root /var/www/html;

    location /static {
        proxy_pass http://backend;
    }
}

I also update the development configuration to ensure it doesn’t interfere with production:

upstream backend {
    server 127.0.0.1:8081; # Corrected port number for consistency
}

server {
    listen 8080;
    location /static {
        root /var/www/dev/html;
    }
}

After making these changes, I reload Nginx and the server starts behaving as expected. The CPU usage drops back to normal levels, and requests are served correctly.

Lessons Learned

This experience really highlighted the importance of keeping our configurations clean and consistent. It’s easy for small details like port numbers to slip through the cracks during development, especially when multiple environments share similar setups. As we continue to scale and add more automation, these kinds of issues can become even harder to debug.

Tomorrow, I’ll write a script to ensure that such discrepancies are caught early on. Maybe it’s time to automate our Nginx configuration validation too!

This was a particularly challenging bug, but it also taught me the value of keeping track of every detail in my configurations. In today’s world where infrastructure as code is becoming more prevalent, these kinds of lessons can save a lot of headaches down the line.