Development

Incident Report: Site Outage 07 October

Multitasking got to the best of me today when I distractedly decided to update my web host server from Ubuntu 20.04 to 24.04.

At or around 11:15am EST, I brazenly went forth on my attempt to update my AWS instance. As I was running apt update I realized that I should created at least one backup of the instance, and took a snapshot. It’s a good thing I did, or I wouldn’t be here today.

The initial upgrade from 20.04 to 22.04 went smoothly. So, I decided (with far more confidence than I had any right to) that I would go ahead and update all the way to 24.04. Did I take another snapshot? No, no I did not. Had I not been smack in the middle of building an analytics dashboard in R and trying to give feedback on a strategy document, I might have thought to. Oh well.

When I was going through the process of upgrading to 24.04, I got a message that told me that I had different ssh configuration files on my host machine and the remote update server. There were a bunch of options to choose from. I carefully took the time to read each one and analyze the differences before I made a decision about how to resolve the conflict.

JK, I just hit OK and replaced the one that was local in favor of the shiny new one.

Folks, don’t do that.

I’m not skilled enough with ssh to know whether or not there was a better way to fix things when I suddenly could no longer get into my web server. How do you get into a cloud server when you can’t ssh in? Rather than try and figure it out, I just decided to recreate and migrate the site. If you were on the M60 with me today, I promise I wasn’t hacking or anything, I was just trying to figure out whether or not my web server config files needed to be updated with my new static IP address.

By the time I got home, I had successfully created a new instance from the original snapshot with a new static IP address. I was able to confirm that my domains were successfully pointing to the new machine, but the site was still broken. Then, I had to go to a bunch of back to back meetings, and wasn’t able to do the literal last step, which was to reopen the port for HTTPS to actually allow web traffic back in.

I wouldn’t normally write an incident report for my personal website. It doesn’t really matter to anyone other than me, but today marked a really meaningful part of my journey as a developer: for the first time, I knew I would be able to fix it on my own.

I’ve been running my site for over ten years now. It’s migrated from a hosted WordPress site to a managed self-hosted site on Azure and now to me running it myself on AWS with a server that I manage. And even though sometimes I end up in rough situations like these, I surprised myself today. Maybe it’s because I’ve been using Ubuntu as my desktop operating system as well as my web host, but today I finally got to experience the satisfaction that came with taking down – and then bringing back up – my own site, without having to ask for help.

That rules.