Monday, February 14, 2011

When in doubt, reboot? Not Unix boxes

Last week, I wrote a little item titled "Nine traits of the veteran Unix admin." Had I known that a few hundred thousand people would read it in just a few days, I might have put on a clean shirt or something. Regardless, I'm sorry I stopped at nine. I had at least fifteen, but it was already a long post.

The most interesting aspect of the feedback I received was that the vast majority of readers agreed with me on just about every point -- with the exception of the first and last items: sudo and reboots. (There were also a few folks who hammered me for not including vi in favor of vim -- I did! And who thought that because I referenced Perl briefly, I hadn't ever used bash or some such nonsense?)

[ Also on InfoWorld: Read Paul Venezia's Deep Dive PDF on virtualization networking. | Check out Paul Venezia's five-year plan to tackle the 8 problems IT must solve. ]

I want to take a closer look at the reboot issue. It's a hot spot for all server admins, but to Unix geeks, it's a deeper issue -- probably because Windows admins use reboots as one of their first troubleshooting steps, while it's one of the last for the Unix team.

Here's the reality: Server reboots should be rare -- very rare. I cited kernel updates and hardware replacement as the two leading causes of reboots in the Unix world. Some have mentioned significant security risks in not rebooting servers, but that's nonsense. If there's a security risk present in a service or application, a patch can be applied without requiring a reboot. If the security risk is present in a kernel module, it's generally possible to unload the module, apply a patch, and reload the module. Yes, as I said, you need to reboot if there's a security risk in the kernel. Otherwise, there's no real reason to reboot a Unix box.

Some argued that other risks arise if you don't reboot, such as the possibility certain critical services aren't set to start at boot, which can cause problems. This is true, but it shouldn't be an issue if you're a good admin. Forgetting to set service startup parameters is a rookie mistake. Naturally, if you're building the box and it's not in production, you can do all the reboot tests you want without adverse effects. That's just good practice.

But there's another side: Those who consider reboots to be a worthy troubleshooting step are going to get themselves in trouble sooner rather than later. Let's say a Unix box has gone wonky. A few services that were running will no longer start, maybe with a segfault, and other oddities abound.

If you shrug and reboot the box after looking around for a few minutes, you may have missed the fact that a junior admin inadvertently deleted /boot and some portions of /etc and /usr/lib64 due to a runaway script they were writing. That's what was causing the segfaults and the wonky behavior. But since you rebooted the server without digging into the problem, you've made it much worse, and you'll soon boot a rescue image -- with all kinds of ponderous work awaiting you -- while a production server is down.

This is but one significant reason reboots in the Unix world should be extremely rare. Rather than a troubleshooting step, they're a Hail Mary approach to server administration. In short, nobody ever fixed a problem caused by a full /var partition by rebooting the box. (And don't give me any pedantic nonsense about open filehandles -- you know what I mean.)

In many cases, it's extremely important not to reboot, because the key to fixing the problem is present on the system before the reboot, but will not be immediately available after. The problem will recur, and if the only known solution is to reboot, then the problem will never be fixed unless or until someone decides not to reboot and instead tries to find the root of the problem. Unfortunately, that's not as common as it should be. Face it -- a bad stick of RAM cares not a whit about system uptime or when the box was last booted. It'll cause problems no matter what.

The next time you're looking at a problem and someone says, "Hey, let's just reboot the thing," make sure you've exhausted every other possibility before you send it to init 6. The time and pain you save will definitely be your own.

No comments:

Post a Comment