Ubuntu Server Slowly Stops Responding
Recently an Ubuntu 8.04 “Hardy Heron” server I have run for years started having random hangs where services stop responding, one at a time. This is a server for our home network and runs a ton of services like SMB, NFS, and AFP file server, IMAP and SMTP mail servers, media server, DHCP, DNS, MySQL… you name it. It has been so reliable that a reboot due to anything but a power failure that exceeded the UPS capacity is exceptional. It has gone at times for over a year without rebooting. But lately this Ubuntu server slowly stops responding, and over a period of an hour or two essentially comes to a standstill.
Early on during the progression of the problem, if I had a terminal window open I could still do most things and it was responsive and nothing seemed wrong. But nobody’s laptop or iPad or smart phone could get a DHCP assignment. Email wasn’t being delivered. DNS didn’t look anything up anymore. I couldn’t log in, even as root from the console. The only choice, apparently, was to reboot it. I learned about the magic sysrqd keyboard codes to cleanly shut down a system even when it is hung, and started shopping for new hardware.
These problems became more frequent, but were always inconsistent. Failure was progressive over a period of a few hours. Usually email would stop being delivered first, then DHCP, then DNS.
There were no obvious clues that anything was wrong except that /var/log/syslog stopped being updated at some point during this process, and instead of normal syslog updates would get “– MARK –” written to it at 20-minute intervals (the default behavior of syslogd if it finds nothing being sent to it to log).
So that was a clue: syslogd wasn’t getting anything written to it. Finally today I got a few search hits by focusing on syslog as a keyword and found a few useful posts in various places:
When syslogd hangs, which was particularly useful, and it references a different post in same thread as the first link above: syslogd hang / livelock (was: 2.6.10-rc3, syslogd hangs then processes get stuck in schedule_timeout
All of these implied that the problem was that somehow (and I have yet to explain this) syslogd was getting hung. In normal operation it reads from /dev/log, a named socket, and distributes all the log messages written there to the various places it is configured to. If syslogd hangs, /dev/log’s buffer doesn’t get read, it eventually fills up, and all the many programs on the system that write syslog messages then block because their writes don’t complete.
I tested this theory on my system by stopping syslogd (“killall -STOP syslogd”) and observed something very close to the behavior I have been wondering about for the last few months. So that appears to be the underlying problem.
For Now, a Brute Force Solution (but it works!)
The real solution of course is to diagnose what is causing syslogd to hang. Being not quite that pure in my approach at the moment, I took a different approach hinted at in some of the links above. I wrote a simple shell script that detected when syslog logging was no longer completing, killed syslogd, and restarted it. Just to cover all the bases, I also removed syslogd and replaced it with rsyslog:
# apt-get purge sysklogd # apt-get install rsyslog
Here is the script that I start that checks if syslog is stuck, which I placed at /usr/local/bin/test-syslog-restart. In this script, under the hang condition the initial “logger” command will block. So I put it in the background and check if it’s still running 20 seconds later. Adjust the last sleep command to determine the check interval; as written it’s only 1 minute which is probably excessive but for now I want to know quickly when something is happening.
#!/bin/bash # This script should be started and backgrounded. It will loop forever # checking whether the logger command blocks, implying that the rsyslogd # daemon has hung. If so, it will restart rsyslogd. prog=`basename $0` while true ; do logger -p syslog.info -t $prog "Testing if syslog is blocking" & pid=$! sleep 20 if [ "`ps -p $pid -o pid h`" = "$pid" ] ; then killall -9 rsyslogd killall -9 rklogd sleep 3 /etc/init.d/rsyslog restart logger -p syslog.crit -t $prog "rsyslogd was blocking; restarted" fi sleep 40 done
It is critical that once the script detects that logging is hung, that it not do anything that depends on further log messages being sent. The normal system startup scripts are not reliable for that, so I mercilessly kill the old syslogd related processes directly with the unblockable KILL (-9) signals, and then restart them.
This script is started from /etc/rc.local as follows:
if [ -x /usr/local/bin/test-syslog-restart ] ; then echo "Starting syslog hang check script" /usr/local/bin/test-syslog-restart & fi
Note that this isn’t super reliable since something could kill this script.
I’ve tested this by sending rsyslogd the STOP signal, effectively hanging it, and within about 2 minutes the /dev/log buffer appears to have filled up and the logger command in the script did not complete in the allotted 20 seconds. As a result the stopped rsyslogd daemons were killed and new ones started, and everything is back to normal.
Watchdog Doesn’t Work Either, For the Same Reason
One avenue I had pursued was using the “watchdog” package with a custom C program to detect the hang and trigger a reboot, but it never worked. After inspecting the watchdog source code, the reason became clear: the first thing watchdog does after a trigger to reboot is write a syslog message. As a result watchdog simply becomes another stuck process unable to do anything while it waits for its syslog write to complete. Recompiling watchdog to not use syslog is another option but I did not want to get into that can of worms if something simpler (yet less elegant) works, and I didn’t want to reboot the system if something less drastic could be done to restore operation.
To me it seems pretty astounding that the reliability of a server OS could be dependent on one simple problem. Some day I might look into why syslogd hung in the first place, but if this works fine I am unlikely to do so :-). If anyone else has similar experiences and, even better, superior solutions, please mention that in the comments.
Here is an update 9 days later. I’ve been monitoring /var/log/syslog for the script’s message that it restarted rsyslog, and it happened for the first time at 01:46 today. The early symptoms of non-responsiveness were a DHCP request to add a forward map for an internal system timing out, and then a sequence of invalid query packets from avahi-daemon. In any case, the script successfully killed and restarted rsyslogd and without looking at /var/log/syslog I would never have known the difference. Unfortunately it means the underlying problem is still there and substituting rsyslogd for syslog did not fix it.