Menu

#9 watchdog can stall when logging to syslog

open
nobody
None
5
2013-03-18
2013-03-18
Brian Kroth
No

Hi, I've recently noticed that a problem with syslog can stall watchdog when it's run with -v.

The situation is as follows:

watchdog.conf contains
file = /var/log/messages
change = 600
logtick = 150

and is run with -v so that /var/log/messages (according to syslog-ng.conf rules) gets a log message at least from watchdog (and probably from many other things) at least once every 10 minutes. The intention is both check for syslog and I/O problems.

If for some reason syslog(-ng) stops reading from the /dev/log socket (this can be simulated with "pkill -STOP -f syslog" or see [1] for a real world example), then after a short period the /dev/log socket buffer will fill up and new clients will block while attempting to write to it - including watchdog. This will happen when /dev/log is opened with either SOCK_DGRAM or SOCK_STREAM - it doesn't matter which (though other clients are not affected as quickly with SOCK_STREAM). The result is that watchdog is blocked so never gets a chance to check /var/log/messages for a lack of changes and subsequently restart the machine. Below is an example of the last line of strace output of watchdog confirming this:
Process 6760 attached - interrupt to quit
sendto(0, "<30>Mar 15 10:20:12 watchdog[676"..., 57, MSG_NOSIGNAL, NULL, 0

Using a process check doesn't appear to help. My thought is that instead watchdog should be fixed up to use something like syslog-async [2], or at least a comment about this edge case in the man page.

Let me know if you have any questions or need any more details.

Thanks,
Brian

[1] http://serverfault.com/questions/101028/can-remote-logging-with-syslog-ng-hang-my-application/358042#358042
[2] http://thekelleys.org.uk/syslog-async/READ-ME

Discussion

  • Paul Crawford

    Paul Crawford - 2013-03-18

    Thanks for reporting this. I am currently making a lot of changes to the watchdog daemon code which will eventually appear on git, etc, at V6.xx and maybe something like the async syslog could be included.

    However, my first observation is you are probably not using any watchdog hardware? If so the blocking of the daemon should trigger a hardware reset anyway. Without hardware support, the watchdog daemon is vulnerable to its own bugs/problems as well as kernel panic, etc.

    Try installing the lm-sensors package and running sensors-detect to find out what hardware that uses, as a lot of the system monitor chips also have a watchdog. Then take a look in the likes of /etc/modprobe.d/blacklist-watchdog.conf to see if the same chip is mentioned, if so add that driver from /etc/modprobe.d/blacklist-watchdog.conf to /etc/modules (or modprobe it) and you should then have hardware support.

    If that fails (i.e. no supported hardware) then at least try the softdog software timer module that presents /dev/watchdog

    Regards,
    Paul

     
  • Brian Kroth

    Brian Kroth - 2013-03-18

    *sigh*

    You're right. That conf snippet was incomplete, but I had completely forgot about that softdog module. I've mostly only ever used watchdog on physical machines, where they all have watchdog hw (usually ipmi), but this was added to VMs recently as yet another belts and suspenders way of detecting semi-broken machines (the kernels are already configured to panic and reboot on various IO problems) and it being years since I used the softdog module, I only went so far as making watchdog trigger on syslog and load and handful of other software only events. I'll definitely be looking into adding that soon.

    Anyways, I think syslog-async might still be useful. Thanks for considering it and thanks for catching my mistake.

    Cheers,
    Brian

     
  • Paul Crawford

    Paul Crawford - 2013-03-18

    No problem, I hope that helps.

    In our own experience of VM problems, you may have to resort to an external watchdog approach (i.e. run the watchdog on the host, not inside the VM) to look at the internal VM activity.

    We say "belt and braces" over here, you might want to look up suspenders on a UK site to see why that is amusing :)
    Regards,
    Paul

     

Log in to post a comment.

MongoDB Logo MongoDB