[Openipmi-developer] IPMI SEL vs. watchdog timeouts
Brought to you by:
cminyard
|
From: Christian T. <ct...@fl...> - 2024-01-04 21:23:51
|
Hi everyone, Hi Corey, (I hope everyone had a good holiday season and made it healthy into 2024!) you might remember that I was chasing mysterious watchdog reboots without any specific issues being shown on the serial console or on the SEL. In December we stumbled over an insight that has given us a valuable clue. We did have a reduced timeout for the watchdog to trigger (60 seconds, systemd was signalling every 20 seconds). I *think* this may have lead to either false positives *OR* just plainly shadowed lockups/stalls that the kernel might have issued but needed more time for the detectors to find and report them. We have increased our watchdog timeouts to 5 minutes now and have even decided to remove watchdogs from KVM hosts (keeping then enabled on routers, backup servers and Ceph servers as those will not cause service interruptions when a watchdog comes in). We’ve not seen an actual lock up / stall since the last 3 weeks, yet, but I think we did solve a significant part of the mystery and maybe reporting it here helps recording it for posterity and might help someone else in the future. Thanks for the help so far! Christian -- Christian Theune · ct...@fl... · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick |