in commit 12583e81eaa093dc1224df08c7de62541142c6c2, a max timeout was added of 254 seconds for the watchdog device. this is weird as the commit is marked as "readability changes", so i'm not sure why it's including functional changes here as well. the later commit 1eee507a1fb7eb6a13a11816ed999b0271f3c613 raised the limit to 600 seconds, but the underlying problems haven't been addressed.
specifically, the problems are:
- the watchdog-timeout is silently clamped to the max value
- no other values (such as interval) are adjusted (even silently)
- if a large timeout is used and the interval is also picked that is larger than the max, then the daemon never pets the hardware in time
example when using the 5.14 release:
watchdog-timeout = 1000
interval = 400
watchdog-timeout is silently set to 254 seconds, but the watchdog daemon itself sleeps for 400 seconds, so the system always reboots after 254 seconds.
since there's no reason i can see to have a max value at all, it should be dropped. if the user tries to pick a time that the device/driver cannot support, it usually returns an error which the daemon will log, and the user can adjust their configs accordingly. the software limit here is entirely artificial.
Dear Mike,
The max value was put in as a sanity check as I thought that was well above the sort of time interval that anyone would want to use, but I guess you have proven me wrong.
Just now the watchdog read back the time interval that was set by the hardware driver and this is logged to syslog, but no attempt is made to rationalise the configured polling interval with that value.
What do you feel is the best behaviour?
(1) Just do what is configured (so if the driver sets a shorter interval then you get the spurious reboot situation you already reported)
(2) Force the polling interval to be, say, 2 seconds less then the hardware timeout set (which may just least to a spurious reboot as well if the 'precision' of the timer is less than 1 second)
Regards,
Paul
i think the max value should be dropped entirely. i'm not sure it is useful in any form since it's purely an artificial limit imposed by the watchdog software. i've worked with hardware in the past that def had higher timeouts than these :).
doing a sanity check against the driver state sounds good ... i'd like to think that watchdog drivers are all good in this regard, but who knows :). maybe approach this incrementally ?
(1) log an error when the config value doesn't match what the driver returned
(2) if the driver value is less than the config value, abort immediately so the user can rectify it
(3) wait for user reports to come back and re-evaluate behavior then
i'm not sure trying to dynamically recover (e.g. going through all the config options like "internal" and making them smaller than the driver value) is feasible or worth the effort. custom watchdog hook scripts might be written assuming the config values were used exactly as-is ... when that assumption is violated, it might be a while before people notice or even figure out why. logging is great, but there are always people who either don't read or read too late (and waste time debugging), so i tend to lean towards aborting in these cases.
Dear Mike,
If you can pull the latest code from this GIT repository, compile, and try it out it would be helpful.
I have removed the silent limit and put in a test & warning if the configured value is not double the interval. While that may be conservative, the watchdog code can get in to situations where the main loop takes around twice the interval, though generally the watchdog device should be written to a bit more frequently.
I am not very happy about simply disabling the watchdog if that were to fail, I think it better to find out with a reboot when testing or driver updates than when the machine is unattended and some fault develops. Not everyone is diligent in checking their log files!
Regards,
Paul
git changes look pretty straight forward to me. i'll ask the original reporter to give it a try for sanity sake.
you can also delete {MAX,MIN}_WD_TIMEOUT from include/extern.h since they're not used.
Dear Mike,
Thanks, let me know how that testing goes.
Currently my own "experimental" version has range-checking on reading most values in the config file, and that will explicitly tell you when things are out of range based on such constants. I know you feel there should be no limits, but it some times helps to have something tested when reading files to catch typing mistakes, etc, even if it is set to a much larger value than feasible (e.g. several hours). So I will think about this a bit more before deciding what to do with those constants.
Regards,
Paul
the original reporter didn't get back to me. i think the git repo as-is is OK.
Should be closed now, last aspect with commit [fea9ef]
Related
Commit: [fea9ef]
Closing as fixed some time ago