watchdog-devel Mailing List for Watchdog
Brought to you by:
meskes,
paulcrawford
You can subscribe to this list here.
2008 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(8) |
Feb
|
Mar
(3) |
Apr
|
May
(6) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
(6) |
Dec
|
2015 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
(5) |
Dec
|
2019 |
Jan
(3) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
(1) |
2020 |
Jan
(7) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2021 |
Jan
(8) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
From: Paul C. <ps...@sa...> - 2021-10-14 14:08:09
|
Dear Josef, > Coverity report shows a memory leak in watchdog-5.16/src/run-as-child.c:102 > with `realloc()` call. Issue is that realloc may return `NULL` when there > is an error causing `opt` to be a null pointer and losing the pointer to > the memory that was allocated by `strdup()` or reallocated by `realloc()`. > > Steps to Reproduce: > 1. watchdog needs to be run with the verbose flag > 2. watchdog needs to receive test/repair arguments of sufficient length to > cause a ENOMEM or another error that may be triggered by realloc > > I've prepared a patch with a possible solution (in attachment). > > It would be great to have it part of upstream source code. I have applied your patch to the current master branch and all seems OK on my own basic test. Hopefully you can do a git-pull and verify all is well in your own tests. Regards, Paul > > Best regards > > Josef Ridky > Senior Software Engineer > Core Services - System management team > Red Hat Czech, s.r.o. > -------------- next part -------------- > An HTML attachment was scrubbed... > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: 0002-mem-leak-verbose.patch > Type: text/x-patch > Size: 758 bytes > Desc: not available > > ------------------------------ > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Watchdog-devel mailing list > Wat...@li... > https://lists.sourceforge.net/lists/listinfo/watchdog-devel > > > ------------------------------ > > End of Watchdog-devel Digest, Vol 18, Issue 1 > ********************************************* > -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: Josef Ř. <jr...@re...> - 2021-10-14 09:03:09
|
Hi, Coverity report shows a memory leak in watchdog-5.16/src/run-as-child.c:102 with `realloc()` call. Issue is that realloc may return `NULL` when there is an error causing `opt` to be a null pointer and losing the pointer to the memory that was allocated by `strdup()` or reallocated by `realloc()`. Steps to Reproduce: 1. watchdog needs to be run with the verbose flag 2. watchdog needs to receive test/repair arguments of sufficient length to cause a ENOMEM or another error that may be triggered by realloc I've prepared a patch with a possible solution (in attachment). It would be great to have it part of upstream source code. Best regards Josef Ridky Senior Software Engineer Core Services - System management team Red Hat Czech, s.r.o. |
From: Paul C. <ps...@sa...> - 2021-01-20 23:28:51
|
>> My devices are headless, so a poweroff/halt when the temperature is >> too >> high is not very helpful. I was wondering whether watchdog could kill >> all processes, then wait for a configurable time to allow the machine >> to >> cool down, and then reboot. E.g. >> >> ?? temperature-reboot-wait = 600 > > Actually this doesn't sound silly at all. It would be a good feature to > have I would say. However, I have no spare cycles at the moment to > implement it. So if you are willing to put time into it, I'd appreciate > a pull request or a patch. On one hand it seems reasonable, but equally it also suggest the machine is not properly cooled in the first place if a CPU load can overheat it! Previously it was (or at least I understood) the scenario that overheating would be the result of a hardware fault: for example the CPU fan failing, or the room air conditioning failing. In that case it really is a shutdown step to prevent permanent damage as it needs physical intervention to fix things. Have you got more information about the sort of situation (e.g. hardware set up) where a simple high process load would overheat the machine? Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: Michael M. <me...@de...> - 2021-01-20 13:45:07
|
Hi Nils, > I found release tarballs for watchdog on SourceForge, but I can see > any > signatures. How to I confirm their integrity? > > E.g. > > $ sha256sum watchdog-5.16.tar.gz > b8e7c070e1b72aee2663bdc13b5cc39f76c9232669cfbb1ac0adc7275a3b019d Nowadays SourceForge is but a backup release site. The Debian archives always get the latest version at the same time as the release. Or in other words, getting the sources from the Debian sites will give you a verified release tarball. Hope this helps, Michael -- Michael Meskes Michael at Fam-Meskes dot De Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org |
From: Michael M. <me...@de...> - 2021-01-20 13:30:42
|
> do need indeed 5.16. Unfortunately that version is much trickier to > build as a debian package on my old system (no SystemD, no > debhelper-compat). Systemd is not necessary and debhelper should be available from backports. Or is your system too old to have a debhelper backport for the current version? It should not be too much hassle to change the debian files back to the older packaging, though. Best, Michael -- Michael Meskes Michael at Fam-Meskes dot De Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org |
From: Michael M. <me...@de...> - 2021-01-20 13:28:02
|
> I just started to use your Linux Watchdog on embedded devices (PC > Engine > ALIX/APU) with Debian. Great work, super helpful, much appreciated! Glad to hear that. > I have a feature request. Apologies if this is silly, or has been > discussed before, please point me in the right direction. > > My devices are headless, so a poweroff/halt when the temperature is > too > high is not very helpful. I was wondering whether watchdog could kill > all processes, then wait for a configurable time to allow the machine > to > cool down, and then reboot. E.g. > > temperature-reboot-wait = 600 Actually this doesn't sound silly at all. It would be a good feature to have I would say. However, I have no spare cycles at the moment to implement it. So if you are willing to put time into it, I'd appreciate a pull request or a patch. Thanks, Michael -- Michael Meskes Michael at Fam-Meskes dot De Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org |
From: Nils T. <nil...@de...> - 2021-01-18 11:21:38
|
On 17/01/2021 22:09, Nils Toedtmann wrote: > Since when does watchdog's "min-memory" account for buffers & cache as > per > https://www.crawford-space.co.uk/old_psc/watchdog/watchdog-configure.html#Memory_Test > ? > > Was it commit a0cf26 from 2019/07/09 "Compute usable memory from > free+buffers+cache", so 5.16? And older versions just use MemFree from > /proc/meminfo > > Just checking what I need to upgrade to :-D After some dumbing down of debian/{control,rules}, I managed to upgrade to 5.15, but (as expected) that has the same issue. So it looks like I do need indeed 5.16. Unfortunately that version is much trickier to build as a debian package on my old system (no SystemD, no debhelper-compat). Instead, I might resort to a 'test-binary' script to do that check. Best, Nils -- Nils Toedtmann | Chief IoT Architect, co-founder & director https://www.demandlogic.co.uk/ +44 (0) 7821 817722 |
From: Nils T. <nil...@de...> - 2021-01-18 09:54:07
|
Hi I found release tarballs for watchdog on SourceForge, but I can see any signatures. How to I confirm their integrity? E.g. $ sha256sum watchdog-5.16.tar.gz b8e7c070e1b72aee2663bdc13b5cc39f76c9232669cfbb1ac0adc7275a3b019d TA, best /nils -- Nils Toedtmann | Chief IoT Architect, co-founder & director https://www.demandlogic.co.uk/ +44 (0) 7821 817722 |
From: Nils T. <nil...@de...> - 2021-01-17 22:39:22
|
Hi Since when does watchdog's "min-memory" account for buffers & cache as per https://www.crawford-space.co.uk/old_psc/watchdog/watchdog-configure.html#Memory_Test ? Was it commit a0cf26 from 2019/07/09 "Compute usable memory from free+buffers+cache", so 5.16? And older versions just use MemFree from /proc/meminfo Just checking what I need to upgrade to :-D Kind regards, Nils -- Nils Toedtmann | Chief IoT Architect, co-founder & director https://www.demandlogic.co.uk/ +44 (0) 7821 817722 |
From: Nils T. <nil...@de...> - 2021-01-17 20:47:19
|
Hi I just started to use your Linux Watchdog on embedded devices (PC Engine ALIX/APU) with Debian. Great work, super helpful, much appreciated! I have a feature request. Apologies if this is silly, or has been discussed before, please point me in the right direction. My devices are headless, so a poweroff/halt when the temperature is too high is not very helpful. I was wondering whether watchdog could kill all processes, then wait for a configurable time to allow the machine to cool down, and then reboot. E.g. temperature-reboot-wait = 600 (Note that for reasons that are too embarrassing to lay out, I currently use watchdog 5.12. But it looks like 5.15 is no different in this regard) Kind regards, Nils -- Nils Toedtmann | Chief IoT Architect, co-founder & director https://www.demandlogic.co.uk/ +44 (0) 7821 817722 |
From: Michael M. <me...@de...> - 2020-04-14 11:59:47
|
Hi, > I'm currently using a raspberry pi with Void Linux. I chose to go > with > the musl version, which means the last release of watchdog can't be > built for that system. I'd like to update the version of watchdog > currently available in their repos to something that can be built for > both musl and glibc, and it seems to be possible in the current > version. > Unfortunately, they tend to only accept released software, so > packaging > a version that works with musl and glibc would require a new release. > > Do you think that there is a new release in the project's future? Are > there any bug fixes or patch revisions still necessary? I would like > to > help, if possible. There's only one thing missing for a release and that's my time. Sorry about the delay, there are a couple small things to be done before we can release but nothing major. I will try to prioritize this. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org Jabber: michael at xmpp dot meskes dot org VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL |
From: <er...@di...> - 2020-04-14 04:36:23
|
Hey there! I'm currently using a raspberry pi with Void Linux. I chose to go with the musl version, which means the last release of watchdog can't be built for that system. I'd like to update the version of watchdog currently available in their repos to something that can be built for both musl and glibc, and it seems to be possible in the current version. Unfortunately, they tend to only accept released software, so packaging a version that works with musl and glibc would require a new release. Do you think that there is a new release in the project's future? Are there any bug fixes or patch revisions still necessary? I would like to help, if possible. Thank you very much, Érico Nogueira |
From: Paul C. <ps...@sa...> - 2020-01-30 12:27:15
|
Dear Marco, > I think I know where our misunderstanding is. > > Let us assume two scenarios, in both watchdog is triggered. Both have a > sigterm_delay of 5m and all processes are ended cleanly, say 30 seconds > after the SIGTERM signal, the only difference is init and systemd. > > The system would be restarted directly with init, but would wait 5 > minutes with systemd since systemd ignores the SIGTERM. No, sending SIGTERM to the init process only tells it to stop respawning process that die. It does not initiate a reboot. There is a specific signal for systemd that will reboot, but this is not used by the watchdog daemon (as it would result in systemd shutting down the watchdog and so not allowing the use of the hardware reset approach). > So maybe there should be a workaround to identify systemd, or generally > an option to ignore cgroup processes since they completely ignore > SIGTERM and SIGKILLS, which, to my knowledge, should be possible with > the /sys/fs/cgroup filesystem. > > So maybe it would be good to have an option to ignore cgroup process and > just reset the system if only cgroup processes are left, because this is > what happens in the end anyway. To prevent this unnecessary timeout. Complexity is often the enemy of reliability! The watchdog is designed to assume that things will not shut down normally (otherwise it is probably not needed in the first place) and makes an attempts to do the nice way to stop processes. But ultimately it is a lot of complexity to work out what is running and so on, compared to simply forcing a hard reset. The default of 5s was thought to be sufficient for most cases, as a screwed-up system may have been sick for a lot longer before the watchdog is triggered. Few people will need/want a longer setting, but it is possible. If you need a fast & brutal reset for some special case you can have a monitored process return error code 254 and it will act as if the reset button was pressed (subject to minimum driver reset time). Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: <goo...@is...> - 2020-01-25 15:23:44
|
Dear Paul, I think I know where our misunderstanding is. Let us assume two scenarios, in both watchdog is triggered. Both have a sigterm_delay of 5m and all processes are ended cleanly, say 30 seconds after the SIGTERM signal, the only difference is init and systemd. The system would be restarted directly with init, but would wait 5 minutes with systemd since systemd ignores the SIGTERM. So maybe there should be a workaround to identify systemd, or generally an option to ignore cgroup processes since they completely ignore SIGTERM and SIGKILLS, which, to my knowledge, should be possible with the /sys/fs/cgroup filesystem. So maybe it would be good to have an option to ignore cgroup process and just reset the system if only cgroup processes are left, because this is what happens in the end anyway. To prevent this unnecessary timeout. Thank you for your effort and time! Am 24.01.20 um 14:36 schrieb Paul Crawford: > Dear Marco, > >> Sorry for the misunderstanding, my bad english dosent help. > > No problem, I am not any good with other languages anyway! > >> I try it again. >> >> I try to explain it like i understand it thanks to your help. >> >> With sigterm_delay set to 500, my Watchdog gets triggered and tries to >> reboot the System so it send's SIGTERMs to all processes. Lets assume >> they all have done a clean shutdown after 120s expcept for systemd >> because its in a cgroup, so in my understanding it should not be >> needed to wait the rest of the sigterm_delay. And as far as i >> understood you it should then try to set the reset time as short as >> possible. > > The watchdog does not monitor other process for exiting, it simply sends > the SIGTERM / SIGKILL signals in a pre-set sequence. > > There is a configurable delay between the 2nd SIGTERM and the 1st > SIGKILL to give the administrator some control over the timing in case > you have a process you know to be slow in properly shutting down (for > example a VM powering down internally). > > But if you configure a long delay it will simply wait for a long time > before it goes to the "end game" of SIGKILL and then file system unmount > before trying to use the timer to force a hard reset. > > Regards, > Paul |
From: Paul C. <ps...@sa...> - 2020-01-24 14:04:17
|
Dear Marco, > Sorry for the misunderstanding, my bad english dosent help. No problem, I am not any good with other languages anyway! > I try it again. > > I try to explain it like i understand it thanks to your help. > > With sigterm_delay set to 500, my Watchdog gets triggered and tries to reboot the System so it send's SIGTERMs to all processes. Lets assume they all have done a clean shutdown after 120s expcept for systemd because its in a cgroup, so in my understanding it should not be needed to wait the rest of the sigterm_delay. And as far as i understood you it should then try to set the reset time as short as possible. The watchdog does not monitor other process for exiting, it simply sends the SIGTERM / SIGKILL signals in a pre-set sequence. There is a configurable delay between the 2nd SIGTERM and the 1st SIGKILL to give the administrator some control over the timing in case you have a process you know to be slow in properly shutting down (for example a VM powering down internally). But if you configure a long delay it will simply wait for a long time before it goes to the "end game" of SIGKILL and then file system unmount before trying to use the timer to force a hard reset. Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: <goo...@is...> - 2020-01-17 17:27:20
|
Dear Paul Sorry for the misunderstanding, my bad english dosent help. I try it again. I try to explain it like i understand it thanks to your help. With sigterm_delay set to 500, my Watchdog gets triggered and tries to reboot the System so it send's SIGTERMs to all processes. Lets assume they all have done a clean shutdown after 120s expcept for systemd because its in a cgroup, so in my understanding it should not be needed to wait the rest of the sigterm_delay. And as far as i understood you it should then try to set the reset time as short as possible. I hope i got it clearer, thanks for your patience! Best regards Marco > Gesendet: Freitag, 17. Januar 2020 um 13:36 Uhr > Von: "Paul Crawford" <ps...@sa...> > An: wat...@li... > Betreff: Re: [Watchdog-devel] sigterm-delay is always waiting > > Dear Marco, > > > I am sorry for my HTML respone, first time i use a mailinglist and my web client got me, shame on me. > > > > But back to the topic. > > > > I see your Points about systemd and cgroups, this make sense, thanks for clarification. > > To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible. > > If it should be that way, how does he determine when to do this and how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing? > > The sigterm delay is a timber between sending out SIGTERM to "politely" > ask process to stop and then the sending out of SIGKILL just before the > hardware timer is used to trigger a reboot. So what you configure for > this will normally be used (except for occasions when the watchdog tries > a hard reset or has locked up and the hardware times-out to the same > effect). > > My apologies if I do not understand your problem correctly, but if you > have a delay configured, for example, 5 seconds, why do expect it to do > something before this time id over? > > Regards, > Paul > > > Thanks for your patience > > > > Best regards > > Marco > > -- > Dr. Paul S. Crawford > c/o Satellite Station > University of Dundee > Small's Wynd, Dundee, DD1 4HN > Email: ps...@sa... > Tel: +44 (0)1382 38 4687 > The University of Dundee is a Scottish Registered Charity, No. SC015096 > > > _______________________________________________ > Watchdog-devel mailing list > Wat...@li... > https://lists.sourceforge.net/lists/listinfo/watchdog-devel > |
From: Paul C. <ps...@sa...> - 2020-01-17 13:06:31
|
Dear Marco, > I am sorry for my HTML respone, first time i use a mailinglist and my web client got me, shame on me. > > But back to the topic. > > I see your Points about systemd and cgroups, this make sense, thanks for clarification. > To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible. > If it should be that way, how does he determine when to do this and how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing? The sigterm delay is a timber between sending out SIGTERM to "politely" ask process to stop and then the sending out of SIGKILL just before the hardware timer is used to trigger a reboot. So what you configure for this will normally be used (except for occasions when the watchdog tries a hard reset or has locked up and the hardware times-out to the same effect). My apologies if I do not understand your problem correctly, but if you have a delay configured, for example, 5 seconds, why do expect it to do something before this time id over? Regards, Paul > Thanks for your patience > > Best regards > Marco -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: <goo...@is...> - 2020-01-14 16:10:11
|
Dear Paul I am sorry for my HTML respone, first time i use a mailinglist and my web client got me, shame on me. But back to the topic. I see your Points about systemd and cgroups, this make sense, thanks for clarification. To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible. If it should be that way, how does he determine when to do this and how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing? Thanks for your patience Best regards Marco Gesendet: Mittwoch, 04. Dezember 2019 um 09:48 Uhr Von: goo...@is... An: "Paul Crawford" <ps...@sa...> Cc: wat...@li... Betreff: Re: [Watchdog-devel] sigterm-delay is always waiting Dear Paul Thanks for the clarification. To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible. If it should be that way how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing? Thanks for your work and time helping me. Best regards Marco Gesendet: Freitag, 29. November 2019 um 17:51 Uhr Von: "Paul Crawford" <ps...@sa...> An: goo...@is... Cc: wat...@li... Betreff: Re: Aw: Re: sigterm-delay is always waiting Dear Marco, > Thanks for your great response, i understand now the workflow much better. No problem, hopefully it helps. >> So I'm guessing your machine is not using rsyslog for logging, is that >> deliberate? The killall5 log shows the systemd process > It's just a testing setup and rsyslog was an easy method to trigger the watchdog. OK, that would work. > If i got you right, you say its systemd itself who is schielding itself to protect it from SIGTERM and SIGKILL, is there a workaround for this? It is a bit more complicated than that. There are two thing a bit different for the watchdog daemon under systemd compared to the older init approaches. First is that many daemons run by systemd are run using the cgroups feature and this can stop signals killing them. However, it is usually a good feature as it allows the system to contain processes better (e.g. memory use, etc) and to identify and shut down detached child processes in a way that is hard to do without cgroups. but it means an orderly shutdown with the watchdog is less likely (at least, being able to unmount file systems). Second is that systemd has a mechanism to kill it and reboot the machine using additional signals, for example: > SIGRTMIN+6 > Reboots the machine via kexec, starts the kexec.target unit. This is mostly equivalent to systemctl start kexec.target --job-mode=replace-irreversibly. Now it looks as if the watchdog should simply trigger a system reboot this way, but that has two more important issues: - It assumes the machine is behaving "normally" and that may not be the case at all! - During a normal reboot the hardware driver is stopped and unloaded (often even with 'nowayout' enabled) by the kernel, so you may not have that backup to ensure a reboot if something like a HDD driver, etc, is locked and halting the machine. The way the watchdog tries to get round both the systemd reluctance to respond as for past init process, and for the issue of a machine hanging after it start shutdown, is to use the hardware timer to issue a reset signal. So what it does is try to shut down as far as possible, and then set the hardware timeout to as short a time as practical to force a reset of the system. Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096_______________________________________________ Watchdog-devel mailing list Wat...@li... https://lists.sourceforge.net/lists/listinfo/watchdog-devel |
From: <goo...@is...> - 2020-01-14 14:57:42
|
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>Dear Paul<br/> <br/> I am sorry for my HTML respone, first time i use a mailinglist and my web client got me, shame on me.<br/> <br/> But back to the topic.<br/> <br/> I see your Points about systemd and cgroups, this make sense, thanks for clarification.<br/> To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible.<br/> If it should be that way, how does he determine when to do this and how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing?<br/> <br/> Thanks for your patience<br/> <br/> Best regards<br/> Marco<br/> <br/> Gesendet: Mittwoch, 04. Dezember 2019 um 09:48 Uhr<br/> Von: goo...@is...<br/> An: "Paul Crawford" <ps...@sa...><br/> Cc: wat...@li...<br/> Betreff: Re: [Watchdog-devel] sigterm-delay is always waiting<br/> <br/> Dear Paul<br/> <br/> Thanks for the clarification.<br/> <br/> To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible.<br/> If it should be that way how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing?<br/> <br/> Thanks for your work and time helping me.<br/> <br/> Best regards<br/> Marco<br/> <br/> <br/> <br/> Gesendet: Freitag, 29. November 2019 um 17:51 Uhr<br/> Von: "Paul Crawford" <ps...@sa...><br/> An: goo...@is...<br/> Cc: wat...@li...<br/> Betreff: Re: Aw: Re: sigterm-delay is always waiting<br/> Dear Marco,<br/> <br/> > Thanks for your great response, i understand now the workflow much better.<br/> <br/> No problem, hopefully it helps.<br/> <br/> >> So I'm guessing your machine is not using rsyslog for logging, is that<br/> >> deliberate? The killall5 log shows the systemd process<br/> > It's just a testing setup and rsyslog was an easy method to trigger the watchdog.<br/> <br/> OK, that would work.<br/> <br/> > If i got you right, you say its systemd itself who is schielding itself to protect it from SIGTERM and SIGKILL, is there a workaround for this?<br/> <br/> It is a bit more complicated than that. There are two thing a bit<br/> different for the watchdog daemon under systemd compared to the older<br/> init approaches.<br/> <br/> First is that many daemons run by systemd are run using the cgroups<br/> feature and this can stop signals killing them. However, it is usually a<br/> good feature as it allows the system to contain processes better (e.g.<br/> memory use, etc) and to identify and shut down detached child processes<br/> in a way that is hard to do without cgroups. but it means an orderly<br/> shutdown with the watchdog is less likely (at least, being able to<br/> unmount file systems).<br/> <br/> Second is that systemd has a mechanism to kill it and reboot the machine<br/> using additional signals, for example:<br/> <br/> > SIGRTMIN+6<br/> > Reboots the machine via kexec, starts the kexec.target unit. This is mostly equivalent to systemctl start kexec.target --job-mode=replace-irreversibly.<br/> <br/> Now it looks as if the watchdog should simply trigger a system reboot<br/> this way, but that has two more important issues:<br/> <br/> - It assumes the machine is behaving "normally" and that may not be the<br/> case at all!<br/> <br/> - During a normal reboot the hardware driver is stopped and unloaded<br/> (often even with 'nowayout' enabled) by the kernel, so you may not have<br/> that backup to ensure a reboot if something like a HDD driver, etc, is<br/> locked and halting the machine.<br/> <br/> The way the watchdog tries to get round both the systemd reluctance to<br/> respond as for past init process, and for the issue of a machine hanging<br/> after it start shutdown, is to use the hardware timer to issue a reset<br/> signal. So what it does is try to shut down as far as possible, and then<br/> set the hardware timeout to as short a time as practical to force a<br/> reset of the system.<br/> <br/> Regards,<br/> Paul<br/> --<br/> Dr. Paul S. Crawford<br/> c/o Satellite Station<br/> University of Dundee<br/> Small's Wynd, Dundee, DD1 4HN<br/> Email: ps...@sa...<br/> Tel: +44 (0)1382 38 4687<br/> The University of Dundee is a Scottish Registered Charity, No. SC015096_______________________________________________ Watchdog-devel mailing list Wat...@li... <a href="https://lists.sourceforge.net/lists/listinfo/watchdog-devel" target="_blank">https://lists.sourceforge.net/lists/listinfo/watchdog-devel</a></div> <div> </div> <div class="signature"> </div></div></body></html> |
From: <goo...@is...> - 2019-12-04 08:48:28
|
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div> <div><br/> Dear Paul<br/> <br/> Thanks for the clarification.</div> <div> </div> <div>To sum it up, watchdog SIGTERM/SIGKILL all process he can and then try to set the reset timer as short as possible.<br/> If it should be that way how can i debug why my watchdog is waiting the whole sigterm_delay? Or what i am missing?</div> <div> </div> <div>Thanks for your work and time helping me.</div> <div> </div> <div>Best regards<br/> Marco</div> </div> <div> <div> <div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"> <div style="margin:0 0 10px 0;"><b>Gesendet:</b> Freitag, 29. November 2019 um 17:51 Uhr<br/> <b>Von:</b> "Paul Crawford" <ps...@sa...><br/> <b>An:</b> goo...@is...<br/> <b>Cc:</b> wat...@li...<br/> <b>Betreff:</b> Re: Aw: Re: sigterm-delay is always waiting</div> <div name="quoted-content">Dear Marco,<br/> <br/> > Thanks for your great response, i understand now the workflow much better.<br/> <br/> No problem, hopefully it helps.<br/> <br/> >> So I'm guessing your machine is not using rsyslog for logging, is that<br/> >> deliberate? The killall5 log shows the systemd process<br/> > It's just a testing setup and rsyslog was an easy method to trigger the watchdog.<br/> <br/> OK, that would work.<br/> <br/> > If i got you right, you say its systemd itself who is schielding itself to protect it from SIGTERM and SIGKILL, is there a workaround for this?<br/> <br/> It is a bit more complicated than that. There are two thing a bit<br/> different for the watchdog daemon under systemd compared to the older<br/> init approaches.<br/> <br/> First is that many daemons run by systemd are run using the cgroups<br/> feature and this can stop signals killing them. However, it is usually a<br/> good feature as it allows the system to contain processes better (e.g.<br/> memory use, etc) and to identify and shut down detached child processes<br/> in a way that is hard to do without cgroups. but it means an orderly<br/> shutdown with the watchdog is less likely (at least, being able to<br/> unmount file systems).<br/> <br/> Second is that systemd has a mechanism to kill it and reboot the machine<br/> using additional signals, for example:<br/> <br/> > SIGRTMIN+6<br/> > Reboots the machine via kexec, starts the kexec.target unit. This is mostly equivalent to systemctl start kexec.target --job-mode=replace-irreversibly.<br/> <br/> Now it looks as if the watchdog should simply trigger a system reboot<br/> this way, but that has two more important issues:<br/> <br/> - It assumes the machine is behaving "normally" and that may not be the<br/> case at all!<br/> <br/> - During a normal reboot the hardware driver is stopped and unloaded<br/> (often even with 'nowayout' enabled) by the kernel, so you may not have<br/> that backup to ensure a reboot if something like a HDD driver, etc, is<br/> locked and halting the machine.<br/> <br/> The way the watchdog tries to get round both the systemd reluctance to<br/> respond as for past init process, and for the issue of a machine hanging<br/> after it start shutdown, is to use the hardware timer to issue a reset<br/> signal. So what it does is try to shut down as far as possible, and then<br/> set the hardware timeout to as short a time as practical to force a<br/> reset of the system.<br/> <br/> Regards,<br/> Paul<br/> --<br/> Dr. Paul S. Crawford<br/> c/o Satellite Station<br/> University of Dundee<br/> Small's Wynd, Dundee, DD1 4HN<br/> Email: ps...@sa...<br/> Tel: +44 (0)1382 38 4687<br/> The University of Dundee is a Scottish Registered Charity, No. SC015096</div> </div> </div> </div></div></body></html> |
From: Paul C. <ps...@sa...> - 2019-11-29 17:23:22
|
Dear Marco, > Thanks for your great response, i understand now the workflow much better. No problem, hopefully it helps. >> So I'm guessing your machine is not using rsyslog for logging, is that >> deliberate? The killall5 log shows the systemd process > It's just a testing setup and rsyslog was an easy method to trigger the watchdog. OK, that would work. > If i got you right, you say its systemd itself who is schielding itself to protect it from SIGTERM and SIGKILL, is there a workaround for this? It is a bit more complicated than that. There are two thing a bit different for the watchdog daemon under systemd compared to the older init approaches. First is that many daemons run by systemd are run using the cgroups feature and this can stop signals killing them. However, it is usually a good feature as it allows the system to contain processes better (e.g. memory use, etc) and to identify and shut down detached child processes in a way that is hard to do without cgroups. but it means an orderly shutdown with the watchdog is less likely (at least, being able to unmount file systems). Second is that systemd has a mechanism to kill it and reboot the machine using additional signals, for example: > SIGRTMIN+6 > Reboots the machine via kexec, starts the kexec.target unit. This is mostly equivalent to systemctl start kexec.target --job-mode=replace-irreversibly. Now it looks as if the watchdog should simply trigger a system reboot this way, but that has two more important issues: - It assumes the machine is behaving "normally" and that may not be the case at all! - During a normal reboot the hardware driver is stopped and unloaded (often even with 'nowayout' enabled) by the kernel, so you may not have that backup to ensure a reboot if something like a HDD driver, etc, is locked and halting the machine. The way the watchdog tries to get round both the systemd reluctance to respond as for past init process, and for the issue of a machine hanging after it start shutdown, is to use the hardware timer to issue a reset signal. So what it does is try to shut down as far as possible, and then set the hardware timeout to as short a time as practical to force a reset of the system. Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: <goo...@is...> - 2019-11-28 10:11:16
|
Hi Paul Thanks for your great response, i understand now the workflow much better. > So I'm guessing your machine is not using rsyslog for logging, is that > deliberate? The killall5 log shows the systemd process It's just a testing setup and rsyslog was an easy method to trigger the watchdog. If i got you right, you say its systemd itself who is schielding itself to protect it from SIGTERM and SIGKILL, is there a workaround for this? best regards, Marco |
From: Paul C. <ps...@sa...> - 2019-11-25 13:28:10
|
> Hi together > > It seems watchdog is always waiting the sigterm-delay even if it seems not necessary to me. There is always a short delay to make sure processes sent the SIGTERM signal are able to exit in an orderly manner. The watchdog has no ide if this is strictly necessary as it knows only the list of running processes to try shutting down for an orderly reboot. > The Setup: > Debian 9.11, watchdog 5.15-2 with the softdog kernelmodul > Linux testing 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux > > I only added to the watchdog config a sigterm-delay, pidfile and for testing verbose. > But the logging stopped when the restart was triggered, even if i tried to start watchdog in forground with nohup and in verbose mode. Logging normally stops at some point as the syslog daemon has been killed! There is an attempt to separate "system" PIDs from user ones to try and delay the end of logging to provide more details on what was happening, but sooner or later the logging service should be killed. > So i compiled watchdog from git, because debian version is quite old and if i am correct has no killall5 logging. > > So i tried the same with the new watchdog (compiled with watchdoge-code from 15-11-19) and run it with the same config. > Now it generated the killall5 logs but to be honest im not quite sure how to interpret them. The killall5 log file reports the contents of /proc/$PID/stat for each running process $PID the format of which is covered in, for example, this man page: http://man7.org/linux/man-pages/man5/proc.5.html The idea is it allows you some insight as to what was running at the time the watchdog was triggered to have some idea of what might have caused it (for example, if it was high load averages or something that is not easily attributed to any one fault). > starting command > nohup watchdog -c /etc/watchdog.conf -F -v -v >> watchdog.log 2>&1 & > > watchdog stdout and stderr > https://pastebin.com/HHyDq8eE > > killall5 log > https://pastebin.com/KCXHJ5b2 > > Can someone give me a hint how i could further debug this issue? What is it you want to debug? The reason for the reboot is given in the stdout file: > cannot open /var/run/rsyslogd.pid (errno = 2 = 'No such file or directory') On my ubuntu 18.04 machine that file exists OK: > $ ls -la /var/run/rsyslogd.pid > -rw-r--r-- 1 root root 4 Nov 24 22:34 /var/run/rsyslogd.pid So I'm guessing your machine is not using rsyslog for logging, is that deliberate? The killall5 log shows the systemd process /lib/systemd/systemd-journald is running (same on my 18.04 box). End of the stdout log shows what it tried to kill in brief summary: > watchdog: Opened dump file /var/log/watchdog/killall5.log > watchdog: Closed dump file > watchdog: sent signal 15 to 23 of 84 processes > watchdog: Opened dump file /var/log/watchdog/killall5.log > watchdog: Closed dump file > watchdog: sent signal 15 to 18 of 80 processes > watchdog: Opened dump file /var/log/watchdog/killall5.log > watchdog: Closed dump file > watchdog: sent signal 9 to 1 of 61 processes > watchdog: Opened dump file /var/log/watchdog/killall5.log > watchdog: Closed dump file > watchdog: sent signal 9 to 1 of 61 processes So there were 84 process at the start of shutdown, of which 23 were deemed "user" so sent SIGTERM, then a second later 18 "user" ones were still running (now out of 80) so sent SIGTERM again, then after the (configurable) delay, only one remained to be sent SIGKILL out of 61 and it seems it ignored that (typically a systemd issue as in a cgroup it will shield it from the watchdog's signal) Regards, Paul -- Dr. Paul S. Crawford c/o Satellite Station University of Dundee Small's Wynd, Dundee, DD1 4HN Email: ps...@sa... Tel: +44 (0)1382 38 4687 The University of Dundee is a Scottish Registered Charity, No. SC015096 |
From: <goo...@is...> - 2019-11-19 08:14:54
|
Hi together It seems watchdog is always waiting the sigterm-delay even if it seems not necessary to me. The Setup: Debian 9.11, watchdog 5.15-2 with the softdog kernelmodul Linux testing 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux I only added to the watchdog config a sigterm-delay, pidfile and for testing verbose. But the logging stopped when the restart was triggered, even if i tried to start watchdog in forground with nohup and in verbose mode. So i compiled watchdog from git, because debian version is quite old and if i am correct has no killall5 logging. So i tried the same with the new watchdog (compiled with watchdoge-code from 15-11-19) and run it with the same config. Now it generated the killall5 logs but to be honest im not quite sure how to interpret them. starting command nohup watchdog -c /etc/watchdog.conf -F -v -v >> watchdog.log 2>&1 & watchdog stdout and stderr https://pastebin.com/HHyDq8eE killall5 log https://pastebin.com/KCXHJ5b2 Can someone give me a hint how i could further debug this issue? Thanks in advance and with best regards Marco |
From: Maxim D. <md...@gm...> - 2019-07-10 20:19:04
|
It seems everything works, thanks! Reboot was triggered instantly. By the way, I noticed that wachdog has not had stable releases for a long time. Could you do this when all the problems with the memory test are resolved? ср, 10 июл. 2019 г. в 11:26, Paul Crawford <ps...@sa...>: > Dear Maxim, > > > Just tested a git head. The calculation of free memory is great, but I > > can't get a reboot. The same configuration on a stable version > > worked. Am I doing something wrong? Maybe something has changed in the > > processing of test results? > > There was a non-obvious bug where the error-repair timer was reused > between the two type of memory test so one was 'repairing' the other, > hence no reboot. > > That is fixed, also I made no-memory a non-repairable type of error > (like watchdog of old when all errors were acted upon immediately) as > realistically it is not something that is likely to go away. > > I guess some might feel that is too aggressive as they might want to > ride-out a momentary shortage of memory, but equally that is the sort of > thing where the OOM killer is likely to jump in anyway and possibly > screw things up in an non-obvious manner. > > Let me know how you get on with this update. > > Regards, > Paul > > > > > In logs I see this: > > > > Integer 'verbose' found = 1 > > starting daemon (5.15): > > Jul 10 04:53:31 pikvm watchdog[20576]: watchdog: Integer 'verbose' found > = 1 > > int=1s realtime=yes sync=no load=24,18,12 soft=no > > memory: minimum pages = 150000 free, 0 allocatable, max swap 0 (4096 > > byte pages) > > ping: no machine to check > > file: no file to check > > pidfile: no server process to check > > interface: no interface to check > > temperature: no sensors to check > > no test binary files > > no repair binary files > > error retry time-out = 60 seconds > > repair attempts = 1 > > alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no > > watchdog now set to 15 seconds > > hardware watchdog identity: Broadcom BCM2835 Watchdog timer > > > > ... and repeated this: > > > > memory available 354036 kB is less than 150000 pages > > still alive after 175 interval(s) > > current load is 0 0 0 > > currently there are 354036 kB usable memory and 0 of 0 swap used > > > > My config: > > > > min-memory = 150000 > > max-load-1 = 24 > > max-load-5 = 18 > > max-load-15 = 12 > > watchdog-device = /dev/watchdog > > watchdog-timeout = 15 > > interval = 1 > > realtime = yes > > priority = 1 > > > > Setup: ./configure --prefix=/usr --sbindir=/usr/bin > > --mandir=/usr/share/man --sysconfdir=/etc --localstatedir=/var > > --with-pidfile=/run/watchdog.pid --with-ka_pidfile=/run/wd_keep > > alive.pid --disable-nfs > > > > вт, 9 июл. 2019 г. в 17:11, Paul Crawford <ps...@sa... > > <mailto:ps...@sa...>>: > > > > Dear Maxim, > > > > > 1) Memory free computed from: MemFree + Buffers + Cached > > > > > > > 3) Add an option for maximum swap usage so the system can reboot > > if swap > > > has grown too big due to a memory leak, etc. Same idea but > > something like: > > > > Both (1) & (3) are now done, but that changes (2) needed for simpler > > specification of memory thresholds (instead of page count) will have > to > > wait for another day. > > > > If you can build from current GIT please test this and see if it is > > doing more or less what you want. Also if you have any ideas for a > > better memory checking approach it would be worth considering. > > > > Regards, > > Paul > > -- > > > > -- > Dr. Paul S. Crawford > c/o Satellite Station > University of Dundee > Small's Wynd, Dundee, DD1 4HN > Email: ps...@sa... > Tel: +44 (0)1382 38 4687 > The University of Dundee is a Scottish Registered Charity, No. SC015096 > |