Re: [Mon-devel] mon "fork bomb"
Brought to you by:
trockij
From: <den...@or...> - 2011-11-18 10:50:48
|
Hello, We had the problem some years ago. We fixed it and I sent a patch. It seems it was never integrated to the source. The problem is due to the fact that sometime the 'exec' function call used to run a monitor or an alert script *does* return. It should not. It's written in the 'exec' documentation. There is even a warning about adding something other than exit after an exec: exec LIST exec PROGRAM LIST The "exec" function executes a system command and never returns-- use "system" instead of "exec" if you want it to return. It fails and returns false only if the command does not exist and it is executed directly instead of via your system's command shell (see below). Since it's a common mistake to use "exec" instead of "system", Perl warns you if there is a following statement which isn't "die", "warn", or "exit" (if "-w" is set - but you always do that). The 'exec' is called in a forked process and should replace that forked process by the monitor or the alert. The 'exec' can fail (and thus return) when there is some ressources missing (we where short on memory when this happened). But it could also happen if you delete a monitor or alert scripts (looking for scripts is done only at startup). In the 'mon' code, there is a 'return' after the call to 'exec'. That's bad. As the exec should never return, we must put an 'exit' after. Without the 'exit', the code symply returns to the main loop and we have a forked version of mon running in parallel with the parent one. Here is the patch we've written to solve this issue. diff -Naur mon-1.2.0/mon mon-1.2.0-bugexec/mon --- mon-1.2.0/mon 2009-12-16 13:46:33.808765600 +0000 +++ mon-1.2.0-bugexec/mon 2009-12-16 13:53:04.613773000 +0000 @@ -3611,9 +3611,12 @@ if (!exec @args) { - syslog ('err', "could not exec '@args': $!"); + syslog ('err', "run_monitor: could not exec '@args': $!"); exit (1); } + syslog ('err', + "run_monitor: (unreachable!) could not exec '@args': $!"); + exit(1); } $sref->{"_last_check"} = scalar (time); @@ -5078,10 +5081,12 @@ } if (!exec @execargs) { - syslog ('err', "could not exec alert $alert: $!"); - return undef; + syslog ('err', "call_alert: could not exec alert $alert: $!"); + exit(1); } - exit; + syslog ('err', + "call_alert: (unreachable!) could not exec alert $alert: $!"); + exit(1); } I hope this will help. Denis Choulette den...@or... Equant France CS&O/ITD/France Customer Application Management/Transversal Activities/Service Assurance & Financial Management BP 91235 Rue de la Touche Lambert Bâtiment 7 F35512 Cesson Sévigné Cedex France Phone: +33 (0) 2 23 28 41 09 Fax: +33 (0) 2 23 28 45 83 http://www.equant.com -----Original Message----- From: Anders Synstad [mailto:and...@ba...] Sent: 17 November 2011 10:38 To: mon...@li... Subject: [Mon-devel] mon "fork bomb" Hello, We've been running mon for a decade now, and it's been working great. However, the last month we've started to run into a problem. The best explaination I have is that mon "fork bombs", and the load goes thru the roof. I've only been able to do a dump of ps before we had to reboot the system, and it lists: 365 of these: qroot 607 0.0 3.0 225952 124644 ? S 13:36 0:00 /local/bin/perl /local/etc/mon/mon -l -f -c /etc/mon/mon.cf -P /var/run/mon.pid root 3044 2.6 3.3 225952 137156 ? D 13:38 0:09 /local/bin/perl /local/etc/mon/mon -l -f -c /etc/mon/mon.cf -P /var/run/mon.pid and 1684 of these: root 3043 0.7 0.0 0 0 ? Z 13:38 0:02 [mon] <defunct> This week is has happened 3 times already. This is something I've haven't seen in the past. During normal use, it doesn't seem to be overloaded: root@mon03.osl mon]# free total used free shared buffers cached Mem: 4040056 1654504 2385552 0 68576 1119588 -/+ buffers/cache: 466340 3573716 Swap: 4192944 0 4192944 [root@mon03.osl mon]# uptime 10:10am up 1:11, 5 users, load average: 3.11, 3.05, 2.77 I'm still trying to debug whenever this happens, but there is limits to how long we can debug each time as monitoring is down in this period, and deubgging with triple-digit load time is tideous. I haven't found any indicators in any logfiles either. Does anyone have any ideas what could be causing this? -- Anders Synstad Basefarm AS ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Mon-devel mailing list Mon...@li... https://lists.sourceforge.net/lists/listinfo/mon-devel _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorization. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange shall not be liable if this message was modified, changed or falsified. Thank you. |