From: Buchan M. <bg...@st...> - 2009-05-21 08:23:11
|
On Wednesday 20 May 2009 15:12:09 Glenn Attwood wrote: > We are having an issue with devmon 3.1b1 and hobbit/xymon 4.2.0-dfsg10 > on an Ubuntu 8.04 machine. After a while (as long as 6 days, as short > as 36 hours) it looks like it stalls waiting on for hobbit/xymon. > > output from "devmon -f -vv": > [09-05-13@16:43:32] Getting device status from hobbit at localhost:1984 > Can't use an undefined value as a HASH reference at > /opt/devmon-0.3.1-beta1/modules/dm_snmp.pm line 405, <$__ANONIO__> line > 52616. > [09-05-13@17:25:44] Shutting down (we shut it down at this point) > > strace on the master: > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > read(12, "", 4096) = 0 > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > read(12, "", 4096) = 0 > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > read(12, "", 4096) = 0 > > strace on one of the children: > --- SIGALRM (Alarm clock) @ 0 (0) --- > sigreturn() = ? (mask now []) > rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0 > rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0 > rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0 > rt_sigaction(SIGALRM, {SIG_DFL}, {0x80b4ad0, [], 0}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > kill(29920, SIG_0) = 0 > rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0 > rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_DFL}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0 > rt_sigaction(SIGALRM, {0x80b4ad0, [], 0}, {SIG_DFL}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > alarm(15) = 0 > read(16, 0x8983d20, 4096) = ? ERESTARTSYS (To be restarted) > > Any suggestions for fixes or further troubleshooting? This is a known issue ... but one that is (for me) quite difficult to fix. Of the 3 production boxes and 2 dev boxes/workstations we have running devmon, the most often we see it is about once in 5 days (but, usually more than 2 weeks between occurances, and this on the production boxes). It looks like the socketpair for communication between the master process and the worker process gives up, and I think this leaves the worker process in a loop it will not return from. Once there are no more worker processes with working communication, devmon just waits for them to answer, and they never respond. I think there is a relatively way to prevent this ... but as I can't try it and easily see if it has helped, I haven't got very far on resolving this myself. If the relatively easy way doesn't work, there is a more difficult way as well ... I'll try and send you a patch to try in the next few days. Regards, Buchan |