From: Buchan M. <bg...@st...> - 2010-03-02 08:14:30
|
On Friday, 26 February 2010 01:20:41 Colin Coe wrote: > Hi all > > I've been seeing this for a while also. > > It happened again today so rather than just restart I'm going to do > some testing. > > In /var/log/devmon/devmon.log I see > --- > [10-02-26@05:10:59] Starting snmp queries > [10-02-26@05:10:59] Getting device status from hobbit at 127.0.0.1:1984 > [10-02-26@05:11:00] Performing test logic > [10-02-26@05:11:01] Done with test logic > [10-02-26@05:11:01] Sending messages to display server > [10-02-26@05:11:01] Done sending messages > [10-02-26@05:11:01] Sleeping for 58 seconds. > [10-02-26@05:11:59] Starting snmp queries > [10-02-26@05:11:59] Getting device status from hobbit at 127.0.0.1:1984 > [10-02-26@05:12:01] Performing test logic > [10-02-26@05:12:01] Done with test logic > [10-02-26@05:12:01] Sending messages to display server > [10-02-26@05:12:01] Done sending messages > [10-02-26@05:12:01] Sleeping for 58 seconds. > [10-02-26@05:13:00] Starting snmp queries > [10-02-26@05:13:00] Getting device status from hobbit at 127.0.0.1:1984 > --- > > 5:13AM is when devmon last reported in to xymon, currently 7:52AM. > > Using this scriptlet, I've straced the devmon processes. > --- > for I in `ps -ef | awk '/devmon/ && !/awk/ {print $2}'`; do > echo "About to 'strace' PID $I" > echo "-----------" > strace -tfp $I > echo "---------" > done > About to 'strace' PID 24357 > ----------- > Process 24357 attached - interrupt to quit > 07:56:02 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > 07:56:02 read(11, "", 4096) = 0 > 07:56:02 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > 07:56:02 read(11, "", 4096) = 0 > -- snip -- > 07:56:05 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > 07:56:05 read(11, "", 4096) = 0 > 07:56:05 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > 07:56:05 read(11, "", 4096) = 0 > 07:56:05 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> > Process 24357 detached > --------- [...] > --- > # ps -ef | grep devmon > root 8973 16821 0 08:14 pts/1 00:00:00 grep devmon > devmon 24357 1 1 Feb22 ? 01:14:11 devmon[master] > devmon 24359 24357 0 Feb22 ? 00:04:02 devmon > devmon 24360 24357 0 Feb22 ? 00:04:01 devmon > devmon 24361 24357 0 Feb22 ? 00:02:57 devmon > devmon 24362 24357 0 Feb22 ? 00:04:01 devmon > devmon 24363 24357 0 Feb22 ? 00:02:53 devmon > devmon 24365 24357 0 Feb22 ? 00:02:55 devmon > devmon 24366 24357 0 Feb22 ? 00:04:13 devmon > devmon 24367 24357 0 Feb22 ? 00:02:26 devmon > devmon 24368 24357 0 Feb22 ? 00:02:52 devmon > > 08:14:10 HOST=sw02.hpdms USER=root > # ls -l /proc/24357/fd > total 0 > lrwx------ 1 devmon devmon 64 Feb 26 08:14 0 -> /dev/null > lrwx------ 1 devmon devmon 64 Feb 26 08:14 1 -> /dev/null > lrwx------ 1 devmon devmon 64 Feb 26 08:14 10 -> socket:[5679517] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 11 -> socket:[5679519] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 12 -> socket:[5679521] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 13 -> socket:[5679523] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 14 -> socket:[5679525] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 15 -> socket:[5679527] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 2 -> /dev/null > l-wx------ 1 devmon devmon 64 Feb 26 08:14 3 -> > /var/log/devmon/devmon.log.1 lr-x------ 1 devmon devmon 64 Feb 26 08:14 4 > -> /usr/share/devmon/templates lr-x------ 1 devmon devmon 64 Feb 26 08:14 > 5 -> > /usr/share/devmon/templates/ironport-asyncos > lrwx------ 1 devmon devmon 64 Feb 26 08:14 6 -> socket:[5679509] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 7 -> socket:[5679511] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 8 -> socket:[5679513] > lrwx------ 1 devmon devmon 64 Feb 26 08:14 9 -> socket:[5679515] > --- > > >From the logs, strace output and viewing the source, I believe that > > the master process is stuck on the child with file descriptor 11. Yep, for some reason the socketpair breaks down, and the existing timeout wasn't catching it. I've been working on some changes to try and address this, but I got distracted by real work, and haven't yet got a solution that works well. If you are interested, I can post more details, and a patch against current svn. > Interestingly, fd 12 is shown in 'ls' above but not in the preceding > 'ps'. Maybe I'm missing something... Without knowing where the sockets are connected, it's a bit difficult to tell. > I going to leave devmon in this state for a while to do further > testing but if anyone has any ideas they want me to try, I'll happily > oblige. There is a bug filed on this, feel free to add what you know, I will try and get around to attaching a patch to that bug once I have something that works for me. Regards, Buchan |