From: Colin C. <col...@gm...> - 2010-03-03 23:10:36
|
Doh, this should have gone to the list. Sorry. CC ---------- Forwarded message ---------- From: Colin Coe <col...@gm...> Date: Wed, Mar 3, 2010 at 1:32 PM Subject: Re: [Devmon] devmon keeps going purple - long post To: Buchan Milne <bg...@st...> This may be totally unrelated (and perfectly acceptable) but please consider the following output (newlines added for readability): --- # ps -ef | grep devmon devmon 2686 1 1 13:00 ? 00:00:04 devmon[master] devmon 2688 2686 0 13:00 ? 00:00:00 devmon[client-01] devmon 2689 2686 0 13:00 ? 00:00:00 devmon[client-02] devmon 2690 2686 0 13:00 ? 00:00:00 devmon[client-03] devmon 2691 2686 0 13:00 ? 00:00:00 devmon[client-04] devmon 2692 2686 0 13:00 ? 00:00:00 devmon[client-05] devmon 2693 2686 0 13:00 ? 00:00:00 devmon[client-06] devmon 2694 2686 0 13:00 ? 00:00:00 devmon[client-07] devmon 2695 2686 0 13:00 ? 00:00:00 devmon[client-08] devmon 2696 2686 0 13:00 ? 00:00:00 devmon[client-09] devmon 2697 2686 0 13:00 ? 00:00:00 devmon[client-10] root 2808 29484 0 13:05 pts/3 00:00:00 grep devmon # lsof -u devmon | grep socket devmon 2686 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2686 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2686 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2686 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2686 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2686 devmon 11u unix 0xffff88016fa74900 9610122 socket devmon 2686 devmon 12u unix 0xffff88016fa74e80 9610124 socket devmon 2686 devmon 13u unix 0xffff880169395480 9610126 socket devmon 2686 devmon 14u unix 0xffff88016f8a9100 9610128 socket devmon 2686 devmon 15u unix 0xffff88016f8a8e40 9610130 socket devmon 2688 devmon 7u unix 0xffff8800000b0600 9610113 socket devmon 2689 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2689 devmon 8u unix 0xffff88016fa74640 9610115 socket devmon 2690 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2690 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2690 devmon 9u unix 0xffff88016fa75980 9610117 socket devmon 2691 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2691 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2691 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2691 devmon 10u unix 0xffff88016fa75140 9610119 socket devmon 2692 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2692 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2692 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2692 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2692 devmon 11u unix 0xffff88016fa75400 9610121 socket devmon 2693 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2693 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2693 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2693 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2693 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2693 devmon 12u unix 0xffff88016fa74380 9610123 socket devmon 2694 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2694 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2694 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2694 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2694 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2694 devmon 11u unix 0xffff88016fa74900 9610122 socket devmon 2694 devmon 13u unix 0xffff88016a7f4080 9610125 socket devmon 2695 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2695 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2695 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2695 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2695 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2695 devmon 11u unix 0xffff88016fa74900 9610122 socket devmon 2695 devmon 12u unix 0xffff88016fa74e80 9610124 socket devmon 2695 devmon 14u unix 0xffff880169395a00 9610127 socket devmon 2696 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2696 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2696 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2696 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2696 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2696 devmon 11u unix 0xffff88016fa74900 9610122 socket devmon 2696 devmon 12u unix 0xffff88016fa74e80 9610124 socket devmon 2696 devmon 13u unix 0xffff880169395480 9610126 socket devmon 2696 devmon 15u unix 0xffff88016f8a8340 9610129 socket devmon 2697 devmon 6u unix 0xffff8800000b1940 9610112 socket devmon 2697 devmon 7u unix 0xffff8800000b1100 9610114 socket devmon 2697 devmon 8u unix 0xffff88016fa740c0 9610116 socket devmon 2697 devmon 9u unix 0xffff88016fa756c0 9610118 socket devmon 2697 devmon 10u unix 0xffff88016fa74bc0 9610120 socket devmon 2697 devmon 11u unix 0xffff88016fa74900 9610122 socket devmon 2697 devmon 12u unix 0xffff88016fa74e80 9610124 socket devmon 2697 devmon 13u unix 0xffff880169395480 9610126 socket devmon 2697 devmon 14u unix 0xffff88016f8a9100 9610128 socket devmon 2697 devmon 16u unix 0xffff88016f8a8600 9610131 socket --- It seems that the first devmon client process (first fork) has a single socket. Each subsequent devmon client process has an additional socket. Am I correct in thinking that all of the children should have a single socket? If I'm not completely off my rocker, then could you consider this patch as it makes /proc/pid/fd and lsof output cleaner. ----- --- dm_snmp.pm.old 2010-03-03 13:18:24.000000000 +0800 +++ dm_snmp.pm 2010-03-03 13:15:16.000000000 +0800 @@ -356,7 +356,10 @@ # Child code here $g{'parent'} = 0; # We arent the parent any more... close $g{'forks'}{$num}{'CS'}; # Same as above - $0 = 'devmon'; # Remove our 'master' tag + for my $ichild (1..$g{'numforks'}) { + close $g{'forks'}{$ichild}{'CS'} unless $ichild == $num; + } + $0 = sprintf("devmon[client-%02d]", $num); # Remove our 'master' tag fork_sub($num); # Enter our neverending query loop exit; # We should never get here, but just in case } ----- With the patch applied, my lsof output looks like (again, newlines added for readability): --- devmon 3007 devmon 6u unix 0xffff88016fa74900 9614757 socket devmon 3007 devmon 7u unix 0xffff88016fa74bc0 9614759 socket devmon 3007 devmon 8u unix 0xffff88016fa756c0 9614761 socket devmon 3007 devmon 9u unix 0xffff88016fa740c0 9614763 socket devmon 3007 devmon 10u unix 0xffff8800000b1100 9614765 socket devmon 3007 devmon 11u unix 0xffff8800000b1940 9614767 socket devmon 3007 devmon 12u unix 0xffff8801684ac100 9614769 socket devmon 3007 devmon 13u unix 0xffff8801684ad9c0 9614771 socket devmon 3007 devmon 14u unix 0xffff88016917a3c0 9614773 socket devmon 3007 devmon 15u unix 0xffff88016917a680 9614775 socket devmon 3009 devmon 7u unix 0xffff88016fa74380 9614758 socket devmon 3010 devmon 8u unix 0xffff88016fa75400 9614760 socket devmon 3011 devmon 9u unix 0xffff88016fa75140 9614762 socket devmon 3012 devmon 10u unix 0xffff88016fa75980 9614764 socket devmon 3013 devmon 11u unix 0xffff88016fa74640 9614766 socket devmon 3014 devmon 12u unix 0xffff8800000b0600 9614768 socket devmon 3015 devmon 13u unix 0xffff8801684adc80 9614770 socket devmon 3016 devmon 14u unix 0xffff8801684ac3c0 9614772 socket devmon 3017 devmon 15u unix 0xffff88016917ac00 9614774 socket devmon 3018 devmon 16u unix 0xffff88016917b9c0 9614776 socket --- Thanks CC On Wed, Mar 3, 2010 at 7:52 AM, Colin Coe <col...@gm...> wrote: > On Tue, Mar 2, 2010 at 4:14 PM, Buchan Milne <bg...@st...> wrote: >> On Friday, 26 February 2010 01:20:41 Colin Coe wrote: >>> Hi all >>> >>> I've been seeing this for a while also. >>> >>> It happened again today so rather than just restart I'm going to do >>> some testing. >>> >>> In /var/log/devmon/devmon.log I see >>> --- >>> [10-02-26@05:10:59] Starting snmp queries >>> [10-02-26@05:10:59] Getting device status from hobbit at 127.0.0.1:1984 >>> [10-02-26@05:11:00] Performing test logic >>> [10-02-26@05:11:01] Done with test logic >>> [10-02-26@05:11:01] Sending messages to display server >>> [10-02-26@05:11:01] Done sending messages >>> [10-02-26@05:11:01] Sleeping for 58 seconds. >>> [10-02-26@05:11:59] Starting snmp queries >>> [10-02-26@05:11:59] Getting device status from hobbit at 127.0.0.1:1984 >>> [10-02-26@05:12:01] Performing test logic >>> [10-02-26@05:12:01] Done with test logic >>> [10-02-26@05:12:01] Sending messages to display server >>> [10-02-26@05:12:01] Done sending messages >>> [10-02-26@05:12:01] Sleeping for 58 seconds. >>> [10-02-26@05:13:00] Starting snmp queries >>> [10-02-26@05:13:00] Getting device status from hobbit at 127.0.0.1:1984 >>> --- >>> >>> 5:13AM is when devmon last reported in to xymon, currently 7:52AM. >>> >>> Using this scriptlet, I've straced the devmon processes. >>> --- >>> for I in `ps -ef | awk '/devmon/ && !/awk/ {print $2}'`; do >>> echo "About to 'strace' PID $I" >>> echo "-----------" >>> strace -tfp $I >>> echo "---------" >>> done >>> About to 'strace' PID 24357 >>> ----------- >>> Process 24357 attached - interrupt to quit >>> 07:56:02 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> 07:56:02 read(11, "", 4096) = 0 >>> 07:56:02 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> 07:56:02 read(11, "", 4096) = 0 >>> -- snip -- >>> 07:56:05 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> 07:56:05 read(11, "", 4096) = 0 >>> 07:56:05 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >>> 07:56:05 read(11, "", 4096) = 0 >>> 07:56:05 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> >>> Process 24357 detached >>> --------- >> >> [...] >> >>> --- >>> # ps -ef | grep devmon >>> root 8973 16821 0 08:14 pts/1 00:00:00 grep devmon >>> devmon 24357 1 1 Feb22 ? 01:14:11 devmon[master] >>> devmon 24359 24357 0 Feb22 ? 00:04:02 devmon >>> devmon 24360 24357 0 Feb22 ? 00:04:01 devmon >>> devmon 24361 24357 0 Feb22 ? 00:02:57 devmon >>> devmon 24362 24357 0 Feb22 ? 00:04:01 devmon >>> devmon 24363 24357 0 Feb22 ? 00:02:53 devmon >>> devmon 24365 24357 0 Feb22 ? 00:02:55 devmon >>> devmon 24366 24357 0 Feb22 ? 00:04:13 devmon >>> devmon 24367 24357 0 Feb22 ? 00:02:26 devmon >>> devmon 24368 24357 0 Feb22 ? 00:02:52 devmon >>> >>> 08:14:10 HOST=sw02.hpdms USER=root >>> # ls -l /proc/24357/fd >>> total 0 >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 0 -> /dev/null >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 1 -> /dev/null >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 10 -> socket:[5679517] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 11 -> socket:[5679519] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 12 -> socket:[5679521] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 13 -> socket:[5679523] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 14 -> socket:[5679525] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 15 -> socket:[5679527] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 2 -> /dev/null >>> l-wx------ 1 devmon devmon 64 Feb 26 08:14 3 -> >>> /var/log/devmon/devmon.log.1 lr-x------ 1 devmon devmon 64 Feb 26 08:14 4 >>> -> /usr/share/devmon/templates lr-x------ 1 devmon devmon 64 Feb 26 08:14 >>> 5 -> >>> /usr/share/devmon/templates/ironport-asyncos >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 6 -> socket:[5679509] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 7 -> socket:[5679511] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 8 -> socket:[5679513] >>> lrwx------ 1 devmon devmon 64 Feb 26 08:14 9 -> socket:[5679515] >>> --- >>> >>> >From the logs, strace output and viewing the source, I believe that >>> >>> the master process is stuck on the child with file descriptor 11. >> >> Yep, for some reason the socketpair breaks down, and the existing timeout >> wasn't catching it. I've been working on some changes to try and address this, >> but I got distracted by real work, and haven't yet got a solution that works >> well. >> >> If you are interested, I can post more details, and a patch against current >> svn. >> >>> Interestingly, fd 12 is shown in 'ls' above but not in the preceding >>> 'ps'. Maybe I'm missing something... >> >> Without knowing where the sockets are connected, it's a bit difficult to tell. >> >>> I going to leave devmon in this state for a while to do further >>> testing but if anyone has any ideas they want me to try, I'll happily >>> oblige. >> >> There is a bug filed on this, feel free to add what you know, I will try and >> get around to attaching a patch to that bug once I have something that works >> for me. >> >> Regards, >> Buchan >> > > Hi > > Devmon hasn't hung yet so I'm still waiting to see what happens. > However, can you advise if there is any specific debugging you'd like > me to do when devmon does eventually hang? > > Out of curiousity, how do I tell which fd connects to which socket? > > I'm very interested in more details and a look at the current patch against svn. > > Thanks > > CC > > -- > RHCE#805007969328369 > -- RHCE#805007969328369 -- RHCE#805007969328369 |