Re: [Apcupsd-users] apcupsd 3.14.0 ether/net loses sync?
Brought to you by:
adk0212
|
From: Adam K. <akr...@ro...> - 2007-03-25 18:51:43
|
Jan Ceuleers wrote: > Adam Kropelin wrote: > >>> [root@skr03 root]# strace -p 1016 >>> read(8, <unfinished ...> >>> [root@skr03 root]# kill -SIGCONT 1016 >> >> Stuck on a read() call... That's amusing. How long did you let it sit >> before interrupting the strace? What does 'lsof -p 1016' say? > > A Suitably Long Time(tm). That is: about a minute. > > Just did it again; left it running for 5 mins; no difference. > > [root@skr03 root]# lsof -p 1016 > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME > apcupsd 1016 root cwd DIR 3,1 1024 2 / > apcupsd 1016 root rtd DIR 3,1 1024 2 / > apcupsd 1016 root txt REG 3,1 95576 32808 /sbin/apcupsd > apcupsd 1016 root mem REG 3,1 494262 24343 > /lib/ld-2.2.4.so apcupsd 1016 root mem REG 3,1 531552 > 72880 /lib/i686/libpthread-0.9.so > apcupsd 1016 root mem REG 3,1 420462 48822 > /usr/lib/libstdc++-3-libc6.2-2-2.10.0.so > apcupsd 1016 root mem REG 3,1 626466 72883 > /lib/i686/libm-2.2.4.so > apcupsd 1016 root mem REG 3,1 5792809 72881 > /lib/i686/libc-2.2.4.so > apcupsd 1016 root 0r CHR 1,3 32041 /dev/null > apcupsd 1016 root 1r CHR 1,3 32041 /dev/null > apcupsd 1016 root 2r CHR 1,3 32041 /dev/null > apcupsd 1016 root 3u REG 0,8 9357 2672 > /var/log/apcupsd.events > apcupsd 1016 root 4r FIFO 0,5 2684 pipe > apcupsd 1016 root 5w FIFO 0,5 2684 pipe > apcupsd 1016 root 6u IPv4 2691 TCP *:3551 (LISTEN) > apcupsd 1016 root 7u unix 0xc6cabc20 2692 socket > apcupsd 1016 root 8u IPv4 11953 TCP > skr03.xperim.be:3512->penta.xperim.be:3551 (ESTABLISHED) Ok, that cliches it. fd #8 is the nis client socket open to penta, and we're stuck in read() on it. I can see one way that would happen, although it should be very hard to hit. If penta crashes after skr03 sends a request, but before it transmits the response, skr03 would end up stuck forever in read(). This is a very narrow window to hit, but for some reason you are hitting it regularly. (Do you happen to have NETTIME set very low? That would make it more likely, but still very small.) I have been able to successfully reproduce the hang on my setup by inserting some sleep() calls on the client to make the window wide enough to hit easily. I'll work up a fix and send it to you for testing. Thanks for the debugging help! --Adam |