From: Daniel P. <da...@po...> - 2009-11-30 13:29:13
|
Carlo Marcelo Arenas Belon wrote: > On Mon, Nov 30, 2009 at 08:12:34AM +0000, Daniel Pocock wrote: > >> Carlo Marcelo Arenas Belon wrote: >> >>> On Sun, Nov 29, 2009 at 10:57:01AM +0000, Carlo Marcelo Arenas Belon wrote: >>> >>> >>>> On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote: >>>> >>>> >>>>> Please help us test on as many OS/archs as possible, as this would go >>>>> GA quite immediately ;-) >>>>> >>>>> >>>> FreeBSD is not able to return any XML data through TCP/8649 (tested with >>>> FreeBSD 8.0 amd64). >>>> >>> the problem wasn't actually the TCP/8649 service but the fact that gmond >>> was going into an infinite loop after sending the first metric update. >>> >>> the issue was tracked down to r2043 and a 3.1.5 development package with >>> that patch reverted is available for testing from : >>> >>> http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2101.tar.gz >>> >>> >> Did you see this issue with 3.1.3 or 3.1.4? They both contain the same >> patch. >> > > Both 3.1.3 and 3.1.4 should have the same problem, but haven't been able to > test 3.1.3 since it is no longer available. (FreeBSD 8 was just released a > couple of days ago anyway). 3.1.4 shows the same behavior at least there > and the "fixed" package seems to also work find with OpenBSD 4.4 amd64, > NetBSD 4 i386 and DragonFlyBSD 2.4.1 i386 and amd64 (after also patched > with r2124 to workaround BUG245). > > >>>> DragonFlyBSD fails to build but a 3.2 version of ganglia which includes >>>> fixes for that fails with the same TCP issue than FreeBSD and so this >>>> issue might be affecting other BSD as well. >>>> >>> confirmed also to be affecting OpenBSD (tested with OpenBSD 4.5 amd64) >>> but considering the nature of the "fix" wouldn't be surprised if other >>> configurations were also affected. >>> >>> >> Are you proposing a fix or just revert the change? >> > > Your call, eventhough a fix for this feature will be probably preferred as > there is nothing special about the BSD for them to be affected and it might > be that the problem is therefore more generic. > It may be that this bug is revealing a more serious issue in the way initialisation is done, so I would prefer to know the real cause rather than just revert the change that forces the problem to show itself. > At least a revert would be needed for 3.1 as this accounts for a regression > but haven't done so either waiting for you to first revert it on trunk and > then decide on how to proceed from there depending on how critical this > feature was for the release. > > I agree that it is a recession, but reverting it may cause the real culprit to remain hidden. I'd rather hold the release while we look more closely. >> The change has been working on Linux, Solaris and Cygwin. >> > > Other than just doing a manual bisect (using git instead of svn here would > had been useful) to find where the problem was introduced and validate that > reverting it corrects the problem haven't done much analysis of it, but the > fact that it broke in such a strange way (was indeed expecting the culprit > to be somewhere else, specially considering all recent changes in the > networking and the fact that it seemed originally to be triggered by a TCP > request) probably points to a bigger issue which just happens to have not > been visible on the configurations used to test Linux, Solaris and Cygwin, > specially considering how pervasive it was (broke all BSD I had access to > test, at least) > Can you provide output from strace/truss and also a stack trace from the point where it is in the infinite loop? There is a good reason for moving the daemonize code the way I did - an alternative would be to daemonize, but make the original process hang around until the daemon process has entered the main loop. |