From: Emmanuel C. <ma...@fr...> - 2010-05-10 11:03:52
|
Hi Francis, Yes you can send me the distributed virtual database log to double check. But as Hedera did not catch the JGroups messages I don't expect the virtual database to have executed any handler. Don't wait for my analysis of the logs and go ahead with the new experiment. Thanks again for your feedback Emmanuel > Yes, it is enabled. Also, I’m going to try the latest hedera and will > let you know the result. Do you need me to check the other loggings > before I run the test? > > Thanks, > > Seby. > > -----Original Message----- > From: seq...@li... > [mailto:seq...@li...] On Behalf Of > Emmanuel Cecchet > Sent: Saturday, May 08, 2010 3:54 PM > To: Sequoia general mailing list > Cc: seq...@li... > Subject: Re: [Sequoia] Failure detection > > Hi Francis, > > Do you have the traces with > > log4j.logger.org.continuent.sequoia.controller.virtualdatabase set to > DEBUG? > > Could you also try with the latest version of Hedera? > > Sorry for the lag in the responses I have been swamped since I'm back! > > Emmanuel > > > Hello Emmanuel, > > > > > > Yes, all were in debug. Here is the snippet: > > > > > > ###################################### > > > # Hedera group communication loggers # > > > ###################################### > > > # Hedera channels test # > > > log4j.logger.test.org.continuent.hedera.channel=DEBUG, Console, Filetrace > > > log4j.additivity.test.org.continuent.hedera.channel=false > > > # Hedera adapters # > > > log4j.logger.org.continuent.hedera.adapters=DEBUG, Console, Filetrace > > > log4j.additivity.org.continuent.hedera.adapters=false > > > # Hedera factories # > > > log4j.logger.org.continuent.hedera.factory=DEBUG, Console, Filetrace > > > log4j.additivity.org.continuent.hedera.factory=false > > > # Hedera channels # > > > log4j.logger.org.continuent.hedera.channel=DEBUG, Console, Filetrace > > > log4j.additivity.org.continuent.hedera.channel=false > > > # Hedera Group Membership Service # > > > log4j.logger.org.continuent.hedera.gms=DEBUG, Console, Filetrace > > > log4j.additivity.org.continuent.hedera.gms=false > > > # JGroups > > > log4j.logger.org.jgroups=DEBUG, Console, Filetrace > > > log4j.additivity.org.jgroups=false > > > # JGroups protocols > > > log4j.logger.org.jgroups.protocols=DEBUG, Console, Filetrace > > > log4j.additivity.org.jgroups.protocols=false > > > ###################################### > > > > > > I've the distributed logs for the same time-frame. Let me know if you > need that. > > > > > > No, the hedera were not updated. > > > > > > Thanks, > > > Seby. > > > -----Original Message----- > > > From: seq...@li... > [mailto:seq...@li...] On Behalf Of > Emmanuel Cecchet > > > Sent: Tuesday, May 04, 2010 6:20 AM > > > To: Sequoia general mailing list > > > Cc: seq...@li... > > > Subject: Re: [Sequoia] Failure detection > > > > > > Hi Seby, > > > > > > When JGroups reported the MERGE messages in the log, did you have Hedera > > > DEBUG logs enabled too? If this is the case, the message was never > > > handled by Hedera which is a problem. The new view should have been > > > installed anyway by the view synchrony layer and Hedera should at least > > > catch that. > > > Can you confirm is the Hedera logs are enabled? > > > Could you also set the Distributed Virtual Database logs to DEBUG? > > > Did you try to update Hedera to a newer version? > > > > > > Thanks > > > Emmanuel > > > > > > > > >> Hi Emmanuel, > > >> > > >> Do you need more logs on this. Please let me know. > > >> > > >> Thanks, > > >> Seby. > > >> > > >> -----Original Message----- > > >> From: seq...@li... > [mailto:seq...@li...] On Behalf Of > Francis, Seby > > >> Sent: Monday, March 29, 2010 1:51 PM > > >> To: Sequoia general mailing list > > >> Cc: seq...@li... > > >> Subject: Re: [Sequoia] Failure detection > > >> > > >> Hi Emmanuel, > > >> > > >> I've tried different jgroup configuration and now I can see in the > logs that the groups are merging. But for some reason, Sequoia never > shows that it is merged. Ie; when I ran 'show controllers' on console > I see only that particular host. Below is the snippet from one of the > host. I see the similar on the other host showing the merge. Let me > know if you would like to see the debug logs during the time-frame. > > >> > > >> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT > diff=1507, mbr 10.0.0.33:35974 is dead (passing up SUSPECT event) > > >> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported > suspected member:10.0.0.33:35974 > > >> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms > Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2). > > >> > > >> 2010-03-29 06:59:45,868 INFO controller.requestmanager.cleanup > Waiting 30000ms for client of controller 562949953421312 to failover > > >> 2010-03-29 07:00:15,875 INFO controller.requestmanager.cleanup > Cleanup for controller 562949953421312 failure is completed. > > >> > > >> ----- > > >> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I > (10.0.0.23:49731) will be the leader. Starting the merge task for > [10.0.0.33:35974, 10.0.0.23:49731] > > >> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 > running merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731] > > >> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader > 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731] > > >> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to > 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: > GmsHeader[MERGE_RSP]: view=[10.0.0.23:49731|2] [10.0.0.23:49731], > digest=10.0.0.23:49731: [44 : 47 (47)], merge_rejected=false, > merge_id=[10.0.0.23:49731|1269860594727], UNICAST: [UNICAST: DATA, > seqno=4], UDP: [channel_name=db2] > > >> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 > responded to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727] > > >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader > 10.0.0.23:49731 expects 2 responses, so far got 2 responses > > >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader > 10.0.0.23:49731 collected 2 merge response(s) in 36 ms > > >> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader > 10.0.0.23:49731 computed new merged view that will be > MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], > subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] > [10.0.0.33:35974]] > > >> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 > is sending merge view [10.0.0.23:49731|3] to coordinators > [10.0.0.33:35974, 10.0.0.23:49731 > > >> > > >> Seby. > > >> > > >> -----Original Message----- > > >> From: seq...@li... > [mailto:seq...@li...] On Behalf Of > Emmanuel Cecchet > > >> Sent: Wednesday, March 24, 2010 10:41 AM > > >> To: Sequoia general mailing list > > >> Cc: seq...@li... > > >> Subject: Re: [Sequoia] Failure detection > > >> > > >> Hi Seby, > > >> > > >> Sorry for the late reply, I have been very busy these past days. > > >> This seems to be a JGroups issue that could probably be better answered > > >> by Bela Ban on the JGroups mailing list. I have seen emails these past > > >> days on the list with people having similar problem. > > >> I would recommend that you post an email on the JGroups mailing list > > >> with your JGroups configuration and the messages you see regarding > MERGE > > >> failing. > > >> > > >> Keep me posted > > >> Emmanuel > > >> > > >> > > >> > > >>> Also, here is the error which I see from the logs: > > >>> > > >>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader > 10.10.10.23:39729 expects 2 responses, so far got 1 responses > > >>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader > 10.10.10.23:39729 waiting 382 msecs for merge responses > > >>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At > 10.10.10.23:39729 cancelling merge due to timer timeout (5000 ms) > > >>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge > (merge_id=[10.10.10.23:39729|1269261071286]) > > >>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler > > >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader > 10.10.10.23:39729 expects 2 responses, so far got 0 responses > > >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader > 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms > > >>> 2010-03-22 08:31:16,318 WARN protocols.pbcast.GMS Merge aborted. > Merge leader did not get MergeData from all subgroup coordinators > [10.10.10.33:38822, 10.10.10.23:39729] > > >>> > > >>> -----Original Message----- > > >>> From: Francis, Seby > > >>> Sent: Monday, March 22, 2010 1:03 PM > > >>> To: 'Sequoia general mailing list' > > >>> Cc: seq...@li... > > >>> Subject: RE: [Sequoia] Failure detection > > >>> > > >>> Hi Emmanuel, > > >>> > > >>> I've updated my jgroups to the version which you have mentioned, > but I still see the issue with Merging the groups. One of the > controller lost track after the failure and won't merge. Can you > please give me a hand to figure out where it goes wrong. I've the > debug logs. Shall I send the logs as a zip file. > > >>> > > >>> Thanks, > > >>> Seby. > > >>> > > >>> -----Original Message----- > > >>> From: seq...@li... > [mailto:seq...@li...] On Behalf Of > Emmanuel Cecchet > > >>> Sent: Thursday, March 18, 2010 10:22 PM > > >>> To: Sequoia general mailing list > > >>> Cc: seq...@li... > > >>> Subject: Re: [Sequoia] Failure detection > > >>> > > >>> Hi Seby, > > >>> > > >>> I looked into the mailing list archive and this version of JGroups > has a > > >>> number of significant bugs. An issue was filed > > >>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it > > >>> for Sequoia 4. Just using a drop in replacement for JGroups core for > > >>> Sequoia 2.10.10 might work. You might have to update Hedera jars as > well > > >>> but that could work with the old one too. > > >>> > > >>> Let me know if the upgrade does not work > > >>> Emmanuel > > >>> > > >>> > > >>> > > >>> > > >>>> Thanks for your support!! > > >>>> > > >>>> I'm using jgroups-core.jar Version 2.4.2 which came with > > >>>> "sequoia-2.10.10". My solaris test servers have only single interface > > >>>> and I'm using the same ip for both group & db/client > communications. I > > >>>> ran a test again removing "*STATE_TRANSFER*" and attached the > logs. At > > >>>> around 13:36, I took the host1 interface down and opened it around > > >>>> 13:38. After I opened the interface, and when I ran the show > > >>>> controllers on console, host1 showed both controllers while host2 > > >>>> showed its own name in the member list. > > >>>> > > >>>> Regards, > > >>>> > > >>>> Seby. > > >>>> > > >>>> -----Original Message----- > > >>>> Hi Seby, > > >>>> > > >>>> Welcome to the wonderful world of group communications! > > >>>> > > >>>> > > >>>> > > >>>> > > >>>>> I've tried various FD options and could not get it working when one > > >>>>> > > >>>>> > > >>>>> > > >>>> of the hosts fail. I can see the message 'A leaving group' on live > > >>>> controller B when I shutdown the interface of A. This is working as > > >>>> expected and the virtual db is still accessible/writable as the > > >>>> controller B is alive. But when I open the interface on A, the > > >>>> controller A shows (show controllers) that the virtual-db is > hosted by > > >>>> controllers A & B while controller B just shows B. And the data > > >>>> inserted into the vdb hosted by controller B is NOT being played > on A. > > >>>> This will cause inconsistencies in the data between the virtual-dbs. > > >>>> Is there a way, we can disable the backend if the network goes down, > > >>>> so that I can recover the db using the backup? > > >>>> > > >>>> > > >>>> There is a problem with your group communication configuration if > > >>>> controllers have different views of the group. That should not happen. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>>> I've also noticed that in some cases, if I take one of the host > > >>>>> > > >>>>> > > >>>>> > > >>>> interface down, both of them thinks that the other controller failed. > > >>>> This will also create issues. In my case, I only have two controllers > > >>>> hosted. Is it possible to ping a network gateway? That way the > > >>>> controller know that it is the one which failed and can disable the > > >>>> backend. > > >>>> > > >>>> > > >>>> The best solution is to use the same interface for group > communication > > >>>> and client/database communications. If you use a dedicated network > for > > >>>> group communications and this network fails, you will end up with a > > >>>> network partition and this is very bad. If all communications go > > >>>> through the same interface, when it goes down, all communications are > > >>>> down and the controller will not be able to serve stale data. > > >>>> > > >>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer > > >>>> protocol when a new member joins a group. Which version of JGroups > are > > >>>> you using? Could you send me the log with JGroups messages that you > > >>>> see on each controller by activating them in log4j.properties. I > would > > >>>> need the initial sequence when you start the cluster and the messages > > >>>> you see when the failure is detected and when the failed controller > > >>>> joins back. There might be a problem with the timeout settings of the > > >>>> different component of the stack. > > >>>> > > >>>> Keep me posted with your findings > > >>>> > > >>>> Emmanuel > > >>>> > > >>>> > ------------------------------------------------------------------------ > > >>>> > > >>>> > > >>> > > >>> > > >>> > > >> > > >> > > > > > > > > > > > -- > > Emmanuel Cecchet > > FTO @ Frog Thinker > > Open Source Development & Consulting > > -- > > Web: http://www.frogthinker.org > > email: ma...@fr... > > Skype: emmanuel_cecchet > > _______________________________________________ > > Sequoia mailing list > > Se...@li... > > http://forge.continuent.org/mailman/listinfo/sequoia > > ------------------------------------------------------------------------ > > _______________________________________________ > Sequoia mailing list > Se...@li... > http://forge.continuent.org/mailman/listinfo/sequoia -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: ma...@fr... Skype: emmanuel_cecchet |