From: Emmanuel C. <ma...@fr...> - 2010-05-01 04:01:50
|
Hi Francis, > Are you back? > Not yet, I am in Sydney right now and I am flying back home (it should take about 24 hours). I should be back online on Monday. Talk to you soon Emmanuel > Seby. > -----Original Message----- > From: Francis, Seby > Sent: Monday, April 05, 2010 11:26 PM > To: Sequoia general mailing list > Cc: seq...@li... > Subject: RE: [Sequoia] Failure detection > > Hi Emmanuel, > > Do you need more logs on this. Please let me know. > > Thanks, > Seby. > > -----Original Message----- > From: seq...@li... [mailto:seq...@li...] On Behalf Of Francis, Seby > Sent: Monday, March 29, 2010 1:51 PM > To: Sequoia general mailing list > Cc: seq...@li... > Subject: Re: [Sequoia] Failure detection > > Hi Emmanuel, > > I've tried different jgroup configuration and now I can see in the logs that the groups are merging. But for some reason, Sequoia never shows that it is merged. Ie; when I ran 'show controllers' on console I see only that particular host. Below is the snippet from one of the host. I see the similar on the other host showing the merge. Let me know if you would like to see the debug logs during the time-frame. > > 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT diff=1507, mbr 10.0.0.33:35974 is dead (passing up SUSPECT event) > 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported suspected member:10.0.0.33:35974 > 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2). > > 2010-03-29 06:59:45,868 INFO controller.requestmanager.cleanup Waiting 30000ms for client of controller 562949953421312 to failover > 2010-03-29 07:00:15,875 INFO controller.requestmanager.cleanup Cleanup for controller 562949953421312 failure is completed. > > ----- > 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I (10.0.0.23:49731) will be the leader. Starting the merge task for [10.0.0.33:35974, 10.0.0.23:49731] > 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 running merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731] > 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731] > 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: GmsHeader[MERGE_RSP]: view=[10.0.0.23:49731|2] [10.0.0.23:49731], digest=10.0.0.23:49731: [44 : 47 (47)], merge_rejected=false, merge_id=[10.0.0.23:49731|1269860594727], UNICAST: [UNICAST: DATA, seqno=4], UDP: [channel_name=db2] > 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 responded to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727] > 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 expects 2 responses, so far got 2 responses > 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 collected 2 merge response(s) in 36 ms > 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 computed new merged view that will be MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] [10.0.0.33:35974]] > 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 is sending merge view [10.0.0.23:49731|3] to coordinators [10.0.0.33:35974, 10.0.0.23:49731 > > Seby. > > -----Original Message----- > From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet > Sent: Wednesday, March 24, 2010 10:41 AM > To: Sequoia general mailing list > Cc: seq...@li... > Subject: Re: [Sequoia] Failure detection > > Hi Seby, > > Sorry for the late reply, I have been very busy these past days. > This seems to be a JGroups issue that could probably be better answered > by Bela Ban on the JGroups mailing list. I have seen emails these past > days on the list with people having similar problem. > I would recommend that you post an email on the JGroups mailing list > with your JGroups configuration and the messages you see regarding MERGE > failing. > > Keep me posted > Emmanuel > > >> Also, here is the error which I see from the logs: >> >> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 1 responses >> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 waiting 382 msecs for merge responses >> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 cancelling merge due to timer timeout (5000 ms) >> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge (merge_id=[10.10.10.23:39729|1269261071286]) >> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler >> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 0 responses >> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms >> 2010-03-22 08:31:16,318 WARN protocols.pbcast.GMS Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [10.10.10.33:38822, 10.10.10.23:39729] >> >> -----Original Message----- >> From: Francis, Seby >> Sent: Monday, March 22, 2010 1:03 PM >> To: 'Sequoia general mailing list' >> Cc: seq...@li... >> Subject: RE: [Sequoia] Failure detection >> >> Hi Emmanuel, >> >> I've updated my jgroups to the version which you have mentioned, but I still see the issue with Merging the groups. One of the controller lost track after the failure and won't merge. Can you please give me a hand to figure out where it goes wrong. I've the debug logs. Shall I send the logs as a zip file. >> >> Thanks, >> Seby. >> >> -----Original Message----- >> From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet >> Sent: Thursday, March 18, 2010 10:22 PM >> To: Sequoia general mailing list >> Cc: seq...@li... >> Subject: Re: [Sequoia] Failure detection >> >> Hi Seby, >> >> I looked into the mailing list archive and this version of JGroups has a >> number of significant bugs. An issue was filed >> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it >> for Sequoia 4. Just using a drop in replacement for JGroups core for >> Sequoia 2.10.10 might work. You might have to update Hedera jars as well >> but that could work with the old one too. >> >> Let me know if the upgrade does not work >> Emmanuel >> >> >> >>> Thanks for your support!! >>> >>> I'm using jgroups-core.jar Version 2.4.2 which came with >>> "sequoia-2.10.10". My solaris test servers have only single interface >>> and I'm using the same ip for both group & db/client communications. I >>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At >>> around 13:36, I took the host1 interface down and opened it around >>> 13:38. After I opened the interface, and when I ran the show >>> controllers on console, host1 showed both controllers while host2 >>> showed its own name in the member list. >>> >>> Regards, >>> >>> Seby. >>> >>> -----Original Message----- >>> Hi Seby, >>> >>> Welcome to the wonderful world of group communications! >>> >>> >>> >>>> I've tried various FD options and could not get it working when one >>>> >>>> >>> of the hosts fail. I can see the message 'A leaving group' on live >>> controller B when I shutdown the interface of A. This is working as >>> expected and the virtual db is still accessible/writable as the >>> controller B is alive. But when I open the interface on A, the >>> controller A shows (show controllers) that the virtual-db is hosted by >>> controllers A & B while controller B just shows B. And the data >>> inserted into the vdb hosted by controller B is NOT being played on A. >>> This will cause inconsistencies in the data between the virtual-dbs. >>> Is there a way, we can disable the backend if the network goes down, >>> so that I can recover the db using the backup? >>> >>> >>> There is a problem with your group communication configuration if >>> controllers have different views of the group. That should not happen. >>> >>> >>> >>>> I've also noticed that in some cases, if I take one of the host >>>> >>>> >>> interface down, both of them thinks that the other controller failed. >>> This will also create issues. In my case, I only have two controllers >>> hosted. Is it possible to ping a network gateway? That way the >>> controller know that it is the one which failed and can disable the >>> backend. >>> >>> >>> The best solution is to use the same interface for group communication >>> and client/database communications. If you use a dedicated network for >>> group communications and this network fails, you will end up with a >>> network partition and this is very bad. If all communications go >>> through the same interface, when it goes down, all communications are >>> down and the controller will not be able to serve stale data. >>> >>> You don't need STATE_TRANSFER as Sequoia has its own state transfer >>> protocol when a new member joins a group. Which version of JGroups are >>> you using? Could you send me the log with JGroups messages that you >>> see on each controller by activating them in log4j.properties. I would >>> need the initial sequence when you start the cluster and the messages >>> you see when the failure is detected and when the failed controller >>> joins back. There might be a problem with the timeout settings of the >>> different component of the stack. >>> >>> Keep me posted with your findings >>> >>> Emmanuel >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Sequoia mailing list >>> Se...@li... >>> http://forge.continuent.org/mailman/listinfo/sequoia >>> >>> >> >> > > > -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: ma...@fr... Skype: emmanuel_cecchet |