From: Emmanuel C. <ma...@fr...> - 2010-05-08 19:54:33
|
Hi Francis, Do you have the traces with log4j.logger.org.continuent.sequoia.controller.virtualdatabase set to DEBUG? Could you also try with the latest version of Hedera? Sorry for the lag in the responses I have been swamped since I'm back! Emmanuel > Hello Emmanuel, > > Yes, all were in debug. Here is the snippet: > > ###################################### > # Hedera group communication loggers # > ###################################### > # Hedera channels test # > log4j.logger.test.org.continuent.hedera.channel=DEBUG, Console, Filetrace > log4j.additivity.test.org.continuent.hedera.channel=false > # Hedera adapters # > log4j.logger.org.continuent.hedera.adapters=DEBUG, Console, Filetrace > log4j.additivity.org.continuent.hedera.adapters=false > # Hedera factories # > log4j.logger.org.continuent.hedera.factory=DEBUG, Console, Filetrace > log4j.additivity.org.continuent.hedera.factory=false > # Hedera channels # > log4j.logger.org.continuent.hedera.channel=DEBUG, Console, Filetrace > log4j.additivity.org.continuent.hedera.channel=false > # Hedera Group Membership Service # > log4j.logger.org.continuent.hedera.gms=DEBUG, Console, Filetrace > log4j.additivity.org.continuent.hedera.gms=false > # JGroups > log4j.logger.org.jgroups=DEBUG, Console, Filetrace > log4j.additivity.org.jgroups=false > # JGroups protocols > log4j.logger.org.jgroups.protocols=DEBUG, Console, Filetrace > log4j.additivity.org.jgroups.protocols=false > ###################################### > > I've the distributed logs for the same time-frame. Let me know if you need that. > > No, the hedera were not updated. > > Thanks, > Seby. > -----Original Message----- > From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet > Sent: Tuesday, May 04, 2010 6:20 AM > To: Sequoia general mailing list > Cc: seq...@li... > Subject: Re: [Sequoia] Failure detection > > Hi Seby, > > When JGroups reported the MERGE messages in the log, did you have Hedera > DEBUG logs enabled too? If this is the case, the message was never > handled by Hedera which is a problem. The new view should have been > installed anyway by the view synchrony layer and Hedera should at least > catch that. > Can you confirm is the Hedera logs are enabled? > Could you also set the Distributed Virtual Database logs to DEBUG? > Did you try to update Hedera to a newer version? > > Thanks > Emmanuel > > >> Hi Emmanuel, >> >> Do you need more logs on this. Please let me know. >> >> Thanks, >> Seby. >> >> -----Original Message----- >> From: seq...@li... [mailto:seq...@li...] On Behalf Of Francis, Seby >> Sent: Monday, March 29, 2010 1:51 PM >> To: Sequoia general mailing list >> Cc: seq...@li... >> Subject: Re: [Sequoia] Failure detection >> >> Hi Emmanuel, >> >> I've tried different jgroup configuration and now I can see in the logs that the groups are merging. But for some reason, Sequoia never shows that it is merged. Ie; when I ran 'show controllers' on console I see only that particular host. Below is the snippet from one of the host. I see the similar on the other host showing the merge. Let me know if you would like to see the debug logs during the time-frame. >> >> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT diff=1507, mbr 10.0.0.33:35974 is dead (passing up SUSPECT event) >> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported suspected member:10.0.0.33:35974 >> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2). >> >> 2010-03-29 06:59:45,868 INFO controller.requestmanager.cleanup Waiting 30000ms for client of controller 562949953421312 to failover >> 2010-03-29 07:00:15,875 INFO controller.requestmanager.cleanup Cleanup for controller 562949953421312 failure is completed. >> >> ----- >> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I (10.0.0.23:49731) will be the leader. Starting the merge task for [10.0.0.33:35974, 10.0.0.23:49731] >> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 running merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731] >> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731] >> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: GmsHeader[MERGE_RSP]: view=[10.0.0.23:49731|2] [10.0.0.23:49731], digest=10.0.0.23:49731: [44 : 47 (47)], merge_rejected=false, merge_id=[10.0.0.23:49731|1269860594727], UNICAST: [UNICAST: DATA, seqno=4], UDP: [channel_name=db2] >> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 responded to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727] >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 expects 2 responses, so far got 2 responses >> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 collected 2 merge response(s) in 36 ms >> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 computed new merged view that will be MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] [10.0.0.33:35974]] >> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 is sending merge view [10.0.0.23:49731|3] to coordinators [10.0.0.33:35974, 10.0.0.23:49731 >> >> Seby. >> >> -----Original Message----- >> From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet >> Sent: Wednesday, March 24, 2010 10:41 AM >> To: Sequoia general mailing list >> Cc: seq...@li... >> Subject: Re: [Sequoia] Failure detection >> >> Hi Seby, >> >> Sorry for the late reply, I have been very busy these past days. >> This seems to be a JGroups issue that could probably be better answered >> by Bela Ban on the JGroups mailing list. I have seen emails these past >> days on the list with people having similar problem. >> I would recommend that you post an email on the JGroups mailing list >> with your JGroups configuration and the messages you see regarding MERGE >> failing. >> >> Keep me posted >> Emmanuel >> >> >> >>> Also, here is the error which I see from the logs: >>> >>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 1 responses >>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 waiting 382 msecs for merge responses >>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 cancelling merge due to timer timeout (5000 ms) >>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge (merge_id=[10.10.10.23:39729|1269261071286]) >>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 0 responses >>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms >>> 2010-03-22 08:31:16,318 WARN protocols.pbcast.GMS Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [10.10.10.33:38822, 10.10.10.23:39729] >>> >>> -----Original Message----- >>> From: Francis, Seby >>> Sent: Monday, March 22, 2010 1:03 PM >>> To: 'Sequoia general mailing list' >>> Cc: seq...@li... >>> Subject: RE: [Sequoia] Failure detection >>> >>> Hi Emmanuel, >>> >>> I've updated my jgroups to the version which you have mentioned, but I still see the issue with Merging the groups. One of the controller lost track after the failure and won't merge. Can you please give me a hand to figure out where it goes wrong. I've the debug logs. Shall I send the logs as a zip file. >>> >>> Thanks, >>> Seby. >>> >>> -----Original Message----- >>> From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet >>> Sent: Thursday, March 18, 2010 10:22 PM >>> To: Sequoia general mailing list >>> Cc: seq...@li... >>> Subject: Re: [Sequoia] Failure detection >>> >>> Hi Seby, >>> >>> I looked into the mailing list archive and this version of JGroups has a >>> number of significant bugs. An issue was filed >>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it >>> for Sequoia 4. Just using a drop in replacement for JGroups core for >>> Sequoia 2.10.10 might work. You might have to update Hedera jars as well >>> but that could work with the old one too. >>> >>> Let me know if the upgrade does not work >>> Emmanuel >>> >>> >>> >>> >>>> Thanks for your support!! >>>> >>>> I'm using jgroups-core.jar Version 2.4.2 which came with >>>> "sequoia-2.10.10". My solaris test servers have only single interface >>>> and I'm using the same ip for both group & db/client communications. I >>>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At >>>> around 13:36, I took the host1 interface down and opened it around >>>> 13:38. After I opened the interface, and when I ran the show >>>> controllers on console, host1 showed both controllers while host2 >>>> showed its own name in the member list. >>>> >>>> Regards, >>>> >>>> Seby. >>>> >>>> -----Original Message----- >>>> Hi Seby, >>>> >>>> Welcome to the wonderful world of group communications! >>>> >>>> >>>> >>>> >>>>> I've tried various FD options and could not get it working when one >>>>> >>>>> >>>>> >>>> of the hosts fail. I can see the message 'A leaving group' on live >>>> controller B when I shutdown the interface of A. This is working as >>>> expected and the virtual db is still accessible/writable as the >>>> controller B is alive. But when I open the interface on A, the >>>> controller A shows (show controllers) that the virtual-db is hosted by >>>> controllers A & B while controller B just shows B. And the data >>>> inserted into the vdb hosted by controller B is NOT being played on A. >>>> This will cause inconsistencies in the data between the virtual-dbs. >>>> Is there a way, we can disable the backend if the network goes down, >>>> so that I can recover the db using the backup? >>>> >>>> >>>> There is a problem with your group communication configuration if >>>> controllers have different views of the group. That should not happen. >>>> >>>> >>>> >>>> >>>>> I've also noticed that in some cases, if I take one of the host >>>>> >>>>> >>>>> >>>> interface down, both of them thinks that the other controller failed. >>>> This will also create issues. In my case, I only have two controllers >>>> hosted. Is it possible to ping a network gateway? That way the >>>> controller know that it is the one which failed and can disable the >>>> backend. >>>> >>>> >>>> The best solution is to use the same interface for group communication >>>> and client/database communications. If you use a dedicated network for >>>> group communications and this network fails, you will end up with a >>>> network partition and this is very bad. If all communications go >>>> through the same interface, when it goes down, all communications are >>>> down and the controller will not be able to serve stale data. >>>> >>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer >>>> protocol when a new member joins a group. Which version of JGroups are >>>> you using? Could you send me the log with JGroups messages that you >>>> see on each controller by activating them in log4j.properties. I would >>>> need the initial sequence when you start the cluster and the messages >>>> you see when the failure is detected and when the failed controller >>>> joins back. There might be a problem with the timeout settings of the >>>> different component of the stack. >>>> >>>> Keep me posted with your findings >>>> >>>> Emmanuel >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>> >>> >>> >> >> > > > -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: ma...@fr... Skype: emmanuel_cecchet |