Re: [Sequoiadb-discuss] [Sequoia] Failure detection

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Seby,

When JGroups reported the MERGE messages in the log, did you have Hedera 
DEBUG logs enabled too? If this is the case, the message was never 
handled by Hedera which is a problem. The new view should have been 
installed anyway by the view synchrony layer and Hedera should at least 
catch that.
Can you confirm is the Hedera logs are enabled?
Could you also set the Distributed Virtual Database logs to DEBUG?
Did you try to update Hedera to a newer version?

Thanks
Emmanuel

> Hi Emmanuel,
>
> Do you need more logs on this. Please let me know.
>
> Thanks,
> Seby.
>
> -----Original Message-----
> From: seq...@li... [mailto:seq...@li...] On Behalf Of Francis, Seby
> Sent: Monday, March 29, 2010 1:51 PM
> To: Sequoia general mailing list
> Cc: seq...@li...
> Subject: Re: [Sequoia] Failure detection
>
> Hi Emmanuel,
>
> I've tried different jgroup configuration and now I can see in the logs that the groups are merging. But for some reason, Sequoia never shows that it is merged. Ie; when I ran 'show controllers' on console I see only that particular host. Below is the snippet from one of the host. I see the similar on the other host showing the merge. Let me know if you would like to see the debug logs during the time-frame.
>
> 2010-03-29 06:59:45,683 DEBUG jgroups.protocols.VERIFY_SUSPECT diff=1507, mbr 10.0.0.33:35974 is dead (passing up SUSPECT event)
> 2010-03-29 06:59:45,687 DEBUG continuent.hedera.gms JGroups reported suspected member:10.0.0.33:35974
> 2010-03-29 06:59:45,688 DEBUG continuent.hedera.gms Member(address=/10.0.0.33:35974, uid=db2) leaves Group(gid=db2).
>
> 2010-03-29 06:59:45,868 INFO  controller.requestmanager.cleanup Waiting 30000ms for client of controller 562949953421312 to failover
> 2010-03-29 07:00:15,875 INFO  controller.requestmanager.cleanup Cleanup for controller 562949953421312 failure is completed.
>
> -----
> 2010-03-29 07:03:14,725 DEBUG protocols.pbcast.GMS I (10.0.0.23:49731) will be the leader. Starting the merge task for [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,726 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 running merge task, coordinators are [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,730 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 sending MERGE_REQ to [10.0.0.33:35974, 10.0.0.23:49731]
> 2010-03-29 07:03:14,746 DEBUG jgroups.protocols.UDP sending msg to 10.0.0.23:49731, src=10.0.0.23:49731, headers are GMS: GmsHeader[MERGE_RSP]: view=[10.0.0.23:49731|2] [10.0.0.23:49731], digest=10.0.0.23:49731: [44 : 47 (47)], merge_rejected=false, merge_id=[10.0.0.23:49731|1269860594727], UNICAST: [UNICAST: DATA, seqno=4], UDP: [channel_name=db2]
> 2010-03-29 07:03:14,748 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 responded to 10.0.0.23:49731, merge_id=[10.0.0.23:49731|1269860594727]
> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 expects 2 responses, so far got 2 responses
> 2010-03-29 07:03:14,766 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 collected 2 merge response(s) in 36 ms
> 2010-03-29 07:03:14,772 DEBUG protocols.pbcast.GMS Merge leader 10.0.0.23:49731 computed new merged view that will be MergeView::[10.0.0.23:49731|3] [10.0.0.23:49731, 10.0.0.33:35974], subgroups=[[10.0.0.23:49731|2] [10.0.0.23:49731], [10.0.0.33:35974|2] [10.0.0.33:35974]]
> 2010-03-29 07:03:14,773 DEBUG protocols.pbcast.GMS 10.0.0.23:49731 is sending merge view [10.0.0.23:49731|3] to coordinators [10.0.0.33:35974, 10.0.0.23:49731
>
> Seby.
>
> -----Original Message-----
> From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet
> Sent: Wednesday, March 24, 2010 10:41 AM
> To: Sequoia general mailing list
> Cc: seq...@li...
> Subject: Re: [Sequoia] Failure detection
>
> Hi Seby,
>
> Sorry for the late reply, I have been very busy these past days.
> This seems to be a JGroups issue that could probably be better answered 
> by Bela Ban on the JGroups mailing list. I have seen emails these past 
> days on the list with people having similar problem.
> I would recommend that you post an email on the JGroups mailing list 
> with your JGroups configuration and the messages you see regarding MERGE 
> failing.
>
> Keep me posted
> Emmanuel
>
>   
>> Also, here is the error which I see from the logs:
>>
>> 2010-03-22 08:31:15,912 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 1 responses
>> 2010-03-22 08:31:15,913 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 waiting 382 msecs for merge responses
>> 2010-03-22 08:31:16,313 DEBUG protocols.pbcast.GMS At 10.10.10.23:39729 cancelling merge due to timer timeout (5000 ms)
>> 2010-03-22 08:31:16,314 DEBUG protocols.pbcast.GMS cancelling merge (merge_id=[10.10.10.23:39729|1269261071286])
>> 2010-03-22 08:31:16,316 DEBUG protocols.pbcast.GMS resumed ViewHandler
>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 expects 2 responses, so far got 0 responses
>> 2010-03-22 08:31:16,317 DEBUG protocols.pbcast.GMS Merge leader 10.10.10.23:39729 collected 0 merge response(s) in 5027 ms
>> 2010-03-22 08:31:16,318 WARN  protocols.pbcast.GMS Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [10.10.10.33:38822, 10.10.10.23:39729]
>>
>> -----Original Message-----
>> From: Francis, Seby 
>> Sent: Monday, March 22, 2010 1:03 PM
>> To: 'Sequoia general mailing list'
>> Cc: seq...@li...
>> Subject: RE: [Sequoia] Failure detection
>>
>> Hi Emmanuel,
>>
>> I've updated my jgroups to the version which you have mentioned, but I still see the issue with Merging the groups. One of the controller lost track after the failure and won't merge. Can you please give me a hand to figure out where it goes wrong. I've the debug logs. Shall I send the logs as a zip file.  
>>
>> Thanks,
>> Seby.
>>
>> -----Original Message-----
>> From: seq...@li... [mailto:seq...@li...] On Behalf Of Emmanuel Cecchet
>> Sent: Thursday, March 18, 2010 10:22 PM
>> To: Sequoia general mailing list
>> Cc: seq...@li...
>> Subject: Re: [Sequoia] Failure detection
>>
>> Hi Seby,
>>
>> I looked into the mailing list archive and this version of JGroups has a 
>> number of significant bugs. An issue was filed 
>> (http://forge.continuent.org/jira/browse/SEQUOIA-1130) and I fixed it 
>> for Sequoia 4. Just using a drop in replacement for JGroups core for 
>> Sequoia 2.10.10 might work. You might have to update Hedera jars as well 
>> but that could work with the old one too.
>>
>> Let me know if the upgrade does not work
>> Emmanuel
>>
>>   
>>     
>>> Thanks for your support!!
>>>
>>> I'm using jgroups-core.jar Version 2.4.2 which came with 
>>> "sequoia-2.10.10". My solaris test servers have only single interface 
>>> and I'm using the same ip for both group & db/client communications. I 
>>> ran a test again removing "*STATE_TRANSFER*" and attached the logs. At 
>>> around 13:36, I took the host1 interface down and opened it around 
>>> 13:38. After I opened the interface, and when I ran the show 
>>> controllers on console, host1 showed both controllers while host2 
>>> showed its own name in the member list.
>>>
>>> Regards,
>>>
>>> Seby.
>>>
>>> -----Original Message-----
>>> Hi Seby,
>>>
>>> Welcome to the wonderful world of group communications!
>>>
>>>     
>>>       
>>>> I've tried various FD options and could not get it working when one 
>>>>       
>>>>         
>>> of the hosts fail. I can see the message 'A leaving group' on live 
>>> controller B when I shutdown the interface of A. This is working as 
>>> expected and the virtual db is still accessible/writable as the 
>>> controller B is alive. But when I open the interface on A, the 
>>> controller A shows (show controllers) that the virtual-db is hosted by 
>>> controllers A & B while controller B just shows B. And the data 
>>> inserted into the vdb hosted by controller B is NOT being played on A. 
>>> This will cause inconsistencies in the data between the virtual-dbs. 
>>> Is there a way, we can disable the backend if the network goes down, 
>>> so that I can recover the db using the backup?
>>>
>>>     
>>> There is a problem with your group communication configuration if 
>>> controllers have different views of the group. That should not happen.
>>>
>>>     
>>>       
>>>> I've also noticed that in some cases, if I take one of the host 
>>>>       
>>>>         
>>> interface down, both of them thinks that the other controller failed. 
>>> This will also create issues. In my case, I only have two controllers 
>>> hosted. Is it possible to ping a network gateway? That way the 
>>> controller know that it is the one which failed and can disable the 
>>> backend.
>>>
>>>     
>>> The best solution is to use the same interface for group communication 
>>> and client/database communications. If you use a dedicated network for 
>>> group communications and this network fails, you will end up with a 
>>> network partition and this is very bad. If all communications go 
>>> through the same interface, when it goes down, all communications are 
>>> down and the controller will not be able to serve stale data.
>>>
>>> You don't need STATE_TRANSFER as Sequoia has its own state transfer 
>>> protocol when a new member joins a group. Which version of JGroups are 
>>> you using? Could you send me the log with JGroups messages that you 
>>> see on each controller by activating them in log4j.properties. I would 
>>> need the initial sequence when you start the cluster and the messages 
>>> you see when the failure is detected and when the failed controller 
>>> joins back. There might be a problem with the timeout settings of the 
>>> different component of the stack.
>>>
>>> Keep me posted with your findings
>>>
>>> Emmanuel
>>>
>>> ------------------------------------------------------------------------
>>>       
>>   
>>     
>
>
>   

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: ma...@fr...
Skype: emmanuel_cecchet