Re: [Zookeeper-user] zookeeper vs. heartbeat messages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

You are correct.  Generally all the servers in a cluster need to keep this
state handy in some form or other so this isn't an issue (usually).  This
state can be minute since all the master needs to keep is the zxid
corresponding to the last time they were up to date.  A quick stat scan of
the children will let you know which are newer than this.

Note that the problem is actually worse than you say since you only ever get
a single notification from a single zookeeper watch.  That means that you
will be notified when the first change happens, but if a second change
happens before you look (and set a watch) again, then you won't know that.
This double update issue causes you to need to do a scan for changes even if
you knew the change that you were notified about.

If it is a problem to keep state, then your master can keep state in
zookeeper.  :-)

The other suggestion that there is a shared configuration file that
everybody can update suffers similar issues, but can probably be made to
work.  I find that I prefer to have a single writer for each file so that
zookeeper will keep things straight for me.  I would rather not have to do
the dance of trying to write a specific version and then, on failure,
applying the same transaction to the new data.  I appreciate that the idiom
is possible, but I would rather simplify matters.

On 4/24/08 4:30 PM, "Stefan Groschupf" <sg...@10...> wrote:

> Hi Ted,
> I like this idea a lot, though I noticed you get a notification on the
> parent folder.
> For example adding a node "/slaves/server1" triggers a notification "/
> slaves".
> What basically means my master need to know the children of "/slaves"
> from before to identify the new node in the "/slaves" path. What means
> the master has a kind of state.
> Or is there anything I miss?
> Thanks
> Stefan
> 
> On Apr 22, 2008, at 3:28 PM, Ted Dunning wrote:
> 
>> 
>> 
>> I think I would use a redundant representation where there is
>> 
>> * a directory with ephemeral files, one per slave
>> 
>> * one directory per slave containing references to the shards it
>> should
>> serve
>> 
>> * one directory per shard with ephemeral files per slave serving the
>> shard
>> 
>> Suppose the whole system is rooted at /katta, these directories
>> would be
>> /katta/slaves, /katta/shard-assignments, /katta/shard-servers.  The
>> purpose
>> of /katta/slaves is to allow a quick inventory of all live servers
>> and to
>> have a single location where notifications of slave death or birth
>> can be
>> done.  The purpose of /katta/shard-assignments is to allow the
>> shards to
>> each have a directory on which they can listen for new shards or old
>> shards
>> that are being taken away.  The purpose of /katta/shard-servers is
>> so that
>> it is easy to find under-served shards whenever a slave goes down or
>> comes
>> up.  I think that this structure is such that no locks need ever be
>> used to
>> ensure consistency.
>> 
>> One implementation that I have had pretty good luck with is to build a
>> stateless rest layer that does complex manipulations on a zookeeper
>> storage
>> layer.  This lets me think in terms of large scale operations (get a
>> batch
>> of work) and inherently allows easy access from a variety of different
>> languages.  Since the rest layer is stateless, it can be scaled
>> effortlessly.
>> 
>> 
>> 
>> On 4/22/08 3:15 PM, "Stefan Groschupf" <sg...@10...> wrote:
>> 
>>> As mention we plan to build a distributed lucene serving grid system.
>>> One virtual index composed out of many lucene indexes (shards).
>>> Each slave should server a set of shards. The master need to assign
>>> shards to a slave if it becomes available or a new index is deployed.
>>> A slave need to get notifications if a new shard is assigned to it.
>>> The master need to get notifications in case a slave dies or for some
>>> reason can not serve a shard.
>>> The master also need to know some all shards a slave serves.
>>> How you would recommend to structure those information?
>>> Use one file per slave that is updated by master and slave and
>>> contains all shards meta data.
>>> Or one folder per node and each shard is one file?
>>> Or...?
>>> 
>>> Thanks a lot!
>>> Stefan
>>> 
>>> 
>>> 
>>> In our case we want to manage shards assign
>>> On Apr 22, 2008, at 2:55 PM, Mahadev Konar wrote:
>>>> Thanks Ted for your reasons listed. I would just like to clarify
>>>> that
>>>> the 1MB limit is the limit per node. Zookeeper keeps its datatree in
>>>> memory. So theretically you are just limited by the amount of memory
>>>> you
>>>> have on machines you are running zookeeper on. Most of our users use
>>>> Zookeeper for failure detection of slave nodes, master discovery and
>>>> also for assigning workloads among nodes. They haven't had any
>>>> memory
>>>> issues.
>>>> 
>>>> Regards
>>>> Mahadev
>>>> 
>>>>> -----Original Message-----
>>>>> From: zoo...@li...
>>>> [mailto:zookeeper-user-
>>>>> bo...@li...] On Behalf Of Ted Dunning
>>>>> Sent: Tuesday, April 22, 2008 2:47 PM
>>>>> To: Stefan Groschupf; zoo...@li...
>>>>> Subject: Re: [Zookeeper-user] zookeeper vs. heartbeat messages
>>>>> 
>>>>> 
>>>>> I would definitely recommend zookeeper over heartbeat.  Here are my
>>>>> reasons:
>>>>> 
>>>>> A) it works.  Anything you implement using heartbeats will not work
>>>>> initially and you will (should) always worry about that
>>>>> implementation
>>>>> because it won't get as much testing.
>>>>> 
>>>>> B) it works.  The use of ephemeral files is much easier than
>>>>> doing a
>>>> good
>>>>> heartbeat API.  In particular, heartbeat always implies some sort
>>>>> of
>>>>> fail-over which is trivial to implement well in zookeeper and very
>>>> hard to
>>>>> implement (correctly) by hand.
>>>>> 
>>>>> C) it works.  Heartbeat architectures often lead logically to a
>>>> spiraling
>>>>> number of heartbeats being exchanged.  With zookeeper, your server
>>>> will be
>>>>> talking to zookeeper, and everybody else will have low latency
>>>>> updates
>>>> in
>>>>> case of a problem (if they want).
>>>>> 
>>>>> 
>>>>> Regarding the amount of data, I doubt you will have a problem if
>>>>> you
>>>> keep
>>>>> your zookeeper files reasonably sized.  There is an imposed limit
>>>>> of
>>>> 1MB,
>>>>> but would you be going anywhere near that?
>>>>> 
>>>>> On 4/22/08 2:31 PM, "Stefan Groschupf" <sg...@10...> wrote:
>>>>> 
>>>>>> Hi
>>>>>> I'm new to zookeeper and I work on a system that require classical
>>>>>> master - slave communication similar to hadoop dfs or hbase.
>>>>>> Though in my case the master need to know much faster if a slave
>>>>>> crashes than in other cases.
>>>>>> So I wonder if it would make sense instead of using classical
>>>>>> heartbeat messages in can use zookeeper.
>>>>>> Basically the master need to know if a new slave becomes available
>>>> and
>>>>>> what kind of data it servers.
>>>>>> 
>>>>>> Is there any limitations where the amount of data stored within
>>>>>> zookeeper becomes an issue?
>>>>>> Would you recommend to use zookeeper over heartbeat messages?
>>>>>> 
>>>>>> Thanks for any hints.
>>>>>> Stefan
>>>>>> 
>>>>>> 
>>>>>> 
>>>> ------------------------------------------------------------------------
>>>>> -
>>>>>> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
>>>>>> Don't miss this year's exciting event. There's still time to save
>>>> $100.
>>>>>> Use priority code J8TL2D2.
>>>>>> 
>>>>> 
>>>> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/j
>>>> av
>>>>> aone
>>>>>> _______________________________________________
>>>>>> Zookeeper-user mailing list
>>>>>> Zoo...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/zookeeper-user
>>>>> 
>>>>> 
>>>>> 
>>>> ------------------------------------------------------------------------
>>>> -
>>>>> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
>>>>> Don't miss this year's exciting event. There's still time to save
>>>> $100.
>>>>> Use priority code J8TL2D2.
>>>>> 
>>>> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/j
>>>> av
>>>>> aone
>>>>> _______________________________________________
>>>>> Zookeeper-user mailing list
>>>>> Zoo...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/zookeeper-user
>>>> 
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 101tec Inc.
>>> Menlo Park, California, USA
>>> http://www.101tec.com
>>> 
>>> 
>>> 
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
>>> Don't miss this year's exciting event. There's still time to save
>>> $100.
>>> Use priority code J8TL2D2.
>>> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao
>>> ne
>>> _______________________________________________
>>> Zookeeper-user mailing list
>>> Zoo...@li...
>>> https://lists.sourceforge.net/lists/listinfo/zookeeper-user
>> 
>> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> Menlo Park, California, USA
> http://www.101tec.com
> 
>