You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(26) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(5) |
Feb
(1) |
Mar
|
Apr
(71) |
May
(22) |
Jun
(47) |
Jul
(32) |
Aug
(18) |
Sep
(9) |
Oct
(4) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Flavio J. <fpj...@ya...> - 2008-06-26 10:53:57
|
Hi Martin, Have you observed any session expiration in your logs? What is the value of tickTime you are using? -Flavio ----- Original Message ---- From: Martin Schaaf <ms...@10...> To: zoo...@li... Sent: Thursday, June 26, 2008 12:05:21 PM Subject: [Zookeeper-user] Lost connection Hi, After some days on running our application that uses zookeeper over 10 servers we get connection loss exception by writing into zookeeper. Caused by: com.yahoo.zookeeper.KeeperException: KeeperErrorCode = ConnectionLoss at com.yahoo.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:634) at net.sf.katta.zk.ZKClient.getChildren(ZKClient.java:321) ... 3 more We use zookeeper 2.2.0. We can reach the servers all the time. Is this known and how can we find out what the cause is? Thanks in advance for every answer. Bye, Martin. ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zoo...@li... https://lists.sourceforge.net/lists/listinfo/zookeeper-user |
From: Martin S. <ms...@10...> - 2008-06-26 10:05:24
|
Hi, After some days on running our application that uses zookeeper over 10 servers we get connection loss exception by writing into zookeeper. Caused by: com.yahoo.zookeeper.KeeperException: KeeperErrorCode = ConnectionLoss at com.yahoo.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:634) at net.sf.katta.zk.ZKClient.getChildren(ZKClient.java:321) ... 3 more We use zookeeper 2.2.0. We can reach the servers all the time. Is this known and how can we find out what the cause is? Thanks in advance for every answer. Bye, Martin. |
From: Benjamin R. <br...@ya...> - 2008-06-20 07:29:44
|
Thanx Shane! You are correct the line below the if block should be moved up. I've opened issue ZOOKEEPER-49 for it. The fix and test are included in the ZOOKEEPER-48 patch. ben Shane Mingins wrote: > Hi > > Still exploring the ACL stuff in Zookeeper. Tried using setACL for a > path but get InvalidACL error thrown .... looking at pRequest in > PrepRequestProcessor ... and in particular these lines ... > > > SetACLRequest setAclRequest = new SetACLRequest(); > if (!fixupACL(request.authInfo, > setAclRequest.getAcl())) { > throw new KeeperException(Code.InvalidACL); > } > > a new SetACLRequest will return a null when called in fixupACL > returning false and throwing the exception .... as far as I can see. > > > Help?? > > Cheers, Shane > > > Shane Mingins > ELC Technologies (TM) > 1921 State Street > Santa Barbara, CA 93101 > > > Phone: +64 4 568 6684 > Mobile: +64 21 435 586 > Email: smi...@el... > AIM: ShaneMingins > Skype: shane.mingins > > (866) 863-7365 Tel - Santa Barbara Office > (866) 893-1902 Fax - Santa Barbara Office > > +44 020 7504 1346 Tel - London Office > +44 020 7504 1347 Fax - London Office > > http://www.elctech.com > > -------------------------------------------------------------------- > Privacy and Confidentiality Notice: > The information contained in this electronic mail message is intended > for the named recipient(s) only. It may contain privileged and > confidential information. If you are not an intended recipient, you > must not copy, forward, distribute or take any action in reliance on > it. If you have received this electronic mail message in > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Zookeeper-user mailing list > Zoo...@li... > https://lists.sourceforge.net/lists/listinfo/zookeeper-user > |
From: Benjamin R. <br...@ya...> - 2008-06-20 07:01:15
|
You understand it correctly. The confusion is the result of a bug you have found. I've filed a bug report with a patch for your problem: https://issues.apache.org/jira/browse/ZOOKEEPER-48. Since there weren't any authenticated ids on the channel, when CREATOR_ALL_ACL was used the ACL was empty. (Only authenticated ids qualify as CREATOR ids.) Thanx for the report. ben Shane Mingins wrote: > Hi > > Something that confuses me a little .... this fails when I have > commented out the line adding authentication info. Without that line > it seems that no ACL is added to the node and I would have thought > that invalid? > > ZooKeeper zk = null; > zk =createClient(); > > // zk.addAuthInfo("digest", "ben:passwd".getBytes()); > > zk.create("/ben2", new byte[0], Ids.CREATOR_ALL_ACL, 0, this, results); > zk.close(); > > zk =createClient(); > zk.addAuthInfo("digest", "ben:passwd2".getBytes()); > > try { > zk.getData("/ben2", false, new Stat()); > fail("Should have received a permission error"); > } catch (KeeperException e) { > assertEquals(Code.NoAuth, e.getCode()); > } > > zk.close(); > > > So is using CREATOR_ALL_ACL without adding authentication information > valid? I was thinking that perhaps it should not be, but I am hoping > someone can fill me in a bit > > Thanks > Shane > > > Shane Mingins > ELC Technologies (TM) > 1921 State Street > Santa Barbara, CA 93101 > > > Phone: +64 4 568 6684 > Mobile: +64 21 435 586 > Email: smi...@el... > AIM: ShaneMingins > Skype: shane.mingins > > (866) 863-7365 Tel - Santa Barbara Office > (866) 893-1902 Fax - Santa Barbara Office > > +44 020 7504 1346 Tel - London Office > +44 020 7504 1347 Fax - London Office > > http://www.elctech.com > > -------------------------------------------------------------------- > Privacy and Confidentiality Notice: > The information contained in this electronic mail message is intended > for the named recipient(s) only. It may contain privileged and > confidential information. If you are not an intended recipient, you > must not copy, forward, distribute or take any action in reliance on > it. If you have received this electronic mail message in > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Zookeeper-user mailing list > Zoo...@li... > https://lists.sourceforge.net/lists/listinfo/zookeeper-user > |
From: Shane M. <smi...@el...> - 2008-06-19 00:24:45
|
Hi Still exploring the ACL stuff in Zookeeper. Tried using setACL for a path but get InvalidACL error thrown .... looking at pRequest in PrepRequestProcessor ... and in particular these lines ... SetACLRequest setAclRequest = new SetACLRequest(); if (!fixupACL(request.authInfo, setAclRequest.getAcl())) { throw new KeeperException(Code.InvalidACL); } a new SetACLRequest will return a null when called in fixupACL returning false and throwing the exception .... as far as I can see. Help?? Cheers, Shane Shane Mingins ELC Technologies (TM) 1921 State Street Santa Barbara, CA 93101 Phone: +64 4 568 6684 Mobile: +64 21 435 586 Email: smi...@el... AIM: ShaneMingins Skype: shane.mingins (866) 863-7365 Tel - Santa Barbara Office (866) 893-1902 Fax - Santa Barbara Office +44 020 7504 1346 Tel - London Office +44 020 7504 1347 Fax - London Office http://www.elctech.com -------------------------------------------------------------------- Privacy and Confidentiality Notice: The information contained in this electronic mail message is intended for the named recipient(s) only. It may contain privileged and confidential information. If you are not an intended recipient, you must not copy, forward, distribute or take any action in reliance on it. If you have received this electronic mail message in |
From: Shane M. <smi...@el...> - 2008-06-18 00:59:16
|
Hi Something that confuses me a little .... this fails when I have commented out the line adding authentication info. Without that line it seems that no ACL is added to the node and I would have thought that invalid? ZooKeeper zk = null; zk =createClient(); // zk.addAuthInfo("digest", "ben:passwd".getBytes()); zk.create("/ben2", new byte[0], Ids.CREATOR_ALL_ACL, 0, this, results); zk.close(); zk =createClient(); zk.addAuthInfo("digest", "ben:passwd2".getBytes()); try { zk.getData("/ben2", false, new Stat()); fail("Should have received a permission error"); } catch (KeeperException e) { assertEquals(Code.NoAuth, e.getCode()); } zk.close(); So is using CREATOR_ALL_ACL without adding authentication information valid? I was thinking that perhaps it should not be, but I am hoping someone can fill me in a bit Thanks Shane Shane Mingins ELC Technologies (TM) 1921 State Street Santa Barbara, CA 93101 Phone: +64 4 568 6684 Mobile: +64 21 435 586 Email: smi...@el... AIM: ShaneMingins Skype: shane.mingins (866) 863-7365 Tel - Santa Barbara Office (866) 893-1902 Fax - Santa Barbara Office +44 020 7504 1346 Tel - London Office +44 020 7504 1347 Fax - London Office http://www.elctech.com -------------------------------------------------------------------- Privacy and Confidentiality Notice: The information contained in this electronic mail message is intended for the named recipient(s) only. It may contain privileged and confidential information. If you are not an intended recipient, you must not copy, forward, distribute or take any action in reliance on it. If you have received this electronic mail message in |
From: Jacob L. <jy...@ya...> - 2008-06-17 17:42:17
|
Ben described the general outline of the protocol we implemented, which is an improvement on the recipe to avoid a herd effect every time that the leader changed. This improvement was actually suggested by Runping Qi of Yahoo!. The recipe protocol requires all clients to recomputed if they are now the leader, when the current leader either relinquishes leadership or disconnects. The improved protocol guarantees that only one client will need to recompute. Here's the algorithm: A persistent ZNode is used to be the parent of one or more ephemeral ZNodes. These ephemeral ZNodes represent the bids of different clients to become the leader. When a client wants to bid to become the leader, it creates an ephemeral sequence node and records the sequence number. Then, to compute if it is the leader, the client scans backwards from the sequence number it was assigned till 0, to find any preceding bids. If a preceding bid is found, the client places a watch on that ZNode, so it is informed when that ZNode is deleted. The deletion represents the owner client relinquishing leadership or disconnecting. When the watch event is received by a client, it scans backwards from its assigned sequence number to 0, to find a preceding bid. If none is found, then this client is now the leader. If a preceding bid is found, the client places a new watch on the ZNode, and waits again. Note that this protocol handles the situation when the current leader disconnects or abdicates, as well as the situation where a preceding-bid but non-leader client disconnects. In both cases, only one client gets a watch notification, so no herd effect is observed. Please ask if you need more details. This protocol will be part of the client library I'm implementing -- however do not get your hopes up too high, because at this time I do not know whether the library will be released outside of Yahoo!. --Jacob -----Original Message----- From: zoo...@li... [mailto:zoo...@li...] On Behalf Of Benjamin Reed Sent: Tuesday, June 17, 2008 10:27 AM To: zoo...@li... Subject: Re: [Zookeeper-user] Leader Election Good point. The recipe we show guarantees there will be a single leader elected, but only the leader knows it. Jacob Levy has been implementing a client library to do leader election, so he should really chime in here, but just in case he doesn't: I believe Jacob's solution was for the leader to create an ephemeral znode called LEADER with its id as the data when it becomes the leader, and then delete the node before relinquishing leadership. The other nodes then watch for the existence of the LEADER znode to see leadership changes. ben On Tuesday 17 June 2008 09:28:39 Avinash Lakshman wrote: > Hi All > > I am trying to write a simple leader election module and I have 5 nodes A, > B, C, D and E amongst which I need to elect a leader. Now I am following > the example using SEQUENCE flags and trying to use the technique where the > herd effect can be done away with. So I have A create a znode L-1, B create > znode L-2 .... and E create znode L-5. After this I have L-2 watch L-1, L-3 > watch L-2 etc. Let us assume A was elected leader. When A dies B should > automatically become the leader and this seems to be working. What I need > to know is how to C, D and E know about this? Do I need another mechanism > to disseminate this information? I ask because not all znodes are being > watched i.e C, D and E are not watching for L-1 which is the znode created > by A. So how will they learn as to who the new leader is since no watch > event will be triggered at their end. > > Thanks in advance > Avinash ------------------------------------------------------------------------ - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zoo...@li... https://lists.sourceforge.net/lists/listinfo/zookeeper-user |
From: Benjamin R. <br...@ya...> - 2008-06-17 17:27:13
|
Good point. The recipe we show guarantees there will be a single leader elected, but only the leader knows it. Jacob Levy has been implementing a client library to do leader election, so he should really chime in here, but just in case he doesn't: I believe Jacob's solution was for the leader to create an ephemeral znode called LEADER with its id as the data when it becomes the leader, and then delete the node before relinquishing leadership. The other nodes then watch for the existence of the LEADER znode to see leadership changes. ben On Tuesday 17 June 2008 09:28:39 Avinash Lakshman wrote: > Hi All > > I am trying to write a simple leader election module and I have 5 nodes A, > B, C, D and E amongst which I need to elect a leader. Now I am following > the example using SEQUENCE flags and trying to use the technique where the > herd effect can be done away with. So I have A create a znode L-1, B create > znode L-2 .... and E create znode L-5. After this I have L-2 watch L-1, L-3 > watch L-2 etc. Let us assume A was elected leader. When A dies B should > automatically become the leader and this seems to be working. What I need > to know is how to C, D and E know about this? Do I need another mechanism > to disseminate this information? I ask because not all znodes are being > watched i.e C, D and E are not watching for L-1 which is the znode created > by A. So how will they learn as to who the new leader is since no watch > event will be triggered at their end. > > Thanks in advance > Avinash |
From: Avinash L. <avi...@gm...> - 2008-06-17 16:28:32
|
Hi All I am trying to write a simple leader election module and I have 5 nodes A, B, C, D and E amongst which I need to elect a leader. Now I am following the example using SEQUENCE flags and trying to use the technique where the herd effect can be done away with. So I have A create a znode L-1, B create znode L-2 .... and E create znode L-5. After this I have L-2 watch L-1, L-3 watch L-2 etc. Let us assume A was elected leader. When A dies B should automatically become the leader and this seems to be working. What I need to know is how to C, D and E know about this? Do I need another mechanism to disseminate this information? I ask because not all znodes are being watched i.e C, D and E are not watching for L-1 which is the znode created by A. So how will they learn as to who the new leader is since no watch event will be triggered at their end. Thanks in advance Avinash |
From: Benjamin R. <br...@ya...> - 2008-06-17 15:08:54
|
The client library is in charge of preventing expirations, if it happens it is probably because of dead servers or network problems that caused a spike in latency. Increasing your session timeout helps prevent this. If/when it happens, you need create another ZooKeeper object to reconnect to ZooKeeper. If you had any ephemeral nodes, they will be gone. For applications that are just reading things from ZooKeeper or updating status znodes the recovery is very simple. For master applications that setup complex ZooKeeper subtrees with ephemeral znodes at initialization, they need to rerun initialization logic. From ZooKeeper's point of view, the application has restarted when a new ZooKeeper object is recreated. ben On Tuesday 17 June 2008 00:34:58 Avinash Lakshman wrote: > How do prevent my session from timing out? I get this exception: > > Priming connection to java.nio.channels.SocketChannel[connected local=/ > 10.16.138.101:8352 remote=fool.xyz.com/10.18.39.211:5001] > WARN - Closing: > java.io.IOException: Session Expired > at > com.yahoo.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java >:406) at > com.yahoo.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:492) > at > com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:705) > ERROR - from SendThread > java.lang.NoSuchMethodError: org.apache.log4j.Logger.isTraceEnabled()Z > at > com.yahoo.zookeeper.server.ZooTrace.isTraceEnabled(ZooTrace.java:63) > at > com.yahoo.zookeeper.server.ZooTrace.logTraceMessage(ZooTrace.java:67) > at > com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:732) > > Why would this happen and how can I prevent this from happening? How should > the software react to this situation?: > > Avinash |
From: Benjamin R. <br...@ya...> - 2008-06-17 14:59:17
|
You can do this. We only done preliminary testing with that configuration. You may need to increase the tickTime if the pipe between A and B has high latency and you start seeing timeouts. There is a bit of a gotcha that you need to keep in mind. Currently clients select servers at random to connect to, so if you aren't careful your performance will actually start going down since two sevenths of the clients in A will connect to servers in B and five sevenths of the clients in B will connect to A. We haven't yet put latency awareness into the client code. So, the best thing to do would be to configure the clients in B to only connect to the servers in B and the clients in A to only connect to the servers in A. This will get you the performance that you are looking for, but it will mean that if both servers in B go down the clients will see ZooKeeper as unavailable even though the cluster is still running. I've opened an issue: https://issues.apache.org/jira/browse/ZOOKEEPER-46 to address this. ben On Monday 16 June 2008 16:01:59 Avinash Lakshman wrote: > > How would you guys recommend we set up the Zookeeper cluster across data > centers? We have a system that runs in a cluster spread across 2 data > centers A and B and want to have it integrated with Zookeeper. So far we > had Zookeeper cluster set up across a 5 node cluster in data center A. Is > it ok to set up 2 nodes out of the current Zookeeper cluster in data center > B so that the nodes in data center B may read from the local instances in > data center B? What would you guys recommend? > > Please advice > > Avinash |
From: Avinash L. <avi...@gm...> - 2008-06-17 14:43:45
|
How do prevent my session from timing out? I get this exception: Priming connection to java.nio.channels.SocketChannel[connected local=/ 10.16.138.101:8352 remote=fool.xyz.com/10.18.39.211:5001] WARN - Closing: java.io.IOException: Session Expired at com.yahoo.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:406) at com.yahoo.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:492) at com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:705) ERROR - from SendThread java.lang.NoSuchMethodError: org.apache.log4j.Logger.isTraceEnabled()Z at com.yahoo.zookeeper.server.ZooTrace.isTraceEnabled(ZooTrace.java:63) at com.yahoo.zookeeper.server.ZooTrace.logTraceMessage(ZooTrace.java:67) at com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:732) I have set my session timeout to 5000ms in the ctor of the Zookeeper instance. I have my Zookeeper instance running in colo A while the clients that are using Zookeeper are located in colo B and C. Is this ok? Why would this happen and how can I prevent this from happening? How should the software react to this situation? Thanks in advance Avinash |
From: Avinash L. <avi...@gm...> - 2008-06-17 07:34:52
|
How do prevent my session from timing out? I get this exception: Priming connection to java.nio.channels.SocketChannel[connected local=/ 10.16.138.101:8352 remote=fool.xyz.com/10.18.39.211:5001] WARN - Closing: java.io.IOException: Session Expired at com.yahoo.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:406) at com.yahoo.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:492) at com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:705) ERROR - from SendThread java.lang.NoSuchMethodError: org.apache.log4j.Logger.isTraceEnabled()Z at com.yahoo.zookeeper.server.ZooTrace.isTraceEnabled(ZooTrace.java:63) at com.yahoo.zookeeper.server.ZooTrace.logTraceMessage(ZooTrace.java:67) at com.yahoo.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:732) Why would this happen and how can I prevent this from happening? How should the software react to this situation?: Avinash |
From: Avinash L. <avi...@gm...> - 2008-06-16 23:01:54
|
Hi All How would you guys recommend we set up the Zookeeper cluster across data centers? We have a system that runs in a cluster spread across 2 data centers A and B and want to have it integrated with Zookeeper. So far we had Zookeeper cluster set up across a 5 node cluster in data center A. Is it ok to set up 2 nodes out of the current Zookeeper cluster in data center B so that the nodes in data center B may read from the local instances in data center B? What would you guys recommend? Please advice Avinash |
From: Chris D. <ch...@pe...> - 2008-06-10 18:26:14
|
Hi -- Thanks for this enlightening thread! Having not had the time to study the code in detail, I too have been uncertain as to whether Paxos was or was not at the core of ZooKeeper's coordination guarantees. One reason I was unclear was the small parenthetical comment on: http://zookeeper.wiki.sourceforge.net/ZooKeeperGuarantees which states "(This is called the _monotonicity condition_ in Paxos.)" Without additional information, it's not clear whether this just notes a minor point of similarity between two different algorithms, or implies that Paxos is part of ZooKeeper. If you could include some of the recent mailing list discussion on the Wiki, I suspect it would help future readers. Chris. P.S. Glad to hear you're joining as an Apache project. :-) -- GPG Key ID: 366A375B GPG Key Fingerprint: 485E 5041 17E1 E2BB C263 E4DE C8E3 FA36 366A 375B |
From: Flavio J. <fpj...@ya...> - 2008-06-10 16:11:48
|
Ben and I were discussing this thread, so we ended up writing a reply together. This is an interesting discussion indeed, and we appreciate your interest in learning more about the implementation of zookeeper. Here is another attempt to explain you the differences between our algorithm and Paxos. The Paxos Multi-Decree protocol basically consists of running parallel instances of the Synod protocol (you probably know this one, but just for completeness the paper is called "The Part-time parliament", and can be found on Lamport's web page). The original Synod protocol, which is what we are calling Paxos, proceeds in three phases, just like a three-phase commit protocol. Our protocol instead, has only two phases, just like a two-phase commit protocol. Of course, for Paxos, we can ignore the first phase in runs in which we have a single proposer as we can run phase 1 for multiple instances at a time, which is what Ben called previously Multi-Paxos, I believe. The trick with skipping phase 1 is to deal with leader switching. However, if we have a run with multiple proposers, operating simultaneously or not, then we have to run phase 1 at least for the instances that haven't been committed. The ZooKeeper protocol does not. The reason why we don't is twofold. First, we assume FIFO channels. (FIFO meaning if a packet is received from the channel all previously sent messages will have been delivered. If a packet is lost in the channel, all subsequent packets will be lost. TCP is a FIFO channel.) Paxos doesn't assume such a channel, and it is a rather practical assumption that simplifies the protocol a lot. Second, there can be at most one leader (proposer) at any time, and we guarantee this by making sure that a quorum of replicas recognize the leader as a leader by committing to an epoch change. This change in epoch also allows us to get unique zxids since the epoch forms part of the zxid. Followers (they both acceptors and learners in the Paxos terminology) have a FIFO channel to a single leader, so that can only be a single active leader. As a result, we can skip the phase 1 of Paxos completely, and also during recovery we can skip all the uncommitted zxids of the epoch of the previous leader. Since messages can be received out of order and even lost with Paxos, it is possible to have gaps in the sequence of instances, and these instances have to be decided when a new proposer arises. The conclusion is that by making stronger assumptions for the system, we are able to use a simpler algorithm that works truly in two phases. One difference we find interesting is that Paxos embeds recovery into the protocol. According to the algorithm, a new proposer just has to start one phase 1 for each instance that it believes hasn't been committed yet. If such an instance has been committed, then there is no problem as the value can't change once it is committed. With the ZooKeeper protocol, we have to run an auxiliary protocol to make sure that new leaders are up to date with respect to operations that have been committed, but because of the FIFO assumption, we know that the replica with the latest transaction id has the latest committed state. Again, strengthening the assumptions for the system enable a simpler solution. Oh and don't get distracted by the leader election algorithm. Our protocol assumes there is one, but it's not part of the broadcast protocol. The leader election algorithm can easily change. There are actually two different ones in the sources, and one of them doesn't even use notifications. -Flavio and Ben ----- Original Message ---- From: Evan Jones <ev...@MI...> To: Benjamin Reed <br...@ya...> Cc: zoo...@li... Sent: Tuesday, June 10, 2008 2:50:26 AM Subject: Re: [Zookeeper-user] Does ZooKeeper includes Paxos ? This is a pretty academic debate, but I'm sorry: I'm a graduate student, I can't resist. Also: In case it isn't clear, I don't intend this as a criticism of Zookeeper: I think it is a great, and am likely to use it for a research project I am working on at the moment. I just want to make sure I understand its implementation. Summary: I am still unconvinced. I still think the Zookeeper protocol is basically Paxos. On Jun 7, 2008, at 2:33 , Benjamin Reed wrote: > There are some basic elements that all atomic broadcast protocols > have, but I don't think that makes them all the same. Fair enough: I have had this same debate with others, so perhaps I hold a minority opinion: All the consensus protocols tend to have the same 2 rounds in order to reach a decision, in the failure recovery case, and 1 round in the failure free case. They also tend to have the same need for sequence numbers in the messages, and voting, etc. While some of the details differ, the gist of all them seems pretty similar to me. Hence, I consider all of them roughly equivalent. Part of the confusion may be that "Paxos" as defined by Lamport isn't really useful by itself: it is a protocol for deciding a single value. You need to add a bunch of tweaks to the base algorithm to produce a useful state machine replication system. These tweaks can all be done in slightly different ways. > A big difference between us and Paxos is that we never do phase1. By Phase 1 you mean the prepare/ack round in Paxos? Except the Zookeeper protocol *does* do this: They are equivalent to the "notification" messages sent by the Leader election algorithm. The notification messages and the acknowledgements contain the same information as would be exchanged by Paxos phase 1 messages. > Now you could say that the Propose/ACK phase is straight from Paxos, > but is is also part of classic 2 phase commit or pretty much any > other such protocol. Right: Any distributed consensus protocol will use voting like this. As you say, a Multi-Paxos implementation will have the exact same number of messages as Zookeeper does, with almost exactly the same message contents. > Looking at ZooKeeper messaging will not give you insight into Paxos > unfortunately. We take advantage of FIFO channels (TCP), which > allows us to assume ordering; How do you handle the cases where a TCP connection breaks and is then re-established? This can violate the FIFO assumption. Is there code to handle this case that I can look at? > we also make sure that there will never be a two messages with the > same zxid, which is why we do not need something like phase1 of Paxos; Paxos makes the same assumption: there cannot be two messages with the same proposal number. > We think it is a cool protocol, and it's something that is really > quite intuitive and easy to explain. (We have been pulling a paper > together on it. I'll put up a presentation we give on the ZooKeeper > site.) A cursory glance may make you think Paxos, but it really isn't. I look forward to reading a description of the protocol. Reverse engineering the protocol from the source code may not be the most reliable way to figure it out. I *think* I know how Zookeeper's algorithm works, but I could easily be mistaken in some edge cases. Evan -- Evan Jones http://evanjones.ca/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zoo...@li... https://lists.sourceforge.net/lists/listinfo/zookeeper-user |
From: Evan J. <ev...@MI...> - 2008-06-10 00:50:31
|
This is a pretty academic debate, but I'm sorry: I'm a graduate student, I can't resist. Also: In case it isn't clear, I don't intend this as a criticism of Zookeeper: I think it is a great, and am likely to use it for a research project I am working on at the moment. I just want to make sure I understand its implementation. Summary: I am still unconvinced. I still think the Zookeeper protocol is basically Paxos. On Jun 7, 2008, at 2:33 , Benjamin Reed wrote: > There are some basic elements that all atomic broadcast protocols > have, but I don't think that makes them all the same. Fair enough: I have had this same debate with others, so perhaps I hold a minority opinion: All the consensus protocols tend to have the same 2 rounds in order to reach a decision, in the failure recovery case, and 1 round in the failure free case. They also tend to have the same need for sequence numbers in the messages, and voting, etc. While some of the details differ, the gist of all them seems pretty similar to me. Hence, I consider all of them roughly equivalent. Part of the confusion may be that "Paxos" as defined by Lamport isn't really useful by itself: it is a protocol for deciding a single value. You need to add a bunch of tweaks to the base algorithm to produce a useful state machine replication system. These tweaks can all be done in slightly different ways. > A big difference between us and Paxos is that we never do phase1. By Phase 1 you mean the prepare/ack round in Paxos? Except the Zookeeper protocol *does* do this: They are equivalent to the "notification" messages sent by the Leader election algorithm. The notification messages and the acknowledgements contain the same information as would be exchanged by Paxos phase 1 messages. > Now you could say that the Propose/ACK phase is straight from Paxos, > but is is also part of classic 2 phase commit or pretty much any > other such protocol. Right: Any distributed consensus protocol will use voting like this. As you say, a Multi-Paxos implementation will have the exact same number of messages as Zookeeper does, with almost exactly the same message contents. > Looking at ZooKeeper messaging will not give you insight into Paxos > unfortunately. We take advantage of FIFO channels (TCP), which > allows us to assume ordering; How do you handle the cases where a TCP connection breaks and is then re-established? This can violate the FIFO assumption. Is there code to handle this case that I can look at? > we also make sure that there will never be a two messages with the > same zxid, which is why we do not need something like phase1 of Paxos; Paxos makes the same assumption: there cannot be two messages with the same proposal number. > We think it is a cool protocol, and it's something that is really > quite intuitive and easy to explain. (We have been pulling a paper > together on it. I'll put up a presentation we give on the ZooKeeper > site.) A cursory glance may make you think Paxos, but it really isn't. I look forward to reading a description of the protocol. Reverse engineering the protocol from the source code may not be the most reliable way to figure it out. I *think* I know how Zookeeper's algorithm works, but I could easily be mistaken in some edge cases. Evan -- Evan Jones http://evanjones.ca/ |
From: Benjamin R. <br...@ya...> - 2008-06-09 16:57:55
|
We are starting the process of moving to Apache. We have been approved as a subproject of Hadoop and are in the process of moving subversion over. We will also be moving over the documentation and tracker items. For now, the only thing setup on Apache is Jira, the bug tracking system. Please report new bugs there. The link is: https://issues.apache.org/jira/browse/ZOOKEEPER Open bugs on sourceforge will be moved over to Jira. Congratulations to Evan Jones for having the first three real bugs :) (I opened Jira issues for the problems he reported.) thanx ben |
From: Benjamin R. <br...@ya...> - 2008-06-07 06:34:01
|
There are some basic elements that all atomic broadcast protocols have, but I don't think that makes them all the same. There actually are two parts to the atomic broadcast protocol. The establishment of the leader, and the leading phase. A big difference between us and Paxos is that we never do phase1. You are correct that the NEW_LEADER proposal is a key reason (but not the only reason) we can skip phase1, but it isn't Paxos and we never deliver any messages with that proposal. Multi-Paxos does something kind of similar, but they still have to deal with some phase1 issues. The basic protocol of ZooKeeper is certainly not Paxos. The logic for the basic protocol is mostly in the Leader and Follower classes. It is extremely simple: Leader -Proposal-> Followers -ACK-> Leader -COMMIT-> Followers Now you could say that the Propose/ACK phase is straight from Paxos, but is is also part of classic 2 phase commit or pretty much any other such protocol. (To be honest the inspiration came from 2 phase commit. The challenge was to get around the recovery problems of 2 phase commit. Of course we aren't 2 phase commit either since we do not have to worry about aborts, which made things simpler for us.) Looking at ZooKeeper messaging will not give you insight into Paxos unfortunately. We take advantage of FIFO channels (TCP), which allows us to assume ordering; we also make sure that there will never be a two messages with the same zxid, which is why we do not need something like phase1 of Paxos; we maintain strict ordering; we also use the well known concept of epochs to ensure unique messages ids. (The zxid is a <epoch, proposal> pair.) We think it is a cool protocol, and it's something that is really quite intuitive and easy to explain. (We have been pulling a paper together on it. I'll put up a presentation we give on the ZooKeeper site.) A cursory glance may make you think Paxos, but it really isn't. A similar thing happens with ZooKeeper and Chubby. People seem to lump them together after a cursory glance, but ZooKeeper is not a lock service, requests never block each other, clients never block each other, we don't even deliver notifications synchronously; our reads are local fast reads; even operationally, ZooKeeper clients connect to followers whereas Chubby clients always connect to the master (this gets back to the fast read thing). The biggest thing we have in common is the hierarchal namespace, but hey, I'm a filesystem guy; what other kind of namespace is there? :) Put everything together, ZooKeeper is used differently than Chubby. They may both be used for the same problem, but the resulting solutions will be different. ben ----- Original Message ---- From: Evan Jones <ev...@MI...> To: Benjamin Reed <br...@ya...> Cc: zoo...@li... Sent: Friday, June 6, 2008 8:33:29 PM Subject: Re: [Zookeeper-user] Does ZooKeeper includes Paxos ? On Jun 6, 2008, at 17:42 , Benjamin Reed wrote: > No. Internally we use an atomic broadcast protocol (which is what > Paxos would provide us) to keep the replicas in sync, but the > protocol is much simpler than Paxos. After reading QuorumPeer, I think it is safe to say that Zookeeper implements Paxos. The mapping between your code and Lamport's terminology: Zookeeper -> Lamport (Paxos) Notification (FastLeaderElection.java) -> Prepare request NEWLEADER (Leader.java) -> Accept message (zxid, leader id) -> "proposal number" It seems more or less equivalent to me, and there are no obvious problems with it that I can find. Unrelated observations from browsing the source: 1. syncLimit has slightly different comments in the class header, and inline with the variable. 2. ResponderThread: It reads "state" from the enclosing class. This variable is not volatile. My weak understanding of the Java memory model is that without doing a "synchronizing operation," the JVM would be free to optimize away the access of this "state" variable. That is, the ResponderThread could get "stuck" in an old state. I could be wrong because this stuff is ridiculously hard. See the FAQ: http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile The accesses to the currentVote volatile variable probably gives you the memory barrier guarantees you need, except when LEADING. In that case, the ResponderThread never touches another volatile, so in theory the JVM could read state once and by happy with it forever after. Will it ever actually do that? I would be surprised. However, I think you would be safer making state volatile, so ResponderThread would *definitely* see a change immediately. FastLeaderElection.java line 224: The part of the condition after && is not needed: This is the else branch of an if statement, where the condition is exactly the first part. Hence, the part after && *must* be true. FastLeaderElection.java: I think it also has accesses of a set of variables between threads with no synchronization or volatiles, such as logicalclock. Evan -- Evan Jones http://evanjones.ca/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zoo...@li... https://lists.sourceforge.net/lists/listinfo/zookeeper-user |
From: Evan J. <ev...@MI...> - 2008-06-07 03:33:40
|
On Jun 6, 2008, at 17:42 , Benjamin Reed wrote: > No. Internally we use an atomic broadcast protocol (which is what > Paxos would provide us) to keep the replicas in sync, but the > protocol is much simpler than Paxos. After reading QuorumPeer, I think it is safe to say that Zookeeper implements Paxos. The mapping between your code and Lamport's terminology: Zookeeper -> Lamport (Paxos) Notification (FastLeaderElection.java) -> Prepare request NEWLEADER (Leader.java) -> Accept message (zxid, leader id) -> "proposal number" It seems more or less equivalent to me, and there are no obvious problems with it that I can find. Unrelated observations from browsing the source: 1. syncLimit has slightly different comments in the class header, and inline with the variable. 2. ResponderThread: It reads "state" from the enclosing class. This variable is not volatile. My weak understanding of the Java memory model is that without doing a "synchronizing operation," the JVM would be free to optimize away the access of this "state" variable. That is, the ResponderThread could get "stuck" in an old state. I could be wrong because this stuff is ridiculously hard. See the FAQ: http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile The accesses to the currentVote volatile variable probably gives you the memory barrier guarantees you need, except when LEADING. In that case, the ResponderThread never touches another volatile, so in theory the JVM could read state once and by happy with it forever after. Will it ever actually do that? I would be surprised. However, I think you would be safer making state volatile, so ResponderThread would *definitely* see a change immediately. FastLeaderElection.java line 224: The part of the condition after && is not needed: This is the else branch of an if statement, where the condition is exactly the first part. Hence, the part after && *must* be true. FastLeaderElection.java: I think it also has accesses of a set of variables between threads with no synchronization or volatiles, such as logicalclock. Evan -- Evan Jones http://evanjones.ca/ |
From: Avinash L. <avi...@gm...> - 2008-06-07 01:22:55
|
Actually it was my bad. I figured out what I was doing incorrectly. All is good. Thanks Avinash On Fri, Jun 6, 2008 at 1:59 PM, Benjamin Reed <br...@ya...> wrote: > Are you sure there are no exceptions being thrown? So after the create the > path is not there? Is it possible the create is succeeding but the session > is timing out and the node is being deleted? > > ben > > ----- Original Message ---- > From: Avinash Lakshman <avi...@gm...> > To: zoo...@li... > Sent: Friday, June 6, 2008 1:49:16 AM > Subject: [Zookeeper-user] create() failure > > When I do execute the folloiwing code : > > String pathCreated = zk.create(createPath, new byte[0], > Ids.OPEN_ACL_UNSAFE, (CreateFlags.SEQUENCE | CreateFlags.EPHEMERAL) ); > > > It seems to fail silently w/o reporting any errors. I mean I have a log > statement after this line which never shows up. However it works if I > execute it via the debugger. Any ideas as to what might be happening? > > Thanks > A > |
From: Evan J. <ev...@MI...> - 2008-06-06 22:42:27
|
On Jun 6, 2008, at 17:42 , Benjamin Reed wrote: > No. Internally we use an atomic broadcast protocol (which is what > Paxos would provide us) to keep the replicas in sync, but the > protocol is much simpler than Paxos. Fundamentally, I am assuming that the protocol you use must look similar to Paxos. All the distributed consensus protocols I have studied look pretty similar. Where are the leader election/atomic broadcast parts of the protocol implemented? Judging by file names I am assuming that QuorumPeer is the "root" of the protocol? Evan Jones -- Evan Jones http://evanjones.ca/ |
From: Benjamin R. <br...@ya...> - 2008-06-06 21:42:12
|
No. Internally we use an atomic broadcast protocol (which is what Paxos would provide us) to keep the replicas in sync, but the protocol is much simpler than Paxos. We take advantage the fact that we use FIFO channels (TCP) and always use a leader along with some operational rules which results in a much simpler high performance protocol. ben Chen Zhuan wrote: > Hello, > > I was wondering whether ZooKeeper includes Paxos in its code. Can you > tell me which part of the code is about the Paxos? Maybe ZooKeeper may > provide a good chance for people to see how to implement Paxos > correctly and efficiently. > Thank you. > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > ------------------------------------------------------------------------ > > _______________________________________________ > Zookeeper-user mailing list > Zoo...@li... > https://lists.sourceforge.net/lists/listinfo/zookeeper-user > |
From: Ted D. <ted...@gm...> - 2008-06-06 21:33:33
|
I am not sure that the full Paxos algorithm is applicable to zookeeper. Much more important internally is the ability to nominate a single cluster leader which is simpler than Paxos. Once there is a cluster leader, then all updates go through that single host which, in turn, makes any client based algorithms vastly simpler. For instance, naive configuration nomination in zookeeper consists of proposers writing their proposed values to a well known ephemeral file with a watch placed on that file. If the file exists before the proposal, the proposer knows they have lost the race. IF the file does not exist, then it will be created (atomically) and they will know they have won the proposal. Readers can read the file at any time and will know if any proposal has been accepted. Moreover, proposers will be notified via their watch if the any accepted proposer loses their connection which will cause another proposer to propose a value. This suffers from the horde effect, but there is a more nuanced solution on the wiki that does not. This is vastly simpler than the Paxos algorithm, of course, due to the availability of atomic create, ephemeral files and other capabilities of zookeeper. Have you looked at the examples section of the wiki? On Thu, May 29, 2008 at 6:22 PM, Chen Zhuan <cz...@gm...> wrote: > Hello, > > I was wondering whether ZooKeeper includes Paxos in its code. Can you tell > me which part of the code is about the Paxos? Maybe ZooKeeper may provide a > good chance for people to see how to implement Paxos correctly and > efficiently. > Thank you. > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Zookeeper-user mailing list > Zoo...@li... > https://lists.sourceforge.net/lists/listinfo/zookeeper-user > > -- ted |
From: Mahadev K. <ma...@ya...> - 2008-06-06 18:48:34
|
Hi Avinash, Can you post sample code to explain what you are doing? Its harder to see from what you explained as to why zookeeper should fail. As for session timeout, it means that your session will be expired after 5000ms if your zookeeper client goes away. Mahadev ________________________________________ From: zoo...@li... [mailto:zoo...@li...] On Behalf Of Avinash Lakshman Sent: Friday, June 06, 2008 8:58 AM To: Flavio Junqueira Cc: zoo...@li... Subject: Re: [Zookeeper-user] create() failure These are sequence of interactions with ZooKeeper: (1) Instantiate Zookeeper and perform the checks for a bunch of znodes. All this works as expected. (2) Spawn a thread which tries to perform leader election. This is where I am trying to create the znode with the sequence numbers. I am assuming that the client connected successfully because the checks in step 1 went through fine. Also I have Zookeeper run in a cluster of 5 nodes which are all in the same rack and in the same data center. Another question I have is if I set the seesion timeout parameter to 5000 ms in the ZooKeeper ctor does that mean my session is invalidated after 5000 ms or is kept alive after every 5000 ms. What is the significance of the session timeout parameter? Thanks Avinash On Fri, Jun 6, 2008 at 1:55 AM, Flavio Junqueira <fpj...@ya...> wrote: It could be because your client has not connected properly to a ZooKeeper server. Have you taken a look at the logs on the server side to make sure that a session has been created? Could you give me a little more detail about your setup? -Flavio ----- Original Message ---- From: Avinash Lakshman <avi...@gm...> To: zoo...@li... Sent: Friday, June 6, 2008 10:49:16 AM Subject: [Zookeeper-user] create() failure When I do execute the folloiwing code : String pathCreated = zk.create(createPath, new byte[0], Ids.OPEN_ACL_UNSAFE, (CreateFlags.SEQUENCE | CreateFlags.EPHEMERAL) ); It seems to fail silently w/o reporting any errors. I mean I have a log statement after this line which never shows up. However it works if I execute it via the debugger. Any ideas as to what might be happening? Thanks A |