[jgroups-dev] completion of join-with-state-transfer needs to be coordinated by the coordinator
Brought to you by:
belaban
From: Newcomb, Michael-P. <Mic...@gd...> - 2008-04-21 19:41:39
|
I'm running some concurrent startup tests on one machine and I'm seeing some interesting issues. Here is how join-with-state-transfer is looking to me on an established group. A new member wants to join, so he: 1. pings and finds the group 2. sends a join to the coordinator 3. the coordinator starts a flush 4. the coordinator sends a join response 5a. the new member installs the view 6a. the new member responds with a view_ack 7a. the new member gets the state 8a. the new member stops the flush 5b. the coordinator issues a view change 6b. all members install the view (the letter denotes concurrently happening steps) Several problems can happen at step 4. 1. the new member can send his view_ack *before* the coordinator starts waiting for them 2. the new member could complete the state transfer *before* all existing members have installed the new view and the existing members could drop his stop_flush call When the new member receives his join response, he immediately starts finishing the JChannel.connect() concurrently (steps 5a-8a) while the coordinator moves on to steps 5b-6b. As mentioned above, the new member may send his view ack before the coordinator starts waiting for it. Also the new member can complete the state transfer before the rest of the members have installed the new view, if that happens, the existing members will drop this stop_flush message. I'm seeing problem 2 occur because I added code to print all messages that were discarded in NAKACK because they weren't a member and I did indeed see a STOP_FLUSH come before the new view was installed. To solve the first problem, I think the new member should not install the view from the JoinRsp, but should wait for the coordinator to give him the view along with everyone else? Perhaps the JoinRsp just delivers the view id that the new member should be waiting for? The second problem would still exist, because the new member could still get his view *before* everyone else has installed theirs and thus send out the stop_flush before everyone has installed the new view (with the new member in the view). To solve this problem, instead of the new member passing down RESUME in JChannel.stopFlush, couldn't the coordinator wait until all view_acks have been received, then unicast RESUME to the new member? Then the new member would essentially behave in the same way and call onResume() which is what he does anyway? Also, the JChannel.stopFlush would block until it was unblocked by the RESUME... This might need to be a new 'stopFlush' method so external calls to JChannel.stopFlush would still send RESUME down... Thanks, Michael |