From: Thomas Guyot-S. <Th...@za...> - 2005-11-28 16:33:54
|
Hi Steve, I tried your patch over the weekend and it didn't fix the problem. I = don't have debug logs, however, and no more time to work on this. I'll just = lower the timeout so it doesn't affect much the failover process. Thanks for your help Thomas Guyot-Sionnest, Administrateur de syst=E8mes T=E9l: (514) 842-7054 Fax: (514) 221-3395 Courriel: th...@za... =20 > -----Original Message----- > From: Steve Dobbelstein [mailto:st...@us...] > Sent: November 21, 2005 17:48 > To: Thomas Guyot-Sionnest > Cc: evm...@li...; = evm...@li... > Subject: RE: [Evms-cluster] Evms unable to import a deported container >=20 > "Thomas Guyot-Sionnest" <Th...@za...> wrote on 11/21/2005 = 02:55:32 > PM: >=20 > > Hi Steve, >=20 > Hi, Thomas. >=20 > > Since I sent you the last mail on this issue I did not hear anything > about > > it. >=20 > Sorry for not getting back to you. Too many things going on. Too = many > distractions. >=20 > > Since then I got evms working with heartbeat 2, but I still see in = rare > > cases a 10 minute timeout in cluster operations. > > > > I assume that it's the same timeout as in the log I sent from my > previous > > mail, which is defined at line 106 in engine/remote.c: > > > > #define REQUEST_TIMEOUT 600 /* in seconds */ > > > > Is there any reason to have such a long timeout? Can it be safely = set to > 60, > > or even 30 seconds to prevent blocking the whole failover process? >=20 > The reason was that I didn't have any experience in knowing what a = typical > lag time might be, so I just picked a large value that I thought would = be > safe. Since you know the performance of your systems, you can set it = to a > lower number that you think is safe. 60 or 30 seconds sounds = reasonable. >=20 > > I included the relevant part of the log sent previously at the end = of > this > > e-mail for reference. >=20 > Thanks for the snippet from the log. It gives me some clues so that I > think I know what is happening. >=20 > It looks like you have run into a timing issue that then exposes some = bugs > in the code. The log shows that the Engine started its > remote_open_engine() function and blocked in the middle of the = function. > Three seconds later the Engine got notified that a node xserve-test2 > joined > the cluster. It also says that after the join the membership had one > node. > That means that there were zero nodes in the membership when > remote_open_engine() ran. Looking at remote_open_engine() (in > engine/remote.c), it sets: > response_count =3D membership->num_entries - 1; > (The response count is one less than the number of nodes in the = membership > because it doesn't send a message to itself.) Since membership- > >num_entries > was zero, that means response_count was set to -1. The code then = falls > into a loop that waits for the responses to come in: > while ((response_count !=3D 0) && (rc =3D=3D 0)) { > As you can see, there will be no responses coming back, since none = were > sent, and response_count will never go to zero. There is a check for = the > time-out within the loop. It breaks out of the loop when time-out > expires, > which is what you are seeing. >=20 > My guess is that most of the time the membership is available before > remote_open_engine() runs and response_count gets set correctly. On > occasion the membership arrives late and causes the code to fall into = the > bug described above. >=20 > You can try the attached patch which will set the response count > correctly. > (See attached file: response_count.patch) >=20 > However, I suspect that even with the response count set correctly the > code > may still fail. The Engine does not handle the dynamic joining and > leaving > of nodes very well. In the case above, the Engine will proceed with = zero > nodes. When the Engine gets notified that another node has joined the > cluster it simply adds that node to its own record of the membership. = But > it doesn't handle establishing a connection to the new node. The = Engine > is > currently coded assuming a static membership. The handling of dynamic > joins and leaves was going to be added "later". Looks like "later" is > "now". The code to handle dynamic joins and leaves is not trivial and > will > require some thought and testing, as you can imagine. I will put it = on my > list of things to do. Stay tuned. >=20 > > Except this issue, everything works perfectly with heartbeat 2. I > greatly > > appreciate your support on previous issues. >=20 > Glad to hear it. Glad to help. >=20 > Steve D. |