[Evms-devel] RE: [Evms-cluster] Evms unable to import a deported container

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Steve,

I tried your patch over the weekend and it didn't fix the problem. I =
don't
have debug logs, however, and no more time to work on this. I'll just =
lower
the timeout so it doesn't affect much the failover process.

Thanks for your help

Thomas Guyot-Sionnest,
Administrateur de syst=E8mes
T=E9l: (514) 842-7054
Fax: (514) 221-3395
Courriel: th...@za... =20

> -----Original Message-----
> From: Steve Dobbelstein [mailto:st...@us...]
> Sent: November 21, 2005 17:48
> To: Thomas Guyot-Sionnest
> Cc: evm...@li...; =
evm...@li...
> Subject: RE: [Evms-cluster] Evms unable to import a deported container
>=20
> "Thomas Guyot-Sionnest" <Th...@za...> wrote on 11/21/2005 =
02:55:32
> PM:
>=20
> > Hi Steve,
>=20
> Hi, Thomas.
>=20
> > Since I sent you the last mail on this issue I did not hear anything
> about
> > it.
>=20
> Sorry for not getting back to you.  Too many things going on.  Too =
many
> distractions.
>=20
> > Since then I got evms working with heartbeat 2, but I still see in =
rare
> > cases a 10 minute timeout in cluster operations.
> >
> > I assume that it's the same timeout as in the log I sent from my
> previous
> > mail, which is defined at line 106 in engine/remote.c:
> >
> > #define REQUEST_TIMEOUT 600     /* in seconds */
> >
> > Is there any reason to have such a long timeout? Can it be safely =
set to
> 60,
> > or even 30 seconds to prevent blocking the whole failover process?
>=20
> The reason was that I didn't have any experience in knowing what a =
typical
> lag time might be, so I just picked a large value that I thought would =
be
> safe.  Since you know the performance of your systems, you can set it =
to a
> lower number that you think is safe.  60 or 30 seconds sounds =
reasonable.
>=20
> > I included the relevant part of the log sent previously at the end =
of
> this
> > e-mail for reference.
>=20
> Thanks for the snippet from the log.  It gives me some clues so that I
> think I know what is happening.
>=20
> It looks like you have run into a timing issue that then exposes some =
bugs
> in the code.  The log shows that the Engine started its
> remote_open_engine() function and blocked in the middle of the =
function.
> Three seconds later the Engine got notified that a node xserve-test2
> joined
> the cluster.  It also says that after the join the membership had one
> node.
> That means that there were zero nodes in the membership when
> remote_open_engine() ran.  Looking at remote_open_engine() (in
> engine/remote.c), it sets:
> response_count =3D membership->num_entries - 1;
> (The response count is one less than the number of nodes in the =
membership
> because it doesn't send a message to itself.) Since membership-
> >num_entries
> was zero, that means response_count was set to -1.  The code then =
falls
> into a loop that waits for the responses to come in:
> while ((response_count !=3D 0) && (rc =3D=3D 0)) {
> As you can see, there will be no responses coming back, since none =
were
> sent, and response_count will never go to zero.  There is a check for =
the
> time-out within the loop.  It breaks out of the loop when time-out
> expires,
> which is what you are seeing.
>=20
> My guess is that most of the time the membership is available before
> remote_open_engine() runs and response_count gets set correctly.  On
> occasion the membership arrives late and causes the code to fall into =
the
> bug described above.
>=20
> You can try the attached patch which will set the response count
> correctly.
> (See attached file: response_count.patch)
>=20
> However, I suspect that even with the response count set correctly the
> code
> may still fail.  The Engine does not handle the dynamic joining and
> leaving
> of nodes very well.  In the case above, the Engine will proceed with =
zero
> nodes.  When the Engine gets notified that another node has joined the
> cluster it simply adds that node to its own record of the membership.  =
But
> it doesn't handle establishing a connection to the new node.  The =
Engine
> is
> currently coded assuming a static membership.  The handling of dynamic
> joins and leaves was going to be added "later".  Looks like "later" is
> "now".  The code to handle dynamic joins and leaves is not trivial and
> will
> require some thought and testing, as you can imagine.  I will put it =
on my
> list of things to do.  Stay tuned.
>=20
> > Except this issue, everything works perfectly with heartbeat 2. I
> greatly
> > appreciate your support on previous issues.
>=20
> Glad to hear it.  Glad to help.
>=20
> Steve D.