Re: [Evms-cluster] evmsgui hangs when trying to administer the second (nonlocal) node

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Stefan Beck wrote:
>
> new Problem:
> when running 'evmsgui -d everything', I won't be able to select a node
> to administer. The select box is empty ("No items were found matching
> the selected criteria")!!! This is reproducable, too.

I can think of two reasons why the node list would be empty.  One is that
there is no membership available.  However, if that were true the GUI would
log a message sayin that evms_get_node_list() returned an error.  I don't
see that message in any of the logs.

The second reason is that there is a membership but it doesn't have any
entries.  This could be possible if the CCM is reporting no active members.
The EVMS Engine registers with the HA plug-in for "deltas" in the
membership.  That means it gets notifications when nodes join or leave the
cluster rather than getting a whole new membership every time that the
membership changes.  If all nodes have left the membership the Engine can
be left with its membership that has no entries.

> Setting the engine log-level to everything in evms.conf yields the same.

Yes.  The debug level can be set via evms.conf or by the "-d" switch on
evmsgui.  Using the "-d" switch will override the setting in evms.conf.

> This holds true for all debuglevels (except "default"), that I have
> tried (extra, warning, ...).

This is odd behavior.  The debug level should have no affect on whether you
an administer another node in the cluster.  It may be true that with more
logging, e.g., debug level set to "everything", that the logging may slow
down the code and have an effect on the behavior.  But if that is true,
then setting the debug level to "warning" should not cause the error since
"warning" is more restrictive than "default", i.e., fewer messages are
logged.  Puzzling.

> So the attached logfiles are with log-level=default.

Unfortunately, I would need at least "entry_exit" to be able to trace the
flow through the functions to get an idea of what is happening.

> Node gfs2 is active and has all cluster volumes active:
>
> Volume Name: /dev/evms/rack/home
> Volume Name: /dev/evms/rack/sync
>
> and the local volumes:
> Volume Name: /dev/evms/sda1
> Volume Name: /dev/evms/sda2
> Volume Name: /dev/evms/sda3
>
>
> 1. gfs2: evmsgui -d everything (no nodes to select, see above)
>
> 2. gfs2: evmsgui  => administer => node gfs1
> It took a long time, but succeeded (for the first time!?)
> maybe I haven't been waiting long enough the other times I tried this?
> Sorry if it has been my fault.

Switching to a different node can take some time.  What is done under the
covers is close the Engine on the current node then open the Engine on the
other node.  Opening the Engine means going through the discovery process
on the other node.  The discovery logic is handled by the node where the
user is.  It talks over the wire to the other node to read and write to the
disks.  The added network latency adds to the already time consuming
discovery process.  Sometimes it looks like the discovery process is hung
when it is just taking a long time.  Sometimes the discovery process is
indeed hung due to some communication failure with the other node.

> 3. gfs1: evmsgui (no nodes to select, even with loglevel=default)
>
> => Problem (?)
> why can't I administer running evmsgui on the passive node ?
> Is this the way it should works?

You should be able to administer any active node from another active node
as long as the nodes are in the same membership.  Are both nodes up and
members of the cluster?

> tried steps 2,3 for a few times ...
>
> Hope this helps.
>
> Tell me if you need more info.
>
>
> regards
> Stefan
>
> Latest update (just before pressing the "send button" for this email),
> I managed to reproduce the "hang":
>
> After a takeover to gfs1, which worked perfectly:
>
> starting evmsgui on gfs1 => administer => node gfs2:
>
> after one minute I get a popup:
> ------------------------------------------------------------------------
> Feb 10 09:26:08  Engine: Another process urgently needs the Engine.
> Please save your changes or quit now.  This process will self destruct
> in 30 seconds.
>
> Feb 10 09:26:18  Engine: Another process urgently needs the Engine.
> Please save your changes or quit now.  This process will self destruct
> in 20 seconds.
>
> Feb 10 09:26:28  Engine: Another process urgently needs the Engine.
> Please save your changes or quit now.  This process will self destruct
> in 10 seconds.
>
>
> Feb 10 09:26:38  Engine: Self destruct sequence initiated.
> -------------------------------------------------------------------------

The other "process that urgently needs the Engine" is the evms_failover
script being kicked off to do a fail over.  It opens the EVMS Engine in the
mode ENGINE_READWRITE_CRITICAL which makes the Engine shut down any
instance of the Engine that is running, even if it is running on another
node.

> The status line showed "discovering regions" and the progess bar went
> from left to right.

Your instance of the Engine self destructed because of the evms_failover
script's critical need to run the Engine.  The discovery process was
stopped.  Your entire evmsgui session should have exited.  I noticed that
the log says there was a segfault.  evmsgui runs with several threads.  My
guess is that the thread that was doing discovery segfaulted during the
self destruct sequence.  The remaining threads kept running instead of
exiting as they should.  That brings up the question of what segfaulted.
Hard to tell from the log.

> Maybe the evmsd interfered with the evmsgui ?

The HA plug-in running under evmsd monitors the membership.  If the
membership changes it launches the evms_failover script.

Since the fail over sequence was initiated, it makes me think that perhaps
the cluster membership is not that stable.  On our test cluster I have seen
the fail over sequence initiated when one daemon was started.  Somehow it
thought the other node was dead, so it stonith'ed it and started running
evms_failover before I could start the daemon on the other node.  On rarer
occurrences I have been working on one node only to have it go dead and
then discover that the other node had killed it for some reason.  Granted I
am running a test system and things can be flaky.

Do you have the hardware to do stonith?  If so, a fail over would be
somewhat obvious since one node would be rebooted.  If not, the fail over
would not be very obvious.

I would be good to make sure that both nodes are members of the cluster and
that the membership is stable.  (I'm looking into modifying the evmsccm
utility to display more information about the membership.)  My guess is
that the fail over is kicking in causing you not to be able to configure
gfs2 from gfs1 and vice versa.

I realize these are not definitive answers to your questions.  I hope they
get us going in the right direction to solve your problems.

Steve D.