From: Steve D. <st...@us...> - 2004-02-10 18:55:22
|
Stefan Beck wrote: > > new Problem: > when running 'evmsgui -d everything', I won't be able to select a node > to administer. The select box is empty ("No items were found matching > the selected criteria")!!! This is reproducable, too. I can think of two reasons why the node list would be empty. One is that there is no membership available. However, if that were true the GUI would log a message sayin that evms_get_node_list() returned an error. I don't see that message in any of the logs. The second reason is that there is a membership but it doesn't have any entries. This could be possible if the CCM is reporting no active members. The EVMS Engine registers with the HA plug-in for "deltas" in the membership. That means it gets notifications when nodes join or leave the cluster rather than getting a whole new membership every time that the membership changes. If all nodes have left the membership the Engine can be left with its membership that has no entries. > Setting the engine log-level to everything in evms.conf yields the same. Yes. The debug level can be set via evms.conf or by the "-d" switch on evmsgui. Using the "-d" switch will override the setting in evms.conf. > This holds true for all debuglevels (except "default"), that I have > tried (extra, warning, ...). This is odd behavior. The debug level should have no affect on whether you an administer another node in the cluster. It may be true that with more logging, e.g., debug level set to "everything", that the logging may slow down the code and have an effect on the behavior. But if that is true, then setting the debug level to "warning" should not cause the error since "warning" is more restrictive than "default", i.e., fewer messages are logged. Puzzling. > So the attached logfiles are with log-level=default. Unfortunately, I would need at least "entry_exit" to be able to trace the flow through the functions to get an idea of what is happening. > Node gfs2 is active and has all cluster volumes active: > > Volume Name: /dev/evms/rack/home > Volume Name: /dev/evms/rack/sync > > and the local volumes: > Volume Name: /dev/evms/sda1 > Volume Name: /dev/evms/sda2 > Volume Name: /dev/evms/sda3 > > > 1. gfs2: evmsgui -d everything (no nodes to select, see above) > > 2. gfs2: evmsgui => administer => node gfs1 > It took a long time, but succeeded (for the first time!?) > maybe I haven't been waiting long enough the other times I tried this? > Sorry if it has been my fault. Switching to a different node can take some time. What is done under the covers is close the Engine on the current node then open the Engine on the other node. Opening the Engine means going through the discovery process on the other node. The discovery logic is handled by the node where the user is. It talks over the wire to the other node to read and write to the disks. The added network latency adds to the already time consuming discovery process. Sometimes it looks like the discovery process is hung when it is just taking a long time. Sometimes the discovery process is indeed hung due to some communication failure with the other node. > 3. gfs1: evmsgui (no nodes to select, even with loglevel=default) > > => Problem (?) > why can't I administer running evmsgui on the passive node ? > Is this the way it should works? You should be able to administer any active node from another active node as long as the nodes are in the same membership. Are both nodes up and members of the cluster? > tried steps 2,3 for a few times ... > > Hope this helps. > > Tell me if you need more info. > > > regards > Stefan > > Latest update (just before pressing the "send button" for this email), > I managed to reproduce the "hang": > > After a takeover to gfs1, which worked perfectly: > > starting evmsgui on gfs1 => administer => node gfs2: > > after one minute I get a popup: > ------------------------------------------------------------------------ > Feb 10 09:26:08 Engine: Another process urgently needs the Engine. > Please save your changes or quit now. This process will self destruct > in 30 seconds. > > Feb 10 09:26:18 Engine: Another process urgently needs the Engine. > Please save your changes or quit now. This process will self destruct > in 20 seconds. > > Feb 10 09:26:28 Engine: Another process urgently needs the Engine. > Please save your changes or quit now. This process will self destruct > in 10 seconds. > > > Feb 10 09:26:38 Engine: Self destruct sequence initiated. > ------------------------------------------------------------------------- The other "process that urgently needs the Engine" is the evms_failover script being kicked off to do a fail over. It opens the EVMS Engine in the mode ENGINE_READWRITE_CRITICAL which makes the Engine shut down any instance of the Engine that is running, even if it is running on another node. > The status line showed "discovering regions" and the progess bar went > from left to right. Your instance of the Engine self destructed because of the evms_failover script's critical need to run the Engine. The discovery process was stopped. Your entire evmsgui session should have exited. I noticed that the log says there was a segfault. evmsgui runs with several threads. My guess is that the thread that was doing discovery segfaulted during the self destruct sequence. The remaining threads kept running instead of exiting as they should. That brings up the question of what segfaulted. Hard to tell from the log. > Maybe the evmsd interfered with the evmsgui ? The HA plug-in running under evmsd monitors the membership. If the membership changes it launches the evms_failover script. Since the fail over sequence was initiated, it makes me think that perhaps the cluster membership is not that stable. On our test cluster I have seen the fail over sequence initiated when one daemon was started. Somehow it thought the other node was dead, so it stonith'ed it and started running evms_failover before I could start the daemon on the other node. On rarer occurrences I have been working on one node only to have it go dead and then discover that the other node had killed it for some reason. Granted I am running a test system and things can be flaky. Do you have the hardware to do stonith? If so, a fail over would be somewhat obvious since one node would be rebooted. If not, the fail over would not be very obvious. I would be good to make sure that both nodes are members of the cluster and that the membership is stable. (I'm looking into modifying the evmsccm utility to display more information about the membership.) My guess is that the fail over is kicking in causing you not to be able to configure gfs2 from gfs1 and vice versa. I realize these are not definitive answers to your questions. I hope they get us going in the right direction to solve your problems. Steve D. |