From: Marc G. <gr...@at...> - 2012-11-20 08:42:09
|
Jorge, let's first start with fencing. You are using ipmilan for fencing. I didn't evaluate the agent with rhel6. So let's start fixing this issue. Try the following: com-chroot /sbin/fence_ipmilan -h Send me the output. There might some libs missing. The clvmd is very strange. Try to stay with locking_type=2 or locking_type=3. Then rebuild an initrd and reboot. If clvmd stays with 100% CPU kill it and start it again manually with -d flag. Send me the output. Perhaps we see something from there. Regards Marc. Am 19.11.2012 15:39, schrieb Jorge Silva: > Marc > > Hi, np, thanks for helping. The /var/run/cman* are there. I will > disable the clustered flag on the second volume. Even more disturbing > is after the last email i sent you I went from a state where clvmd was > behaving normally (not 100%), I could access clustered volumes. I > rebooted to verify the that everything was functioning - but I am now > back to the state where clvmd is running at 100% - back to where we > started (can't access clustered volumes). > > locking-type=0 > [root@bwccs302 ~]# vgs > WARNING: Locking disabled. Be careful! This could corrupt your metadata. > VG #PV #LV #SN Attr VSize VFree > VG_DATA1 1 2 0 wz--n- 64.00g 4.20g > vg_osroot 1 1 0 wz--n- 60.00g 0 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 23207 root 2 -18 101m 23m 3176 S *99.9* 0.1 0:05.82 clvmd > > lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin -> > /var/comoonics/chroot//var/run/cman_admin > lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client -> > /var/comoonics/chroot//var/run/cman_client > > locking_type=3 > [root@bwccs302 ~]# service clvmd restart > Restarting clvmd: [ OK ] > [root@bwccs302 ~]# vgs > cluster request failed: Invalid argument > Can't get lock for VG_DATA1 > cluster request failed: Invalid argument > Can't get lock for vg_osroot > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 23829 root 2 -18 167m 24m 3268 S *99.8* 0.1 0:31.29 clvm > > > > As far as the shutdown - with the two nodes up once I issue the > shutdown on node1, the shutdown proceeds to the point where I sent the > screenshots (deactivating cluster services) - on node2, I notice - > > corosync[16648]: [TOTEM ] A processor failed, forming new > configuration. It attempts to fence > > node1. After 3 unsuccessful attempts, it locks up. Node1 stays stuck > (screen dump I sent) I do a tcp dump and I see the two nodes are still > sending multicast messages and until I reset node1, node2 will stay in > a locked state with no access... this is the last set of messages I see : > > fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result: error > from agent > fenced[16784]: fence smc01b failed > > After 3 attempts as fencing failed the cluster locks up till I have > reset the node. > I suspect there is another issue at play here as I can manually fence > a nodes using fence_node x ( I will continue to dig into this I have > tried fenced -q or messagebus with the same result) > > Thanks > Jorge > > > On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at... > <mailto:gr...@at...>> wrote: > > Hi Jorge, > sorry for the delay but I was quite busy on the last days. > Nevertheless I'm don't understand the problem. > Let's first start at the point I think could lead to problems > during shutdown and friends. > Are the control files in /var/run/cman* being created from the > bootsr initscript or do you still have to create them manually. > If they are not created I would still be very interested in the > output of > bash -x /etc/init.d/bootsr start > after a node has been started. > > If not we need to dig deeper into the problems during shutdown. > I would then also change the clustered flag for the other volume > group. > Again as long as you don't change the size it wont hurt. > And it's only for better understanding the problem. > > Another command I'd like to see is a cman_tool services on the > other node (say node 2) while the shutdown node is being stuck > (say node 1). > > Thanks Marc. > Am 15.11.2012 19:08, schrieb Jorge Silva: >> Marc >> >> Hi, I believe the problem is related to the clsuter services not >> shutting down. init 0, will not work with 1 or more nodes, init >> 6 will only work when 1 node is present. When more than 1 node >> is present the node with the init 6 will have to be fenced as it >> will not shut down. I believe the cluster components aren't >> shutting down (this also happens with init 6 when more than one >> node is present) - I still see cluster traffic on the network, >> this is periodic. >> >> 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > >> 229.192.0.2.netsupport: UDP, length 119 >> >> At the point that the system will not shut down, it still is a >> cluster member and there is still cluster traffic. >> >> 1 node : >> [root@bwccs302 ~]# init 0 >> >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 2 logical volume(s) in volume group >> "VG_SDATA" now active >> 1 logical volume(s) in volume group "vg_osroot" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 >> 15:09:53 $ >> osr(debug) Calling cmd /sbin/halt -d -p >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory >> [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, clutype: >> gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys >> /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init >> [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell >> bash: no job control in this shell >> >> Type help to get more information.. >> Type exit to continue work.. >> ------------------------------------------------------------- >> >> comoonics 1 > cman_tool: unknown option cman_tool >> comoonics 2 > comoonics 2 > Version: 6.2.0 >> Config Version: 1 >> Cluster Name: ProdCluster01 >> Cluster Id: 11454 >> Cluster Member: Yes >> Cluster Generation: 4 >> Membership state: Cluster-Member >> Nodes: 1 >> Expected votes: 4 >> Quorum device votes: 3 >> Total votes: 4 >> Node votes: 1 >> Quorum: 3 >> Active subsystems: 10 >> Flags: >> Ports Bound: 0 11 178 >> Node name: smc01b >> Node ID: 2 >> Multicast addresses: 229.192.0.2 >> Node addresses: 172.17.62.12 >> comoonics 3 > fence domain >> member count 1 >> victim count 0 >> victim now 0 >> master nodeid 2 >> wait state none >> members 2 >> >> dlm lockspaces >> name clvmd >> id 0x4104eefa >> flags 0x00000000 >> change member 1 joined 1 remove 0 failed 0 seq 1,1 >> members 2 >> >> comoonics 4 > bash: exitt: command not found >> comoonics 5 > exit >> osr(notice) Back to work.. >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> It hung at the point above - so I re-ran with the edit set -x in >> line 207. >> 1 -node: >> [root@bwccs302 ~]# init 0 >> [root@bwccs302 ~ >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" n ow active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 1 logical volume(s) in volume group >> "vg_osroot" now active >> 2 logical volume(s) in volume group "VG_SDATA" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_ osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 >> 15:09:53 $ >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, clutype: >> gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys >> /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> + clusterfs_services_stop '' '' 0 >> ++ repository_get_value rootfs >> +++ repository_normalize_value rootfs >> ++ local key=rootfs >> ++ local default= >> ++ local repository= >> ++ '[' -z '' ']' >> ++ repository=comoonics >> ++ local value= >> ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' >> +++ cat /var/cache/comoonics-repository/comoonics.rootfs >> ++ value=gfs2 >> ++ echo gfs2 >> ++ return 0 >> + local rootfs=gfs2 >> + gfs2_services_stop '' '' 0 >> + local chroot_path= >> + local lock_method= >> + local lvm_sup=0 >> + '[' -n 0 ']' >> + '[' 0 -eq 0 ']' >> + /etc/init.d/clvmd stop >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> with 2 nodes + quorate when init 6 is issued: >> >> [root@bwccs304 ~]# init 6 >> [root@bwccs304 ~ >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 1 logical volume(s) in volume group >> "vg_osroot" now active >> 2 logical volume(s) in volume group "VG_SDATA" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> qdiskd[15713]: Unregistering quorum device. >> >> Sending all processes the KILL signal... dlm: clvmd: no userland >> control daemon, stopping lockspace >> dlm: OSRoot: no userland control daemon, stopping lockspace >> [ OK ] >> - stops here and will not die... Still have full cluster coms >> >> Thanks >> jorge >> >> On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at... >> <mailto:gr...@at...>> wrote: >> >> Hi Jorge, >> because of the "init 0". >> Please issue the following commands prior to init 0. >> # Make it a little more chatty >> $ com-chroot setparameter debug >> # Break after before cluster will be stopped >> $ com-chroot setparameter step halt_umountoldroot >> >> Then issue a init 0. >> This should lead you to a breakpoint during shutdown >> (hopefully, cause sometimes the console gets confused). >> In side the breakpoint issue: >> $ cman_tool status >> $ cman_tool services >> # Continue shutdown >> $ exit >> Then send me the output. >> >> If this fails also do as follows: >> $ com-chroot vi com-realhalt.sh >> # go to line 207 (before clusterfs_services_stop) is called >> and add a set -x >> $ init 0 >> >> Send the output. >> Thanks Marc. >> >> ----- Original Message ----- >> From: "Jorge Silva" <me...@je... <mailto:me...@je...>> >> To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>> >> Cc: ope...@li... >> <mailto:ope...@li...> >> Sent: Tuesday, November 13, 2012 3:22:37 PM >> Subject: Re: Problem with VG activation clvmd runs at 100% >> >> Marc >> >> >> Hi, thanks for the info, it helps. I have also noticed that >> gfs2 entries in the fstab get ignored on boot, I have added >> in rc.local. I have done a bit more digging and the issue I >> described below: >> >> >> "I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more." >> >> >> If I issue a init 6 on a nodes they will restart. If I issue >> init 0, then I have the problem the node start to shut down, >> but will stay in the cluster. I have to shut it off, it will >> not shut down, this is the log. >> >> >> >> [root@bwccs304 ~]# init 0 >> >> >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Stopping nscd: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 2 logical volume(s) in volume group >> "VG_SDATA" now active >> 1 logical volume(s) in volume group "vg_osroot" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in >> volume group "VG_SDATA" unmonitored >> [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: >> 2011-02-11 15:09:53 $ >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory >> [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, >> clutype: gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( >> /mnt/newroot/sys /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock >> /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init >> [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> >> >> >> >> On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... >> <mailto:gr...@at...> > wrote: >> >> >> Jorge, >> you don't need to be doubtful about the fact that the volume >> group for the root file system is not flagged as clustered. >> This has no implications whatsoever on the gfs2 file system. >> >> It will only be a problem whenever the lvm settings of the >> vg_osroot change (size, number of lvs etc.). >> >> Nevertheless while thinking about your problem I think I had >> the idea on how to fix this problem on being able to have the >> root vg clustered also. I will provide new packages in the >> next days that should deal with the problem. >> >> Keep in mind that there is a difference between cman_tool >> services and the lvm usage. >> clvmd only uses the locktable clvmd shown by cman_tool >> services and the other locktables are relevant to the file >> systems and other services (fenced, rgmanager, ..). This is a >> complete different use case. >> >> Try to elaborate a bit more on the fact >> >> "I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more." >> What do you mean with it? How does this happen? This sounds >> like something you should have a look at. >> >> >> "Once thing that I can confirm is >> osr(notice): Detecting nodeid & nodename >> This does not always display the correct info, but it doesn't >> seem to be a problem either ?" >> >> You should always look at the nodeid the nodename is (more or >> less) only descriptive and might not be set as expected. But >> the nodeid should always be consistent. Does this help? >> >> About your notes (I only take the relevant ones): >> >> 1. osr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> This message should not be misleading but only tells the >> these control files are being created inside the ramdisk. >> This has nothing to do with these files on your root file >> system. Nevertheless /etc/init.d/bootsr should take over this >> part and create the files. Please send me another >> bash -x /etc/init.d/bootsr start >> output. Please when those files are not existant. >> >> 2. vgs >> >> VG #PV #LV #SN Attr VSize VFree >> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> This is perfectly ok. This only means the vg is not >> clustered. But the filesystem IS. This does not have any >> connection. >> >> Hope this helps. >> Let me know about the open issues. >> >> Regards >> >> Marc. >> >> >> ----- Original Message ----- >> From: "Jorge Silva" < me...@je... <mailto:me...@je...> > >> To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> > >> >> Sent: Tuesday, November 13, 2012 2:15:23 AM >> Subject: Re: Problem with VG activation clvmd runs at 100% >> >> >> Marc >> >> >> Hi - I believe I have solved my problem, with your help, >> thank you. Yet, I'm not sure how I caused it - but the root >> volume group as you pointed out had the clustered >> attribute(and I had to have done something silly along the >> way). I re-installed from scratch see notes below and then >> just to prove that is a problem, I changed the attribute of >> the rootfs- vgchange -cy and rebooted and I ran into trouble, >> I changed it back and it is fine so that does cause problems >> on start-up, I'm not sure I understand why as there is an >> active quorum for the clvm to join and take part.. >> >> >> Despite it not being marked as a cluster volume cman_tool >> services show it as being, but clvmd status doesn't ? Is it >> safe to write to it with multiple nodes mounted? >> >> >> I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more. >> >> >> >> >> Once thing that I can confirm is >> osr(notice): Detecting nodeid & nodename >> >> >> This does not always display the correct info, but it doesn't >> seem to be a problem either ? >> >> >> >> >> Thanks >> Jorge >> >> >> Notes: >> I decided to start from scratch and I blew away the rootfs >> and started from scratch as per the website. My assumption - >> that I edited something and messed it up (I did look at a lot >> of the scripts to try to "figure out and fix" the problem, I >> can send the history if you want or I can edit and contribute). >> >> >> I rebooted the server and I had an issue - I didn't disable >> selinux so I had to intervene in the boot stage. That >> completed, but I noticed that : >> >> >> >> osr(notice): Starting network configuration for lo0 [OK] >> osr(notice): Detecting nodeid & nodename >> >> >> Is blank, but somehow the correct nodeid and name was deduced. >> >> >> I had to rebuild the ram disk to fix the selinux disabled. I >> also added the following >> >> yum install pciutils - the mkinitrd warned about this so, I >> installed it. >> I also installed : >> yum install cluster-snmp >> yum install rgmanager >> in lvm >> >> >> On this reboot I noticed that despite this message >> >> sr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> >> >> Starting clvmd: dlm: Using TCP for communications >> >> >> Activating VG(s): File descriptor 3 (/dev/console) leaked on >> vgchange invocation. Parent PID 15995: /bin/bash >> File descriptor 4 (/dev/console) leaked on vgchange >> invocation. Parent PID 15995: /bin/bash >> Skipping clustered volume group VG_SDATA >> 1 logical volume(s) in volume group "vg_osroot" now active >> >> >> the links weren't created and I did this manually >> >> >> >> ln -sf /var/comoonics/chroot//var/run/cman_admin >> /var/run/cman_admin >> ln -sf /var/comoonics/chroot//var/run/cman_client >> /var/run/cman_client >> >> >> I could then get clusterstatus etc, and clvmd was running ok >> >> >> I looked in /etc/lvm/lvm.conf and locking_type = 4 ? >> >> >> I then issued >> >> >> lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf >> locking_type = 3. >> >> >> vgscan correctly showed up clusterd volumes and was working ok. >> >> >> >> >> I did not rebuild the ramdisk (I can confirm that the lvm >> .conf in the ramdisk has locking_type=4) I have rebooted and >> everything is working. >> >> Starting clvmd: dlm: Using TCP for communications >> >> >> Activating VG(s): File descriptor 3 (/dev/console) leaked on >> vgchange invocation. Parent PID 15983: /bin/bash >> File descriptor 4 (/dev/console) leaked on vgchange >> invocation. Parent PID 15983: /bin/bash >> Skipping clustered volume group VG_SDATA >> 1 logical volume(s) in volume group "vg_osroot" now active >> >> >> >> >> >> >> I have rebooted a number of times and am confident that >> things are ok, >> >> >> I decided to add two other nodes to the mix and I can confirm >> that everytime a new node is added these files are missing : >> >> >> /var/run/cman_admin >> /var/run/cman_client >> But I can see from the logs: >> >> >> >> osr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> >> >> despite the above message, also, the information below is not >> always detected, but still the nodeid etc is correct... >> >> >> osr(notice): Detecting nodeid & nodename >> >> >> >> >> So now I have 3 nodes in the cluster and things look ok: >> >> >> >> [root@bwccs302 ~]# cman_tool services >> fence domain >> member count 3 >> victim count 0 >> victim now 0 >> master nodeid 2 >> wait state none >> members 2 3 4 >> >> >> dlm lockspaces >> name home >> id 0xf8ee17aa >> flags 0x00000008 fs_reg >> change member 3 joined 1 remove 0 failed 0 seq 3,3 >> members 2 3 4 >> >> >> name clvmd >> id 0x4104eefa >> flags 0x00000000 >> change member 3 joined 1 remove 0 failed 0 seq 15,15 >> members 2 3 4 >> >> >> name OSRoot >> id 0xab5404ad >> flags 0x00000008 fs_reg >> change member 3 joined 1 remove 0 failed 0 seq 7,7 >> members 2 3 4 >> >> >> gfs mountgroups >> name home >> id 0x686e3fc4 >> flags 0x00000048 mounted >> change member 3 joined 1 remove 0 failed 0 seq 3,3 >> members 2 3 4 >> >> >> name OSRoot >> id 0x659f7afe >> flags 0x00000048 mounted >> change member 3 joined 1 remove 0 failed 0 seq 7,7 >> members 2 3 4 >> >> >> >> service clvmd status >> clvmd (pid 25771) is running... >> Clustered Volume Groups: VG_SDATA >> Active clustered Logical Volumes: LV_HOME LV_DEVDB >> >> >> it doesn't believe that the root file-system is clustered >> despite the output from the above. >> >> >> >> [root@bwccs302 ~]# vgs >> VG #PV #LV #SN Attr VSize VFree >> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> >> The above got me thinking on what you wanted me to do to >> diable the clusterd flag on the root volume - with it left on >> I was having problems (not sure how it got turned) on. >> >> >> With everything working ok, I remade ramdisk and now lvm.conf=3.. >> >> >> The systems start up and things look ok. >> >> > > > -- Marc Grimme E-Mail: grimme( at )atix.de ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de | www.comoonics.org Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) | Vorsitzender des Aufsichtsrats: Dr. Martin Buss |