From: Marc G. <gr...@at...> - 2012-11-20 21:57:25
|
Jorge, please send me the following information: /etc/cluster/cluster.conf (for the fencing configuration). When you started clvmd with -d option, it looks like locking_type=0. Are you sure that locking_type is 3 there? WARNING: Locking disabled. Be careful! This could corrupt your metadata. Another option would be to start clvmd with strace and send me the trace: strace -t -T -o /tmp/clvmd-strace.out clvmd Perhaps I can see some odd things from there. Regards Marc. Am 20.11.2012 14:35, schrieb Jorge Silva: > Marc > > Hi, I have confirmed that the locking_type=3 rebuilt initrd and > reboot, attatched is the boot log. clvmd -d : > > [root@bwccs302 ~]# clvmd -d > CLVMD[560ec7a0]: Nov 20 08:30:43 CLVMD started > CLVMD[560ec7a0]: Nov 20 08:30:43 Connected to CMAN > CLVMD[560ec7a0]: Nov 20 08:30:43 CMAN initialisation complete > CLVMD[560ec7a0]: Nov 20 08:30:43 Opened existing DLM lockspace for CLVMD. > CLVMD[560ec7a0]: Nov 20 08:30:43 DLM initialisation complete > CLVMD[560ec7a0]: Nov 20 08:30:43 Cluster ready, doing some more > initialisation > CLVMD[560ec7a0]: Nov 20 08:30:43 starting LVM thread > CLVMD[560eb700]: Nov 20 08:30:43 LVM thread function started > WARNING: Locking disabled. Be careful! This could corrupt your metadata. > CLVMD[560eb700]: Nov 20 08:30:43 Sub thread ready for work. > CLVMD[560ec7a0]: Nov 20 08:30:43 clvmd ready for work > CLVMD[560eb700]: Nov 20 08:30:43 LVM thread waiting for work > CLVMD[560ec7a0]: Nov 20 08:30:43 Using timeout of 60 seconds > > Output from top: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 27261 root 20 0 101m 23m 3176 S 99.7 0.1 0:58.69 clvmd > > > ipmilan : > [root@bwccs302 ~]# com-chroot /sbin/fence_ipmilan -h > bash: /sbin/fence_ipmilan: No such file or directory > > I checked in /usr/sbin/fence_ipmilan > > [root@bwccs302 ~]# com-chroot /usr/sbin/fence_ipmilan -h > usage: fence_ipmilan <options> > -A <authtype> IPMI Lan Auth type (md5, password, or none) > -a <ipaddr> IPMI Lan IP to talk to > -i <ipaddr> IPMI Lan IP to talk to (deprecated, use -a) > -p <password> Password (if required) to control power on > IPMI device > -P Use Lanplus > -S <path> Script to retrieve password (if required) > -l <login> Username/Login (if required) to control power > on IPMI device > -L <privlvl> IPMI privilege level. Defaults to ADMINISTRATOR. > See ipmitool(1) for more info. > -o <op> Operation to perform. > Valid operations: on, off, reboot, status, > diag, list or monitor > -t <timeout> Timeout (sec) for IPMI operation (default 20) > -T <timeout> Wait X seconds after on/off operation > -f <timeout> Wait X seconds before fencing is started > -C <cipher> Ciphersuite to use (same as ipmitool -C parameter) > -M <method> Method to fence (onoff or cycle (default onoff) > -V Print version and exit > -v Verbose mode > > If no options are specified, the following options will be read > from standard input (one per line): > > auth=<auth> Same as -A > ipaddr=<#> Same as -a > passwd=<pass> Same as -p > passwd_script=<path> Same as -S > lanplus Same as -P > login=<login> Same as -l > option=<op> Same as -o > operation=<op> Same as -o > action=<op> Same as -o > delay=<seconds> Same as -f > timeout=<timeout> Same as -t > power_wait=<time> Same as -T > cipher=<cipher> Same as -C > method=<method> Same as -M > privlvl=<privlvl> Same as -L > verbose Same as -v > > On Tue, Nov 20, 2012 at 3:41 AM, Marc Grimme <gr...@at... > <mailto:gr...@at...>> wrote: > > Jorge, > let's first start with fencing. > You are using ipmilan for fencing. I didn't evaluate the agent > with rhel6. > So let's start fixing this issue. > Try the following: > com-chroot /sbin/fence_ipmilan -h > > Send me the output. There might some libs missing. > > The clvmd is very strange. Try to stay with locking_type=2 or > locking_type=3. > Then rebuild an initrd and reboot. > If clvmd stays with 100% CPU kill it and start it again manually > with -d flag. Send me the output. Perhaps we see something from there. > > Regards Marc. > Am 19.11.2012 15:39, schrieb Jorge Silva: >> Marc >> >> Hi, np, thanks for helping. The /var/run/cman* are there. I will >> disable the clustered flag on the second volume. Even more >> disturbing is after the last email i sent you I went from a state >> where clvmd was behaving normally (not 100%), I could access >> clustered volumes. I rebooted to verify the that everything was >> functioning - but I am now back to the state where clvmd is >> running at 100% - back to where we started (can't access >> clustered volumes). >> >> locking-type=0 >> [root@bwccs302 ~]# vgs >> WARNING: Locking disabled. Be careful! This could corrupt your >> metadata. >> VG #PV #LV #SN Attr VSize VFree >> VG_DATA1 1 2 0 wz--n- 64.00g 4.20g >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 23207 root 2 -18 101m 23m 3176 S *99.9* 0.1 0:05.82 clvmd >> >> lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin -> >> /var/comoonics/chroot//var/run/cman_admin >> lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client -> >> /var/comoonics/chroot//var/run/cman_client >> >> locking_type=3 >> [root@bwccs302 ~]# service clvmd restart >> Restarting clvmd: [ OK ] >> [root@bwccs302 ~]# vgs >> cluster request failed: Invalid argument >> Can't get lock for VG_DATA1 >> cluster request failed: Invalid argument >> Can't get lock for vg_osroot >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 23829 root 2 -18 167m 24m 3268 S *99.8* 0.1 0:31.29 clvm >> >> >> >> As far as the shutdown - with the two nodes up once I issue the >> shutdown on node1, the shutdown proceeds to the point where I >> sent the screenshots (deactivating cluster services) - on node2, >> I notice - >> >> corosync[16648]: [TOTEM ] A processor failed, forming new >> configuration. It attempts to fence >> >> node1. After 3 unsuccessful attempts, it locks up. Node1 stays >> stuck (screen dump I sent) I do a tcp dump and I see the two >> nodes are still sending multicast messages and until I reset >> node1, node2 will stay in a locked state with no access... this >> is the last set of messages I see : >> >> fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result: >> error from agent >> fenced[16784]: fence smc01b failed >> >> After 3 attempts as fencing failed the cluster locks up till I >> have reset the node. >> I suspect there is another issue at play here as I can manually >> fence a nodes using fence_node x ( I will continue to dig into >> this I have tried fenced -q or messagebus with the same result) >> >> Thanks >> Jorge >> >> >> On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at... >> <mailto:gr...@at...>> wrote: >> >> Hi Jorge, >> sorry for the delay but I was quite busy on the last days. >> Nevertheless I'm don't understand the problem. >> Let's first start at the point I think could lead to problems >> during shutdown and friends. >> Are the control files in /var/run/cman* being created from >> the bootsr initscript or do you still have to create them >> manually. >> If they are not created I would still be very interested in >> the output of >> bash -x /etc/init.d/bootsr start >> after a node has been started. >> >> If not we need to dig deeper into the problems during shutdown. >> I would then also change the clustered flag for the other >> volume group. >> Again as long as you don't change the size it wont hurt. >> And it's only for better understanding the problem. >> >> Another command I'd like to see is a cman_tool services on >> the other node (say node 2) while the shutdown node is being >> stuck (say node 1). >> >> Thanks Marc. >> Am 15.11.2012 19:08, schrieb Jorge Silva: >>> Marc >>> >>> Hi, I believe the problem is related to the clsuter services >>> not shutting down. init 0, will not work with 1 or more >>> nodes, init 6 will only work when 1 node is present. When >>> more than 1 node is present the node with the init 6 will >>> have to be fenced as it will not shut down. I believe the >>> cluster components aren't shutting down (this also happens >>> with init 6 when more than one node is present) - I still >>> see cluster traffic on the network, this is periodic. >>> >>> 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > >>> 229.192.0.2.netsupport: UDP, length 119 >>> >>> At the point that the system will not shut down, it still is >>> a cluster member and there is still cluster traffic. >>> >>> 1 node : >>> [root@bwccs302 ~]# init 0 >>> >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 2 logical volume(s) in volume group >>> "VG_SDATA" now active >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of GFS2 >>> quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(debug) Calling cmd /sbin/halt -d -p >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory >>> [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >>> /sbin/init >>> [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> osr(notice) Breakpoint "halt_umountoldroot" detected forking >>> a shell >>> bash: no job control in this shell >>> >>> Type help to get more information.. >>> Type exit to continue work.. >>> ------------------------------------------------------------- >>> >>> comoonics 1 > cman_tool: unknown option cman_tool >>> comoonics 2 > comoonics 2 > Version: 6.2.0 >>> Config Version: 1 >>> Cluster Name: ProdCluster01 >>> Cluster Id: 11454 >>> Cluster Member: Yes >>> Cluster Generation: 4 >>> Membership state: Cluster-Member >>> Nodes: 1 >>> Expected votes: 4 >>> Quorum device votes: 3 >>> Total votes: 4 >>> Node votes: 1 >>> Quorum: 3 >>> Active subsystems: 10 >>> Flags: >>> Ports Bound: 0 11 178 >>> Node name: smc01b >>> Node ID: 2 >>> Multicast addresses: 229.192.0.2 >>> Node addresses: 172.17.62.12 >>> comoonics 3 > fence domain >>> member count 1 >>> victim count 0 >>> victim now 0 >>> master nodeid 2 >>> wait state none >>> members 2 >>> >>> dlm lockspaces >>> name clvmd >>> id 0x4104eefa >>> flags 0x00000000 >>> change member 1 joined 1 remove 0 failed 0 seq 1,1 >>> members 2 >>> >>> comoonics 4 > bash: exitt: command not found >>> comoonics 5 > exit >>> osr(notice) Back to work.. >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> It hung at the point above - so I re-ran with the edit set >>> -x in line 207. >>> 1 -node: >>> [root@bwccs302 ~]# init 0 >>> [root@bwccs302 ~ >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" n ow active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 1 logical volume(s) in volume group >>> "vg_osroot" now active >>> 2 logical volume(s) in volume group "VG_SDATA" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_ osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of GFS2 >>> quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >>> /sbin/init [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> + clusterfs_services_stop '' '' 0 >>> ++ repository_get_value rootfs >>> +++ repository_normalize_value rootfs >>> ++ local key=rootfs >>> ++ local default= >>> ++ local repository= >>> ++ '[' -z '' ']' >>> ++ repository=comoonics >>> ++ local value= >>> ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' >>> +++ cat /var/cache/comoonics-repository/comoonics.rootfs >>> ++ value=gfs2 >>> ++ echo gfs2 >>> ++ return 0 >>> + local rootfs=gfs2 >>> + gfs2_services_stop '' '' 0 >>> + local chroot_path= >>> + local lock_method= >>> + local lvm_sup=0 >>> + '[' -n 0 ']' >>> + '[' 0 -eq 0 ']' >>> + /etc/init.d/clvmd stop >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> with 2 nodes + quorate when init 6 is issued: >>> >>> [root@bwccs304 ~]# init 6 >>> [root@bwccs304 ~ >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 1 logical volume(s) in volume group >>> "vg_osroot" now active >>> 2 logical volume(s) in volume group "VG_SDATA" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> qdiskd[15713]: Unregistering quorum device. >>> >>> Sending all processes the KILL signal... dlm: clvmd: no >>> userland control daemon, stopping lockspace >>> dlm: OSRoot: no userland control daemon, stopping lockspace >>> [ OK ] >>> - stops here and will not die... Still have full cluster coms >>> >>> Thanks >>> jorge >>> >>> On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at... >>> <mailto:gr...@at...>> wrote: >>> >>> Hi Jorge, >>> because of the "init 0". >>> Please issue the following commands prior to init 0. >>> # Make it a little more chatty >>> $ com-chroot setparameter debug >>> # Break after before cluster will be stopped >>> $ com-chroot setparameter step halt_umountoldroot >>> >>> Then issue a init 0. >>> This should lead you to a breakpoint during shutdown >>> (hopefully, cause sometimes the console gets confused). >>> In side the breakpoint issue: >>> $ cman_tool status >>> $ cman_tool services >>> # Continue shutdown >>> $ exit >>> Then send me the output. >>> >>> If this fails also do as follows: >>> $ com-chroot vi com-realhalt.sh >>> # go to line 207 (before clusterfs_services_stop) is >>> called and add a set -x >>> $ init 0 >>> >>> Send the output. >>> Thanks Marc. >>> >>> ----- Original Message ----- >>> From: "Jorge Silva" <me...@je... >>> <mailto:me...@je...>> >>> To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>> >>> Cc: ope...@li... >>> <mailto:ope...@li...> >>> Sent: Tuesday, November 13, 2012 3:22:37 PM >>> Subject: Re: Problem with VG activation clvmd runs at 100% >>> >>> Marc >>> >>> >>> Hi, thanks for the info, it helps. I have also noticed >>> that gfs2 entries in the fstab get ignored on boot, I >>> have added in rc.local. I have done a bit more digging >>> and the issue I described below: >>> >>> >>> "I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more." >>> >>> >>> If I issue a init 6 on a nodes they will restart. If I >>> issue init 0, then I have the problem the node start to >>> shut down, but will stay in the cluster. I have to shut >>> it off, it will not shut down, this is the log. >>> >>> >>> >>> [root@bwccs304 ~]# init 0 >>> >>> >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Stopping nscd: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) >>> ..bindmounts.. [ OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 2 logical volume(s) in volume group >>> "VG_SDATA" now active >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG VG_SDATA: 1 logical volume(s) >>> in volume group "VG_SDATA" unmonitored >>> [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical >>> volume(s) in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of >>> GFS2 quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory >>> [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: >>> Re-executing /sbin/init >>> [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> >>> >>> >>> >>> On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < >>> gr...@at... <mailto:gr...@at...> > wrote: >>> >>> >>> Jorge, >>> you don't need to be doubtful about the fact that the >>> volume group for the root file system is not flagged as >>> clustered. This has no implications whatsoever on the >>> gfs2 file system. >>> >>> It will only be a problem whenever the lvm settings of >>> the vg_osroot change (size, number of lvs etc.). >>> >>> Nevertheless while thinking about your problem I think I >>> had the idea on how to fix this problem on being able to >>> have the root vg clustered also. I will provide new >>> packages in the next days that should deal with the problem. >>> >>> Keep in mind that there is a difference between >>> cman_tool services and the lvm usage. >>> clvmd only uses the locktable clvmd shown by cman_tool >>> services and the other locktables are relevant to the >>> file systems and other services (fenced, rgmanager, ..). >>> This is a complete different use case. >>> >>> Try to elaborate a bit more on the fact >>> >>> "I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more." >>> What do you mean with it? How does this happen? This >>> sounds like something you should have a look at. >>> >>> >>> "Once thing that I can confirm is >>> osr(notice): Detecting nodeid & nodename >>> This does not always display the correct info, but it >>> doesn't seem to be a problem either ?" >>> >>> You should always look at the nodeid the nodename is >>> (more or less) only descriptive and might not be set as >>> expected. But the nodeid should always be consistent. >>> Does this help? >>> >>> About your notes (I only take the relevant ones): >>> >>> 1. osr(notice): Creating clusterfiles >>> /var/run/cman_admin /var/run/cman_client.. [OK] >>> This message should not be misleading but only tells the >>> these control files are being created inside the >>> ramdisk. This has nothing to do with these files on your >>> root file system. Nevertheless /etc/init.d/bootsr should >>> take over this part and create the files. Please send me >>> another >>> bash -x /etc/init.d/bootsr start >>> output. Please when those files are not existant. >>> >>> 2. vgs >>> >>> VG #PV #LV #SN Attr VSize VFree >>> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >>> vg_osroot 1 1 0 wz--n- 60.00g 0 >>> >>> This is perfectly ok. This only means the vg is not >>> clustered. But the filesystem IS. This does not have any >>> connection. >>> >>> Hope this helps. >>> Let me know about the open issues. >>> >>> Regards >>> >>> Marc. >>> >>> >>> ----- Original Message ----- >>> From: "Jorge Silva" < me...@je... >>> <mailto:me...@je...> > >>> To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> > >>> >>> Sent: Tuesday, November 13, 2012 2:15:23 AM >>> Subject: Re: Problem with VG activation clvmd runs at 100% >>> >>> >>> Marc >>> >>> >>> Hi - I believe I have solved my problem, with your help, >>> thank you. Yet, I'm not sure how I caused it - but the >>> root volume group as you pointed out had the clustered >>> attribute(and I had to have done something silly along >>> the way). I re-installed from scratch see notes below >>> and then just to prove that is a problem, I changed the >>> attribute of the rootfs- vgchange -cy and rebooted and I >>> ran into trouble, I changed it back and it is fine so >>> that does cause problems on start-up, I'm not sure I >>> understand why as there is an active quorum for the clvm >>> to join and take part.. >>> >>> >>> Despite it not being marked as a cluster volume >>> cman_tool services show it as being, but clvmd status >>> doesn't ? Is it safe to write to it with multiple nodes >>> mounted? >>> >>> >>> I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more. >>> >>> >>> >>> >>> Once thing that I can confirm is >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> This does not always display the correct info, but it >>> doesn't seem to be a problem either ? >>> >>> >>> >>> >>> Thanks >>> Jorge >>> >>> >>> Notes: >>> I decided to start from scratch and I blew away the >>> rootfs and started from scratch as per the website. My >>> assumption - that I edited something and messed it up (I >>> did look at a lot of the scripts to try to "figure out >>> and fix" the problem, I can send the history if you want >>> or I can edit and contribute). >>> >>> >>> I rebooted the server and I had an issue - I didn't >>> disable selinux so I had to intervene in the boot stage. >>> That completed, but I noticed that : >>> >>> >>> >>> osr(notice): Starting network configuration for lo0 [OK] >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> Is blank, but somehow the correct nodeid and name was >>> deduced. >>> >>> >>> I had to rebuild the ram disk to fix the selinux >>> disabled. I also added the following >>> >>> yum install pciutils - the mkinitrd warned about this >>> so, I installed it. >>> I also installed : >>> yum install cluster-snmp >>> yum install rgmanager >>> in lvm >>> >>> >>> On this reboot I noticed that despite this message >>> >>> sr(notice): Creating clusterfiles /var/run/cman_admin >>> /var/run/cman_client.. [OK] >>> >>> >>> Starting clvmd: dlm: Using TCP for communications >>> >>> >>> Activating VG(s): File descriptor 3 (/dev/console) >>> leaked on vgchange invocation. Parent PID 15995: /bin/bash >>> File descriptor 4 (/dev/console) leaked on vgchange >>> invocation. Parent PID 15995: /bin/bash >>> Skipping clustered volume group VG_SDATA >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> >>> >>> the links weren't created and I did this manually >>> >>> >>> >>> ln -sf /var/comoonics/chroot//var/run/cman_admin >>> /var/run/cman_admin >>> ln -sf /var/comoonics/chroot//var/run/cman_client >>> /var/run/cman_client >>> >>> >>> I could then get clusterstatus etc, and clvmd was running ok >>> >>> >>> I looked in /etc/lvm/lvm.conf and locking_type = 4 ? >>> >>> >>> I then issued >>> >>> >>> lvmconf --enable cluster - and this changed >>> /etc/lvm/lvm.conf locking_type = 3. >>> >>> >>> vgscan correctly showed up clusterd volumes and was >>> working ok. >>> >>> >>> >>> >>> I did not rebuild the ramdisk (I can confirm that the >>> lvm .conf in the ramdisk has locking_type=4) I have >>> rebooted and everything is working. >>> >>> Starting clvmd: dlm: Using TCP for communications >>> >>> >>> Activating VG(s): File descriptor 3 (/dev/console) >>> leaked on vgchange invocation. Parent PID 15983: /bin/bash >>> File descriptor 4 (/dev/console) leaked on vgchange >>> invocation. Parent PID 15983: /bin/bash >>> Skipping clustered volume group VG_SDATA >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> >>> >>> >>> >>> >>> >>> I have rebooted a number of times and am confident that >>> things are ok, >>> >>> >>> I decided to add two other nodes to the mix and I can >>> confirm that everytime a new node is added these files >>> are missing : >>> >>> >>> /var/run/cman_admin >>> /var/run/cman_client >>> But I can see from the logs: >>> >>> >>> >>> osr(notice): Creating clusterfiles /var/run/cman_admin >>> /var/run/cman_client.. [OK] >>> >>> >>> despite the above message, also, the information below >>> is not always detected, but still the nodeid etc is >>> correct... >>> >>> >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> >>> >>> So now I have 3 nodes in the cluster and things look ok: >>> >>> >>> >>> [root@bwccs302 ~]# cman_tool services >>> fence domain >>> member count 3 >>> victim count 0 >>> victim now 0 >>> master nodeid 2 >>> wait state none >>> members 2 3 4 >>> >>> >>> dlm lockspaces >>> name home >>> id 0xf8ee17aa >>> flags 0x00000008 fs_reg >>> change member 3 joined 1 remove 0 failed 0 seq 3,3 >>> members 2 3 4 >>> >>> >>> name clvmd >>> id 0x4104eefa >>> flags 0x00000000 >>> change member 3 joined 1 remove 0 failed 0 seq 15,15 >>> members 2 3 4 >>> >>> >>> name OSRoot >>> id 0xab5404ad >>> flags 0x00000008 fs_reg >>> change member 3 joined 1 remove 0 failed 0 seq 7,7 >>> members 2 3 4 >>> >>> >>> gfs mountgroups >>> name home >>> id 0x686e3fc4 >>> flags 0x00000048 mounted >>> change member 3 joined 1 remove 0 failed 0 seq 3,3 >>> members 2 3 4 >>> >>> >>> name OSRoot >>> id 0x659f7afe >>> flags 0x00000048 mounted >>> change member 3 joined 1 remove 0 failed 0 seq 7,7 >>> members 2 3 4 >>> >>> >>> >>> service clvmd status >>> clvmd (pid 25771) is running... >>> Clustered Volume Groups: VG_SDATA >>> Active clustered Logical Volumes: LV_HOME LV_DEVDB >>> >>> >>> it doesn't believe that the root file-system is >>> clustered despite the output from the above. >>> >>> >>> >>> [root@bwccs302 ~]# vgs >>> VG #PV #LV #SN Attr VSize VFree >>> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >>> vg_osroot 1 1 0 wz--n- 60.00g 0 >>> >>> >>> The above got me thinking on what you wanted me to do to >>> diable the clusterd flag on the root volume - with it >>> left on I was having problems (not sure how it got >>> turned) on. >>> >>> >>> With everything working ok, I remade ramdisk and now >>> lvm.conf=3.. >>> >>> >>> The systems start up and things look ok. >>> >>> >> >> |