Re: [OSR-users] Problem with VG activation clvmd runs at 100%

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Jorge,
please send me the following information:
/etc/cluster/cluster.conf (for the fencing configuration).

When you started clvmd with -d option, it looks like locking_type=0.
Are you sure that locking_type is 3 there?
WARNING: Locking disabled. Be careful! This could corrupt your metadata.

Another option would be to start clvmd with strace and send me the trace:
strace -t -T -o /tmp/clvmd-strace.out clvmd

Perhaps I can see some odd things from there.

Regards Marc.
Am 20.11.2012 14:35, schrieb Jorge Silva:
> Marc
>
> Hi, I have confirmed that the locking_type=3 rebuilt initrd and
> reboot, attatched is the boot log. clvmd -d :
>
> [root@bwccs302 ~]# clvmd -d
> CLVMD[560ec7a0]: Nov 20 08:30:43 CLVMD started
> CLVMD[560ec7a0]: Nov 20 08:30:43 Connected to CMAN
> CLVMD[560ec7a0]: Nov 20 08:30:43 CMAN initialisation complete
> CLVMD[560ec7a0]: Nov 20 08:30:43 Opened existing DLM lockspace for CLVMD.
> CLVMD[560ec7a0]: Nov 20 08:30:43 DLM initialisation complete
> CLVMD[560ec7a0]: Nov 20 08:30:43 Cluster ready, doing some more
> initialisation
> CLVMD[560ec7a0]: Nov 20 08:30:43 starting LVM thread
> CLVMD[560eb700]: Nov 20 08:30:43 LVM thread function started
> WARNING: Locking disabled. Be careful! This could corrupt your metadata.
> CLVMD[560eb700]: Nov 20 08:30:43 Sub thread ready for work.
> CLVMD[560ec7a0]: Nov 20 08:30:43 clvmd ready for work
> CLVMD[560eb700]: Nov 20 08:30:43 LVM thread waiting for work
> CLVMD[560ec7a0]: Nov 20 08:30:43 Using timeout of 60 seconds
>
> Output from top:
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 27261 root      20   0  101m  23m 3176 S 99.7  0.1   0:58.69 clvmd
>
>
> ipmilan :
> [root@bwccs302 ~]# com-chroot /sbin/fence_ipmilan -h
> bash: /sbin/fence_ipmilan: No such file or directory
>
> I checked in  /usr/sbin/fence_ipmilan
>
> [root@bwccs302 ~]# com-chroot /usr/sbin/fence_ipmilan -h
> usage: fence_ipmilan <options>
>    -A <authtype>  IPMI Lan Auth type (md5, password, or none)
>    -a <ipaddr>    IPMI Lan IP to talk to
>    -i <ipaddr>    IPMI Lan IP to talk to (deprecated, use -a)
>    -p <password>  Password (if required) to control power on
>                   IPMI device
>    -P             Use Lanplus
>    -S <path>      Script to retrieve password (if required)
>    -l <login>     Username/Login (if required) to control power
>                   on IPMI device
>    -L <privlvl>   IPMI privilege level.  Defaults to ADMINISTRATOR.
>                   See ipmitool(1) for more info.
>    -o <op>        Operation to perform.
>                   Valid operations: on, off, reboot, status,
>                   diag, list or monitor
>    -t <timeout>   Timeout (sec) for IPMI operation (default 20)
>    -T <timeout>   Wait X seconds after on/off operation
>    -f <timeout>   Wait X seconds before fencing is started
>    -C <cipher>    Ciphersuite to use (same as ipmitool -C parameter)
>    -M <method>    Method to fence (onoff or cycle (default onoff)
>    -V             Print version and exit
>    -v             Verbose mode
>
> If no options are specified, the following options will be read
> from standard input (one per line):
>
>    auth=<auth>           Same as -A
>    ipaddr=<#>            Same as -a
>    passwd=<pass>         Same as -p
>    passwd_script=<path>  Same as -S
>    lanplus               Same as -P
>    login=<login>         Same as -l
>    option=<op>           Same as -o
>    operation=<op>        Same as -o
>    action=<op>           Same as -o
>    delay=<seconds>       Same as -f
>    timeout=<timeout>     Same as -t
>    power_wait=<time>     Same as -T
>    cipher=<cipher>       Same as -C
>    method=<method>       Same as -M
>    privlvl=<privlvl>     Same as -L
>    verbose               Same as -v
>
> On Tue, Nov 20, 2012 at 3:41 AM, Marc Grimme <gr...@at...
> <mailto:gr...@at...>> wrote:
>
>     Jorge,
>     let's first start with fencing.
>     You are using ipmilan for fencing. I didn't evaluate the agent
>     with rhel6.
>     So let's start fixing this issue.
>     Try the following:
>     com-chroot /sbin/fence_ipmilan -h
>
>     Send me the output. There might some libs missing.
>
>     The clvmd is very strange. Try to stay with locking_type=2 or
>     locking_type=3.
>     Then rebuild an initrd and reboot.
>     If clvmd stays with 100% CPU kill it and start it again manually
>     with -d flag. Send me the output. Perhaps we see something from there.
>
>     Regards Marc.
>     Am 19.11.2012 15:39, schrieb Jorge Silva:
>>     Marc
>>
>>     Hi, np, thanks for helping. The /var/run/cman* are there.  I will
>>     disable the clustered flag on the second volume.  Even more
>>     disturbing is after the last email i sent you I went from a state
>>     where clvmd was behaving normally (not 100%),  I could access
>>     clustered volumes. I rebooted to verify the that everything was
>>     functioning - but I am now back to the state  where clvmd is
>>     running at 100% - back to where we started (can't access
>>     clustered volumes). 
>>
>>     locking-type=0
>>     [root@bwccs302 ~]# vgs
>>       WARNING: Locking disabled. Be careful! This could corrupt your
>>     metadata.
>>       VG        #PV #LV #SN Attr   VSize  VFree
>>       VG_DATA1    1   2   0 wz--n- 64.00g 4.20g
>>       vg_osroot   1   1   0 wz--n- 60.00g    0
>>
>>      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>     23207 root       2 -18  101m  23m 3176 S *99.9*  0.1   0:05.82 clvmd
>>
>>     lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin ->
>>     /var/comoonics/chroot//var/run/cman_admin
>>     lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client ->
>>     /var/comoonics/chroot//var/run/cman_client
>>
>>     locking_type=3
>>     [root@bwccs302 ~]# service clvmd restart
>>     Restarting clvmd:  [  OK  ]
>>     [root@bwccs302 ~]# vgs
>>       cluster request failed: Invalid argument
>>       Can't get lock for VG_DATA1
>>       cluster request failed: Invalid argument
>>       Can't get lock for vg_osroot
>>      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>     23829 root       2 -18  167m  24m 3268 S *99.8*  0.1   0:31.29 clvm
>>
>>
>>
>>     As far as the shutdown - with the two nodes up once I issue the
>>     shutdown on node1, the shutdown proceeds to the point where I
>>     sent the screenshots (deactivating cluster services) - on node2,
>>     I notice -
>>
>>     corosync[16648]:   [TOTEM ] A processor failed, forming new
>>     configuration. It attempts to fence
>>
>>     node1. After 3 unsuccessful attempts, it locks up. Node1 stays
>>     stuck (screen dump I sent) I do a tcp dump and I see the two
>>     nodes are still sending multicast messages and until I reset
>>     node1, node2 will stay in a locked state with no access...  this
>>     is the last set of messages I see :
>>
>>     fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result:
>>     error from agent
>>     fenced[16784]: fence smc01b failed
>>
>>     After 3 attempts as fencing failed the cluster locks up till I
>>     have reset the node.
>>     I suspect there is another issue at play here as I can manually
>>     fence a nodes using fence_node x ( I will continue to dig into
>>     this I have tried fenced -q or messagebus with the same result)
>>
>>     Thanks
>>     Jorge
>>
>>
>>     On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at...
>>     <mailto:gr...@at...>> wrote:
>>
>>         Hi Jorge,
>>         sorry for the delay but I was quite busy on the last days.
>>         Nevertheless I'm don't understand the problem.
>>         Let's first start at the point I think could lead to problems
>>         during shutdown and friends.
>>         Are the control files in /var/run/cman* being created from
>>         the bootsr initscript or do you still have to create them
>>         manually.
>>         If they are not created I would still be very interested in
>>         the output of
>>         bash -x /etc/init.d/bootsr start
>>         after a node has been started.
>>
>>         If not we need to dig deeper into the problems during shutdown.
>>         I would then also change the clustered flag for the other
>>         volume group.
>>         Again as long as you don't change the size it wont hurt.
>>         And it's only for better understanding the problem.
>>
>>         Another command I'd like to see is a cman_tool services on
>>         the other node (say node 2) while the shutdown node is being
>>         stuck (say node 1).
>>
>>         Thanks Marc.
>>         Am 15.11.2012 19:08, schrieb Jorge Silva:
>>>         Marc
>>>
>>>         Hi, I believe the problem is related to the clsuter services
>>>         not shutting down.  init 0, will not work with 1 or more
>>>         nodes, init 6 will only work when 1 node is present.  When
>>>         more than 1 node is present the node with the init 6  will
>>>         have to be fenced as it will not shut down.  I believe the
>>>         cluster components aren't shutting down (this also happens
>>>         with init 6 when more than one node is present)  - I still
>>>         see cluster traffic on the network, this is periodic.
>>>
>>>         12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn >
>>>         229.192.0.2.netsupport: UDP, length 119
>>>
>>>         At the point that the system will not shut down, it still is
>>>         a cluster member and there is still cluster traffic.
>>>
>>>         1 node :
>>>         [root@bwccs302 ~]# init 0
>>>
>>>         Can't connect to default. Skipping.
>>>         Shutting down Cluster Module - cluster monitor: [  OK  ]
>>>         Shutting down ricci: [  OK  ]
>>>         Shutting down Avahi daemon: [  OK  ]
>>>         Shutting down oddjobd: [  OK  ]
>>>         Stopping saslauthd: [  OK  ]
>>>         Stopping sshd: [  OK  ]
>>>         Shutting down sm-client: [  OK  ]
>>>         Shutting down sendmail: [  OK  ]
>>>         Stopping imsd via sshd: [  OK  ]
>>>         Stopping snmpd: [  OK  ]
>>>         Stopping crond: [  OK  ]
>>>         Stopping HAL daemon: [  OK  ]
>>>         Shutting down ntpd: [  OK  ]
>>>         Deactivating clustered VG(s):   0 logical volume(s) in
>>>         volume group "VG_SDATA" now active
>>>         [  OK  ]
>>>         Signaling clvmd to exit [  OK  ]
>>>         clvmd terminated[  OK  ]
>>>         Stopping lldpad: [  OK  ]
>>>         Stopping system message bus: [  OK  ]
>>>         Stopping multipathd daemon: [  OK  ]
>>>         Stopping rpcbind: [  OK  ]
>>>         Stopping auditd: [  OK  ]
>>>         Stopping nslcd: [  OK  ]
>>>         Shutting down system logger: [  OK  ]
>>>         Stopping sssd: [  OK  ]
>>>         Stopping gfs dependent services osr(notice) ..bindmounts.. [
>>>          OK  ]
>>>         Stopping gfs2 dependent services Starting clvmd:
>>>         Activating VG(s):   2 logical volume(s) in volume group
>>>         "VG_SDATA" now active
>>>           1 logical volume(s) in volume group "vg_osroot" now active
>>>         [  OK  ]
>>>         osr(notice) ..bindmounts.. [  OK  ]
>>>         Stopping monitoring for VG vg_osroot:   1 logical volume(s)
>>>         in volume group "vg_osroot" unmonitored
>>>         [  OK  ]
>>>         Sending all processes the TERM signal... [  OK  ]
>>>         Sending all processes the KILL signal... [  OK  ]
>>>         Saving random seed:  [  OK  ]
>>>         Syncing hardware clock to system time [  OK  ]
>>>         Turning off quotas:  quotaoff: Cannot change state of GFS2
>>>         quota.
>>>         quotaoff: Cannot change state of GFS2 quota.
>>>         [FAILED]
>>>         Unmounting file systems:  [  OK  ]
>>>         init: Re-executing /sbin/init
>>>         Halting system...
>>>         osr(notice) Scanning for Bootparameters...
>>>         osr(notice) Starting ATIX exitrd
>>>         osr(notice) Comoonics-Release
>>>         osr(notice) comoonics Community Release 5.0 (Gumpn)
>>>         osr(notice) Internal Version $Revision: 1.18 $ $Date:
>>>         2011-02-11 15:09:53 $
>>>         osr(debug) Calling cmd /sbin/halt -d -p
>>>         osr(notice) Preparing chrootcp: cannot stat
>>>         `/mnt/newroot/dev/initctl': No such file or directory
>>>         [  OK  ]
>>>         osr(notice) com-realhalt: detected distribution: rhel6,
>>>         clutype: gfs, rootfs: gfs2
>>>         osr(notice) Restarting init process in chroot[  OK  ]
>>>         osr(notice) Moving dev filesystem[  OK  ]
>>>         osr(notice) Umounting filesystems in oldroot (
>>>         /mnt/newroot/sys /mnt/newroot/proc)
>>>         osr(notice) Umounting /mnt/newroot/sys[  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/proc[  OK  ]
>>>         osr(notice) Umounting filesystems in oldroot
>>>         (/mnt/newroot/var/run /mnt/newroot/var/lock
>>>         /mnt/newroot/.cdsl.local)
>>>         osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing
>>>         /sbin/init
>>>         [  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/var/lock[  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/.cdsl.local[  OK  ]
>>>         osr(notice) Umounting oldroot /mnt/newroot[  OK  ]
>>>         osr(notice) Breakpoint "halt_umountoldroot" detected forking
>>>         a shell
>>>         bash: no job control in this shell
>>>
>>>         Type help to get more information..
>>>         Type exit to continue work..
>>>         -------------------------------------------------------------
>>>
>>>         comoonics 1 > cman_tool: unknown option cman_tool
>>>         comoonics 2 > comoonics 2 > Version: 6.2.0
>>>         Config Version: 1
>>>         Cluster Name: ProdCluster01
>>>         Cluster Id: 11454
>>>         Cluster Member: Yes
>>>         Cluster Generation: 4
>>>         Membership state: Cluster-Member
>>>         Nodes: 1
>>>         Expected votes: 4
>>>         Quorum device votes: 3
>>>         Total votes: 4
>>>         Node votes: 1
>>>         Quorum: 3
>>>         Active subsystems: 10
>>>         Flags:
>>>         Ports Bound: 0 11 178
>>>         Node name: smc01b
>>>         Node ID: 2
>>>         Multicast addresses: 229.192.0.2
>>>         Node addresses: 172.17.62.12
>>>         comoonics 3 > fence domain
>>>         member count  1
>>>         victim count  0
>>>         victim now    0
>>>         master nodeid 2
>>>         wait state    none
>>>         members       2
>>>
>>>         dlm lockspaces
>>>         name          clvmd
>>>         id            0x4104eefa
>>>         flags         0x00000000
>>>         change        member 1 joined 1 remove 0 failed 0 seq 1,1
>>>         members       2
>>>
>>>         comoonics 4 > bash: exitt: command not found
>>>         comoonics 5 > exit
>>>         osr(notice) Back to work..
>>>         Deactivating clustered VG(s):   0 logical volume(s) in
>>>         volume group "VG_SDATA" now active
>>>
>>>         It hung at the point above - so I re-ran with the edit set
>>>         -x in line 207.
>>>         1 -node:
>>>         [root@bwccs302 ~]# init 0
>>>         [root@bwccs302 ~
>>>         Can't connect to default. Skipping.
>>>         Shutting down Cluster Module - cluster monitor: [  OK  ]
>>>         Shutting down ricci: [  OK  ]
>>>         Shutting down Avahi daemon: [  OK  ]
>>>         Shutting down oddjobd: [  OK  ]
>>>         Stopping saslauthd: [  OK  ]
>>>         Stopping sshd: [  OK  ]
>>>         Shutting down sm-client: [  OK  ]
>>>         Shutting down sendmail: [  OK  ]
>>>         Stopping imsd via sshd: [  OK  ]
>>>         Stopping snmpd: [  OK  ]
>>>         Stopping crond: [  OK  ]
>>>         Stopping HAL daemon: [  OK  ]
>>>         Shutting down ntpd: [  OK  ]
>>>         Deactivating clustered VG(s):   0 logical volume(s) in
>>>         volume group "VG_SDATA" n                       ow active
>>>         [  OK  ]
>>>         Signaling clvmd to exit [  OK  ]
>>>         clvmd terminated[  OK  ]
>>>         Stopping lldpad: [  OK  ]
>>>         Stopping system message bus: [  OK  ]
>>>         Stopping multipathd daemon: [  OK  ]
>>>         Stopping rpcbind: [  OK  ]
>>>         Stopping auditd: [  OK  ]
>>>         Stopping nslcd: [  OK  ]
>>>         Shutting down system logger: [  OK  ]
>>>         Stopping sssd: [  OK  ]
>>>         Stopping gfs dependent services osr(notice) ..bindmounts.. [
>>>          OK  ]
>>>         Stopping gfs2 dependent services Starting clvmd:
>>>         Activating VG(s):   1 logical volume(s) in volume group
>>>         "vg_osroot" now active
>>>           2 logical volume(s) in volume group "VG_SDATA" now active
>>>         [  OK  ]
>>>         osr(notice) ..bindmounts.. [  OK  ]
>>>         Stopping monitoring for VG vg_osroot:   1 logical volume(s)
>>>         in volume group "vg_                       osroot" unmonitored
>>>         [  OK  ]
>>>         Sending all processes the TERM signal... [  OK  ]
>>>         Sending all processes the KILL signal... [  OK  ]
>>>         Saving random seed:  [  OK  ]
>>>         Syncing hardware clock to system time [  OK  ]
>>>         Turning off quotas:  quotaoff: Cannot change state of GFS2
>>>         quota.
>>>         quotaoff: Cannot change state of GFS2 quota.
>>>         [FAILED]
>>>         Unmounting file systems:  [  OK  ]
>>>         init: Re-executing /sbin/init
>>>         Halting system...
>>>         osr(notice) Scanning for Bootparameters...
>>>         osr(notice) Starting ATIX exitrd
>>>         osr(notice) Comoonics-Release
>>>         osr(notice) comoonics Community Release 5.0 (Gumpn)
>>>         osr(notice) Internal Version $Revision: 1.18 $ $Date:
>>>         2011-02-11 15:09:53 $
>>>         osr(notice) Preparing chrootcp: cannot stat
>>>         `/mnt/newroot/dev/initctl': No such file or directory [  OK  ]
>>>         osr(notice) com-realhalt: detected distribution: rhel6,
>>>         clutype: gfs, rootfs: gfs2
>>>         osr(notice) Restarting init process in chroot[  OK  ]
>>>         osr(notice) Moving dev filesystem[  OK  ]
>>>         osr(notice) Umounting filesystems in oldroot (
>>>         /mnt/newroot/sys /mnt/newroot/proc)
>>>         osr(notice) Umounting /mnt/newroot/sys[  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/proc[  OK  ]
>>>         osr(notice) Umounting filesystems in oldroot
>>>         (/mnt/newroot/var/run /mnt/newroot/var/lock
>>>         /mnt/newroot/.cdsl.local)
>>>         osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing
>>>         /sbin/init [  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/var/lock[  OK  ]
>>>         osr(notice) Umounting /mnt/newroot/.cdsl.local[  OK  ]
>>>         osr(notice) Umounting oldroot /mnt/newroot[  OK  ]
>>>         + clusterfs_services_stop '' '' 0
>>>         ++ repository_get_value rootfs
>>>         +++ repository_normalize_value rootfs
>>>         ++ local key=rootfs
>>>         ++ local default=
>>>         ++ local repository=
>>>         ++ '[' -z '' ']'
>>>         ++ repository=comoonics
>>>         ++ local value=
>>>         ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']'
>>>         +++ cat /var/cache/comoonics-repository/comoonics.rootfs
>>>         ++ value=gfs2
>>>         ++ echo gfs2
>>>         ++ return 0
>>>         + local rootfs=gfs2
>>>         + gfs2_services_stop '' '' 0
>>>         + local chroot_path=
>>>         + local lock_method=
>>>         + local lvm_sup=0
>>>         + '[' -n 0 ']'
>>>         + '[' 0 -eq 0 ']'
>>>         + /etc/init.d/clvmd stop
>>>         Deactivating clustered VG(s):   0 logical volume(s) in
>>>         volume group "VG_SDATA" now active
>>>
>>>         with 2 nodes + quorate when init 6 is issued:
>>>
>>>         [root@bwccs304 ~]# init 6
>>>         [root@bwccs304 ~
>>>         Can't connect to default. Skipping.
>>>         Shutting down Cluster Module - cluster monitor: [  OK  ]
>>>         Shutting down ricci: [  OK  ]
>>>         Shutting down Avahi daemon: [  OK  ]
>>>         Shutting down oddjobd: [  OK  ]
>>>         Stopping saslauthd: [  OK  ]
>>>         Stopping sshd: [  OK  ]
>>>         Shutting down sm-client: [  OK  ]
>>>         Shutting down sendmail: [  OK  ]
>>>         Stopping imsd via sshd: [  OK  ]
>>>         Stopping snmpd: [  OK  ]
>>>         Stopping crond: [  OK  ]
>>>         Stopping HAL daemon: [  OK  ]
>>>         Shutting down ntpd: [  OK  ]
>>>         Deactivating clustered VG(s):   0 logical volume(s) in
>>>         volume group "VG_SDATA" now active
>>>         [  OK  ]
>>>         Signaling clvmd to exit [  OK  ]
>>>         clvmd terminated[  OK  ]
>>>         Stopping lldpad: [  OK  ]
>>>         Stopping system message bus: [  OK  ]
>>>         Stopping multipathd daemon: [  OK  ]
>>>         Stopping rpcbind: [  OK  ]
>>>         Stopping auditd: [  OK  ]
>>>         Stopping nslcd: [  OK  ]
>>>         Shutting down system logger: [  OK  ]
>>>         Stopping sssd: [  OK  ]
>>>         Stopping gfs dependent services osr(notice) ..bindmounts.. [
>>>          OK  ]
>>>         Stopping gfs2 dependent services Starting clvmd:
>>>         Activating VG(s):   1 logical volume(s) in volume group
>>>         "vg_osroot" now active
>>>           2 logical volume(s) in volume group "VG_SDATA" now active
>>>         [  OK  ]
>>>         osr(notice) ..bindmounts.. [  OK  ]
>>>         Stopping monitoring for VG vg_osroot:   1 logical volume(s)
>>>         in volume group "vg_osroot" unmonitored
>>>         [  OK  ]
>>>         Sending all processes the TERM signal... [  OK  ]
>>>         qdiskd[15713]: Unregistering quorum device.
>>>
>>>         Sending all processes the KILL signal... dlm: clvmd: no
>>>         userland control daemon, stopping lockspace
>>>         dlm: OSRoot: no userland control daemon, stopping lockspace
>>>         [  OK  ]
>>>          - stops here and will not die...  Still have full cluster coms
>>>
>>>         Thanks
>>>         jorge
>>>
>>>         On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at...
>>>         <mailto:gr...@at...>> wrote:
>>>
>>>             Hi Jorge,
>>>             because of the "init 0".
>>>             Please issue the following commands prior to init 0.
>>>             # Make it a little more chatty
>>>             $ com-chroot setparameter debug
>>>             # Break after before cluster will be stopped
>>>             $ com-chroot setparameter step halt_umountoldroot
>>>
>>>             Then issue a init 0.
>>>             This should lead you to a breakpoint during shutdown
>>>             (hopefully, cause sometimes the console gets confused).
>>>             In side the breakpoint issue:
>>>             $ cman_tool status
>>>             $ cman_tool services
>>>             # Continue shutdown
>>>             $ exit
>>>             Then send me the output.
>>>
>>>             If this fails also do as follows:
>>>             $ com-chroot vi com-realhalt.sh
>>>             # go to line 207 (before clusterfs_services_stop) is
>>>             called and add a set -x
>>>             $ init 0
>>>
>>>             Send the output.
>>>             Thanks Marc.
>>>
>>>             ----- Original Message -----
>>>             From: "Jorge Silva" <me...@je...
>>>             <mailto:me...@je...>>
>>>             To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>>
>>>             Cc: ope...@li...
>>>             <mailto:ope...@li...>
>>>             Sent: Tuesday, November 13, 2012 3:22:37 PM
>>>             Subject: Re: Problem with VG activation clvmd runs at 100%
>>>
>>>             Marc
>>>
>>>
>>>             Hi, thanks for the info, it helps. I have also noticed
>>>             that gfs2 entries in the fstab get ignored on boot, I
>>>             have added in rc.local. I have done a bit more digging
>>>             and the issue I described below:
>>>
>>>
>>>             "I am still a bit stuck when nodes with gfs2 mounted
>>>             don't restart if instructed to do so, but I will read
>>>             some more."
>>>
>>>
>>>             If I issue a init 6 on a nodes they will restart. If I
>>>             issue init 0, then I have the problem the node start to
>>>             shut down, but will stay in the cluster. I have to shut
>>>             it off, it will not shut down, this is the log.
>>>
>>>
>>>
>>>             [root@bwccs304 ~]# init 0
>>>
>>>
>>>             Can't connect to default. Skipping.
>>>             Shutting down Cluster Module - cluster monitor: [ OK ]
>>>             Shutting down ricci: [ OK ]
>>>             Shutting down oddjobd: [ OK ]
>>>             Stopping saslauthd: [ OK ]
>>>             Stopping sshd: [ OK ]
>>>             Shutting down sm-client: [ OK ]
>>>             Shutting down sendmail: [ OK ]
>>>             Stopping imsd via sshd: [ OK ]
>>>             Stopping snmpd: [ OK ]
>>>             Stopping crond: [ OK ]
>>>             Stopping HAL daemon: [ OK ]
>>>             Stopping nscd: [ OK ]
>>>             Shutting down ntpd: [ OK ]
>>>             Deactivating clustered VG(s): 0 logical volume(s) in
>>>             volume group "VG_SDATA" now active
>>>             [ OK ]
>>>             Signaling clvmd to exit [ OK ]
>>>             clvmd terminated[ OK ]
>>>             Stopping lldpad: [ OK ]
>>>             Stopping system message bus: [ OK ]
>>>             Stopping multipathd daemon: [ OK ]
>>>             Stopping rpcbind: [ OK ]
>>>             Stopping auditd: [ OK ]
>>>             Stopping nslcd: [ OK ]
>>>             Shutting down system logger: [ OK ]
>>>             Stopping sssd: [ OK ]
>>>             Stopping gfs dependent services osr(notice)
>>>             ..bindmounts.. [ OK ]
>>>             Stopping gfs2 dependent services Starting clvmd:
>>>             Activating VG(s): 2 logical volume(s) in volume group
>>>             "VG_SDATA" now active
>>>             1 logical volume(s) in volume group "vg_osroot" now active
>>>             [ OK ]
>>>             osr(notice) ..bindmounts.. [ OK ]
>>>             Stopping monitoring for VG VG_SDATA: 1 logical volume(s)
>>>             in volume group "VG_SDATA" unmonitored
>>>             [ OK ]
>>>             Stopping monitoring for VG vg_osroot: 1 logical
>>>             volume(s) in volume group "vg_osroot" unmonitored
>>>             [ OK ]
>>>             Sending all processes the TERM signal... [ OK ]
>>>             Sending all processes the KILL signal... [ OK ]
>>>             Saving random seed: [ OK ]
>>>             Syncing hardware clock to system time [ OK ]
>>>             Turning off quotas: quotaoff: Cannot change state of
>>>             GFS2 quota.
>>>             quotaoff: Cannot change state of GFS2 quota.
>>>             [FAILED]
>>>             Unmounting file systems: [ OK ]
>>>             init: Re-executing /sbin/init
>>>             Halting system...
>>>             osr(notice) Scanning for Bootparameters...
>>>             osr(notice) Starting ATIX exitrd
>>>             osr(notice) Comoonics-Release
>>>             osr(notice) comoonics Community Release 5.0 (Gumpn)
>>>             osr(notice) Internal Version $Revision: 1.18 $ $Date:
>>>             2011-02-11 15:09:53 $
>>>             osr(notice) Preparing chrootcp: cannot stat
>>>             `/mnt/newroot/dev/initctl': No such file or directory
>>>             [ OK ]
>>>             osr(notice) com-realhalt: detected distribution: rhel6,
>>>             clutype: gfs, rootfs: gfs2
>>>             osr(notice) Restarting init process in chroot[ OK ]
>>>             osr(notice) Moving dev filesystem[ OK ]
>>>             osr(notice) Umounting filesystems in oldroot (
>>>             /mnt/newroot/sys /mnt/newroot/proc)
>>>             osr(notice) Umounting /mnt/newroot/sys[ OK ]
>>>             osr(notice) Umounting /mnt/newroot/proc[ OK ]
>>>             osr(notice) Umounting filesystems in oldroot
>>>             (/mnt/newroot/var/run /mnt/newroot/var/lock
>>>             /mnt/newroot/.cdsl.local)
>>>             osr(notice) Umounting /mnt/newroot/var/runinit:
>>>             Re-executing /sbin/init
>>>             [ OK ]
>>>             osr(notice) Umounting /mnt/newroot/var/lock[ OK ]
>>>             osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ]
>>>             osr(notice) Umounting oldroot /mnt/newroot[ OK ]
>>>             Deactivating clustered VG(s): 0 logical volume(s) in
>>>             volume group "VG_SDATA" now active
>>>
>>>
>>>
>>>
>>>
>>>             On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme <
>>>             gr...@at... <mailto:gr...@at...> > wrote:
>>>
>>>
>>>             Jorge,
>>>             you don't need to be doubtful about the fact that the
>>>             volume group for the root file system is not flagged as
>>>             clustered. This has no implications whatsoever on the
>>>             gfs2 file system.
>>>
>>>             It will only be a problem whenever the lvm settings of
>>>             the vg_osroot change (size, number of lvs etc.).
>>>
>>>             Nevertheless while thinking about your problem I think I
>>>             had the idea on how to fix this problem on being able to
>>>             have the root vg clustered also. I will provide new
>>>             packages in the next days that should deal with the problem.
>>>
>>>             Keep in mind that there is a difference between
>>>             cman_tool services and the lvm usage.
>>>             clvmd only uses the locktable clvmd shown by cman_tool
>>>             services and the other locktables are relevant to the
>>>             file systems and other services (fenced, rgmanager, ..).
>>>             This is a complete different use case.
>>>
>>>             Try to elaborate a bit more on the fact
>>>
>>>             "I am still a bit stuck when nodes with gfs2 mounted
>>>             don't restart if instructed to do so, but I will read
>>>             some more."
>>>             What do you mean with it? How does this happen? This
>>>             sounds like something you should have a look at.
>>>
>>>
>>>             "Once thing that I can confirm is
>>>             osr(notice): Detecting nodeid & nodename
>>>             This does not always display the correct info, but it
>>>             doesn't seem to be a problem either ?"
>>>
>>>             You should always look at the nodeid the nodename is
>>>             (more or less) only descriptive and might not be set as
>>>             expected. But the nodeid should always be consistent.
>>>             Does this help?
>>>
>>>             About your notes (I only take the relevant ones):
>>>
>>>             1. osr(notice): Creating clusterfiles
>>>             /var/run/cman_admin /var/run/cman_client.. [OK]
>>>             This message should not be misleading but only tells the
>>>             these control files are being created inside the
>>>             ramdisk. This has nothing to do with these files on your
>>>             root file system. Nevertheless /etc/init.d/bootsr should
>>>             take over this part and create the files. Please send me
>>>             another
>>>             bash -x /etc/init.d/bootsr start
>>>             output. Please when those files are not existant.
>>>
>>>             2. vgs
>>>
>>>             VG #PV #LV #SN Attr VSize VFree
>>>             VG_SDATA 1 2 0 wz--nc 1000.00g 0
>>>             vg_osroot 1 1 0 wz--n- 60.00g 0
>>>
>>>             This is perfectly ok. This only means the vg is not
>>>             clustered. But the filesystem IS. This does not have any
>>>             connection.
>>>
>>>             Hope this helps.
>>>             Let me know about the open issues.
>>>
>>>             Regards
>>>
>>>             Marc.
>>>
>>>
>>>             ----- Original Message -----
>>>             From: "Jorge Silva" < me...@je...
>>>             <mailto:me...@je...> >
>>>             To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> >
>>>
>>>             Sent: Tuesday, November 13, 2012 2:15:23 AM
>>>             Subject: Re: Problem with VG activation clvmd runs at 100%
>>>
>>>
>>>             Marc
>>>
>>>
>>>             Hi - I believe I have solved my problem, with your help,
>>>             thank you. Yet, I'm not sure how I caused it - but the
>>>             root volume group as you pointed out had the clustered
>>>             attribute(and I had to have done something silly along
>>>             the way). I re-installed from scratch see notes below
>>>             and then just to prove that is a problem, I changed the
>>>             attribute of the rootfs- vgchange -cy and rebooted and I
>>>             ran into trouble, I changed it back and it is fine so
>>>             that does cause problems on start-up, I'm not sure I
>>>             understand why as there is an active quorum for the clvm
>>>             to join and take part..
>>>
>>>
>>>             Despite it not being marked as a cluster volume
>>>             cman_tool services show it as being, but clvmd status
>>>             doesn't ? Is it safe to write to it with multiple nodes
>>>             mounted?
>>>
>>>
>>>             I am still a bit stuck when nodes with gfs2 mounted
>>>             don't restart if instructed to do so, but I will read
>>>             some more.
>>>
>>>
>>>
>>>
>>>             Once thing that I can confirm is
>>>             osr(notice): Detecting nodeid & nodename
>>>
>>>
>>>             This does not always display the correct info, but it
>>>             doesn't seem to be a problem either ?
>>>
>>>
>>>
>>>
>>>             Thanks
>>>             Jorge
>>>
>>>
>>>             Notes:
>>>             I decided to start from scratch and I blew away the
>>>             rootfs and started from scratch as per the website. My
>>>             assumption - that I edited something and messed it up (I
>>>             did look at a lot of the scripts to try to "figure out
>>>             and fix" the problem, I can send the history if you want
>>>             or I can edit and contribute).
>>>
>>>
>>>             I rebooted the server and I had an issue - I didn't
>>>             disable selinux so I had to intervene in the boot stage.
>>>             That completed, but I noticed that :
>>>
>>>
>>>
>>>             osr(notice): Starting network configuration for lo0 [OK]
>>>             osr(notice): Detecting nodeid & nodename
>>>
>>>
>>>             Is blank, but somehow the correct nodeid and name was
>>>             deduced.
>>>
>>>
>>>             I had to rebuild the ram disk to fix the selinux
>>>             disabled. I also added the following
>>>
>>>             yum install pciutils - the mkinitrd warned about this
>>>             so, I installed it.
>>>             I also installed :
>>>             yum install cluster-snmp
>>>             yum install rgmanager
>>>             in lvm
>>>
>>>
>>>             On this reboot I noticed that despite this message
>>>
>>>             sr(notice): Creating clusterfiles /var/run/cman_admin
>>>             /var/run/cman_client.. [OK]
>>>
>>>
>>>             Starting clvmd: dlm: Using TCP for communications
>>>
>>>
>>>             Activating VG(s): File descriptor 3 (/dev/console)
>>>             leaked on vgchange invocation. Parent PID 15995: /bin/bash
>>>             File descriptor 4 (/dev/console) leaked on vgchange
>>>             invocation. Parent PID 15995: /bin/bash
>>>             Skipping clustered volume group VG_SDATA
>>>             1 logical volume(s) in volume group "vg_osroot" now active
>>>
>>>
>>>             the links weren't created and I did this manually
>>>
>>>
>>>
>>>             ln -sf /var/comoonics/chroot//var/run/cman_admin
>>>             /var/run/cman_admin
>>>             ln -sf /var/comoonics/chroot//var/run/cman_client
>>>             /var/run/cman_client
>>>
>>>
>>>             I could then get clusterstatus etc, and clvmd was running ok
>>>
>>>
>>>             I looked in /etc/lvm/lvm.conf and locking_type = 4 ?
>>>
>>>
>>>             I then issued
>>>
>>>
>>>             lvmconf --enable cluster - and this changed
>>>             /etc/lvm/lvm.conf locking_type = 3.
>>>
>>>
>>>             vgscan correctly showed up clusterd volumes and was
>>>             working ok.
>>>
>>>
>>>
>>>
>>>             I did not rebuild the ramdisk (I can confirm that the
>>>             lvm .conf in the ramdisk has locking_type=4) I have
>>>             rebooted and everything is working.
>>>
>>>             Starting clvmd: dlm: Using TCP for communications
>>>
>>>
>>>             Activating VG(s): File descriptor 3 (/dev/console)
>>>             leaked on vgchange invocation. Parent PID 15983: /bin/bash
>>>             File descriptor 4 (/dev/console) leaked on vgchange
>>>             invocation. Parent PID 15983: /bin/bash
>>>             Skipping clustered volume group VG_SDATA
>>>             1 logical volume(s) in volume group "vg_osroot" now active
>>>
>>>
>>>
>>>
>>>
>>>
>>>             I have rebooted a number of times and am confident that
>>>             things are ok,
>>>
>>>
>>>             I decided to add two other nodes to the mix and I can
>>>             confirm that everytime a new node is added these files
>>>             are missing :
>>>
>>>
>>>             /var/run/cman_admin
>>>             /var/run/cman_client
>>>             But I can see from the logs:
>>>
>>>
>>>
>>>             osr(notice): Creating clusterfiles /var/run/cman_admin
>>>             /var/run/cman_client.. [OK]
>>>
>>>
>>>             despite the above message, also, the information below
>>>             is not always detected, but still the nodeid etc is
>>>             correct...
>>>
>>>
>>>             osr(notice): Detecting nodeid & nodename
>>>
>>>
>>>
>>>
>>>             So now I have 3 nodes in the cluster and things look ok:
>>>
>>>
>>>
>>>             [root@bwccs302 ~]# cman_tool services
>>>             fence domain
>>>             member count 3
>>>             victim count 0
>>>             victim now 0
>>>             master nodeid 2
>>>             wait state none
>>>             members 2 3 4
>>>
>>>
>>>             dlm lockspaces
>>>             name home
>>>             id 0xf8ee17aa
>>>             flags 0x00000008 fs_reg
>>>             change member 3 joined 1 remove 0 failed 0 seq 3,3
>>>             members 2 3 4
>>>
>>>
>>>             name clvmd
>>>             id 0x4104eefa
>>>             flags 0x00000000
>>>             change member 3 joined 1 remove 0 failed 0 seq 15,15
>>>             members 2 3 4
>>>
>>>
>>>             name OSRoot
>>>             id 0xab5404ad
>>>             flags 0x00000008 fs_reg
>>>             change member 3 joined 1 remove 0 failed 0 seq 7,7
>>>             members 2 3 4
>>>
>>>
>>>             gfs mountgroups
>>>             name home
>>>             id 0x686e3fc4
>>>             flags 0x00000048 mounted
>>>             change member 3 joined 1 remove 0 failed 0 seq 3,3
>>>             members 2 3 4
>>>
>>>
>>>             name OSRoot
>>>             id 0x659f7afe
>>>             flags 0x00000048 mounted
>>>             change member 3 joined 1 remove 0 failed 0 seq 7,7
>>>             members 2 3 4
>>>
>>>
>>>
>>>             service clvmd status
>>>             clvmd (pid 25771) is running...
>>>             Clustered Volume Groups: VG_SDATA
>>>             Active clustered Logical Volumes: LV_HOME LV_DEVDB
>>>
>>>
>>>             it doesn't believe that the root file-system is
>>>             clustered despite the output from the above.
>>>
>>>
>>>
>>>             [root@bwccs302 ~]# vgs
>>>             VG #PV #LV #SN Attr VSize VFree
>>>             VG_SDATA 1 2 0 wz--nc 1000.00g 0
>>>             vg_osroot 1 1 0 wz--n- 60.00g 0
>>>
>>>
>>>             The above got me thinking on what you wanted me to do to
>>>             diable the clusterd flag on the root volume - with it
>>>             left on I was having problems (not sure how it got
>>>             turned) on.
>>>
>>>
>>>             With everything working ok, I remade ramdisk and now
>>>             lvm.conf=3..
>>>
>>>
>>>             The systems start up and things look ok.
>>>
>>>
>>
>>