Re: [OSR-users] Problem with VG activation clvmd runs at 100%

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Jorge,
let's first start with fencing.
You are using ipmilan for fencing. I didn't evaluate the agent with rhel6.
So let's start fixing this issue.
Try the following:
com-chroot /sbin/fence_ipmilan -h

Send me the output. There might some libs missing.

The clvmd is very strange. Try to stay with locking_type=2 or
locking_type=3.
Then rebuild an initrd and reboot.
If clvmd stays with 100% CPU kill it and start it again manually with -d
flag. Send me the output. Perhaps we see something from there.

Regards Marc.
Am 19.11.2012 15:39, schrieb Jorge Silva:
> Marc
>
> Hi, np, thanks for helping. The /var/run/cman* are there.  I will
> disable the clustered flag on the second volume.  Even more disturbing
> is after the last email i sent you I went from a state where clvmd was
> behaving normally (not 100%),  I could access clustered volumes. I
> rebooted to verify the that everything was functioning - but I am now
> back to the state  where clvmd is running at 100% - back to where we
> started (can't access clustered volumes). 
>
> locking-type=0
> [root@bwccs302 ~]# vgs
>   WARNING: Locking disabled. Be careful! This could corrupt your metadata.
>   VG        #PV #LV #SN Attr   VSize  VFree
>   VG_DATA1    1   2   0 wz--n- 64.00g 4.20g
>   vg_osroot   1   1   0 wz--n- 60.00g    0
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 23207 root       2 -18  101m  23m 3176 S *99.9*  0.1   0:05.82 clvmd
>
> lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin ->
> /var/comoonics/chroot//var/run/cman_admin
> lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client ->
> /var/comoonics/chroot//var/run/cman_client
>
> locking_type=3
> [root@bwccs302 ~]# service clvmd restart
> Restarting clvmd:  [  OK  ]
> [root@bwccs302 ~]# vgs
>   cluster request failed: Invalid argument
>   Can't get lock for VG_DATA1
>   cluster request failed: Invalid argument
>   Can't get lock for vg_osroot
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 23829 root       2 -18  167m  24m 3268 S *99.8*  0.1   0:31.29 clvm
>
>
>
> As far as the shutdown - with the two nodes up once I issue the
> shutdown on node1, the shutdown proceeds to the point where I sent the
> screenshots (deactivating cluster services) - on node2, I notice -
>
> corosync[16648]:   [TOTEM ] A processor failed, forming new
> configuration. It attempts to fence
>
> node1. After 3 unsuccessful attempts, it locks up. Node1 stays stuck
> (screen dump I sent) I do a tcp dump and I see the two nodes are still
> sending multicast messages and until I reset node1, node2 will stay in
> a locked state with no access...  this is the last set of messages I see :
>
> fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result: error
> from agent
> fenced[16784]: fence smc01b failed
>
> After 3 attempts as fencing failed the cluster locks up till I have
> reset the node.
> I suspect there is another issue at play here as I can manually fence
> a nodes using fence_node x ( I will continue to dig into this I have
> tried fenced -q or messagebus with the same result)
>
> Thanks
> Jorge
>
>
> On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at...
> <mailto:gr...@at...>> wrote:
>
>     Hi Jorge,
>     sorry for the delay but I was quite busy on the last days.
>     Nevertheless I'm don't understand the problem.
>     Let's first start at the point I think could lead to problems
>     during shutdown and friends.
>     Are the control files in /var/run/cman* being created from the
>     bootsr initscript or do you still have to create them manually.
>     If they are not created I would still be very interested in the
>     output of
>     bash -x /etc/init.d/bootsr start
>     after a node has been started.
>
>     If not we need to dig deeper into the problems during shutdown.
>     I would then also change the clustered flag for the other volume
>     group.
>     Again as long as you don't change the size it wont hurt.
>     And it's only for better understanding the problem.
>
>     Another command I'd like to see is a cman_tool services on the
>     other node (say node 2) while the shutdown node is being stuck
>     (say node 1).
>
>     Thanks Marc.
>     Am 15.11.2012 19:08, schrieb Jorge Silva:
>>     Marc
>>
>>     Hi, I believe the problem is related to the clsuter services not
>>     shutting down.  init 0, will not work with 1 or more nodes, init
>>     6 will only work when 1 node is present.  When more than 1 node
>>     is present the node with the init 6  will have to be fenced as it
>>     will not shut down.  I believe the cluster components aren't
>>     shutting down (this also happens with init 6 when more than one
>>     node is present)  - I still see cluster traffic on the network,
>>     this is periodic.
>>
>>     12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn >
>>     229.192.0.2.netsupport: UDP, length 119
>>
>>     At the point that the system will not shut down, it still is a
>>     cluster member and there is still cluster traffic.
>>
>>     1 node :
>>     [root@bwccs302 ~]# init 0
>>
>>     Can't connect to default. Skipping.
>>     Shutting down Cluster Module - cluster monitor: [  OK  ]
>>     Shutting down ricci: [  OK  ]
>>     Shutting down Avahi daemon: [  OK  ]
>>     Shutting down oddjobd: [  OK  ]
>>     Stopping saslauthd: [  OK  ]
>>     Stopping sshd: [  OK  ]
>>     Shutting down sm-client: [  OK  ]
>>     Shutting down sendmail: [  OK  ]
>>     Stopping imsd via sshd: [  OK  ]
>>     Stopping snmpd: [  OK  ]
>>     Stopping crond: [  OK  ]
>>     Stopping HAL daemon: [  OK  ]
>>     Shutting down ntpd: [  OK  ]
>>     Deactivating clustered VG(s):   0 logical volume(s) in volume
>>     group "VG_SDATA" now active
>>     [  OK  ]
>>     Signaling clvmd to exit [  OK  ]
>>     clvmd terminated[  OK  ]
>>     Stopping lldpad: [  OK  ]
>>     Stopping system message bus: [  OK  ]
>>     Stopping multipathd daemon: [  OK  ]
>>     Stopping rpcbind: [  OK  ]
>>     Stopping auditd: [  OK  ]
>>     Stopping nslcd: [  OK  ]
>>     Shutting down system logger: [  OK  ]
>>     Stopping sssd: [  OK  ]
>>     Stopping gfs dependent services osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping gfs2 dependent services Starting clvmd:
>>     Activating VG(s):   2 logical volume(s) in volume group
>>     "VG_SDATA" now active
>>       1 logical volume(s) in volume group "vg_osroot" now active
>>     [  OK  ]
>>     osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping monitoring for VG vg_osroot:   1 logical volume(s) in
>>     volume group "vg_osroot" unmonitored
>>     [  OK  ]
>>     Sending all processes the TERM signal... [  OK  ]
>>     Sending all processes the KILL signal... [  OK  ]
>>     Saving random seed:  [  OK  ]
>>     Syncing hardware clock to system time [  OK  ]
>>     Turning off quotas:  quotaoff: Cannot change state of GFS2 quota.
>>     quotaoff: Cannot change state of GFS2 quota.
>>     [FAILED]
>>     Unmounting file systems:  [  OK  ]
>>     init: Re-executing /sbin/init
>>     Halting system...
>>     osr(notice) Scanning for Bootparameters...
>>     osr(notice) Starting ATIX exitrd
>>     osr(notice) Comoonics-Release
>>     osr(notice) comoonics Community Release 5.0 (Gumpn)
>>     osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11
>>     15:09:53 $
>>     osr(debug) Calling cmd /sbin/halt -d -p
>>     osr(notice) Preparing chrootcp: cannot stat
>>     `/mnt/newroot/dev/initctl': No such file or directory
>>     [  OK  ]
>>     osr(notice) com-realhalt: detected distribution: rhel6, clutype:
>>     gfs, rootfs: gfs2
>>     osr(notice) Restarting init process in chroot[  OK  ]
>>     osr(notice) Moving dev filesystem[  OK  ]
>>     osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys
>>     /mnt/newroot/proc)
>>     osr(notice) Umounting /mnt/newroot/sys[  OK  ]
>>     osr(notice) Umounting /mnt/newroot/proc[  OK  ]
>>     osr(notice) Umounting filesystems in oldroot
>>     (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local)
>>     osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing
>>     /sbin/init
>>     [  OK  ]
>>     osr(notice) Umounting /mnt/newroot/var/lock[  OK  ]
>>     osr(notice) Umounting /mnt/newroot/.cdsl.local[  OK  ]
>>     osr(notice) Umounting oldroot /mnt/newroot[  OK  ]
>>     osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell
>>     bash: no job control in this shell
>>
>>     Type help to get more information..
>>     Type exit to continue work..
>>     -------------------------------------------------------------
>>
>>     comoonics 1 > cman_tool: unknown option cman_tool
>>     comoonics 2 > comoonics 2 > Version: 6.2.0
>>     Config Version: 1
>>     Cluster Name: ProdCluster01
>>     Cluster Id: 11454
>>     Cluster Member: Yes
>>     Cluster Generation: 4
>>     Membership state: Cluster-Member
>>     Nodes: 1
>>     Expected votes: 4
>>     Quorum device votes: 3
>>     Total votes: 4
>>     Node votes: 1
>>     Quorum: 3
>>     Active subsystems: 10
>>     Flags:
>>     Ports Bound: 0 11 178
>>     Node name: smc01b
>>     Node ID: 2
>>     Multicast addresses: 229.192.0.2
>>     Node addresses: 172.17.62.12
>>     comoonics 3 > fence domain
>>     member count  1
>>     victim count  0
>>     victim now    0
>>     master nodeid 2
>>     wait state    none
>>     members       2
>>
>>     dlm lockspaces
>>     name          clvmd
>>     id            0x4104eefa
>>     flags         0x00000000
>>     change        member 1 joined 1 remove 0 failed 0 seq 1,1
>>     members       2
>>
>>     comoonics 4 > bash: exitt: command not found
>>     comoonics 5 > exit
>>     osr(notice) Back to work..
>>     Deactivating clustered VG(s):   0 logical volume(s) in volume
>>     group "VG_SDATA" now active
>>
>>     It hung at the point above - so I re-ran with the edit set -x in
>>     line 207.
>>     1 -node:
>>     [root@bwccs302 ~]# init 0
>>     [root@bwccs302 ~
>>     Can't connect to default. Skipping.
>>     Shutting down Cluster Module - cluster monitor: [  OK  ]
>>     Shutting down ricci: [  OK  ]
>>     Shutting down Avahi daemon: [  OK  ]
>>     Shutting down oddjobd: [  OK  ]
>>     Stopping saslauthd: [  OK  ]
>>     Stopping sshd: [  OK  ]
>>     Shutting down sm-client: [  OK  ]
>>     Shutting down sendmail: [  OK  ]
>>     Stopping imsd via sshd: [  OK  ]
>>     Stopping snmpd: [  OK  ]
>>     Stopping crond: [  OK  ]
>>     Stopping HAL daemon: [  OK  ]
>>     Shutting down ntpd: [  OK  ]
>>     Deactivating clustered VG(s):   0 logical volume(s) in volume
>>     group "VG_SDATA" n                       ow active
>>     [  OK  ]
>>     Signaling clvmd to exit [  OK  ]
>>     clvmd terminated[  OK  ]
>>     Stopping lldpad: [  OK  ]
>>     Stopping system message bus: [  OK  ]
>>     Stopping multipathd daemon: [  OK  ]
>>     Stopping rpcbind: [  OK  ]
>>     Stopping auditd: [  OK  ]
>>     Stopping nslcd: [  OK  ]
>>     Shutting down system logger: [  OK  ]
>>     Stopping sssd: [  OK  ]
>>     Stopping gfs dependent services osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping gfs2 dependent services Starting clvmd:
>>     Activating VG(s):   1 logical volume(s) in volume group
>>     "vg_osroot" now active
>>       2 logical volume(s) in volume group "VG_SDATA" now active
>>     [  OK  ]
>>     osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping monitoring for VG vg_osroot:   1 logical volume(s) in
>>     volume group "vg_                       osroot" unmonitored
>>     [  OK  ]
>>     Sending all processes the TERM signal... [  OK  ]
>>     Sending all processes the KILL signal... [  OK  ]
>>     Saving random seed:  [  OK  ]
>>     Syncing hardware clock to system time [  OK  ]
>>     Turning off quotas:  quotaoff: Cannot change state of GFS2 quota.
>>     quotaoff: Cannot change state of GFS2 quota.
>>     [FAILED]
>>     Unmounting file systems:  [  OK  ]
>>     init: Re-executing /sbin/init
>>     Halting system...
>>     osr(notice) Scanning for Bootparameters...
>>     osr(notice) Starting ATIX exitrd
>>     osr(notice) Comoonics-Release
>>     osr(notice) comoonics Community Release 5.0 (Gumpn)
>>     osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11
>>     15:09:53 $
>>     osr(notice) Preparing chrootcp: cannot stat
>>     `/mnt/newroot/dev/initctl': No such file or directory [  OK  ]
>>     osr(notice) com-realhalt: detected distribution: rhel6, clutype:
>>     gfs, rootfs: gfs2
>>     osr(notice) Restarting init process in chroot[  OK  ]
>>     osr(notice) Moving dev filesystem[  OK  ]
>>     osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys
>>     /mnt/newroot/proc)
>>     osr(notice) Umounting /mnt/newroot/sys[  OK  ]
>>     osr(notice) Umounting /mnt/newroot/proc[  OK  ]
>>     osr(notice) Umounting filesystems in oldroot
>>     (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local)
>>     osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing
>>     /sbin/init [  OK  ]
>>     osr(notice) Umounting /mnt/newroot/var/lock[  OK  ]
>>     osr(notice) Umounting /mnt/newroot/.cdsl.local[  OK  ]
>>     osr(notice) Umounting oldroot /mnt/newroot[  OK  ]
>>     + clusterfs_services_stop '' '' 0
>>     ++ repository_get_value rootfs
>>     +++ repository_normalize_value rootfs
>>     ++ local key=rootfs
>>     ++ local default=
>>     ++ local repository=
>>     ++ '[' -z '' ']'
>>     ++ repository=comoonics
>>     ++ local value=
>>     ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']'
>>     +++ cat /var/cache/comoonics-repository/comoonics.rootfs
>>     ++ value=gfs2
>>     ++ echo gfs2
>>     ++ return 0
>>     + local rootfs=gfs2
>>     + gfs2_services_stop '' '' 0
>>     + local chroot_path=
>>     + local lock_method=
>>     + local lvm_sup=0
>>     + '[' -n 0 ']'
>>     + '[' 0 -eq 0 ']'
>>     + /etc/init.d/clvmd stop
>>     Deactivating clustered VG(s):   0 logical volume(s) in volume
>>     group "VG_SDATA" now active
>>
>>     with 2 nodes + quorate when init 6 is issued:
>>
>>     [root@bwccs304 ~]# init 6
>>     [root@bwccs304 ~
>>     Can't connect to default. Skipping.
>>     Shutting down Cluster Module - cluster monitor: [  OK  ]
>>     Shutting down ricci: [  OK  ]
>>     Shutting down Avahi daemon: [  OK  ]
>>     Shutting down oddjobd: [  OK  ]
>>     Stopping saslauthd: [  OK  ]
>>     Stopping sshd: [  OK  ]
>>     Shutting down sm-client: [  OK  ]
>>     Shutting down sendmail: [  OK  ]
>>     Stopping imsd via sshd: [  OK  ]
>>     Stopping snmpd: [  OK  ]
>>     Stopping crond: [  OK  ]
>>     Stopping HAL daemon: [  OK  ]
>>     Shutting down ntpd: [  OK  ]
>>     Deactivating clustered VG(s):   0 logical volume(s) in volume
>>     group "VG_SDATA" now active
>>     [  OK  ]
>>     Signaling clvmd to exit [  OK  ]
>>     clvmd terminated[  OK  ]
>>     Stopping lldpad: [  OK  ]
>>     Stopping system message bus: [  OK  ]
>>     Stopping multipathd daemon: [  OK  ]
>>     Stopping rpcbind: [  OK  ]
>>     Stopping auditd: [  OK  ]
>>     Stopping nslcd: [  OK  ]
>>     Shutting down system logger: [  OK  ]
>>     Stopping sssd: [  OK  ]
>>     Stopping gfs dependent services osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping gfs2 dependent services Starting clvmd:
>>     Activating VG(s):   1 logical volume(s) in volume group
>>     "vg_osroot" now active
>>       2 logical volume(s) in volume group "VG_SDATA" now active
>>     [  OK  ]
>>     osr(notice) ..bindmounts.. [  OK  ]
>>     Stopping monitoring for VG vg_osroot:   1 logical volume(s) in
>>     volume group "vg_osroot" unmonitored
>>     [  OK  ]
>>     Sending all processes the TERM signal... [  OK  ]
>>     qdiskd[15713]: Unregistering quorum device.
>>
>>     Sending all processes the KILL signal... dlm: clvmd: no userland
>>     control daemon, stopping lockspace
>>     dlm: OSRoot: no userland control daemon, stopping lockspace
>>     [  OK  ]
>>      - stops here and will not die...  Still have full cluster coms
>>
>>     Thanks
>>     jorge
>>
>>     On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at...
>>     <mailto:gr...@at...>> wrote:
>>
>>         Hi Jorge,
>>         because of the "init 0".
>>         Please issue the following commands prior to init 0.
>>         # Make it a little more chatty
>>         $ com-chroot setparameter debug
>>         # Break after before cluster will be stopped
>>         $ com-chroot setparameter step halt_umountoldroot
>>
>>         Then issue a init 0.
>>         This should lead you to a breakpoint during shutdown
>>         (hopefully, cause sometimes the console gets confused).
>>         In side the breakpoint issue:
>>         $ cman_tool status
>>         $ cman_tool services
>>         # Continue shutdown
>>         $ exit
>>         Then send me the output.
>>
>>         If this fails also do as follows:
>>         $ com-chroot vi com-realhalt.sh
>>         # go to line 207 (before clusterfs_services_stop) is called
>>         and add a set -x
>>         $ init 0
>>
>>         Send the output.
>>         Thanks Marc.
>>
>>         ----- Original Message -----
>>         From: "Jorge Silva" <me...@je... <mailto:me...@je...>>
>>         To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>>
>>         Cc: ope...@li...
>>         <mailto:ope...@li...>
>>         Sent: Tuesday, November 13, 2012 3:22:37 PM
>>         Subject: Re: Problem with VG activation clvmd runs at 100%
>>
>>         Marc
>>
>>
>>         Hi, thanks for the info, it helps. I have also noticed that
>>         gfs2 entries in the fstab get ignored on boot, I have added
>>         in rc.local. I have done a bit more digging and the issue I
>>         described below:
>>
>>
>>         "I am still a bit stuck when nodes with gfs2 mounted don't
>>         restart if instructed to do so, but I will read some more."
>>
>>
>>         If I issue a init 6 on a nodes they will restart. If I issue
>>         init 0, then I have the problem the node start to shut down,
>>         but will stay in the cluster. I have to shut it off, it will
>>         not shut down, this is the log.
>>
>>
>>
>>         [root@bwccs304 ~]# init 0
>>
>>
>>         Can't connect to default. Skipping.
>>         Shutting down Cluster Module - cluster monitor: [ OK ]
>>         Shutting down ricci: [ OK ]
>>         Shutting down oddjobd: [ OK ]
>>         Stopping saslauthd: [ OK ]
>>         Stopping sshd: [ OK ]
>>         Shutting down sm-client: [ OK ]
>>         Shutting down sendmail: [ OK ]
>>         Stopping imsd via sshd: [ OK ]
>>         Stopping snmpd: [ OK ]
>>         Stopping crond: [ OK ]
>>         Stopping HAL daemon: [ OK ]
>>         Stopping nscd: [ OK ]
>>         Shutting down ntpd: [ OK ]
>>         Deactivating clustered VG(s): 0 logical volume(s) in volume
>>         group "VG_SDATA" now active
>>         [ OK ]
>>         Signaling clvmd to exit [ OK ]
>>         clvmd terminated[ OK ]
>>         Stopping lldpad: [ OK ]
>>         Stopping system message bus: [ OK ]
>>         Stopping multipathd daemon: [ OK ]
>>         Stopping rpcbind: [ OK ]
>>         Stopping auditd: [ OK ]
>>         Stopping nslcd: [ OK ]
>>         Shutting down system logger: [ OK ]
>>         Stopping sssd: [ OK ]
>>         Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ]
>>         Stopping gfs2 dependent services Starting clvmd:
>>         Activating VG(s): 2 logical volume(s) in volume group
>>         "VG_SDATA" now active
>>         1 logical volume(s) in volume group "vg_osroot" now active
>>         [ OK ]
>>         osr(notice) ..bindmounts.. [ OK ]
>>         Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in
>>         volume group "VG_SDATA" unmonitored
>>         [ OK ]
>>         Stopping monitoring for VG vg_osroot: 1 logical volume(s) in
>>         volume group "vg_osroot" unmonitored
>>         [ OK ]
>>         Sending all processes the TERM signal... [ OK ]
>>         Sending all processes the KILL signal... [ OK ]
>>         Saving random seed: [ OK ]
>>         Syncing hardware clock to system time [ OK ]
>>         Turning off quotas: quotaoff: Cannot change state of GFS2 quota.
>>         quotaoff: Cannot change state of GFS2 quota.
>>         [FAILED]
>>         Unmounting file systems: [ OK ]
>>         init: Re-executing /sbin/init
>>         Halting system...
>>         osr(notice) Scanning for Bootparameters...
>>         osr(notice) Starting ATIX exitrd
>>         osr(notice) Comoonics-Release
>>         osr(notice) comoonics Community Release 5.0 (Gumpn)
>>         osr(notice) Internal Version $Revision: 1.18 $ $Date:
>>         2011-02-11 15:09:53 $
>>         osr(notice) Preparing chrootcp: cannot stat
>>         `/mnt/newroot/dev/initctl': No such file or directory
>>         [ OK ]
>>         osr(notice) com-realhalt: detected distribution: rhel6,
>>         clutype: gfs, rootfs: gfs2
>>         osr(notice) Restarting init process in chroot[ OK ]
>>         osr(notice) Moving dev filesystem[ OK ]
>>         osr(notice) Umounting filesystems in oldroot (
>>         /mnt/newroot/sys /mnt/newroot/proc)
>>         osr(notice) Umounting /mnt/newroot/sys[ OK ]
>>         osr(notice) Umounting /mnt/newroot/proc[ OK ]
>>         osr(notice) Umounting filesystems in oldroot
>>         (/mnt/newroot/var/run /mnt/newroot/var/lock
>>         /mnt/newroot/.cdsl.local)
>>         osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing
>>         /sbin/init
>>         [ OK ]
>>         osr(notice) Umounting /mnt/newroot/var/lock[ OK ]
>>         osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ]
>>         osr(notice) Umounting oldroot /mnt/newroot[ OK ]
>>         Deactivating clustered VG(s): 0 logical volume(s) in volume
>>         group "VG_SDATA" now active
>>
>>
>>
>>
>>
>>         On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at...
>>         <mailto:gr...@at...> > wrote:
>>
>>
>>         Jorge,
>>         you don't need to be doubtful about the fact that the volume
>>         group for the root file system is not flagged as clustered.
>>         This has no implications whatsoever on the gfs2 file system.
>>
>>         It will only be a problem whenever the lvm settings of the
>>         vg_osroot change (size, number of lvs etc.).
>>
>>         Nevertheless while thinking about your problem I think I had
>>         the idea on how to fix this problem on being able to have the
>>         root vg clustered also. I will provide new packages in the
>>         next days that should deal with the problem.
>>
>>         Keep in mind that there is a difference between cman_tool
>>         services and the lvm usage.
>>         clvmd only uses the locktable clvmd shown by cman_tool
>>         services and the other locktables are relevant to the file
>>         systems and other services (fenced, rgmanager, ..). This is a
>>         complete different use case.
>>
>>         Try to elaborate a bit more on the fact
>>
>>         "I am still a bit stuck when nodes with gfs2 mounted don't
>>         restart if instructed to do so, but I will read some more."
>>         What do you mean with it? How does this happen? This sounds
>>         like something you should have a look at.
>>
>>
>>         "Once thing that I can confirm is
>>         osr(notice): Detecting nodeid & nodename
>>         This does not always display the correct info, but it doesn't
>>         seem to be a problem either ?"
>>
>>         You should always look at the nodeid the nodename is (more or
>>         less) only descriptive and might not be set as expected. But
>>         the nodeid should always be consistent. Does this help?
>>
>>         About your notes (I only take the relevant ones):
>>
>>         1. osr(notice): Creating clusterfiles /var/run/cman_admin
>>         /var/run/cman_client.. [OK]
>>         This message should not be misleading but only tells the
>>         these control files are being created inside the ramdisk.
>>         This has nothing to do with these files on your root file
>>         system. Nevertheless /etc/init.d/bootsr should take over this
>>         part and create the files. Please send me another
>>         bash -x /etc/init.d/bootsr start
>>         output. Please when those files are not existant.
>>
>>         2. vgs
>>
>>         VG #PV #LV #SN Attr VSize VFree
>>         VG_SDATA 1 2 0 wz--nc 1000.00g 0
>>         vg_osroot 1 1 0 wz--n- 60.00g 0
>>
>>         This is perfectly ok. This only means the vg is not
>>         clustered. But the filesystem IS. This does not have any
>>         connection.
>>
>>         Hope this helps.
>>         Let me know about the open issues.
>>
>>         Regards
>>
>>         Marc.
>>
>>
>>         ----- Original Message -----
>>         From: "Jorge Silva" < me...@je... <mailto:me...@je...> >
>>         To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> >
>>
>>         Sent: Tuesday, November 13, 2012 2:15:23 AM
>>         Subject: Re: Problem with VG activation clvmd runs at 100%
>>
>>
>>         Marc
>>
>>
>>         Hi - I believe I have solved my problem, with your help,
>>         thank you. Yet, I'm not sure how I caused it - but the root
>>         volume group as you pointed out had the clustered
>>         attribute(and I had to have done something silly along the
>>         way). I re-installed from scratch see notes below and then
>>         just to prove that is a problem, I changed the attribute of
>>         the rootfs- vgchange -cy and rebooted and I ran into trouble,
>>         I changed it back and it is fine so that does cause problems
>>         on start-up, I'm not sure I understand why as there is an
>>         active quorum for the clvm to join and take part..
>>
>>
>>         Despite it not being marked as a cluster volume cman_tool
>>         services show it as being, but clvmd status doesn't ? Is it
>>         safe to write to it with multiple nodes mounted?
>>
>>
>>         I am still a bit stuck when nodes with gfs2 mounted don't
>>         restart if instructed to do so, but I will read some more.
>>
>>
>>
>>
>>         Once thing that I can confirm is
>>         osr(notice): Detecting nodeid & nodename
>>
>>
>>         This does not always display the correct info, but it doesn't
>>         seem to be a problem either ?
>>
>>
>>
>>
>>         Thanks
>>         Jorge
>>
>>
>>         Notes:
>>         I decided to start from scratch and I blew away the rootfs
>>         and started from scratch as per the website. My assumption -
>>         that I edited something and messed it up (I did look at a lot
>>         of the scripts to try to "figure out and fix" the problem, I
>>         can send the history if you want or I can edit and contribute).
>>
>>
>>         I rebooted the server and I had an issue - I didn't disable
>>         selinux so I had to intervene in the boot stage. That
>>         completed, but I noticed that :
>>
>>
>>
>>         osr(notice): Starting network configuration for lo0 [OK]
>>         osr(notice): Detecting nodeid & nodename
>>
>>
>>         Is blank, but somehow the correct nodeid and name was deduced.
>>
>>
>>         I had to rebuild the ram disk to fix the selinux disabled. I
>>         also added the following
>>
>>         yum install pciutils - the mkinitrd warned about this so, I
>>         installed it.
>>         I also installed :
>>         yum install cluster-snmp
>>         yum install rgmanager
>>         in lvm
>>
>>
>>         On this reboot I noticed that despite this message
>>
>>         sr(notice): Creating clusterfiles /var/run/cman_admin
>>         /var/run/cman_client.. [OK]
>>
>>
>>         Starting clvmd: dlm: Using TCP for communications
>>
>>
>>         Activating VG(s): File descriptor 3 (/dev/console) leaked on
>>         vgchange invocation. Parent PID 15995: /bin/bash
>>         File descriptor 4 (/dev/console) leaked on vgchange
>>         invocation. Parent PID 15995: /bin/bash
>>         Skipping clustered volume group VG_SDATA
>>         1 logical volume(s) in volume group "vg_osroot" now active
>>
>>
>>         the links weren't created and I did this manually
>>
>>
>>
>>         ln -sf /var/comoonics/chroot//var/run/cman_admin
>>         /var/run/cman_admin
>>         ln -sf /var/comoonics/chroot//var/run/cman_client
>>         /var/run/cman_client
>>
>>
>>         I could then get clusterstatus etc, and clvmd was running ok
>>
>>
>>         I looked in /etc/lvm/lvm.conf and locking_type = 4 ?
>>
>>
>>         I then issued
>>
>>
>>         lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf
>>         locking_type = 3.
>>
>>
>>         vgscan correctly showed up clusterd volumes and was working ok.
>>
>>
>>
>>
>>         I did not rebuild the ramdisk (I can confirm that the lvm
>>         .conf in the ramdisk has locking_type=4) I have rebooted and
>>         everything is working.
>>
>>         Starting clvmd: dlm: Using TCP for communications
>>
>>
>>         Activating VG(s): File descriptor 3 (/dev/console) leaked on
>>         vgchange invocation. Parent PID 15983: /bin/bash
>>         File descriptor 4 (/dev/console) leaked on vgchange
>>         invocation. Parent PID 15983: /bin/bash
>>         Skipping clustered volume group VG_SDATA
>>         1 logical volume(s) in volume group "vg_osroot" now active
>>
>>
>>
>>
>>
>>
>>         I have rebooted a number of times and am confident that
>>         things are ok,
>>
>>
>>         I decided to add two other nodes to the mix and I can confirm
>>         that everytime a new node is added these files are missing :
>>
>>
>>         /var/run/cman_admin
>>         /var/run/cman_client
>>         But I can see from the logs:
>>
>>
>>
>>         osr(notice): Creating clusterfiles /var/run/cman_admin
>>         /var/run/cman_client.. [OK]
>>
>>
>>         despite the above message, also, the information below is not
>>         always detected, but still the nodeid etc is correct...
>>
>>
>>         osr(notice): Detecting nodeid & nodename
>>
>>
>>
>>
>>         So now I have 3 nodes in the cluster and things look ok:
>>
>>
>>
>>         [root@bwccs302 ~]# cman_tool services
>>         fence domain
>>         member count 3
>>         victim count 0
>>         victim now 0
>>         master nodeid 2
>>         wait state none
>>         members 2 3 4
>>
>>
>>         dlm lockspaces
>>         name home
>>         id 0xf8ee17aa
>>         flags 0x00000008 fs_reg
>>         change member 3 joined 1 remove 0 failed 0 seq 3,3
>>         members 2 3 4
>>
>>
>>         name clvmd
>>         id 0x4104eefa
>>         flags 0x00000000
>>         change member 3 joined 1 remove 0 failed 0 seq 15,15
>>         members 2 3 4
>>
>>
>>         name OSRoot
>>         id 0xab5404ad
>>         flags 0x00000008 fs_reg
>>         change member 3 joined 1 remove 0 failed 0 seq 7,7
>>         members 2 3 4
>>
>>
>>         gfs mountgroups
>>         name home
>>         id 0x686e3fc4
>>         flags 0x00000048 mounted
>>         change member 3 joined 1 remove 0 failed 0 seq 3,3
>>         members 2 3 4
>>
>>
>>         name OSRoot
>>         id 0x659f7afe
>>         flags 0x00000048 mounted
>>         change member 3 joined 1 remove 0 failed 0 seq 7,7
>>         members 2 3 4
>>
>>
>>
>>         service clvmd status
>>         clvmd (pid 25771) is running...
>>         Clustered Volume Groups: VG_SDATA
>>         Active clustered Logical Volumes: LV_HOME LV_DEVDB
>>
>>
>>         it doesn't believe that the root file-system is clustered
>>         despite the output from the above.
>>
>>
>>
>>         [root@bwccs302 ~]# vgs
>>         VG #PV #LV #SN Attr VSize VFree
>>         VG_SDATA 1 2 0 wz--nc 1000.00g 0
>>         vg_osroot 1 1 0 wz--n- 60.00g 0
>>
>>
>>         The above got me thinking on what you wanted me to do to
>>         diable the clusterd flag on the root volume - with it left on
>>         I was having problems (not sure how it got turned) on.
>>
>>
>>         With everything working ok, I remade ramdisk and now lvm.conf=3..
>>
>>
>>         The systems start up and things look ok.
>>
>>
>
>
>

-- 

Marc Grimme

E-Mail: grimme( at )atix.de

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de | www.comoonics.org

Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: 
DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) |
Vorsitzender des Aufsichtsrats: Dr. Martin Buss