From: Jorge S. <me...@je...> - 2012-11-15 18:31:40
|
Marc Hi, I believe the problem is related to the clsuter services not shutting down. init 0, will not work with 1 or more nodes, init 6 will only work when 1 node is present. When more than 1 node is present the node with the init 6 will have to be fenced as it will not shut down. I believe the cluster components aren't shutting down (this also happens with init 6 when more than one node is present) - I still see cluster traffic on the network, this is periodic. 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > 229.192.0.2.netsupport: UDP, length 119 At the point that the system will not shut down, it still is a cluster member and there is still cluster traffic. 1 node : [root@bwccs302 ~]# init 0 Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active 1 logical volume(s) in volume group "vg_osroot" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(debug) Calling cmd /sbin/halt -d -p osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell bash: no job control in this shell Type help to get more information.. Type exit to continue work.. ------------------------------------------------------------- comoonics 1 > cman_tool: unknown option cman_tool comoonics 2 > comoonics 2 > Version: 6.2.0 Config Version: 1 Cluster Name: ProdCluster01 Cluster Id: 11454 Cluster Member: Yes Cluster Generation: 4 Membership state: Cluster-Member Nodes: 1 Expected votes: 4 Quorum device votes: 3 Total votes: 4 Node votes: 1 Quorum: 3 Active subsystems: 10 Flags: Ports Bound: 0 11 178 Node name: smc01b Node ID: 2 Multicast addresses: 229.192.0.2 Node addresses: 172.17.62.12 comoonics 3 > fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 comoonics 4 > bash: exitt: command not found comoonics 5 > exit osr(notice) Back to work.. Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active It hung at the point above - so I re-ran with the edit set -x in line 207. 1 -node: [root@bwccs302 ~]# init 0 [root@bwccs302 ~ Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" n ow active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" now active 2 logical volume(s) in volume group "VG_SDATA" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_ osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] + clusterfs_services_stop '' '' 0 ++ repository_get_value rootfs +++ repository_normalize_value rootfs ++ local key=rootfs ++ local default= ++ local repository= ++ '[' -z '' ']' ++ repository=comoonics ++ local value= ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' +++ cat /var/cache/comoonics-repository/comoonics.rootfs ++ value=gfs2 ++ echo gfs2 ++ return 0 + local rootfs=gfs2 + gfs2_services_stop '' '' 0 + local chroot_path= + local lock_method= + local lvm_sup=0 + '[' -n 0 ']' + '[' 0 -eq 0 ']' + /etc/init.d/clvmd stop Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active with 2 nodes + quorate when init 6 is issued: [root@bwccs304 ~]# init 6 [root@bwccs304 ~ Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" now active 2 logical volume(s) in volume group "VG_SDATA" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] qdiskd[15713]: Unregistering quorum device. Sending all processes the KILL signal... dlm: clvmd: no userland control daemon, stopping lockspace dlm: OSRoot: no userland control daemon, stopping lockspace [ OK ] - stops here and will not die... Still have full cluster coms Thanks jorge On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at...> wrote: > Hi Jorge, > because of the "init 0". > Please issue the following commands prior to init 0. > # Make it a little more chatty > $ com-chroot setparameter debug > # Break after before cluster will be stopped > $ com-chroot setparameter step halt_umountoldroot > > Then issue a init 0. > This should lead you to a breakpoint during shutdown (hopefully, cause > sometimes the console gets confused). > In side the breakpoint issue: > $ cman_tool status > $ cman_tool services > # Continue shutdown > $ exit > Then send me the output. > > If this fails also do as follows: > $ com-chroot vi com-realhalt.sh > # go to line 207 (before clusterfs_services_stop) is called and add a set > -x > $ init 0 > > Send the output. > Thanks Marc. > > ----- Original Message ----- > From: "Jorge Silva" <me...@je...> > To: "Marc Grimme" <gr...@at...> > Cc: ope...@li... > Sent: Tuesday, November 13, 2012 3:22:37 PM > Subject: Re: Problem with VG activation clvmd runs at 100% > > Marc > > > Hi, thanks for the info, it helps. I have also noticed that gfs2 entries > in the fstab get ignored on boot, I have added in rc.local. I have done a > bit more digging and the issue I described below: > > > "I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more." > > > If I issue a init 6 on a nodes they will restart. If I issue init 0, then > I have the problem the node start to shut down, but will stay in the > cluster. I have to shut it off, it will not shut down, this is the log. > > > > [root@bwccs304 ~]# init 0 > > > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Stopping nscd: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active > 1 logical volume(s) in volume group "vg_osroot" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in volume group > "VG_SDATA" unmonitored > [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group > "vg_osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > Sending all processes the KILL signal... [ OK ] > Saving random seed: [ OK ] > Syncing hardware clock to system time [ OK ] > Turning off quotas: quotaoff: Cannot change state of GFS2 quota. > quotaoff: Cannot change state of GFS2 quota. > [FAILED] > Unmounting file systems: [ OK ] > init: Re-executing /sbin/init > Halting system... > osr(notice) Scanning for Bootparameters... > osr(notice) Starting ATIX exitrd > osr(notice) Comoonics-Release > osr(notice) comoonics Community Release 5.0 (Gumpn) > osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ > osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No > such file or directory > [ OK ] > osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, > rootfs: gfs2 > osr(notice) Restarting init process in chroot[ OK ] > osr(notice) Moving dev filesystem[ OK ] > osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys > /mnt/newroot/proc) > osr(notice) Umounting /mnt/newroot/sys[ OK ] > osr(notice) Umounting /mnt/newroot/proc[ OK ] > osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run > /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) > osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init > [ OK ] > osr(notice) Umounting /mnt/newroot/var/lock[ OK ] > osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] > osr(notice) Umounting oldroot /mnt/newroot[ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > > > > > > On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... > wrote: > > > Jorge, > you don't need to be doubtful about the fact that the volume group for the > root file system is not flagged as clustered. This has no implications > whatsoever on the gfs2 file system. > > It will only be a problem whenever the lvm settings of the vg_osroot > change (size, number of lvs etc.). > > Nevertheless while thinking about your problem I think I had the idea on > how to fix this problem on being able to have the root vg clustered also. I > will provide new packages in the next days that should deal with the > problem. > > Keep in mind that there is a difference between cman_tool services and the > lvm usage. > clvmd only uses the locktable clvmd shown by cman_tool services and the > other locktables are relevant to the file systems and other services > (fenced, rgmanager, ..). This is a complete different use case. > > Try to elaborate a bit more on the fact > > "I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more." > What do you mean with it? How does this happen? This sounds like something > you should have a look at. > > > "Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > This does not always display the correct info, but it doesn't seem to be a > problem either ?" > > You should always look at the nodeid the nodename is (more or less) only > descriptive and might not be set as expected. But the nodeid should always > be consistent. Does this help? > > About your notes (I only take the relevant ones): > > 1. osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > This message should not be misleading but only tells the these control > files are being created inside the ramdisk. This has nothing to do with > these files on your root file system. Nevertheless /etc/init.d/bootsr > should take over this part and create the files. Please send me another > bash -x /etc/init.d/bootsr start > output. Please when those files are not existant. > > 2. vgs > > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > This is perfectly ok. This only means the vg is not clustered. But the > filesystem IS. This does not have any connection. > > Hope this helps. > Let me know about the open issues. > > Regards > > Marc. > > > ----- Original Message ----- > From: "Jorge Silva" < me...@je... > > To: "Marc Grimme" < gr...@at... > > > Sent: Tuesday, November 13, 2012 2:15:23 AM > Subject: Re: Problem with VG activation clvmd runs at 100% > > > Marc > > > Hi - I believe I have solved my problem, with your help, thank you. Yet, > I'm not sure how I caused it - but the root volume group as you pointed out > had the clustered attribute(and I had to have done something silly along > the way). I re-installed from scratch see notes below and then just to > prove that is a problem, I changed the attribute of the rootfs- vgchange > -cy and rebooted and I ran into trouble, I changed it back and it is fine > so that does cause problems on start-up, I'm not sure I understand why as > there is an active quorum for the clvm to join and take part.. > > > Despite it not being marked as a cluster volume cman_tool services show it > as being, but clvmd status doesn't ? Is it safe to write to it with > multiple nodes mounted? > > > I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more. > > > > > Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > > > This does not always display the correct info, but it doesn't seem to be a > problem either ? > > > > > Thanks > Jorge > > > Notes: > I decided to start from scratch and I blew away the rootfs and started > from scratch as per the website. My assumption - that I edited something > and messed it up (I did look at a lot of the scripts to try to "figure out > and fix" the problem, I can send the history if you want or I can edit and > contribute). > > > I rebooted the server and I had an issue - I didn't disable selinux so I > had to intervene in the boot stage. That completed, but I noticed that : > > > > osr(notice): Starting network configuration for lo0 [OK] > osr(notice): Detecting nodeid & nodename > > > Is blank, but somehow the correct nodeid and name was deduced. > > > I had to rebuild the ram disk to fix the selinux disabled. I also added > the following > > yum install pciutils - the mkinitrd warned about this so, I installed it. > I also installed : > yum install cluster-snmp > yum install rgmanager > in lvm > > > On this reboot I noticed that despite this message > > sr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15995: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15995: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > the links weren't created and I did this manually > > > > ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin > ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client > > > I could then get clusterstatus etc, and clvmd was running ok > > > I looked in /etc/lvm/lvm.conf and locking_type = 4 ? > > > I then issued > > > lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf locking_type > = 3. > > > vgscan correctly showed up clusterd volumes and was working ok. > > > > > I did not rebuild the ramdisk (I can confirm that the lvm .conf in the > ramdisk has locking_type=4) I have rebooted and everything is working. > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15983: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15983: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > > > > > I have rebooted a number of times and am confident that things are ok, > > > I decided to add two other nodes to the mix and I can confirm that > everytime a new node is added these files are missing : > > > /var/run/cman_admin > /var/run/cman_client > But I can see from the logs: > > > > osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > despite the above message, also, the information below is not always > detected, but still the nodeid etc is correct... > > > osr(notice): Detecting nodeid & nodename > > > > > So now I have 3 nodes in the cluster and things look ok: > > > > [root@bwccs302 ~]# cman_tool services > fence domain > member count 3 > victim count 0 > victim now 0 > master nodeid 2 > wait state none > members 2 3 4 > > > dlm lockspaces > name home > id 0xf8ee17aa > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 3 joined 1 remove 0 failed 0 seq 15,15 > members 2 3 4 > > > name OSRoot > id 0xab5404ad > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > gfs mountgroups > name home > id 0x686e3fc4 > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name OSRoot > id 0x659f7afe > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > > service clvmd status > clvmd (pid 25771) is running... > Clustered Volume Groups: VG_SDATA > Active clustered Logical Volumes: LV_HOME LV_DEVDB > > > it doesn't believe that the root file-system is clustered despite the > output from the above. > > > > [root@bwccs302 ~]# vgs > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > > The above got me thinking on what you wanted me to do to diable the > clusterd flag on the root volume - with it left on I was having problems > (not sure how it got turned) on. > > > With everything working ok, I remade ramdisk and now lvm.conf=3.. > > > The systems start up and things look ok. > > |