You can subscribe to this list here.
2006 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
|
Jun
|
Jul
(4) |
Aug
(4) |
Sep
(4) |
Oct
|
Nov
(5) |
Dec
(9) |
2009 |
Jan
(3) |
Feb
(17) |
Mar
(11) |
Apr
(27) |
May
(16) |
Jun
(7) |
Jul
(3) |
Aug
(10) |
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2010 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(3) |
Oct
(2) |
Nov
(2) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(18) |
Dec
(3) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(1) |
2015 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Marcio T. <mar...@nu...> - 2013-12-10 01:11:36
|
Hi, I am experimenting with an OpenSharedRoot 5.0 cluster running RHEL 6.3. I’ve noticed a situation where sometimes when a node is booting, it fails to mount the cluster root and it breaks into a root shell for troubleshooting. I would like to disable this. I have done some digging around in the root shell and I think I’ve determined where in the scripts the failure is occurring, and it appears as if the root shell for troubleshooting is there by design. It appears as if there might be ways to configure both the NFS timeout and retry values, to maybe reduce the likelihood of the problem, or to even change the shell that gets executed. I imagine I can configure those things somewhere, but I do not know where. Before I try to just hack the scripts in the initrd, I thought I would ask here for recommendations. Here is what the screen looks like just prior to the root shell: Osr(notice): Starting service rcpbind NET: Registered protocol family 10 lo: Disabled Privacy Extensions [OK] Osr(notice): Starting service rpc.statd [OK] Osr(notice): Mounting ginseng-nfs:/export/cluster-root on /mnt/newroot… Mount.nfs: Connection timed out [FAILED] bash: cannot set terminal process group (-1): Inappropriate ioctl for device ;@node30:~[root@node30/]# The problem is that now I can view all the local disks on the system as root. This is problematic in our environment. If I type “exit”, I get the following: Osr(notice): Back to work.. So this led me to believe this behavior was by design. I started grepping for strings in the root shell and found a couple promising scripts. I believe the error originally happens in “clusterfs_mount” in “/etc/clusterfs-lib.sh”. In there, there appears to be a timeout value and a number of retries. I might be able to mitigate the problem if I could set the number of retries and timeout to an extremely large number, as this way either the system would eventually boot, or it simply would appear to be hung. So one solution would be to do that, but I don’t know which config file might adjust those values. On to the second option. I found that the “Back to work” message is printed by “breakp” in /etc/boot-lib.sh. There is a $shell variable in there. If there is a way that I could change that to 1) either something that prompts for a root password, or 2) something that hangs or reboots the system, that would be good too. Again, the problem is that I do not know which config file might adjust those parameters. I would like to disable the troubleshooting shell. Ideally the system would just hang or prompt for the root password. On a similar vein, it would be nice to be able to disable the option to boot interactively, namely, the functionality which says: Osr(notice): Press ‘I’ to enter interactive startup In GRUB, this is a feature that can be disabled, and it would be nice to disable this in OSR as well. Any guidance on these questions would very much be appreciated. Thank you, — Marcio |
From: Marc G. <gr...@at...> - 2013-08-01 06:39:40
|
Hi Maricio, looks like you did the right thing but right a little bit clumsy. So this is how to make it better (at least in my eyes): 1. only the file If you just want to add a specific file: echo /lib/modules/2.6.32-279.22.1.el6.centos.plus.x86_64/kernel/drivers/dca/dca.ko >> /etc/comoonics/bootimage/file.initrd.d/dca.list Rebuild your initrd (with -C to rebuild the cache). The cache files are a bad place as they will be cleared after each update of the comoonics tools. 2. the rpm Nevertheless I beliefe the module comes from a specific rpm, right? If so you can just also add the rpm to the list of rpms being included in the initrd. This works as follows: echo <rpmname> >> /etc/comoonics/bootimage/rpm.initrd.d/dca.list Rebuild your initrd (with -C to rebuild the cache). Hope that makes things more clear. Marc. ----- Original Message ----- From: "Marcio Teixeira" <mar...@nu...> To: ope...@li... Sent: Wednesday, July 31, 2013 11:37:35 PM Subject: [OSR-users] Trying to add kernel module to initrd Hello, I've recently added a new node to an RHEL6 OpenSharedRoot 5 cluster that is much more modern than the rest of the cluster. In particular, this node had an internet adapter which is different from what the other nodes are using, and would not boot using the initrd image of my cluster. I eventually found out the "igb" and "dca" modules were failing to load, because of missing kernel module files. I needed to have a way to tell the "mkinitrd" to include the correct files in the initrd image. I experimented with "mkinitrd -M igb –M dca", but this did not work. I also tried to preload the modules prior to running "mkinitrd," under the assumption that "mkinitrd" would package whatever modules were currently in use by the kernel (I'm not sure whether this assumption is correct or not). This did not help either. I eventually got it to work by manually injecting the full path for the missing module into the file cache, and re-running "mkinitrd". E.g: mkinitrd /boot/initrd_sr-$(uname -r).img $(uname –r) echo /lib/modules/2.6.32-279.22.1.el6.centos.plus.x86_64/kernel/drivers/dca/dca.ko >> /var/cache/comoonics-bootimage/file-list.txt mkinitrd /boot/initrd_sr-$(uname -r).img $(uname –r) While this works, it seems a bit clumsy. Can anyone offer some guidance on what is the correct way to tell "mkinitrd" to package the files for the "dca" module without having to run mkinitrd twice? Thank you, Marcio Teixeira ------------------------------------------------------------------------------ Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk _______________________________________________ Open-sharedroot-users mailing list Ope...@li... https://lists.sourceforge.net/lists/listinfo/open-sharedroot-users |
From: Marcio T. <mar...@nu...> - 2013-07-31 21:54:30
|
Hello, I've recently added a new node to an RHEL6 OpenSharedRoot 5 cluster that is much more modern than the rest of the cluster. In particular, this node had an internet adapter which is different from what the other nodes are using, and would not boot using the initrd image of my cluster. I eventually found out the "igb" and "dca" modules were failing to load, because of missing kernel module files. I needed to have a way to tell the "mkinitrd" to include the correct files in the initrd image. I experimented with "mkinitrd -M igb M dca", but this did not work. I also tried to preload the modules prior to running "mkinitrd," under the assumption that "mkinitrd" would package whatever modules were currently in use by the kernel (I'm not sure whether this assumption is correct or not). This did not help either. I eventually got it to work by manually injecting the full path for the missing module into the file cache, and re-running "mkinitrd". E.g: mkinitrd /boot/initrd_sr-$(uname -r).img $(uname r) echo /lib/modules/2.6.32-279.22.1.el6.centos.plus.x86_64/kernel/drivers/dca/dca.k o >> /var/cache/comoonics-bootimage/file-list.txt mkinitrd /boot/initrd_sr-$(uname -r).img $(uname r) While this works, it seems a bit clumsy. Can anyone offer some guidance on what is the correct way to tell "mkinitrd" to package the files for the "dca" module without having to run mkinitrd twice? Thank you, Marcio Teixeira |
From: Marc G. <gr...@at...> - 2012-12-14 20:53:42
|
Hi Marcio, the tool you're looking for is now called com-cdslinvadm as it is used for not only creating the infrastructure but managing it. Hope that helps. Let me know if you have further questions. Marc. ----- Original Message ----- From: "Marcio Teixeira" <mar...@nu...> To: ope...@li... Sent: Thursday, December 13, 2012 9:24:20 PM Subject: [OSR-users] What package provides com-mkcdslinfrastructure in 5.0 release? Hello, I am trying to rebuild an OpenSharedRoot cluster that was running an Fedora 10 and an older version of OpenSharedRoot under RHEL 6 using the new 5.0 release and I've hit a stumbling block. I cannot find "com-mkcdslinfrastructure" in the new release. In my previous install, this was provided by "comoonics-cdsl-py-0.2-11.noarch" But the new " comoonics-cdsl-py-5.0-3_rhel6.noarch" does not provide it. -- Marcio ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Open-sharedroot-users mailing list Ope...@li... https://lists.sourceforge.net/lists/listinfo/open-sharedroot-users |
From: Marcio T. <mar...@nu...> - 2012-12-13 21:26:03
|
Hello, I am trying to rebuild an OpenSharedRoot cluster that was running an Fedora 10 and an older version of OpenSharedRoot under RHEL 6 using the new 5.0 release and I've hit a stumbling block. I cannot find "com-mkcdslinfrastructure" in the new release. In my previous install, this was provided by "comoonics-cdsl-py-0.2-11.noarch" But the new "comoonics-cdsl-py-5.0-3_rhel6.noarch" does not provide it. -- Marcio |
From: Marcio T. <mar...@nu...> - 2012-12-13 20:49:18
|
Hello, I am trying to rebuild an OpenSharedRoot cluster that was running an Fedora 10 and an older version of OpenSharedRoot under RHEL 6 using the new 5.0 release and I've hit a stumbling block. I cannot find "com-mkcdslinfrastructure" in the new release. In my previous install, this was provided by "comoonics-cdsl-py-0.2-11.noarch" But the new "comoonics-cdsl-py-5.0-3_rhel6.noarch" does not provide it. -- Marcio |
From: Marc G. <gr...@at...> - 2012-11-20 21:57:25
|
Jorge, please send me the following information: /etc/cluster/cluster.conf (for the fencing configuration). When you started clvmd with -d option, it looks like locking_type=0. Are you sure that locking_type is 3 there? WARNING: Locking disabled. Be careful! This could corrupt your metadata. Another option would be to start clvmd with strace and send me the trace: strace -t -T -o /tmp/clvmd-strace.out clvmd Perhaps I can see some odd things from there. Regards Marc. Am 20.11.2012 14:35, schrieb Jorge Silva: > Marc > > Hi, I have confirmed that the locking_type=3 rebuilt initrd and > reboot, attatched is the boot log. clvmd -d : > > [root@bwccs302 ~]# clvmd -d > CLVMD[560ec7a0]: Nov 20 08:30:43 CLVMD started > CLVMD[560ec7a0]: Nov 20 08:30:43 Connected to CMAN > CLVMD[560ec7a0]: Nov 20 08:30:43 CMAN initialisation complete > CLVMD[560ec7a0]: Nov 20 08:30:43 Opened existing DLM lockspace for CLVMD. > CLVMD[560ec7a0]: Nov 20 08:30:43 DLM initialisation complete > CLVMD[560ec7a0]: Nov 20 08:30:43 Cluster ready, doing some more > initialisation > CLVMD[560ec7a0]: Nov 20 08:30:43 starting LVM thread > CLVMD[560eb700]: Nov 20 08:30:43 LVM thread function started > WARNING: Locking disabled. Be careful! This could corrupt your metadata. > CLVMD[560eb700]: Nov 20 08:30:43 Sub thread ready for work. > CLVMD[560ec7a0]: Nov 20 08:30:43 clvmd ready for work > CLVMD[560eb700]: Nov 20 08:30:43 LVM thread waiting for work > CLVMD[560ec7a0]: Nov 20 08:30:43 Using timeout of 60 seconds > > Output from top: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 27261 root 20 0 101m 23m 3176 S 99.7 0.1 0:58.69 clvmd > > > ipmilan : > [root@bwccs302 ~]# com-chroot /sbin/fence_ipmilan -h > bash: /sbin/fence_ipmilan: No such file or directory > > I checked in /usr/sbin/fence_ipmilan > > [root@bwccs302 ~]# com-chroot /usr/sbin/fence_ipmilan -h > usage: fence_ipmilan <options> > -A <authtype> IPMI Lan Auth type (md5, password, or none) > -a <ipaddr> IPMI Lan IP to talk to > -i <ipaddr> IPMI Lan IP to talk to (deprecated, use -a) > -p <password> Password (if required) to control power on > IPMI device > -P Use Lanplus > -S <path> Script to retrieve password (if required) > -l <login> Username/Login (if required) to control power > on IPMI device > -L <privlvl> IPMI privilege level. Defaults to ADMINISTRATOR. > See ipmitool(1) for more info. > -o <op> Operation to perform. > Valid operations: on, off, reboot, status, > diag, list or monitor > -t <timeout> Timeout (sec) for IPMI operation (default 20) > -T <timeout> Wait X seconds after on/off operation > -f <timeout> Wait X seconds before fencing is started > -C <cipher> Ciphersuite to use (same as ipmitool -C parameter) > -M <method> Method to fence (onoff or cycle (default onoff) > -V Print version and exit > -v Verbose mode > > If no options are specified, the following options will be read > from standard input (one per line): > > auth=<auth> Same as -A > ipaddr=<#> Same as -a > passwd=<pass> Same as -p > passwd_script=<path> Same as -S > lanplus Same as -P > login=<login> Same as -l > option=<op> Same as -o > operation=<op> Same as -o > action=<op> Same as -o > delay=<seconds> Same as -f > timeout=<timeout> Same as -t > power_wait=<time> Same as -T > cipher=<cipher> Same as -C > method=<method> Same as -M > privlvl=<privlvl> Same as -L > verbose Same as -v > > On Tue, Nov 20, 2012 at 3:41 AM, Marc Grimme <gr...@at... > <mailto:gr...@at...>> wrote: > > Jorge, > let's first start with fencing. > You are using ipmilan for fencing. I didn't evaluate the agent > with rhel6. > So let's start fixing this issue. > Try the following: > com-chroot /sbin/fence_ipmilan -h > > Send me the output. There might some libs missing. > > The clvmd is very strange. Try to stay with locking_type=2 or > locking_type=3. > Then rebuild an initrd and reboot. > If clvmd stays with 100% CPU kill it and start it again manually > with -d flag. Send me the output. Perhaps we see something from there. > > Regards Marc. > Am 19.11.2012 15:39, schrieb Jorge Silva: >> Marc >> >> Hi, np, thanks for helping. The /var/run/cman* are there. I will >> disable the clustered flag on the second volume. Even more >> disturbing is after the last email i sent you I went from a state >> where clvmd was behaving normally (not 100%), I could access >> clustered volumes. I rebooted to verify the that everything was >> functioning - but I am now back to the state where clvmd is >> running at 100% - back to where we started (can't access >> clustered volumes). >> >> locking-type=0 >> [root@bwccs302 ~]# vgs >> WARNING: Locking disabled. Be careful! This could corrupt your >> metadata. >> VG #PV #LV #SN Attr VSize VFree >> VG_DATA1 1 2 0 wz--n- 64.00g 4.20g >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 23207 root 2 -18 101m 23m 3176 S *99.9* 0.1 0:05.82 clvmd >> >> lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin -> >> /var/comoonics/chroot//var/run/cman_admin >> lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client -> >> /var/comoonics/chroot//var/run/cman_client >> >> locking_type=3 >> [root@bwccs302 ~]# service clvmd restart >> Restarting clvmd: [ OK ] >> [root@bwccs302 ~]# vgs >> cluster request failed: Invalid argument >> Can't get lock for VG_DATA1 >> cluster request failed: Invalid argument >> Can't get lock for vg_osroot >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 23829 root 2 -18 167m 24m 3268 S *99.8* 0.1 0:31.29 clvm >> >> >> >> As far as the shutdown - with the two nodes up once I issue the >> shutdown on node1, the shutdown proceeds to the point where I >> sent the screenshots (deactivating cluster services) - on node2, >> I notice - >> >> corosync[16648]: [TOTEM ] A processor failed, forming new >> configuration. It attempts to fence >> >> node1. After 3 unsuccessful attempts, it locks up. Node1 stays >> stuck (screen dump I sent) I do a tcp dump and I see the two >> nodes are still sending multicast messages and until I reset >> node1, node2 will stay in a locked state with no access... this >> is the last set of messages I see : >> >> fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result: >> error from agent >> fenced[16784]: fence smc01b failed >> >> After 3 attempts as fencing failed the cluster locks up till I >> have reset the node. >> I suspect there is another issue at play here as I can manually >> fence a nodes using fence_node x ( I will continue to dig into >> this I have tried fenced -q or messagebus with the same result) >> >> Thanks >> Jorge >> >> >> On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at... >> <mailto:gr...@at...>> wrote: >> >> Hi Jorge, >> sorry for the delay but I was quite busy on the last days. >> Nevertheless I'm don't understand the problem. >> Let's first start at the point I think could lead to problems >> during shutdown and friends. >> Are the control files in /var/run/cman* being created from >> the bootsr initscript or do you still have to create them >> manually. >> If they are not created I would still be very interested in >> the output of >> bash -x /etc/init.d/bootsr start >> after a node has been started. >> >> If not we need to dig deeper into the problems during shutdown. >> I would then also change the clustered flag for the other >> volume group. >> Again as long as you don't change the size it wont hurt. >> And it's only for better understanding the problem. >> >> Another command I'd like to see is a cman_tool services on >> the other node (say node 2) while the shutdown node is being >> stuck (say node 1). >> >> Thanks Marc. >> Am 15.11.2012 19:08, schrieb Jorge Silva: >>> Marc >>> >>> Hi, I believe the problem is related to the clsuter services >>> not shutting down. init 0, will not work with 1 or more >>> nodes, init 6 will only work when 1 node is present. When >>> more than 1 node is present the node with the init 6 will >>> have to be fenced as it will not shut down. I believe the >>> cluster components aren't shutting down (this also happens >>> with init 6 when more than one node is present) - I still >>> see cluster traffic on the network, this is periodic. >>> >>> 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > >>> 229.192.0.2.netsupport: UDP, length 119 >>> >>> At the point that the system will not shut down, it still is >>> a cluster member and there is still cluster traffic. >>> >>> 1 node : >>> [root@bwccs302 ~]# init 0 >>> >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 2 logical volume(s) in volume group >>> "VG_SDATA" now active >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of GFS2 >>> quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(debug) Calling cmd /sbin/halt -d -p >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory >>> [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >>> /sbin/init >>> [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> osr(notice) Breakpoint "halt_umountoldroot" detected forking >>> a shell >>> bash: no job control in this shell >>> >>> Type help to get more information.. >>> Type exit to continue work.. >>> ------------------------------------------------------------- >>> >>> comoonics 1 > cman_tool: unknown option cman_tool >>> comoonics 2 > comoonics 2 > Version: 6.2.0 >>> Config Version: 1 >>> Cluster Name: ProdCluster01 >>> Cluster Id: 11454 >>> Cluster Member: Yes >>> Cluster Generation: 4 >>> Membership state: Cluster-Member >>> Nodes: 1 >>> Expected votes: 4 >>> Quorum device votes: 3 >>> Total votes: 4 >>> Node votes: 1 >>> Quorum: 3 >>> Active subsystems: 10 >>> Flags: >>> Ports Bound: 0 11 178 >>> Node name: smc01b >>> Node ID: 2 >>> Multicast addresses: 229.192.0.2 >>> Node addresses: 172.17.62.12 >>> comoonics 3 > fence domain >>> member count 1 >>> victim count 0 >>> victim now 0 >>> master nodeid 2 >>> wait state none >>> members 2 >>> >>> dlm lockspaces >>> name clvmd >>> id 0x4104eefa >>> flags 0x00000000 >>> change member 1 joined 1 remove 0 failed 0 seq 1,1 >>> members 2 >>> >>> comoonics 4 > bash: exitt: command not found >>> comoonics 5 > exit >>> osr(notice) Back to work.. >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> It hung at the point above - so I re-ran with the edit set >>> -x in line 207. >>> 1 -node: >>> [root@bwccs302 ~]# init 0 >>> [root@bwccs302 ~ >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" n ow active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 1 logical volume(s) in volume group >>> "vg_osroot" now active >>> 2 logical volume(s) in volume group "VG_SDATA" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_ osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of GFS2 >>> quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >>> /sbin/init [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> + clusterfs_services_stop '' '' 0 >>> ++ repository_get_value rootfs >>> +++ repository_normalize_value rootfs >>> ++ local key=rootfs >>> ++ local default= >>> ++ local repository= >>> ++ '[' -z '' ']' >>> ++ repository=comoonics >>> ++ local value= >>> ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' >>> +++ cat /var/cache/comoonics-repository/comoonics.rootfs >>> ++ value=gfs2 >>> ++ echo gfs2 >>> ++ return 0 >>> + local rootfs=gfs2 >>> + gfs2_services_stop '' '' 0 >>> + local chroot_path= >>> + local lock_method= >>> + local lvm_sup=0 >>> + '[' -n 0 ']' >>> + '[' 0 -eq 0 ']' >>> + /etc/init.d/clvmd stop >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> with 2 nodes + quorate when init 6 is issued: >>> >>> [root@bwccs304 ~]# init 6 >>> [root@bwccs304 ~ >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down Avahi daemon: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) ..bindmounts.. [ >>> OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 1 logical volume(s) in volume group >>> "vg_osroot" now active >>> 2 logical volume(s) in volume group "VG_SDATA" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical volume(s) >>> in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> qdiskd[15713]: Unregistering quorum device. >>> >>> Sending all processes the KILL signal... dlm: clvmd: no >>> userland control daemon, stopping lockspace >>> dlm: OSRoot: no userland control daemon, stopping lockspace >>> [ OK ] >>> - stops here and will not die... Still have full cluster coms >>> >>> Thanks >>> jorge >>> >>> On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at... >>> <mailto:gr...@at...>> wrote: >>> >>> Hi Jorge, >>> because of the "init 0". >>> Please issue the following commands prior to init 0. >>> # Make it a little more chatty >>> $ com-chroot setparameter debug >>> # Break after before cluster will be stopped >>> $ com-chroot setparameter step halt_umountoldroot >>> >>> Then issue a init 0. >>> This should lead you to a breakpoint during shutdown >>> (hopefully, cause sometimes the console gets confused). >>> In side the breakpoint issue: >>> $ cman_tool status >>> $ cman_tool services >>> # Continue shutdown >>> $ exit >>> Then send me the output. >>> >>> If this fails also do as follows: >>> $ com-chroot vi com-realhalt.sh >>> # go to line 207 (before clusterfs_services_stop) is >>> called and add a set -x >>> $ init 0 >>> >>> Send the output. >>> Thanks Marc. >>> >>> ----- Original Message ----- >>> From: "Jorge Silva" <me...@je... >>> <mailto:me...@je...>> >>> To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>> >>> Cc: ope...@li... >>> <mailto:ope...@li...> >>> Sent: Tuesday, November 13, 2012 3:22:37 PM >>> Subject: Re: Problem with VG activation clvmd runs at 100% >>> >>> Marc >>> >>> >>> Hi, thanks for the info, it helps. I have also noticed >>> that gfs2 entries in the fstab get ignored on boot, I >>> have added in rc.local. I have done a bit more digging >>> and the issue I described below: >>> >>> >>> "I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more." >>> >>> >>> If I issue a init 6 on a nodes they will restart. If I >>> issue init 0, then I have the problem the node start to >>> shut down, but will stay in the cluster. I have to shut >>> it off, it will not shut down, this is the log. >>> >>> >>> >>> [root@bwccs304 ~]# init 0 >>> >>> >>> Can't connect to default. Skipping. >>> Shutting down Cluster Module - cluster monitor: [ OK ] >>> Shutting down ricci: [ OK ] >>> Shutting down oddjobd: [ OK ] >>> Stopping saslauthd: [ OK ] >>> Stopping sshd: [ OK ] >>> Shutting down sm-client: [ OK ] >>> Shutting down sendmail: [ OK ] >>> Stopping imsd via sshd: [ OK ] >>> Stopping snmpd: [ OK ] >>> Stopping crond: [ OK ] >>> Stopping HAL daemon: [ OK ] >>> Stopping nscd: [ OK ] >>> Shutting down ntpd: [ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> [ OK ] >>> Signaling clvmd to exit [ OK ] >>> clvmd terminated[ OK ] >>> Stopping lldpad: [ OK ] >>> Stopping system message bus: [ OK ] >>> Stopping multipathd daemon: [ OK ] >>> Stopping rpcbind: [ OK ] >>> Stopping auditd: [ OK ] >>> Stopping nslcd: [ OK ] >>> Shutting down system logger: [ OK ] >>> Stopping sssd: [ OK ] >>> Stopping gfs dependent services osr(notice) >>> ..bindmounts.. [ OK ] >>> Stopping gfs2 dependent services Starting clvmd: >>> Activating VG(s): 2 logical volume(s) in volume group >>> "VG_SDATA" now active >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> [ OK ] >>> osr(notice) ..bindmounts.. [ OK ] >>> Stopping monitoring for VG VG_SDATA: 1 logical volume(s) >>> in volume group "VG_SDATA" unmonitored >>> [ OK ] >>> Stopping monitoring for VG vg_osroot: 1 logical >>> volume(s) in volume group "vg_osroot" unmonitored >>> [ OK ] >>> Sending all processes the TERM signal... [ OK ] >>> Sending all processes the KILL signal... [ OK ] >>> Saving random seed: [ OK ] >>> Syncing hardware clock to system time [ OK ] >>> Turning off quotas: quotaoff: Cannot change state of >>> GFS2 quota. >>> quotaoff: Cannot change state of GFS2 quota. >>> [FAILED] >>> Unmounting file systems: [ OK ] >>> init: Re-executing /sbin/init >>> Halting system... >>> osr(notice) Scanning for Bootparameters... >>> osr(notice) Starting ATIX exitrd >>> osr(notice) Comoonics-Release >>> osr(notice) comoonics Community Release 5.0 (Gumpn) >>> osr(notice) Internal Version $Revision: 1.18 $ $Date: >>> 2011-02-11 15:09:53 $ >>> osr(notice) Preparing chrootcp: cannot stat >>> `/mnt/newroot/dev/initctl': No such file or directory >>> [ OK ] >>> osr(notice) com-realhalt: detected distribution: rhel6, >>> clutype: gfs, rootfs: gfs2 >>> osr(notice) Restarting init process in chroot[ OK ] >>> osr(notice) Moving dev filesystem[ OK ] >>> osr(notice) Umounting filesystems in oldroot ( >>> /mnt/newroot/sys /mnt/newroot/proc) >>> osr(notice) Umounting /mnt/newroot/sys[ OK ] >>> osr(notice) Umounting /mnt/newroot/proc[ OK ] >>> osr(notice) Umounting filesystems in oldroot >>> (/mnt/newroot/var/run /mnt/newroot/var/lock >>> /mnt/newroot/.cdsl.local) >>> osr(notice) Umounting /mnt/newroot/var/runinit: >>> Re-executing /sbin/init >>> [ OK ] >>> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >>> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >>> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >>> Deactivating clustered VG(s): 0 logical volume(s) in >>> volume group "VG_SDATA" now active >>> >>> >>> >>> >>> >>> On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < >>> gr...@at... <mailto:gr...@at...> > wrote: >>> >>> >>> Jorge, >>> you don't need to be doubtful about the fact that the >>> volume group for the root file system is not flagged as >>> clustered. This has no implications whatsoever on the >>> gfs2 file system. >>> >>> It will only be a problem whenever the lvm settings of >>> the vg_osroot change (size, number of lvs etc.). >>> >>> Nevertheless while thinking about your problem I think I >>> had the idea on how to fix this problem on being able to >>> have the root vg clustered also. I will provide new >>> packages in the next days that should deal with the problem. >>> >>> Keep in mind that there is a difference between >>> cman_tool services and the lvm usage. >>> clvmd only uses the locktable clvmd shown by cman_tool >>> services and the other locktables are relevant to the >>> file systems and other services (fenced, rgmanager, ..). >>> This is a complete different use case. >>> >>> Try to elaborate a bit more on the fact >>> >>> "I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more." >>> What do you mean with it? How does this happen? This >>> sounds like something you should have a look at. >>> >>> >>> "Once thing that I can confirm is >>> osr(notice): Detecting nodeid & nodename >>> This does not always display the correct info, but it >>> doesn't seem to be a problem either ?" >>> >>> You should always look at the nodeid the nodename is >>> (more or less) only descriptive and might not be set as >>> expected. But the nodeid should always be consistent. >>> Does this help? >>> >>> About your notes (I only take the relevant ones): >>> >>> 1. osr(notice): Creating clusterfiles >>> /var/run/cman_admin /var/run/cman_client.. [OK] >>> This message should not be misleading but only tells the >>> these control files are being created inside the >>> ramdisk. This has nothing to do with these files on your >>> root file system. Nevertheless /etc/init.d/bootsr should >>> take over this part and create the files. Please send me >>> another >>> bash -x /etc/init.d/bootsr start >>> output. Please when those files are not existant. >>> >>> 2. vgs >>> >>> VG #PV #LV #SN Attr VSize VFree >>> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >>> vg_osroot 1 1 0 wz--n- 60.00g 0 >>> >>> This is perfectly ok. This only means the vg is not >>> clustered. But the filesystem IS. This does not have any >>> connection. >>> >>> Hope this helps. >>> Let me know about the open issues. >>> >>> Regards >>> >>> Marc. >>> >>> >>> ----- Original Message ----- >>> From: "Jorge Silva" < me...@je... >>> <mailto:me...@je...> > >>> To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> > >>> >>> Sent: Tuesday, November 13, 2012 2:15:23 AM >>> Subject: Re: Problem with VG activation clvmd runs at 100% >>> >>> >>> Marc >>> >>> >>> Hi - I believe I have solved my problem, with your help, >>> thank you. Yet, I'm not sure how I caused it - but the >>> root volume group as you pointed out had the clustered >>> attribute(and I had to have done something silly along >>> the way). I re-installed from scratch see notes below >>> and then just to prove that is a problem, I changed the >>> attribute of the rootfs- vgchange -cy and rebooted and I >>> ran into trouble, I changed it back and it is fine so >>> that does cause problems on start-up, I'm not sure I >>> understand why as there is an active quorum for the clvm >>> to join and take part.. >>> >>> >>> Despite it not being marked as a cluster volume >>> cman_tool services show it as being, but clvmd status >>> doesn't ? Is it safe to write to it with multiple nodes >>> mounted? >>> >>> >>> I am still a bit stuck when nodes with gfs2 mounted >>> don't restart if instructed to do so, but I will read >>> some more. >>> >>> >>> >>> >>> Once thing that I can confirm is >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> This does not always display the correct info, but it >>> doesn't seem to be a problem either ? >>> >>> >>> >>> >>> Thanks >>> Jorge >>> >>> >>> Notes: >>> I decided to start from scratch and I blew away the >>> rootfs and started from scratch as per the website. My >>> assumption - that I edited something and messed it up (I >>> did look at a lot of the scripts to try to "figure out >>> and fix" the problem, I can send the history if you want >>> or I can edit and contribute). >>> >>> >>> I rebooted the server and I had an issue - I didn't >>> disable selinux so I had to intervene in the boot stage. >>> That completed, but I noticed that : >>> >>> >>> >>> osr(notice): Starting network configuration for lo0 [OK] >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> Is blank, but somehow the correct nodeid and name was >>> deduced. >>> >>> >>> I had to rebuild the ram disk to fix the selinux >>> disabled. I also added the following >>> >>> yum install pciutils - the mkinitrd warned about this >>> so, I installed it. >>> I also installed : >>> yum install cluster-snmp >>> yum install rgmanager >>> in lvm >>> >>> >>> On this reboot I noticed that despite this message >>> >>> sr(notice): Creating clusterfiles /var/run/cman_admin >>> /var/run/cman_client.. [OK] >>> >>> >>> Starting clvmd: dlm: Using TCP for communications >>> >>> >>> Activating VG(s): File descriptor 3 (/dev/console) >>> leaked on vgchange invocation. Parent PID 15995: /bin/bash >>> File descriptor 4 (/dev/console) leaked on vgchange >>> invocation. Parent PID 15995: /bin/bash >>> Skipping clustered volume group VG_SDATA >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> >>> >>> the links weren't created and I did this manually >>> >>> >>> >>> ln -sf /var/comoonics/chroot//var/run/cman_admin >>> /var/run/cman_admin >>> ln -sf /var/comoonics/chroot//var/run/cman_client >>> /var/run/cman_client >>> >>> >>> I could then get clusterstatus etc, and clvmd was running ok >>> >>> >>> I looked in /etc/lvm/lvm.conf and locking_type = 4 ? >>> >>> >>> I then issued >>> >>> >>> lvmconf --enable cluster - and this changed >>> /etc/lvm/lvm.conf locking_type = 3. >>> >>> >>> vgscan correctly showed up clusterd volumes and was >>> working ok. >>> >>> >>> >>> >>> I did not rebuild the ramdisk (I can confirm that the >>> lvm .conf in the ramdisk has locking_type=4) I have >>> rebooted and everything is working. >>> >>> Starting clvmd: dlm: Using TCP for communications >>> >>> >>> Activating VG(s): File descriptor 3 (/dev/console) >>> leaked on vgchange invocation. Parent PID 15983: /bin/bash >>> File descriptor 4 (/dev/console) leaked on vgchange >>> invocation. Parent PID 15983: /bin/bash >>> Skipping clustered volume group VG_SDATA >>> 1 logical volume(s) in volume group "vg_osroot" now active >>> >>> >>> >>> >>> >>> >>> I have rebooted a number of times and am confident that >>> things are ok, >>> >>> >>> I decided to add two other nodes to the mix and I can >>> confirm that everytime a new node is added these files >>> are missing : >>> >>> >>> /var/run/cman_admin >>> /var/run/cman_client >>> But I can see from the logs: >>> >>> >>> >>> osr(notice): Creating clusterfiles /var/run/cman_admin >>> /var/run/cman_client.. [OK] >>> >>> >>> despite the above message, also, the information below >>> is not always detected, but still the nodeid etc is >>> correct... >>> >>> >>> osr(notice): Detecting nodeid & nodename >>> >>> >>> >>> >>> So now I have 3 nodes in the cluster and things look ok: >>> >>> >>> >>> [root@bwccs302 ~]# cman_tool services >>> fence domain >>> member count 3 >>> victim count 0 >>> victim now 0 >>> master nodeid 2 >>> wait state none >>> members 2 3 4 >>> >>> >>> dlm lockspaces >>> name home >>> id 0xf8ee17aa >>> flags 0x00000008 fs_reg >>> change member 3 joined 1 remove 0 failed 0 seq 3,3 >>> members 2 3 4 >>> >>> >>> name clvmd >>> id 0x4104eefa >>> flags 0x00000000 >>> change member 3 joined 1 remove 0 failed 0 seq 15,15 >>> members 2 3 4 >>> >>> >>> name OSRoot >>> id 0xab5404ad >>> flags 0x00000008 fs_reg >>> change member 3 joined 1 remove 0 failed 0 seq 7,7 >>> members 2 3 4 >>> >>> >>> gfs mountgroups >>> name home >>> id 0x686e3fc4 >>> flags 0x00000048 mounted >>> change member 3 joined 1 remove 0 failed 0 seq 3,3 >>> members 2 3 4 >>> >>> >>> name OSRoot >>> id 0x659f7afe >>> flags 0x00000048 mounted >>> change member 3 joined 1 remove 0 failed 0 seq 7,7 >>> members 2 3 4 >>> >>> >>> >>> service clvmd status >>> clvmd (pid 25771) is running... >>> Clustered Volume Groups: VG_SDATA >>> Active clustered Logical Volumes: LV_HOME LV_DEVDB >>> >>> >>> it doesn't believe that the root file-system is >>> clustered despite the output from the above. >>> >>> >>> >>> [root@bwccs302 ~]# vgs >>> VG #PV #LV #SN Attr VSize VFree >>> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >>> vg_osroot 1 1 0 wz--n- 60.00g 0 >>> >>> >>> The above got me thinking on what you wanted me to do to >>> diable the clusterd flag on the root volume - with it >>> left on I was having problems (not sure how it got >>> turned) on. >>> >>> >>> With everything working ok, I remade ramdisk and now >>> lvm.conf=3.. >>> >>> >>> The systems start up and things look ok. >>> >>> >> >> |
From: Jorge S. <me...@je...> - 2012-11-20 14:00:51
|
osr(notice) Mounosr(notice): Creating devices ting dev [OK] cess group service v1.01 _ _____ _____ __ abase access v1.01 / \|_ _|_ _\ \/ / rvice / _ \ | | | | \ / B.01.01 / ___ \| | | | / \ rvice 2.90 /_/ \_\_| |___/_/\_\ orum service v0.1 ADVANCED TECHNOLOGY FOR INDIVIDUAL SUCCESS Engine exiting with statu Open Shared Root Cluster www.open-sharedroot.org www.atix.de Welcome to CentOS osr(notice): Starting udev daemon.. osr(notice): Starting ATIX initrd osr(notice): Comoonics-Release osr(notice): comoonics Community Release 5.0 (Gumpn) osr(notice): Internal Version $Revision: 1.113 $ $Date: 2011-02-28 09:02:11 $ osr(notice): Builddate: Tue Nov 20 08:26:37 EST 2012 *********************************** Welcome to comooniosr(notice): Mounting Proc-FScs Community Release 5.0 (Gumpn) Startin [PASSED] g Open Shared Roosr(notice): Mounting Sys-FSot Boot Process Date: Tue Nov 20 08:26:37 EST 2012 *********************************** osr(notice): Mounting dev [PASSED] osr(notice): Creating devices [OK] osr(notice): Kernel-version: 2.6.32-279.14.1.el6.x86_64 osr(notice): Scanning for Bootparameters... osr(notice): Loading USB Modules.. [OK] osr(notice): Press 'I' to enter interactive startup. osr(notice): osr(notice): Validating cluster configuration. [OK] osr(notice): Detecting Hardware osr(notice): Removing loaded modules Buffer I/O error on device sdd, logical block 0 Buffer I/O error on device sde, logical block 0 Buffer I/O error on device sdf, logical block 0 Buffer I/O error on device sdj, logical block 0 Buffer I/O error on device sdk, logical block 0 Buffer I/O error on device sdl, logical block 0 Buffer I/O error on device sde, logical block 0 Buffer I/O error on device sdd, logical block 0 Buffer I/O error on device sdf, logical block 0 Buffer I/O error on device sdk, logical block 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdj, sector 0 osr(notice): Starting network configuration for lo0 end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdk, sector 0 end_request: I/O error, dev sdj, sector 0 osr(notice): Detecting nodeid & nodename nodeid: 2, nodename: smc01b [OK] end_request: I/O error, dev sdd, sector 0 end_request: I/O error, dev sde, sector 0 end_request: I/O error, dev sdf, sector 0 end_request: I/O error, dev sdl, sector 0 end_request: I/O error, dev sdj, sector 0 end_request: I/O error, dev sdk, sector 0 osr(notice): Scanning other parameters [OK] osr(notice): Loading USB Modules.. [OK] osr(notice): Loading sysctl.. [OK] osr(notice): Creating network configuration for eth0eth0: error fetching interface information: Device not found eth0: error fetching interface information: Device not found [OK] osr(notice): Creating network configuration for eth1eth1: error fetching interface information: Device not found eth1: error fetching interface information: Device not found [OK] osr(notice): Creating network configuration for bond0bond0: error fetching interface information: Device not found bond0: error fetching interface information: Device not found [OK] osr(notice): Creating network configuration for bond0.1762bond0.1762: error fetching interface information: Device not found bond0.1762: error fetching interface information: Device not found [OK] osr(notice): Powering up eth0.. [OK] osr(notice): Powering up eth1.. [OK] osr(notice): Powering up bond0.. bonding: cannot add bond bond0; already exists udevd-work[3739]: error changing netif name 'eth1' to 'eth0': Device or resource busy osr(notice): Powering up bond0.1762.. osr(warn): Could not detect syslog type either no syslog installed in initrd or no syslog bootimage package installed. osr(notice): Loading device mapper modules osr(notice): Starting scsi... osr(notice): Loading scsi_disk Module... [OK] osr(notice): Loading sg.ko module [OK] udevd-work[12907]: error changing netif name 'eth1' to 'eth0': Device or resource busy osr(notice): Loading dm-mutltipath.ko module [OK] osr(notice): Setting up Multipathcreate: osbootp (3600a0b800068636e00001990509b723e) undef IBM,1726-4xx FAStT size=1.0G features='1 queue_if_no_path' hwhandler='1 rdac' wp=undef |-+- policy='round-robin 0' prio=6 status=undef | |- 0:0:0:0 sda 8:0 undef ready running | `- 1:0:0:0 sdg 8:96 undef ready running `-+- policy='round-robin 0' prio=1 status=undef |- 0:0:1:0 sdd 8:48 undef ghost running `- 1:0:1:0 s [OK] dj 8:144 undef ghost running create: osrootp (3600a0b800068636e0000199a509b8818) undef IBM,1726-4xx FAStT size=60G features='1 queue_if_no_path' hwhandler='1 rdac' wp=undef |-+- policy='round-robin 0' prio=6 status=undef | |- 0:0:0:1 sdb 8:16 undef ready running | `- 1:0:0:1 sdh 8:112 undef ready running `-+- policy='round-robin 0' prio=1 status=undef |- 0:0:1:1 sde 8:64 undef ghost running `- 1:0:1:1 sdk 8:160 undef ghost running create: mpathd (3600a0b800068636e000003464b8002c5) undef IBM,1726-4xx FAStT size=60G features='1 queue_if_no_path' hwhandler='1 rdac' wp=undef |-+- policy='round-robin 0' prio=6 status=undef | |- 0:0:0:4 sdc 8:32 undef ready running | `- 1:0:0:4 sdi 8:128 undef ready running `-+- policy='round-robin 0' prio=1 status=undef |- 0:0:1:4 sdf 8:80 undef ghost running `- 1:0:1:4 sdl 8:176 undef ghost running osr(notice): Restarting udev udevd-work[13666]: error changing netif name 'eth1' to 'eth0': Device or resource busy [OK] osr(notice): Scanning for volume groups [OK] osr(notice): Making device nodes [OK] osr(notice): Activating volume group vg_osroot [OK] osr(notice): Building comoonics chroot environment./linuxrc.generic.sh Creating chroot environment.. ./linuxrc.generic.sh falling back to default values.. [OK] osr(notice): Checking for all nodes to be availableSkipping because cluster ProdCluster01 has less or more the two nodes. [OK] osr(notice): Setting clock : Tue Nov 20 08:27:32 EST 2012 [OK] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... corosync[15535]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. corosync[15535]: [MAIN ] Corosync built-in features: nss dbus rdma snmp corosync[15535]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf corosync[15535]: [MAIN ] Successfully parsed cman config corosync[15535]: [TOTEM ] Initializing transport (UDP/IP Multicast). corosync[15535]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). corosync[15535]: [TOTEM ] The network interface [172.17.62.12] is now up. corosync[15535]: [QUORUM] Using quorum provider quorum_cman corosync[15535]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 corosync[15535]: [CMAN ] CMAN 3.0.12.1 (built Aug 21 2012 21:33:38) started corosync[15535]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 corosync[15535]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 corosync[15535]: [SERV ] Service engine loaded: corosync extended virtual synchrony service corosync[15535]: [SERV ] Service engine loaded: corosync configuration service corosync[15535]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 corosync[15535]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 corosync[15535]: [SERV ] Service engine loaded: corosync profile loading service corosync[15535]: [QUORUM] Using quorum provider quorum_cman corosync[15535]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 corosync[15535]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. corosync[15535]: [TOTEM ] A processor joined or left the membership and a new membership was formed. corosync[15535]: [QUORUM] Members[1]: 2 corosync[15535]: [QUORUM] Members[1]: 2 corosync[15535]: [CPG ] chosen downlist: sender r(0) ip(172.17.62.12) ; members(old:0 left:0) corosync[15535]: [MAIN ] Completed service synchronization, ready to provide service. [ OK ] Starting qdiskd... qdiskd[15588]: Quorum Partition: /dev/block/253:4 Label: ProdCluster01 qdiskd[15588]: Quorum Daemon Initializing qdiskd[15588]: Heuristic: '/bin/ping 172.17.62.1 -c1 -t1' UP qdiskd[15588]: Heuristic: '/bin/ping 172.17.62.10 -c1 -t1' UP qdiskd[15588]: Heuristic: 'ip addr | grep bond0 | grep -q UP' UP corosync[15535]: [CMAN ] quorum device registered qdiskd[15588]: Initial score 3/3 qdiskd[15588]: Initialization complete qdiskd[15588]: Score sufficient for master operation (3/3; required=2); upgrading qdiskd[15588]: Assuming master role corosync[15535]: [CMAN ] quorum regained, resuming activity corosync[15535]: [QUORUM] This node is within the primary component and will provide service. corosync[15535]: [QUORUM] Members[1]: 2 [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... fenced[15750]: fenced 3.0.12.1 started dlm_controld[15766]: dlm_controld 3.0.12.1 started [ OK ] Starting gfs_controld... [ OK ] gfs_controld[15831]: gfs_controld 3.0.12.1 started Unfencing self... [ OK ] Joining fence domain... [ OK ] osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] Starting clvmd: dlm: Using TCP for communications Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange invocation. Parent PID 15991: /bin/bash File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID 15991: /bin/bash 2 logical volume(s) in volume group "VG_DATA1" now active 1 logical volume(s) in volume group "vg_osroot" now active [ OK ] osr(notice): Mounting /dev/vg_osroot/lv_root on /mnt/newroot... osr(notice): Mounting /mnt/newroot/.cdsl.local on /mnt/newroot/.cluster/cdsl/2.. [OK] osr(notice): Mounting /mnt/newroot/.cluster/cdsl/2/var/run on /mnt/newroot/var/run... [OK] osr(notice): Mounting /mnt/newroot/.cluster/cdsl/2/var/lock on /mnt/newroot/var/lock... [OK] osr(notice): Moving chroot environment /var/comoonics/chroot to /mnt/newroot [OK] osr(notice): Writing information ... [OK] osr(notice): Writing information to /dev/.initramfs ... [OK] osr(notice): Writing xtab.. [OK] osr(notice): Writing xrootfs.. [OK] osr(notice): Writing xkillall_procs.. [OK] osr(notice): Mounting the device file system [OK] osr(notice): Cleaning up initrd ... [OK] osr(notice): osr(notice): Restart services in newroot ... osr(notice): Restarting scsi services in newroot /mnt/newroot osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] osr(notice): Umounting /mnt/newroot/proc [OK] osr(warn): Could not detect syslog type either no syslog installed in initrd or no syslog bootimage package installed. osr(notice): Setting postsettings in initrd.. osr(notice): Starting init-process (/bin/bash)... Returned from linuxrc.generic.sh. osr(notice): ********************************************************************** osr(notice): comoonics generic switchroot osr(notice): newroot=/mnt/newroot chrootneeded=0 osr(notice): Mounting filesystem /proc in /mnt/newroot.. [OK] osr(notice): Mounting filesystem /sys in /mnt/newroot.. [OK] osr(notice): Umounting /sys [OK] osr(notice): Umounting /proc/bus/usb [OK] osr(notice): Umounting /proc [OK] osr(notice): Cleaning up... warning: can't open /etc/fstab: No such file or directory Welcome to CentOS Starting udev: udevd-work[22597]: error changing netif name 'eth1' to 'eth5': Device or resource busy udevd-work[22596]: error changing netif name 'eth0' to 'eth5': Device or resource busy [ OK ] Setting hostname bwccs302.blackwatchcc.com: [ OK ] + '[' -x /sbin/lvm ']' + action 'Setting up Logical Volume Management:' /sbin/lvm vgchange -a y --sysinit + local STRING rc + STRING='Setting up Logical Volume Management:' + echo -n 'Setting up Logical Volume Management: ' Setting up Logical Volume Management: + shift + /sbin/lvm vgchange -a y --sysinit 2 logical volume(s) in volume group "VG_DATA1" now active 1 logical volume(s) in volume group "vg_osroot" now active + success 'Setting up Logical Volume Management:' + '[' serial '!=' verbose -a -z '' ']' + echo_success + '[' serial = color ']' + echo -n '[' [+ '[' serial = color ']' + echo -n ' OK ' OK + '[' serial = color ']' + echo -n ']' ]+ echo -ne '\r' + return 0 + return 0 + rc=0 + echo + return 0 + '[' -f /etc/crypttab ']' + init_crypto 0 + local have_random dst src key opt mode owner params makeswap skip arg opt + local param value rc ret mke2fs mdir prompt mount_point + ret=0 + have_random=0 + read dst src key opt + return 0 + '[' -f /fastboot ']' + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' fastboot + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + '[' -f /fsckoptions ']' + '[' -f /forcefsck ']' + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' forcefsck + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + '[' -f /.autofsck ']' + '[' serial = color ']' + fsckoptions='-V ' + READONLY= + '[' -f /etc/sysconfig/readonly-root ']' + . /etc/sysconfig/readonly-root ++ READONLY=no ++ TEMPORARY_STATE=no ++ RW_MOUNT=/var/lib/stateless/writable ++ RW_LABEL=stateless-rw ++ RW_OPTIONS= ++ STATE_LABEL=stateless-state ++ STATE_MOUNT=/var/lib/stateless/state ++ STATE_OPTIONS= + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' readonlyroot + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' noreadonlyroot + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + '[' no = yes -o no = yes ']' + [[ -V != *\ \-\y* ]] + fsckoptions='-a -V ' + _RUN_QUOTACHECK=0 + '[' -f /forcequotacheck ']' + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' forcequotacheck + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + '[' -z '' -a no '!=' yes ']' + STRING='Checking filesystems' + echo Checking filesystems Checking filesystems + fsck -T -t noopts=_netdev -A -a -V Checking all file systems. [/sbin/fsck.ext4 (1) -- /boot] fsck.ext4 -a /dev/mapper/osbootpp1 /dev/mapper/osbootpp1: clean, 56/51744 files, 61630/206829 blocks + rc=0 + '[' 0 -eq 0 ']' + success 'Checking filesystems' + '[' serial '!=' verbose -a -z '' ']' + echo_success + '[' serial = color ']' + echo -n '[' [+ '[' serial = color ']' + echo -n ' OK ' OK + '[' serial = color ']' + echo -n ']' ]+ echo -ne '\r' + return 0 + return 0 + echo + '[' 0 -gt 1 ']' + '[' 0 -eq 1 ']' + update_boot_stage RCmountfs + '[' -x /bin/plymouth ']' + /bin/plymouth --update=RCmountfs + return 0 + remount_needed + local state oldifs + '[' no = yes ']' ++ LC_ALL=C ++ awk '/ \/ / && ($3 !~ /rootfs/) { print $4 }' /proc/mounts + state=rw,noatime,hostdata=jid=0,localflocks + oldifs=' ' + IFS=, + for opt in '$state' + '[' rw = rw ']' + IFS=' ' + return 1 + '[' -n '' ']' + '[' -n '' -a no '!=' yes ']' + '[' no '!=' yes ']' + rm -f /etc/mtab~ /etc/mtab~~ + mount -f / + mount -f /proc + mount -f /sys + mount -f /dev/pts + mount -f /dev/shm + mount -f /proc/bus/usb + '[' no '!=' yes ']' + action 'Mounting local filesystems: ' mount -a -t nonfs,nfs4,smbfs,ncpfs,cifs,gfs,gfs2 -O no_netdev + local STRING rc + STRING='Mounting local filesystems: ' + echo -n 'Mounting local filesystems: ' Mounting local filesystems: + shift + mount -a -t nonfs,nfs4,smbfs,ncpfs,cifs,gfs,gfs2 -O no_netdev mount: devpts already mounted or /dev/pts busy mount: according to mtab, /dev/pts is already mounted on /dev/pts + failure 'Mounting local filesystems: ' + local rc=96 + '[' serial '!=' verbose -a -z '' ']' + echo_failure + '[' serial = color ']' + echo -n '[' [+ '[' serial = color ']' + echo -n FAILED FAILED+ '[' serial = color ']' + echo -n ']' ]+ echo -ne '\r' + return 1 + '[' -x /bin/plymouth ']' + /bin/plymouth --details + return 96 + rc=96 + echo + return 96 + '[' X0 = X1 -a -x /sbin/quotacheck ']' + '[' -x /sbin/quotaon ']' + action 'Enabling local filesystem quotas: ' /sbin/quotaon -aug + local STRING rc + STRING='Enabling local filesystem quotas: ' + echo -n 'Enabling local filesystem quotas: ' Enabling local filesystem quotas: + shift + /sbin/quotaon -aug quotaon: Cannot change state of GFS2 quota. quotaon: Cannot change state of GFS2 quota. + failure 'Enabling local filesystem quotas: ' + local rc=2 + '[' serial '!=' verbose -a -z '' ']' + echo_failure + '[' serial = color ']' + echo -n '[' [+ '[' serial = color ']' + echo -n FAILED FAILED+ '[' serial = color ']' + echo -n ']' ]+ echo -ne '\r' + return 1 + '[' -x /bin/plymouth ']' + /bin/plymouth --details + return 2 + rc=2 + echo + return 2 + '[' -n '' -a no '!=' yes ']' + '[' -d /etc/selinux -a no '!=' yes ']' + '[' -f /.autorelabel ']' + '[' -f /var/lib/random-seed ']' + cat /var/lib/random-seed + '[' no '!=' yes ']' + chmod 600 /var/lib/random-seed + dd if=/dev/urandom of=/var/lib/random-seed count=1 bs=512 + '[' -f /etc/crypttab ']' + init_crypto 1 + local have_random dst src key opt mode owner params makeswap skip arg opt + local param value rc ret mke2fs mdir prompt mount_point + ret=0 + have_random=1 + read dst src key opt + return 0 + '[' -f /.unconfigured ']' + rm -f /fastboot /fsckoptions /forcefsck /.autofsck /forcequotacheck /halt /poweroff /.suspended + _NEED_XFILES= + '[' -f /var/run/utmpx -o -f /var/log/wtmpx ']' + rm -rf '/var/lock/cvs/*' '/var/run/screen/*' + find /var/lock /var/run '!' -type d -exec rm -f '{}' ';' + rm -f /var/lib/rpm/__db.001 /var/lib/rpm/__db.002 /var/lib/rpm/__db.003 /var/lib/rpm/__db.004 + rm -f /var/gdm/.gdmfifo + '[' yes '!=' no ']' + touch /var/log/wtmp + plymouth watch-keystroke --command 'touch /var/run/confirm' --keys=Ii + chgrp utmp /var/run/utmp /var/log/wtmp + chmod 0664 /var/run/utmp /var/log/wtmp + '[' -n '' ']' + '[' -n '' ']' + '[' -n '' ']' + rm -f '/tmp/.X*-lock' '/tmp/.lock.*' /tmp/.gdm_socket '/tmp/.s.PGSQL.*' + rm -rf '/tmp/.X*-unix' /tmp/.ICE-unix /tmp/.font-unix '/tmp/hsperfdata_*' '/tmp/kde-*' '/tmp/ksocket-*' '/tmp/mc-*' '/tmp/mcop-*' '/tmp/orbit-*' '/tmp/scrollkeeper-*' '/tmp/ssh-*' /dev/.in_sysinit + mkdir -m 1777 -p /tmp/.ICE-unix + chown root:root /tmp/.ICE-unix + '[' -n '' ']' + update_boot_stage RCswap + '[' -x /bin/plymouth ']' + /bin/plymouth --update=RCswap + return 0 + action 'Enabling /etc/fstab swaps: ' swapon -a -e + local STRING rc + STRING='Enabling /etc/fstab swaps: ' + echo -n 'Enabling /etc/fstab swaps: ' Enabling /etc/fstab swaps: + shift + swapon -a -e + success 'Enabling /etc/fstab swaps: ' + '[' serial '!=' verbose -a -z '' ']' + echo_success + '[' serial = color ']' + echo -n '[' [+ '[' serial = color ']' + echo -n ' OK ' OK + '[' serial = color ']' + echo -n ']' ]+ echo -ne '\r' + return 0 + return 0 + rc=0 + echo + return 0 + '[' no = yes ']' + /bin/mount -t binfmt_misc none /proc/sys/fs/binfmt_misc + '[' -x /usr/sbin/system-config-network-cmd ']' + '[' -f /var/log/dmesg ']' + mv -f /var/log/dmesg /var/log/dmesg.old + dmesg -s 131072 + touch /.autofsck + '[' yes '!=' no ']' + plymouth --ignore-keystroke=Ii + strstr 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' confirm + '[' 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' = 'ro root=/dev/vg_osroot/lv_root quiet console=tty0 console=ttyS1,115200n8' ']' + return 1 + '[' -x /bin/plymouth ']' + /bin/plymouth --sysinit init: rcS main process (22533) terminated with status 1 Entering non-interactive startup Starting monitoring for VG VG_DATA1: 2 logical volume(s) in volume group "VG_DATA1" monitored [ OK ] Starting monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" monitored [ OK ] Starting cgconfig service: [ OK ] Starting multipathd daemon: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface br0: [ OK ] Bringing up interface bond0.1710: [ OK ] Bringing up interface bond0.1720: [ OK ] Bringing up interface bond0.1730: [ OK ] Bringing up interface bond0.1761: [ OK ] Starting auditd: [ OK ] updateing bootsr.. [ OK ] Mounting filesystems to chroot subdirs [ OK ] Updating chroot environment [ OK ] Starting gfs dependent services osr(notice) ..socketfiles.. [ OK ] Starting gfs2 dependent services osr(notice) ..socketfiles.. osr(notice) ..bindmounts.. [ OK ] Starting portreserve: (not starting, no services registered) Starting system logger: [ OK ] Starting rpcbind: [ OK ] [ OK ] sssd: [ OK ] Starting lldpad: [ OK ] Starting system message bus: [ OK ] Starting Avahi daemon... [ OK ] Starting clvmd: Activating VG(s): cluster request failed: Invalid argument Can't get lock for VG_DATA1 cluster request failed: Invalid argument Can't get lock for vg_osroot [FAILED] Starting imsd via sshd: [ OK ] Starting nslcd: [ OK ] Mounting other filesystems: mount: devpts already mounted or /dev/pts busy mount: according to mtab, /dev/pts is already mounted on /dev/pts [FAILED] Starting HAL daemon: [ OK ] Retrigger failed udev events[ OK ] Server address not specified in /etc/sysconfig/netconsole Starting snmpd: [ OK ] Starting sshd: [ OK ] Starting ntpd: [ OK ] Starting sendmail: [ OK ] Starting sm-client: [ OK ] Starting ksm: [ OK ] Starting ksmtuned: [ OK ] Starting crond: [ OK ] Starting jexec servicesStarting libvirtd daemon: [ OK ] Starting oddjobd: [ OK ] Starting Cluster Module - cluster monitor: Setting verbosity level to LogBasic [ OK ] Starting ricci: [ OK ] CentOS release 6.3 (Final) Kernel 2.6.32-279.14.1.el6.x86_64 on an x86_64 bwccs302.blackwatchcc.com login: root Password: Last login: Mon Nov 19 13:08:19 on ttyS1 [root@bwccs302 ~]# top top - 08:29:12 up 2 min, 1 user, load average: 0.98, 0.40, 0.15 Tasks: 434 total, 1 running, 430 sleeping, 0 stopped, 3 zombie Cpu(s): 2.5%us, 3.9%sy, 0.0%ni, 93.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24731500k total, 777344k used, 23954156k free, 3692k buffers Swap: 0k total, 0k used, 0k free, 274560k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26013 root 20 0 167m 24m 3268 S 99.9 0.1 0:29.63 clvmd |
From: Marc G. <gr...@at...> - 2012-11-20 08:42:09
|
Jorge, let's first start with fencing. You are using ipmilan for fencing. I didn't evaluate the agent with rhel6. So let's start fixing this issue. Try the following: com-chroot /sbin/fence_ipmilan -h Send me the output. There might some libs missing. The clvmd is very strange. Try to stay with locking_type=2 or locking_type=3. Then rebuild an initrd and reboot. If clvmd stays with 100% CPU kill it and start it again manually with -d flag. Send me the output. Perhaps we see something from there. Regards Marc. Am 19.11.2012 15:39, schrieb Jorge Silva: > Marc > > Hi, np, thanks for helping. The /var/run/cman* are there. I will > disable the clustered flag on the second volume. Even more disturbing > is after the last email i sent you I went from a state where clvmd was > behaving normally (not 100%), I could access clustered volumes. I > rebooted to verify the that everything was functioning - but I am now > back to the state where clvmd is running at 100% - back to where we > started (can't access clustered volumes). > > locking-type=0 > [root@bwccs302 ~]# vgs > WARNING: Locking disabled. Be careful! This could corrupt your metadata. > VG #PV #LV #SN Attr VSize VFree > VG_DATA1 1 2 0 wz--n- 64.00g 4.20g > vg_osroot 1 1 0 wz--n- 60.00g 0 > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 23207 root 2 -18 101m 23m 3176 S *99.9* 0.1 0:05.82 clvmd > > lrwxrwxrwx 1 root root 41 Nov 16 16:41 /var/run/cman_admin -> > /var/comoonics/chroot//var/run/cman_admin > lrwxrwxrwx 1 root root 42 Nov 16 16:41 /var/run/cman_client -> > /var/comoonics/chroot//var/run/cman_client > > locking_type=3 > [root@bwccs302 ~]# service clvmd restart > Restarting clvmd: [ OK ] > [root@bwccs302 ~]# vgs > cluster request failed: Invalid argument > Can't get lock for VG_DATA1 > cluster request failed: Invalid argument > Can't get lock for vg_osroot > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 23829 root 2 -18 167m 24m 3268 S *99.8* 0.1 0:31.29 clvm > > > > As far as the shutdown - with the two nodes up once I issue the > shutdown on node1, the shutdown proceeds to the point where I sent the > screenshots (deactivating cluster services) - on node2, I notice - > > corosync[16648]: [TOTEM ] A processor failed, forming new > configuration. It attempts to fence > > node1. After 3 unsuccessful attempts, it locks up. Node1 stays stuck > (screen dump I sent) I do a tcp dump and I see the two nodes are still > sending multicast messages and until I reset node1, node2 will stay in > a locked state with no access... this is the last set of messages I see : > > fenced[16784]: fence smc01b dev 0.0 agent fence_ipmilan result: error > from agent > fenced[16784]: fence smc01b failed > > After 3 attempts as fencing failed the cluster locks up till I have > reset the node. > I suspect there is another issue at play here as I can manually fence > a nodes using fence_node x ( I will continue to dig into this I have > tried fenced -q or messagebus with the same result) > > Thanks > Jorge > > > On Mon, Nov 19, 2012 at 3:01 AM, Marc Grimme <gr...@at... > <mailto:gr...@at...>> wrote: > > Hi Jorge, > sorry for the delay but I was quite busy on the last days. > Nevertheless I'm don't understand the problem. > Let's first start at the point I think could lead to problems > during shutdown and friends. > Are the control files in /var/run/cman* being created from the > bootsr initscript or do you still have to create them manually. > If they are not created I would still be very interested in the > output of > bash -x /etc/init.d/bootsr start > after a node has been started. > > If not we need to dig deeper into the problems during shutdown. > I would then also change the clustered flag for the other volume > group. > Again as long as you don't change the size it wont hurt. > And it's only for better understanding the problem. > > Another command I'd like to see is a cman_tool services on the > other node (say node 2) while the shutdown node is being stuck > (say node 1). > > Thanks Marc. > Am 15.11.2012 19:08, schrieb Jorge Silva: >> Marc >> >> Hi, I believe the problem is related to the clsuter services not >> shutting down. init 0, will not work with 1 or more nodes, init >> 6 will only work when 1 node is present. When more than 1 node >> is present the node with the init 6 will have to be fenced as it >> will not shut down. I believe the cluster components aren't >> shutting down (this also happens with init 6 when more than one >> node is present) - I still see cluster traffic on the network, >> this is periodic. >> >> 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > >> 229.192.0.2.netsupport: UDP, length 119 >> >> At the point that the system will not shut down, it still is a >> cluster member and there is still cluster traffic. >> >> 1 node : >> [root@bwccs302 ~]# init 0 >> >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 2 logical volume(s) in volume group >> "VG_SDATA" now active >> 1 logical volume(s) in volume group "vg_osroot" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 >> 15:09:53 $ >> osr(debug) Calling cmd /sbin/halt -d -p >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory >> [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, clutype: >> gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys >> /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init >> [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell >> bash: no job control in this shell >> >> Type help to get more information.. >> Type exit to continue work.. >> ------------------------------------------------------------- >> >> comoonics 1 > cman_tool: unknown option cman_tool >> comoonics 2 > comoonics 2 > Version: 6.2.0 >> Config Version: 1 >> Cluster Name: ProdCluster01 >> Cluster Id: 11454 >> Cluster Member: Yes >> Cluster Generation: 4 >> Membership state: Cluster-Member >> Nodes: 1 >> Expected votes: 4 >> Quorum device votes: 3 >> Total votes: 4 >> Node votes: 1 >> Quorum: 3 >> Active subsystems: 10 >> Flags: >> Ports Bound: 0 11 178 >> Node name: smc01b >> Node ID: 2 >> Multicast addresses: 229.192.0.2 >> Node addresses: 172.17.62.12 >> comoonics 3 > fence domain >> member count 1 >> victim count 0 >> victim now 0 >> master nodeid 2 >> wait state none >> members 2 >> >> dlm lockspaces >> name clvmd >> id 0x4104eefa >> flags 0x00000000 >> change member 1 joined 1 remove 0 failed 0 seq 1,1 >> members 2 >> >> comoonics 4 > bash: exitt: command not found >> comoonics 5 > exit >> osr(notice) Back to work.. >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> It hung at the point above - so I re-ran with the edit set -x in >> line 207. >> 1 -node: >> [root@bwccs302 ~]# init 0 >> [root@bwccs302 ~ >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" n ow active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 1 logical volume(s) in volume group >> "vg_osroot" now active >> 2 logical volume(s) in volume group "VG_SDATA" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_ osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 >> 15:09:53 $ >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, clutype: >> gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys >> /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> + clusterfs_services_stop '' '' 0 >> ++ repository_get_value rootfs >> +++ repository_normalize_value rootfs >> ++ local key=rootfs >> ++ local default= >> ++ local repository= >> ++ '[' -z '' ']' >> ++ repository=comoonics >> ++ local value= >> ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' >> +++ cat /var/cache/comoonics-repository/comoonics.rootfs >> ++ value=gfs2 >> ++ echo gfs2 >> ++ return 0 >> + local rootfs=gfs2 >> + gfs2_services_stop '' '' 0 >> + local chroot_path= >> + local lock_method= >> + local lvm_sup=0 >> + '[' -n 0 ']' >> + '[' 0 -eq 0 ']' >> + /etc/init.d/clvmd stop >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> with 2 nodes + quorate when init 6 is issued: >> >> [root@bwccs304 ~]# init 6 >> [root@bwccs304 ~ >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down Avahi daemon: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 1 logical volume(s) in volume group >> "vg_osroot" now active >> 2 logical volume(s) in volume group "VG_SDATA" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> qdiskd[15713]: Unregistering quorum device. >> >> Sending all processes the KILL signal... dlm: clvmd: no userland >> control daemon, stopping lockspace >> dlm: OSRoot: no userland control daemon, stopping lockspace >> [ OK ] >> - stops here and will not die... Still have full cluster coms >> >> Thanks >> jorge >> >> On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at... >> <mailto:gr...@at...>> wrote: >> >> Hi Jorge, >> because of the "init 0". >> Please issue the following commands prior to init 0. >> # Make it a little more chatty >> $ com-chroot setparameter debug >> # Break after before cluster will be stopped >> $ com-chroot setparameter step halt_umountoldroot >> >> Then issue a init 0. >> This should lead you to a breakpoint during shutdown >> (hopefully, cause sometimes the console gets confused). >> In side the breakpoint issue: >> $ cman_tool status >> $ cman_tool services >> # Continue shutdown >> $ exit >> Then send me the output. >> >> If this fails also do as follows: >> $ com-chroot vi com-realhalt.sh >> # go to line 207 (before clusterfs_services_stop) is called >> and add a set -x >> $ init 0 >> >> Send the output. >> Thanks Marc. >> >> ----- Original Message ----- >> From: "Jorge Silva" <me...@je... <mailto:me...@je...>> >> To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>> >> Cc: ope...@li... >> <mailto:ope...@li...> >> Sent: Tuesday, November 13, 2012 3:22:37 PM >> Subject: Re: Problem with VG activation clvmd runs at 100% >> >> Marc >> >> >> Hi, thanks for the info, it helps. I have also noticed that >> gfs2 entries in the fstab get ignored on boot, I have added >> in rc.local. I have done a bit more digging and the issue I >> described below: >> >> >> "I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more." >> >> >> If I issue a init 6 on a nodes they will restart. If I issue >> init 0, then I have the problem the node start to shut down, >> but will stay in the cluster. I have to shut it off, it will >> not shut down, this is the log. >> >> >> >> [root@bwccs304 ~]# init 0 >> >> >> Can't connect to default. Skipping. >> Shutting down Cluster Module - cluster monitor: [ OK ] >> Shutting down ricci: [ OK ] >> Shutting down oddjobd: [ OK ] >> Stopping saslauthd: [ OK ] >> Stopping sshd: [ OK ] >> Shutting down sm-client: [ OK ] >> Shutting down sendmail: [ OK ] >> Stopping imsd via sshd: [ OK ] >> Stopping snmpd: [ OK ] >> Stopping crond: [ OK ] >> Stopping HAL daemon: [ OK ] >> Stopping nscd: [ OK ] >> Shutting down ntpd: [ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> [ OK ] >> Signaling clvmd to exit [ OK ] >> clvmd terminated[ OK ] >> Stopping lldpad: [ OK ] >> Stopping system message bus: [ OK ] >> Stopping multipathd daemon: [ OK ] >> Stopping rpcbind: [ OK ] >> Stopping auditd: [ OK ] >> Stopping nslcd: [ OK ] >> Shutting down system logger: [ OK ] >> Stopping sssd: [ OK ] >> Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] >> Stopping gfs2 dependent services Starting clvmd: >> Activating VG(s): 2 logical volume(s) in volume group >> "VG_SDATA" now active >> 1 logical volume(s) in volume group "vg_osroot" now active >> [ OK ] >> osr(notice) ..bindmounts.. [ OK ] >> Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in >> volume group "VG_SDATA" unmonitored >> [ OK ] >> Stopping monitoring for VG vg_osroot: 1 logical volume(s) in >> volume group "vg_osroot" unmonitored >> [ OK ] >> Sending all processes the TERM signal... [ OK ] >> Sending all processes the KILL signal... [ OK ] >> Saving random seed: [ OK ] >> Syncing hardware clock to system time [ OK ] >> Turning off quotas: quotaoff: Cannot change state of GFS2 quota. >> quotaoff: Cannot change state of GFS2 quota. >> [FAILED] >> Unmounting file systems: [ OK ] >> init: Re-executing /sbin/init >> Halting system... >> osr(notice) Scanning for Bootparameters... >> osr(notice) Starting ATIX exitrd >> osr(notice) Comoonics-Release >> osr(notice) comoonics Community Release 5.0 (Gumpn) >> osr(notice) Internal Version $Revision: 1.18 $ $Date: >> 2011-02-11 15:09:53 $ >> osr(notice) Preparing chrootcp: cannot stat >> `/mnt/newroot/dev/initctl': No such file or directory >> [ OK ] >> osr(notice) com-realhalt: detected distribution: rhel6, >> clutype: gfs, rootfs: gfs2 >> osr(notice) Restarting init process in chroot[ OK ] >> osr(notice) Moving dev filesystem[ OK ] >> osr(notice) Umounting filesystems in oldroot ( >> /mnt/newroot/sys /mnt/newroot/proc) >> osr(notice) Umounting /mnt/newroot/sys[ OK ] >> osr(notice) Umounting /mnt/newroot/proc[ OK ] >> osr(notice) Umounting filesystems in oldroot >> (/mnt/newroot/var/run /mnt/newroot/var/lock >> /mnt/newroot/.cdsl.local) >> osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing >> /sbin/init >> [ OK ] >> osr(notice) Umounting /mnt/newroot/var/lock[ OK ] >> osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] >> osr(notice) Umounting oldroot /mnt/newroot[ OK ] >> Deactivating clustered VG(s): 0 logical volume(s) in volume >> group "VG_SDATA" now active >> >> >> >> >> >> On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... >> <mailto:gr...@at...> > wrote: >> >> >> Jorge, >> you don't need to be doubtful about the fact that the volume >> group for the root file system is not flagged as clustered. >> This has no implications whatsoever on the gfs2 file system. >> >> It will only be a problem whenever the lvm settings of the >> vg_osroot change (size, number of lvs etc.). >> >> Nevertheless while thinking about your problem I think I had >> the idea on how to fix this problem on being able to have the >> root vg clustered also. I will provide new packages in the >> next days that should deal with the problem. >> >> Keep in mind that there is a difference between cman_tool >> services and the lvm usage. >> clvmd only uses the locktable clvmd shown by cman_tool >> services and the other locktables are relevant to the file >> systems and other services (fenced, rgmanager, ..). This is a >> complete different use case. >> >> Try to elaborate a bit more on the fact >> >> "I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more." >> What do you mean with it? How does this happen? This sounds >> like something you should have a look at. >> >> >> "Once thing that I can confirm is >> osr(notice): Detecting nodeid & nodename >> This does not always display the correct info, but it doesn't >> seem to be a problem either ?" >> >> You should always look at the nodeid the nodename is (more or >> less) only descriptive and might not be set as expected. But >> the nodeid should always be consistent. Does this help? >> >> About your notes (I only take the relevant ones): >> >> 1. osr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> This message should not be misleading but only tells the >> these control files are being created inside the ramdisk. >> This has nothing to do with these files on your root file >> system. Nevertheless /etc/init.d/bootsr should take over this >> part and create the files. Please send me another >> bash -x /etc/init.d/bootsr start >> output. Please when those files are not existant. >> >> 2. vgs >> >> VG #PV #LV #SN Attr VSize VFree >> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> This is perfectly ok. This only means the vg is not >> clustered. But the filesystem IS. This does not have any >> connection. >> >> Hope this helps. >> Let me know about the open issues. >> >> Regards >> >> Marc. >> >> >> ----- Original Message ----- >> From: "Jorge Silva" < me...@je... <mailto:me...@je...> > >> To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> > >> >> Sent: Tuesday, November 13, 2012 2:15:23 AM >> Subject: Re: Problem with VG activation clvmd runs at 100% >> >> >> Marc >> >> >> Hi - I believe I have solved my problem, with your help, >> thank you. Yet, I'm not sure how I caused it - but the root >> volume group as you pointed out had the clustered >> attribute(and I had to have done something silly along the >> way). I re-installed from scratch see notes below and then >> just to prove that is a problem, I changed the attribute of >> the rootfs- vgchange -cy and rebooted and I ran into trouble, >> I changed it back and it is fine so that does cause problems >> on start-up, I'm not sure I understand why as there is an >> active quorum for the clvm to join and take part.. >> >> >> Despite it not being marked as a cluster volume cman_tool >> services show it as being, but clvmd status doesn't ? Is it >> safe to write to it with multiple nodes mounted? >> >> >> I am still a bit stuck when nodes with gfs2 mounted don't >> restart if instructed to do so, but I will read some more. >> >> >> >> >> Once thing that I can confirm is >> osr(notice): Detecting nodeid & nodename >> >> >> This does not always display the correct info, but it doesn't >> seem to be a problem either ? >> >> >> >> >> Thanks >> Jorge >> >> >> Notes: >> I decided to start from scratch and I blew away the rootfs >> and started from scratch as per the website. My assumption - >> that I edited something and messed it up (I did look at a lot >> of the scripts to try to "figure out and fix" the problem, I >> can send the history if you want or I can edit and contribute). >> >> >> I rebooted the server and I had an issue - I didn't disable >> selinux so I had to intervene in the boot stage. That >> completed, but I noticed that : >> >> >> >> osr(notice): Starting network configuration for lo0 [OK] >> osr(notice): Detecting nodeid & nodename >> >> >> Is blank, but somehow the correct nodeid and name was deduced. >> >> >> I had to rebuild the ram disk to fix the selinux disabled. I >> also added the following >> >> yum install pciutils - the mkinitrd warned about this so, I >> installed it. >> I also installed : >> yum install cluster-snmp >> yum install rgmanager >> in lvm >> >> >> On this reboot I noticed that despite this message >> >> sr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> >> >> Starting clvmd: dlm: Using TCP for communications >> >> >> Activating VG(s): File descriptor 3 (/dev/console) leaked on >> vgchange invocation. Parent PID 15995: /bin/bash >> File descriptor 4 (/dev/console) leaked on vgchange >> invocation. Parent PID 15995: /bin/bash >> Skipping clustered volume group VG_SDATA >> 1 logical volume(s) in volume group "vg_osroot" now active >> >> >> the links weren't created and I did this manually >> >> >> >> ln -sf /var/comoonics/chroot//var/run/cman_admin >> /var/run/cman_admin >> ln -sf /var/comoonics/chroot//var/run/cman_client >> /var/run/cman_client >> >> >> I could then get clusterstatus etc, and clvmd was running ok >> >> >> I looked in /etc/lvm/lvm.conf and locking_type = 4 ? >> >> >> I then issued >> >> >> lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf >> locking_type = 3. >> >> >> vgscan correctly showed up clusterd volumes and was working ok. >> >> >> >> >> I did not rebuild the ramdisk (I can confirm that the lvm >> .conf in the ramdisk has locking_type=4) I have rebooted and >> everything is working. >> >> Starting clvmd: dlm: Using TCP for communications >> >> >> Activating VG(s): File descriptor 3 (/dev/console) leaked on >> vgchange invocation. Parent PID 15983: /bin/bash >> File descriptor 4 (/dev/console) leaked on vgchange >> invocation. Parent PID 15983: /bin/bash >> Skipping clustered volume group VG_SDATA >> 1 logical volume(s) in volume group "vg_osroot" now active >> >> >> >> >> >> >> I have rebooted a number of times and am confident that >> things are ok, >> >> >> I decided to add two other nodes to the mix and I can confirm >> that everytime a new node is added these files are missing : >> >> >> /var/run/cman_admin >> /var/run/cman_client >> But I can see from the logs: >> >> >> >> osr(notice): Creating clusterfiles /var/run/cman_admin >> /var/run/cman_client.. [OK] >> >> >> despite the above message, also, the information below is not >> always detected, but still the nodeid etc is correct... >> >> >> osr(notice): Detecting nodeid & nodename >> >> >> >> >> So now I have 3 nodes in the cluster and things look ok: >> >> >> >> [root@bwccs302 ~]# cman_tool services >> fence domain >> member count 3 >> victim count 0 >> victim now 0 >> master nodeid 2 >> wait state none >> members 2 3 4 >> >> >> dlm lockspaces >> name home >> id 0xf8ee17aa >> flags 0x00000008 fs_reg >> change member 3 joined 1 remove 0 failed 0 seq 3,3 >> members 2 3 4 >> >> >> name clvmd >> id 0x4104eefa >> flags 0x00000000 >> change member 3 joined 1 remove 0 failed 0 seq 15,15 >> members 2 3 4 >> >> >> name OSRoot >> id 0xab5404ad >> flags 0x00000008 fs_reg >> change member 3 joined 1 remove 0 failed 0 seq 7,7 >> members 2 3 4 >> >> >> gfs mountgroups >> name home >> id 0x686e3fc4 >> flags 0x00000048 mounted >> change member 3 joined 1 remove 0 failed 0 seq 3,3 >> members 2 3 4 >> >> >> name OSRoot >> id 0x659f7afe >> flags 0x00000048 mounted >> change member 3 joined 1 remove 0 failed 0 seq 7,7 >> members 2 3 4 >> >> >> >> service clvmd status >> clvmd (pid 25771) is running... >> Clustered Volume Groups: VG_SDATA >> Active clustered Logical Volumes: LV_HOME LV_DEVDB >> >> >> it doesn't believe that the root file-system is clustered >> despite the output from the above. >> >> >> >> [root@bwccs302 ~]# vgs >> VG #PV #LV #SN Attr VSize VFree >> VG_SDATA 1 2 0 wz--nc 1000.00g 0 >> vg_osroot 1 1 0 wz--n- 60.00g 0 >> >> >> The above got me thinking on what you wanted me to do to >> diable the clusterd flag on the root volume - with it left on >> I was having problems (not sure how it got turned) on. >> >> >> With everything working ok, I remade ramdisk and now lvm.conf=3.. >> >> >> The systems start up and things look ok. >> >> > > > -- Marc Grimme E-Mail: grimme( at )atix.de ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de | www.comoonics.org Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) | Vorsitzender des Aufsichtsrats: Dr. Martin Buss |
From: Marc G. <gr...@at...> - 2012-11-19 08:02:24
|
Hi Jorge, sorry for the delay but I was quite busy on the last days. Nevertheless I'm don't understand the problem. Let's first start at the point I think could lead to problems during shutdown and friends. Are the control files in /var/run/cman* being created from the bootsr initscript or do you still have to create them manually. If they are not created I would still be very interested in the output of bash -x /etc/init.d/bootsr start after a node has been started. If not we need to dig deeper into the problems during shutdown. I would then also change the clustered flag for the other volume group. Again as long as you don't change the size it wont hurt. And it's only for better understanding the problem. Another command I'd like to see is a cman_tool services on the other node (say node 2) while the shutdown node is being stuck (say node 1). Thanks Marc. Am 15.11.2012 19:08, schrieb Jorge Silva: > Marc > > Hi, I believe the problem is related to the clsuter services not > shutting down. init 0, will not work with 1 or more nodes, init 6 > will only work when 1 node is present. When more than 1 node is > present the node with the init 6 will have to be fenced as it will > not shut down. I believe the cluster components aren't shutting down > (this also happens with init 6 when more than one node is present) - > I still see cluster traffic on the network, this is periodic. > > 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > > 229.192.0.2.netsupport: UDP, length 119 > > At the point that the system will not shut down, it still is a cluster > member and there is still cluster traffic. > > 1 node : > [root@bwccs302 ~]# init 0 > > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down Avahi daemon: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now > active > 1 logical volume(s) in volume group "vg_osroot" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume > group "vg_osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > Sending all processes the KILL signal... [ OK ] > Saving random seed: [ OK ] > Syncing hardware clock to system time [ OK ] > Turning off quotas: quotaoff: Cannot change state of GFS2 quota. > quotaoff: Cannot change state of GFS2 quota. > [FAILED] > Unmounting file systems: [ OK ] > init: Re-executing /sbin/init > Halting system... > osr(notice) Scanning for Bootparameters... > osr(notice) Starting ATIX exitrd > osr(notice) Comoonics-Release > osr(notice) comoonics Community Release 5.0 (Gumpn) > osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 > 15:09:53 $ > osr(debug) Calling cmd /sbin/halt -d -p > osr(notice) Preparing chrootcp: cannot stat > `/mnt/newroot/dev/initctl': No such file or directory > [ OK ] > osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, > rootfs: gfs2 > osr(notice) Restarting init process in chroot[ OK ] > osr(notice) Moving dev filesystem[ OK ] > osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys > /mnt/newroot/proc) > osr(notice) Umounting /mnt/newroot/sys[ OK ] > osr(notice) Umounting /mnt/newroot/proc[ OK ] > osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run > /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) > osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init > [ OK ] > osr(notice) Umounting /mnt/newroot/var/lock[ OK ] > osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] > osr(notice) Umounting oldroot /mnt/newroot[ OK ] > osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell > bash: no job control in this shell > > Type help to get more information.. > Type exit to continue work.. > ------------------------------------------------------------- > > comoonics 1 > cman_tool: unknown option cman_tool > comoonics 2 > comoonics 2 > Version: 6.2.0 > Config Version: 1 > Cluster Name: ProdCluster01 > Cluster Id: 11454 > Cluster Member: Yes > Cluster Generation: 4 > Membership state: Cluster-Member > Nodes: 1 > Expected votes: 4 > Quorum device votes: 3 > Total votes: 4 > Node votes: 1 > Quorum: 3 > Active subsystems: 10 > Flags: > Ports Bound: 0 11 178 > Node name: smc01b > Node ID: 2 > Multicast addresses: 229.192.0.2 > Node addresses: 172.17.62.12 > comoonics 3 > fence domain > member count 1 > victim count 0 > victim now 0 > master nodeid 2 > wait state none > members 2 > > dlm lockspaces > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 1 joined 1 remove 0 failed 0 seq 1,1 > members 2 > > comoonics 4 > bash: exitt: command not found > comoonics 5 > exit > osr(notice) Back to work.. > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > > It hung at the point above - so I re-ran with the edit set -x in line 207. > 1 -node: > [root@bwccs302 ~]# init 0 > [root@bwccs302 ~ > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down Avahi daemon: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" n ow active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" > now active > 2 logical volume(s) in volume group "VG_SDATA" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume > group "vg_ osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > Sending all processes the KILL signal... [ OK ] > Saving random seed: [ OK ] > Syncing hardware clock to system time [ OK ] > Turning off quotas: quotaoff: Cannot change state of GFS2 quota. > quotaoff: Cannot change state of GFS2 quota. > [FAILED] > Unmounting file systems: [ OK ] > init: Re-executing /sbin/init > Halting system... > osr(notice) Scanning for Bootparameters... > osr(notice) Starting ATIX exitrd > osr(notice) Comoonics-Release > osr(notice) comoonics Community Release 5.0 (Gumpn) > osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 > 15:09:53 $ > osr(notice) Preparing chrootcp: cannot stat > `/mnt/newroot/dev/initctl': No such file or directory [ OK ] > osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, > rootfs: gfs2 > osr(notice) Restarting init process in chroot[ OK ] > osr(notice) Moving dev filesystem[ OK ] > osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys > /mnt/newroot/proc) > osr(notice) Umounting /mnt/newroot/sys[ OK ] > osr(notice) Umounting /mnt/newroot/proc[ OK ] > osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run > /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) > osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing > /sbin/init [ OK ] > osr(notice) Umounting /mnt/newroot/var/lock[ OK ] > osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] > osr(notice) Umounting oldroot /mnt/newroot[ OK ] > + clusterfs_services_stop '' '' 0 > ++ repository_get_value rootfs > +++ repository_normalize_value rootfs > ++ local key=rootfs > ++ local default= > ++ local repository= > ++ '[' -z '' ']' > ++ repository=comoonics > ++ local value= > ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' > +++ cat /var/cache/comoonics-repository/comoonics.rootfs > ++ value=gfs2 > ++ echo gfs2 > ++ return 0 > + local rootfs=gfs2 > + gfs2_services_stop '' '' 0 > + local chroot_path= > + local lock_method= > + local lvm_sup=0 > + '[' -n 0 ']' > + '[' 0 -eq 0 ']' > + /etc/init.d/clvmd stop > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > > with 2 nodes + quorate when init 6 is issued: > > [root@bwccs304 ~]# init 6 > [root@bwccs304 ~ > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down Avahi daemon: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" > now active > 2 logical volume(s) in volume group "VG_SDATA" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume > group "vg_osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > qdiskd[15713]: Unregistering quorum device. > > Sending all processes the KILL signal... dlm: clvmd: no userland > control daemon, stopping lockspace > dlm: OSRoot: no userland control daemon, stopping lockspace > [ OK ] > - stops here and will not die... Still have full cluster coms > > Thanks > jorge > > On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at... > <mailto:gr...@at...>> wrote: > > Hi Jorge, > because of the "init 0". > Please issue the following commands prior to init 0. > # Make it a little more chatty > $ com-chroot setparameter debug > # Break after before cluster will be stopped > $ com-chroot setparameter step halt_umountoldroot > > Then issue a init 0. > This should lead you to a breakpoint during shutdown (hopefully, > cause sometimes the console gets confused). > In side the breakpoint issue: > $ cman_tool status > $ cman_tool services > # Continue shutdown > $ exit > Then send me the output. > > If this fails also do as follows: > $ com-chroot vi com-realhalt.sh > # go to line 207 (before clusterfs_services_stop) is called and > add a set -x > $ init 0 > > Send the output. > Thanks Marc. > > ----- Original Message ----- > From: "Jorge Silva" <me...@je... <mailto:me...@je...>> > To: "Marc Grimme" <gr...@at... <mailto:gr...@at...>> > Cc: ope...@li... > <mailto:ope...@li...> > Sent: Tuesday, November 13, 2012 3:22:37 PM > Subject: Re: Problem with VG activation clvmd runs at 100% > > Marc > > > Hi, thanks for the info, it helps. I have also noticed that gfs2 > entries in the fstab get ignored on boot, I have added in > rc.local. I have done a bit more digging and the issue I described > below: > > > "I am still a bit stuck when nodes with gfs2 mounted don't restart > if instructed to do so, but I will read some more." > > > If I issue a init 6 on a nodes they will restart. If I issue init > 0, then I have the problem the node start to shut down, but will > stay in the cluster. I have to shut it off, it will not shut down, > this is the log. > > > > [root@bwccs304 ~]# init 0 > > > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Stopping nscd: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" > now active > 1 logical volume(s) in volume group "vg_osroot" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in volume > group "VG_SDATA" unmonitored > [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in > volume group "vg_osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > Sending all processes the KILL signal... [ OK ] > Saving random seed: [ OK ] > Syncing hardware clock to system time [ OK ] > Turning off quotas: quotaoff: Cannot change state of GFS2 quota. > quotaoff: Cannot change state of GFS2 quota. > [FAILED] > Unmounting file systems: [ OK ] > init: Re-executing /sbin/init > Halting system... > osr(notice) Scanning for Bootparameters... > osr(notice) Starting ATIX exitrd > osr(notice) Comoonics-Release > osr(notice) comoonics Community Release 5.0 (Gumpn) > osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 > 15:09:53 $ > osr(notice) Preparing chrootcp: cannot stat > `/mnt/newroot/dev/initctl': No such file or directory > [ OK ] > osr(notice) com-realhalt: detected distribution: rhel6, clutype: > gfs, rootfs: gfs2 > osr(notice) Restarting init process in chroot[ OK ] > osr(notice) Moving dev filesystem[ OK ] > osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys > /mnt/newroot/proc) > osr(notice) Umounting /mnt/newroot/sys[ OK ] > osr(notice) Umounting /mnt/newroot/proc[ OK ] > osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run > /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) > osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing > /sbin/init > [ OK ] > osr(notice) Umounting /mnt/newroot/var/lock[ OK ] > osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] > osr(notice) Umounting oldroot /mnt/newroot[ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > > > > > > On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... > <mailto:gr...@at...> > wrote: > > > Jorge, > you don't need to be doubtful about the fact that the volume group > for the root file system is not flagged as clustered. This has no > implications whatsoever on the gfs2 file system. > > It will only be a problem whenever the lvm settings of the > vg_osroot change (size, number of lvs etc.). > > Nevertheless while thinking about your problem I think I had the > idea on how to fix this problem on being able to have the root vg > clustered also. I will provide new packages in the next days that > should deal with the problem. > > Keep in mind that there is a difference between cman_tool services > and the lvm usage. > clvmd only uses the locktable clvmd shown by cman_tool services > and the other locktables are relevant to the file systems and > other services (fenced, rgmanager, ..). This is a complete > different use case. > > Try to elaborate a bit more on the fact > > "I am still a bit stuck when nodes with gfs2 mounted don't restart > if instructed to do so, but I will read some more." > What do you mean with it? How does this happen? This sounds like > something you should have a look at. > > > "Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > This does not always display the correct info, but it doesn't seem > to be a problem either ?" > > You should always look at the nodeid the nodename is (more or > less) only descriptive and might not be set as expected. But the > nodeid should always be consistent. Does this help? > > About your notes (I only take the relevant ones): > > 1. osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > This message should not be misleading but only tells the these > control files are being created inside the ramdisk. This has > nothing to do with these files on your root file system. > Nevertheless /etc/init.d/bootsr should take over this part and > create the files. Please send me another > bash -x /etc/init.d/bootsr start > output. Please when those files are not existant. > > 2. vgs > > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > This is perfectly ok. This only means the vg is not clustered. But > the filesystem IS. This does not have any connection. > > Hope this helps. > Let me know about the open issues. > > Regards > > Marc. > > > ----- Original Message ----- > From: "Jorge Silva" < me...@je... <mailto:me...@je...> > > To: "Marc Grimme" < gr...@at... <mailto:gr...@at...> > > > Sent: Tuesday, November 13, 2012 2:15:23 AM > Subject: Re: Problem with VG activation clvmd runs at 100% > > > Marc > > > Hi - I believe I have solved my problem, with your help, thank > you. Yet, I'm not sure how I caused it - but the root volume group > as you pointed out had the clustered attribute(and I had to have > done something silly along the way). I re-installed from scratch > see notes below and then just to prove that is a problem, I > changed the attribute of the rootfs- vgchange -cy and rebooted and > I ran into trouble, I changed it back and it is fine so that does > cause problems on start-up, I'm not sure I understand why as there > is an active quorum for the clvm to join and take part.. > > > Despite it not being marked as a cluster volume cman_tool services > show it as being, but clvmd status doesn't ? Is it safe to write > to it with multiple nodes mounted? > > > I am still a bit stuck when nodes with gfs2 mounted don't restart > if instructed to do so, but I will read some more. > > > > > Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > > > This does not always display the correct info, but it doesn't seem > to be a problem either ? > > > > > Thanks > Jorge > > > Notes: > I decided to start from scratch and I blew away the rootfs and > started from scratch as per the website. My assumption - that I > edited something and messed it up (I did look at a lot of the > scripts to try to "figure out and fix" the problem, I can send the > history if you want or I can edit and contribute). > > > I rebooted the server and I had an issue - I didn't disable > selinux so I had to intervene in the boot stage. That completed, > but I noticed that : > > > > osr(notice): Starting network configuration for lo0 [OK] > osr(notice): Detecting nodeid & nodename > > > Is blank, but somehow the correct nodeid and name was deduced. > > > I had to rebuild the ram disk to fix the selinux disabled. I also > added the following > > yum install pciutils - the mkinitrd warned about this so, I > installed it. > I also installed : > yum install cluster-snmp > yum install rgmanager > in lvm > > > On this reboot I noticed that despite this message > > sr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on > vgchange invocation. Parent PID 15995: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. > Parent PID 15995: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > the links weren't created and I did this manually > > > > ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin > ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client > > > I could then get clusterstatus etc, and clvmd was running ok > > > I looked in /etc/lvm/lvm.conf and locking_type = 4 ? > > > I then issued > > > lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf > locking_type = 3. > > > vgscan correctly showed up clusterd volumes and was working ok. > > > > > I did not rebuild the ramdisk (I can confirm that the lvm .conf in > the ramdisk has locking_type=4) I have rebooted and everything is > working. > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on > vgchange invocation. Parent PID 15983: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. > Parent PID 15983: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > > > > > I have rebooted a number of times and am confident that things are ok, > > > I decided to add two other nodes to the mix and I can confirm that > everytime a new node is added these files are missing : > > > /var/run/cman_admin > /var/run/cman_client > But I can see from the logs: > > > > osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > despite the above message, also, the information below is not > always detected, but still the nodeid etc is correct... > > > osr(notice): Detecting nodeid & nodename > > > > > So now I have 3 nodes in the cluster and things look ok: > > > > [root@bwccs302 ~]# cman_tool services > fence domain > member count 3 > victim count 0 > victim now 0 > master nodeid 2 > wait state none > members 2 3 4 > > > dlm lockspaces > name home > id 0xf8ee17aa > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 3 joined 1 remove 0 failed 0 seq 15,15 > members 2 3 4 > > > name OSRoot > id 0xab5404ad > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > gfs mountgroups > name home > id 0x686e3fc4 > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name OSRoot > id 0x659f7afe > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > > service clvmd status > clvmd (pid 25771) is running... > Clustered Volume Groups: VG_SDATA > Active clustered Logical Volumes: LV_HOME LV_DEVDB > > > it doesn't believe that the root file-system is clustered despite > the output from the above. > > > > [root@bwccs302 ~]# vgs > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > > The above got me thinking on what you wanted me to do to diable > the clusterd flag on the root volume - with it left on I was > having problems (not sure how it got turned) on. > > > With everything working ok, I remade ramdisk and now lvm.conf=3.. > > > The systems start up and things look ok. > > -- Marc Grimme Tel: +49 (0)89 452 35 38-140 Fax: +49 (0)89 452 35 38-290 E-Mail: gr...@at... ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de | www.comoonics.org Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) | Vorsitzender des Aufsichtsrats: Dr. Martin Buss |
From: Jorge S. <me...@je...> - 2012-11-15 18:31:40
|
Marc Hi, I believe the problem is related to the clsuter services not shutting down. init 0, will not work with 1 or more nodes, init 6 will only work when 1 node is present. When more than 1 node is present the node with the init 6 will have to be fenced as it will not shut down. I believe the cluster components aren't shutting down (this also happens with init 6 when more than one node is present) - I still see cluster traffic on the network, this is periodic. 12:42:00.547615 IP 172.17.62.12.hpoms-dps-lstn > 229.192.0.2.netsupport: UDP, length 119 At the point that the system will not shut down, it still is a cluster member and there is still cluster traffic. 1 node : [root@bwccs302 ~]# init 0 Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active 1 logical volume(s) in volume group "vg_osroot" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(debug) Calling cmd /sbin/halt -d -p osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] osr(notice) Breakpoint "halt_umountoldroot" detected forking a shell bash: no job control in this shell Type help to get more information.. Type exit to continue work.. ------------------------------------------------------------- comoonics 1 > cman_tool: unknown option cman_tool comoonics 2 > comoonics 2 > Version: 6.2.0 Config Version: 1 Cluster Name: ProdCluster01 Cluster Id: 11454 Cluster Member: Yes Cluster Generation: 4 Membership state: Cluster-Member Nodes: 1 Expected votes: 4 Quorum device votes: 3 Total votes: 4 Node votes: 1 Quorum: 3 Active subsystems: 10 Flags: Ports Bound: 0 11 178 Node name: smc01b Node ID: 2 Multicast addresses: 229.192.0.2 Node addresses: 172.17.62.12 comoonics 3 > fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 comoonics 4 > bash: exitt: command not found comoonics 5 > exit osr(notice) Back to work.. Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active It hung at the point above - so I re-ran with the edit set -x in line 207. 1 -node: [root@bwccs302 ~]# init 0 [root@bwccs302 ~ Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" n ow active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" now active 2 logical volume(s) in volume group "VG_SDATA" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_ osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] + clusterfs_services_stop '' '' 0 ++ repository_get_value rootfs +++ repository_normalize_value rootfs ++ local key=rootfs ++ local default= ++ local repository= ++ '[' -z '' ']' ++ repository=comoonics ++ local value= ++ '[' -f /var/cache/comoonics-repository/comoonics.rootfs ']' +++ cat /var/cache/comoonics-repository/comoonics.rootfs ++ value=gfs2 ++ echo gfs2 ++ return 0 + local rootfs=gfs2 + gfs2_services_stop '' '' 0 + local chroot_path= + local lock_method= + local lvm_sup=0 + '[' -n 0 ']' + '[' 0 -eq 0 ']' + /etc/init.d/clvmd stop Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active with 2 nodes + quorate when init 6 is issued: [root@bwccs304 ~]# init 6 [root@bwccs304 ~ Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down Avahi daemon: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 1 logical volume(s) in volume group "vg_osroot" now active 2 logical volume(s) in volume group "VG_SDATA" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] qdiskd[15713]: Unregistering quorum device. Sending all processes the KILL signal... dlm: clvmd: no userland control daemon, stopping lockspace dlm: OSRoot: no userland control daemon, stopping lockspace [ OK ] - stops here and will not die... Still have full cluster coms Thanks jorge On Tue, Nov 13, 2012 at 9:32 AM, Marc Grimme <gr...@at...> wrote: > Hi Jorge, > because of the "init 0". > Please issue the following commands prior to init 0. > # Make it a little more chatty > $ com-chroot setparameter debug > # Break after before cluster will be stopped > $ com-chroot setparameter step halt_umountoldroot > > Then issue a init 0. > This should lead you to a breakpoint during shutdown (hopefully, cause > sometimes the console gets confused). > In side the breakpoint issue: > $ cman_tool status > $ cman_tool services > # Continue shutdown > $ exit > Then send me the output. > > If this fails also do as follows: > $ com-chroot vi com-realhalt.sh > # go to line 207 (before clusterfs_services_stop) is called and add a set > -x > $ init 0 > > Send the output. > Thanks Marc. > > ----- Original Message ----- > From: "Jorge Silva" <me...@je...> > To: "Marc Grimme" <gr...@at...> > Cc: ope...@li... > Sent: Tuesday, November 13, 2012 3:22:37 PM > Subject: Re: Problem with VG activation clvmd runs at 100% > > Marc > > > Hi, thanks for the info, it helps. I have also noticed that gfs2 entries > in the fstab get ignored on boot, I have added in rc.local. I have done a > bit more digging and the issue I described below: > > > "I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more." > > > If I issue a init 6 on a nodes they will restart. If I issue init 0, then > I have the problem the node start to shut down, but will stay in the > cluster. I have to shut it off, it will not shut down, this is the log. > > > > [root@bwccs304 ~]# init 0 > > > Can't connect to default. Skipping. > Shutting down Cluster Module - cluster monitor: [ OK ] > Shutting down ricci: [ OK ] > Shutting down oddjobd: [ OK ] > Stopping saslauthd: [ OK ] > Stopping sshd: [ OK ] > Shutting down sm-client: [ OK ] > Shutting down sendmail: [ OK ] > Stopping imsd via sshd: [ OK ] > Stopping snmpd: [ OK ] > Stopping crond: [ OK ] > Stopping HAL daemon: [ OK ] > Stopping nscd: [ OK ] > Shutting down ntpd: [ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > [ OK ] > Signaling clvmd to exit [ OK ] > clvmd terminated[ OK ] > Stopping lldpad: [ OK ] > Stopping system message bus: [ OK ] > Stopping multipathd daemon: [ OK ] > Stopping rpcbind: [ OK ] > Stopping auditd: [ OK ] > Stopping nslcd: [ OK ] > Shutting down system logger: [ OK ] > Stopping sssd: [ OK ] > Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] > Stopping gfs2 dependent services Starting clvmd: > Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active > 1 logical volume(s) in volume group "vg_osroot" now active > [ OK ] > osr(notice) ..bindmounts.. [ OK ] > Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in volume group > "VG_SDATA" unmonitored > [ OK ] > Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group > "vg_osroot" unmonitored > [ OK ] > Sending all processes the TERM signal... [ OK ] > Sending all processes the KILL signal... [ OK ] > Saving random seed: [ OK ] > Syncing hardware clock to system time [ OK ] > Turning off quotas: quotaoff: Cannot change state of GFS2 quota. > quotaoff: Cannot change state of GFS2 quota. > [FAILED] > Unmounting file systems: [ OK ] > init: Re-executing /sbin/init > Halting system... > osr(notice) Scanning for Bootparameters... > osr(notice) Starting ATIX exitrd > osr(notice) Comoonics-Release > osr(notice) comoonics Community Release 5.0 (Gumpn) > osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ > osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No > such file or directory > [ OK ] > osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, > rootfs: gfs2 > osr(notice) Restarting init process in chroot[ OK ] > osr(notice) Moving dev filesystem[ OK ] > osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys > /mnt/newroot/proc) > osr(notice) Umounting /mnt/newroot/sys[ OK ] > osr(notice) Umounting /mnt/newroot/proc[ OK ] > osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run > /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) > osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init > [ OK ] > osr(notice) Umounting /mnt/newroot/var/lock[ OK ] > osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] > osr(notice) Umounting oldroot /mnt/newroot[ OK ] > Deactivating clustered VG(s): 0 logical volume(s) in volume group > "VG_SDATA" now active > > > > > > On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... > wrote: > > > Jorge, > you don't need to be doubtful about the fact that the volume group for the > root file system is not flagged as clustered. This has no implications > whatsoever on the gfs2 file system. > > It will only be a problem whenever the lvm settings of the vg_osroot > change (size, number of lvs etc.). > > Nevertheless while thinking about your problem I think I had the idea on > how to fix this problem on being able to have the root vg clustered also. I > will provide new packages in the next days that should deal with the > problem. > > Keep in mind that there is a difference between cman_tool services and the > lvm usage. > clvmd only uses the locktable clvmd shown by cman_tool services and the > other locktables are relevant to the file systems and other services > (fenced, rgmanager, ..). This is a complete different use case. > > Try to elaborate a bit more on the fact > > "I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more." > What do you mean with it? How does this happen? This sounds like something > you should have a look at. > > > "Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > This does not always display the correct info, but it doesn't seem to be a > problem either ?" > > You should always look at the nodeid the nodename is (more or less) only > descriptive and might not be set as expected. But the nodeid should always > be consistent. Does this help? > > About your notes (I only take the relevant ones): > > 1. osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > This message should not be misleading but only tells the these control > files are being created inside the ramdisk. This has nothing to do with > these files on your root file system. Nevertheless /etc/init.d/bootsr > should take over this part and create the files. Please send me another > bash -x /etc/init.d/bootsr start > output. Please when those files are not existant. > > 2. vgs > > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > This is perfectly ok. This only means the vg is not clustered. But the > filesystem IS. This does not have any connection. > > Hope this helps. > Let me know about the open issues. > > Regards > > Marc. > > > ----- Original Message ----- > From: "Jorge Silva" < me...@je... > > To: "Marc Grimme" < gr...@at... > > > Sent: Tuesday, November 13, 2012 2:15:23 AM > Subject: Re: Problem with VG activation clvmd runs at 100% > > > Marc > > > Hi - I believe I have solved my problem, with your help, thank you. Yet, > I'm not sure how I caused it - but the root volume group as you pointed out > had the clustered attribute(and I had to have done something silly along > the way). I re-installed from scratch see notes below and then just to > prove that is a problem, I changed the attribute of the rootfs- vgchange > -cy and rebooted and I ran into trouble, I changed it back and it is fine > so that does cause problems on start-up, I'm not sure I understand why as > there is an active quorum for the clvm to join and take part.. > > > Despite it not being marked as a cluster volume cman_tool services show it > as being, but clvmd status doesn't ? Is it safe to write to it with > multiple nodes mounted? > > > I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more. > > > > > Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > > > This does not always display the correct info, but it doesn't seem to be a > problem either ? > > > > > Thanks > Jorge > > > Notes: > I decided to start from scratch and I blew away the rootfs and started > from scratch as per the website. My assumption - that I edited something > and messed it up (I did look at a lot of the scripts to try to "figure out > and fix" the problem, I can send the history if you want or I can edit and > contribute). > > > I rebooted the server and I had an issue - I didn't disable selinux so I > had to intervene in the boot stage. That completed, but I noticed that : > > > > osr(notice): Starting network configuration for lo0 [OK] > osr(notice): Detecting nodeid & nodename > > > Is blank, but somehow the correct nodeid and name was deduced. > > > I had to rebuild the ram disk to fix the selinux disabled. I also added > the following > > yum install pciutils - the mkinitrd warned about this so, I installed it. > I also installed : > yum install cluster-snmp > yum install rgmanager > in lvm > > > On this reboot I noticed that despite this message > > sr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15995: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15995: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > the links weren't created and I did this manually > > > > ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin > ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client > > > I could then get clusterstatus etc, and clvmd was running ok > > > I looked in /etc/lvm/lvm.conf and locking_type = 4 ? > > > I then issued > > > lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf locking_type > = 3. > > > vgscan correctly showed up clusterd volumes and was working ok. > > > > > I did not rebuild the ramdisk (I can confirm that the lvm .conf in the > ramdisk has locking_type=4) I have rebooted and everything is working. > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15983: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15983: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > > > > > I have rebooted a number of times and am confident that things are ok, > > > I decided to add two other nodes to the mix and I can confirm that > everytime a new node is added these files are missing : > > > /var/run/cman_admin > /var/run/cman_client > But I can see from the logs: > > > > osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > despite the above message, also, the information below is not always > detected, but still the nodeid etc is correct... > > > osr(notice): Detecting nodeid & nodename > > > > > So now I have 3 nodes in the cluster and things look ok: > > > > [root@bwccs302 ~]# cman_tool services > fence domain > member count 3 > victim count 0 > victim now 0 > master nodeid 2 > wait state none > members 2 3 4 > > > dlm lockspaces > name home > id 0xf8ee17aa > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 3 joined 1 remove 0 failed 0 seq 15,15 > members 2 3 4 > > > name OSRoot > id 0xab5404ad > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > gfs mountgroups > name home > id 0x686e3fc4 > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name OSRoot > id 0x659f7afe > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > > service clvmd status > clvmd (pid 25771) is running... > Clustered Volume Groups: VG_SDATA > Active clustered Logical Volumes: LV_HOME LV_DEVDB > > > it doesn't believe that the root file-system is clustered despite the > output from the above. > > > > [root@bwccs302 ~]# vgs > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > > The above got me thinking on what you wanted me to do to diable the > clusterd flag on the root volume - with it left on I was having problems > (not sure how it got turned) on. > > > With everything working ok, I remade ramdisk and now lvm.conf=3.. > > > The systems start up and things look ok. > > |
From: Jorge S. <me...@je...> - 2012-11-13 14:50:53
|
Marc Hi, thanks for the info, it helps. I have also noticed that gfs2 entries in the fstab get ignored on boot, I have added in rc.local. I have done a bit more digging and the issue I described below: "I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more." If I issue a init 6 on a nodes they will restart. If I issue init 0, then I have the problem the node start to shut down, but will stay in the cluster. I have to shut it off, it will not shut down, this is the log. [root@bwccs304 ~]# init 0 Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Stopping nscd: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active 1 logical volume(s) in volume group "vg_osroot" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in volume group "VG_SDATA" unmonitored [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme <gr...@at...> wrote: > Jorge, > you don't need to be doubtful about the fact that the volume group for the > root file system is not flagged as clustered. This has no implications > whatsoever on the gfs2 file system. > > It will only be a problem whenever the lvm settings of the vg_osroot > change (size, number of lvs etc.). > > Nevertheless while thinking about your problem I think I had the idea on > how to fix this problem on being able to have the root vg clustered also. I > will provide new packages in the next days that should deal with the > problem. > > Keep in mind that there is a difference between cman_tool services and the > lvm usage. > clvmd only uses the locktable clvmd shown by cman_tool services and the > other locktables are relevant to the file systems and other services > (fenced, rgmanager, ..). This is a complete different use case. > > Try to elaborate a bit more on the fact > "I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more." > What do you mean with it? How does this happen? This sounds like something > you should have a look at. > > "Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > This does not always display the correct info, but it doesn't seem to be a > problem either ?" > > You should always look at the nodeid the nodename is (more or less) only > descriptive and might not be set as expected. But the nodeid should always > be consistent. Does this help? > > About your notes (I only take the relevant ones): > > 1. osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > This message should not be misleading but only tells the these control > files are being created inside the ramdisk. This has nothing to do with > these files on your root file system. Nevertheless /etc/init.d/bootsr > should take over this part and create the files. Please send me another > bash -x /etc/init.d/bootsr start > output. Please when those files are not existant. > > 2. vgs > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > This is perfectly ok. This only means the vg is not clustered. But the > filesystem IS. This does not have any connection. > > Hope this helps. > Let me know about the open issues. > > Regards > Marc. > > > ----- Original Message ----- > From: "Jorge Silva" <me...@je...> > To: "Marc Grimme" <gr...@at...> > Sent: Tuesday, November 13, 2012 2:15:23 AM > Subject: Re: Problem with VG activation clvmd runs at 100% > > Marc > > > Hi - I believe I have solved my problem, with your help, thank you. Yet, > I'm not sure how I caused it - but the root volume group as you pointed out > had the clustered attribute(and I had to have done something silly along > the way). I re-installed from scratch see notes below and then just to > prove that is a problem, I changed the attribute of the rootfs- vgchange > -cy and rebooted and I ran into trouble, I changed it back and it is fine > so that does cause problems on start-up, I'm not sure I understand why as > there is an active quorum for the clvm to join and take part.. > > > Despite it not being marked as a cluster volume cman_tool services show it > as being, but clvmd status doesn't ? Is it safe to write to it with > multiple nodes mounted? > > > I am still a bit stuck when nodes with gfs2 mounted don't restart if > instructed to do so, but I will read some more. > > > Once thing that I can confirm is > osr(notice): Detecting nodeid & nodename > > > This does not always display the correct info, but it doesn't seem to be a > problem either ? > > > > > Thanks > Jorge > > > Notes: > I decided to start from scratch and I blew away the rootfs and started > from scratch as per the website. My assumption - that I edited something > and messed it up (I did look at a lot of the scripts to try to "figure out > and fix" the problem, I can send the history if you want or I can edit and > contribute). > > > I rebooted the server and I had an issue - I didn't disable selinux so I > had to intervene in the boot stage. That completed, but I noticed that : > > > > osr(notice): Starting network configuration for lo0 [OK] > osr(notice): Detecting nodeid & nodename > > > Is blank, but somehow the correct nodeid and name was deduced. > > > I had to rebuild the ram disk to fix the selinux disabled. I also added > the following > > yum install pciutils - the mkinitrd warned about this so, I installed it. > I also installed : > yum install cluster-snmp > yum install rgmanager > in lvm > > > On this reboot I noticed that despite this message > > sr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15995: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15995: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > the links weren't created and I did this manually > > > > ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin > ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client > > > I could then get clusterstatus etc, and clvmd was running ok > > > I looked in /etc/lvm/lvm.conf and locking_type = 4 ? > > > I then issued > > > lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf locking_type > = 3. > > > vgscan correctly showed up clusterd volumes and was working ok. > > > > > I did not rebuild the ramdisk (I can confirm that the lvm .conf in the > ramdisk has locking_type=4) I have rebooted and everything is working. > > Starting clvmd: dlm: Using TCP for communications > > > Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange > invocation. Parent PID 15983: /bin/bash > File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID > 15983: /bin/bash > Skipping clustered volume group VG_SDATA > 1 logical volume(s) in volume group "vg_osroot" now active > > > > > > > I have rebooted a number of times and am confident that things are ok, > > > I decided to add two other nodes to the mix and I can confirm that > everytime a new node is added these files are missing : > > > /var/run/cman_admin > /var/run/cman_client > But I can see from the logs: > > > > osr(notice): Creating clusterfiles /var/run/cman_admin > /var/run/cman_client.. [OK] > > > despite the above message, also, the information below is not always > detected, but still the nodeid etc is correct... > > > osr(notice): Detecting nodeid & nodename > > > > > So now I have 3 nodes in the cluster and things look ok: > > > > [root@bwccs302 ~]# cman_tool services > fence domain > member count 3 > victim count 0 > victim now 0 > master nodeid 2 > wait state none > members 2 3 4 > > > dlm lockspaces > name home > id 0xf8ee17aa > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 3 joined 1 remove 0 failed 0 seq 15,15 > members 2 3 4 > > > name OSRoot > id 0xab5404ad > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > gfs mountgroups > name home > id 0x686e3fc4 > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 3,3 > members 2 3 4 > > > name OSRoot > id 0x659f7afe > flags 0x00000048 mounted > change member 3 joined 1 remove 0 failed 0 seq 7,7 > members 2 3 4 > > > > service clvmd status > clvmd (pid 25771) is running... > Clustered Volume Groups: VG_SDATA > Active clustered Logical Volumes: LV_HOME LV_DEVDB > > > it doesn't believe that the root file-system is clustered despite the > output from the above. > > > > [root@bwccs302 ~]# vgs > VG #PV #LV #SN Attr VSize VFree > VG_SDATA 1 2 0 wz--nc 1000.00g 0 > vg_osroot 1 1 0 wz--n- 60.00g 0 > > > The above got me thinking on what you wanted me to do to diable the > clusterd flag on the root volume - with it left on I was having problems > (not sure how it got turned) on. > > > With everything working ok, I remade ramdisk and now lvm.conf=3.. > > > The systems start up and things look ok. > |
From: Marc G. <gr...@at...> - 2012-11-13 14:32:49
|
Hi Jorge, because of the "init 0". Please issue the following commands prior to init 0. # Make it a little more chatty $ com-chroot setparameter debug # Break after before cluster will be stopped $ com-chroot setparameter step halt_umountoldroot Then issue a init 0. This should lead you to a breakpoint during shutdown (hopefully, cause sometimes the console gets confused). In side the breakpoint issue: $ cman_tool status $ cman_tool services # Continue shutdown $ exit Then send me the output. If this fails also do as follows: $ com-chroot vi com-realhalt.sh # go to line 207 (before clusterfs_services_stop) is called and add a set -x $ init 0 Send the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Cc: ope...@li... Sent: Tuesday, November 13, 2012 3:22:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for the info, it helps. I have also noticed that gfs2 entries in the fstab get ignored on boot, I have added in rc.local. I have done a bit more digging and the issue I described below: "I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more." If I issue a init 6 on a nodes they will restart. If I issue init 0, then I have the problem the node start to shut down, but will stay in the cluster. I have to shut it off, it will not shut down, this is the log. [root@bwccs304 ~]# init 0 Can't connect to default. Skipping. Shutting down Cluster Module - cluster monitor: [ OK ] Shutting down ricci: [ OK ] Shutting down oddjobd: [ OK ] Stopping saslauthd: [ OK ] Stopping sshd: [ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Stopping imsd via sshd: [ OK ] Stopping snmpd: [ OK ] Stopping crond: [ OK ] Stopping HAL daemon: [ OK ] Stopping nscd: [ OK ] Shutting down ntpd: [ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active [ OK ] Signaling clvmd to exit [ OK ] clvmd terminated[ OK ] Stopping lldpad: [ OK ] Stopping system message bus: [ OK ] Stopping multipathd daemon: [ OK ] Stopping rpcbind: [ OK ] Stopping auditd: [ OK ] Stopping nslcd: [ OK ] Shutting down system logger: [ OK ] Stopping sssd: [ OK ] Stopping gfs dependent services osr(notice) ..bindmounts.. [ OK ] Stopping gfs2 dependent services Starting clvmd: Activating VG(s): 2 logical volume(s) in volume group "VG_SDATA" now active 1 logical volume(s) in volume group "vg_osroot" now active [ OK ] osr(notice) ..bindmounts.. [ OK ] Stopping monitoring for VG VG_SDATA: 1 logical volume(s) in volume group "VG_SDATA" unmonitored [ OK ] Stopping monitoring for VG vg_osroot: 1 logical volume(s) in volume group "vg_osroot" unmonitored [ OK ] Sending all processes the TERM signal... [ OK ] Sending all processes the KILL signal... [ OK ] Saving random seed: [ OK ] Syncing hardware clock to system time [ OK ] Turning off quotas: quotaoff: Cannot change state of GFS2 quota. quotaoff: Cannot change state of GFS2 quota. [FAILED] Unmounting file systems: [ OK ] init: Re-executing /sbin/init Halting system... osr(notice) Scanning for Bootparameters... osr(notice) Starting ATIX exitrd osr(notice) Comoonics-Release osr(notice) comoonics Community Release 5.0 (Gumpn) osr(notice) Internal Version $Revision: 1.18 $ $Date: 2011-02-11 15:09:53 $ osr(notice) Preparing chrootcp: cannot stat `/mnt/newroot/dev/initctl': No such file or directory [ OK ] osr(notice) com-realhalt: detected distribution: rhel6, clutype: gfs, rootfs: gfs2 osr(notice) Restarting init process in chroot[ OK ] osr(notice) Moving dev filesystem[ OK ] osr(notice) Umounting filesystems in oldroot ( /mnt/newroot/sys /mnt/newroot/proc) osr(notice) Umounting /mnt/newroot/sys[ OK ] osr(notice) Umounting /mnt/newroot/proc[ OK ] osr(notice) Umounting filesystems in oldroot (/mnt/newroot/var/run /mnt/newroot/var/lock /mnt/newroot/.cdsl.local) osr(notice) Umounting /mnt/newroot/var/runinit: Re-executing /sbin/init [ OK ] osr(notice) Umounting /mnt/newroot/var/lock[ OK ] osr(notice) Umounting /mnt/newroot/.cdsl.local[ OK ] osr(notice) Umounting oldroot /mnt/newroot[ OK ] Deactivating clustered VG(s): 0 logical volume(s) in volume group "VG_SDATA" now active On Tue, Nov 13, 2012 at 2:43 AM, Marc Grimme < gr...@at... > wrote: Jorge, you don't need to be doubtful about the fact that the volume group for the root file system is not flagged as clustered. This has no implications whatsoever on the gfs2 file system. It will only be a problem whenever the lvm settings of the vg_osroot change (size, number of lvs etc.). Nevertheless while thinking about your problem I think I had the idea on how to fix this problem on being able to have the root vg clustered also. I will provide new packages in the next days that should deal with the problem. Keep in mind that there is a difference between cman_tool services and the lvm usage. clvmd only uses the locktable clvmd shown by cman_tool services and the other locktables are relevant to the file systems and other services (fenced, rgmanager, ..). This is a complete different use case. Try to elaborate a bit more on the fact "I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more." What do you mean with it? How does this happen? This sounds like something you should have a look at. "Once thing that I can confirm is osr(notice): Detecting nodeid & nodename This does not always display the correct info, but it doesn't seem to be a problem either ?" You should always look at the nodeid the nodename is (more or less) only descriptive and might not be set as expected. But the nodeid should always be consistent. Does this help? About your notes (I only take the relevant ones): 1. osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] This message should not be misleading but only tells the these control files are being created inside the ramdisk. This has nothing to do with these files on your root file system. Nevertheless /etc/init.d/bootsr should take over this part and create the files. Please send me another bash -x /etc/init.d/bootsr start output. Please when those files are not existant. 2. vgs VG #PV #LV #SN Attr VSize VFree VG_SDATA 1 2 0 wz--nc 1000.00g 0 vg_osroot 1 1 0 wz--n- 60.00g 0 This is perfectly ok. This only means the vg is not clustered. But the filesystem IS. This does not have any connection. Hope this helps. Let me know about the open issues. Regards Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Tuesday, November 13, 2012 2:15:23 AM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi - I believe I have solved my problem, with your help, thank you. Yet, I'm not sure how I caused it - but the root volume group as you pointed out had the clustered attribute(and I had to have done something silly along the way). I re-installed from scratch see notes below and then just to prove that is a problem, I changed the attribute of the rootfs- vgchange -cy and rebooted and I ran into trouble, I changed it back and it is fine so that does cause problems on start-up, I'm not sure I understand why as there is an active quorum for the clvm to join and take part.. Despite it not being marked as a cluster volume cman_tool services show it as being, but clvmd status doesn't ? Is it safe to write to it with multiple nodes mounted? I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more. Once thing that I can confirm is osr(notice): Detecting nodeid & nodename This does not always display the correct info, but it doesn't seem to be a problem either ? Thanks Jorge Notes: I decided to start from scratch and I blew away the rootfs and started from scratch as per the website. My assumption - that I edited something and messed it up (I did look at a lot of the scripts to try to "figure out and fix" the problem, I can send the history if you want or I can edit and contribute). I rebooted the server and I had an issue - I didn't disable selinux so I had to intervene in the boot stage. That completed, but I noticed that : osr(notice): Starting network configuration for lo0 [OK] osr(notice): Detecting nodeid & nodename Is blank, but somehow the correct nodeid and name was deduced. I had to rebuild the ram disk to fix the selinux disabled. I also added the following yum install pciutils - the mkinitrd warned about this so, I installed it. I also installed : yum install cluster-snmp yum install rgmanager in lvm On this reboot I noticed that despite this message sr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] Starting clvmd: dlm: Using TCP for communications Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange invocation. Parent PID 15995: /bin/bash File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID 15995: /bin/bash Skipping clustered volume group VG_SDATA 1 logical volume(s) in volume group "vg_osroot" now active the links weren't created and I did this manually ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client I could then get clusterstatus etc, and clvmd was running ok I looked in /etc/lvm/lvm.conf and locking_type = 4 ? I then issued lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf locking_type = 3. vgscan correctly showed up clusterd volumes and was working ok. I did not rebuild the ramdisk (I can confirm that the lvm .conf in the ramdisk has locking_type=4) I have rebooted and everything is working. Starting clvmd: dlm: Using TCP for communications Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange invocation. Parent PID 15983: /bin/bash File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID 15983: /bin/bash Skipping clustered volume group VG_SDATA 1 logical volume(s) in volume group "vg_osroot" now active I have rebooted a number of times and am confident that things are ok, I decided to add two other nodes to the mix and I can confirm that everytime a new node is added these files are missing : /var/run/cman_admin /var/run/cman_client But I can see from the logs: osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] despite the above message, also, the information below is not always detected, but still the nodeid etc is correct... osr(notice): Detecting nodeid & nodename So now I have 3 nodes in the cluster and things look ok: [root@bwccs302 ~]# cman_tool services fence domain member count 3 victim count 0 victim now 0 master nodeid 2 wait state none members 2 3 4 dlm lockspaces name home id 0xf8ee17aa flags 0x00000008 fs_reg change member 3 joined 1 remove 0 failed 0 seq 3,3 members 2 3 4 name clvmd id 0x4104eefa flags 0x00000000 change member 3 joined 1 remove 0 failed 0 seq 15,15 members 2 3 4 name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 3 joined 1 remove 0 failed 0 seq 7,7 members 2 3 4 gfs mountgroups name home id 0x686e3fc4 flags 0x00000048 mounted change member 3 joined 1 remove 0 failed 0 seq 3,3 members 2 3 4 name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 3 joined 1 remove 0 failed 0 seq 7,7 members 2 3 4 service clvmd status clvmd (pid 25771) is running... Clustered Volume Groups: VG_SDATA Active clustered Logical Volumes: LV_HOME LV_DEVDB it doesn't believe that the root file-system is clustered despite the output from the above. [root@bwccs302 ~]# vgs VG #PV #LV #SN Attr VSize VFree VG_SDATA 1 2 0 wz--nc 1000.00g 0 vg_osroot 1 1 0 wz--n- 60.00g 0 The above got me thinking on what you wanted me to do to diable the clusterd flag on the root volume - with it left on I was having problems (not sure how it got turned) on. With everything working ok, I remade ramdisk and now lvm.conf=3.. The systems start up and things look ok. |
From: Marc G. <gr...@at...> - 2012-11-13 07:44:05
|
Jorge, you don't need to be doubtful about the fact that the volume group for the root file system is not flagged as clustered. This has no implications whatsoever on the gfs2 file system. It will only be a problem whenever the lvm settings of the vg_osroot change (size, number of lvs etc.). Nevertheless while thinking about your problem I think I had the idea on how to fix this problem on being able to have the root vg clustered also. I will provide new packages in the next days that should deal with the problem. Keep in mind that there is a difference between cman_tool services and the lvm usage. clvmd only uses the locktable clvmd shown by cman_tool services and the other locktables are relevant to the file systems and other services (fenced, rgmanager, ..). This is a complete different use case. Try to elaborate a bit more on the fact "I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more." What do you mean with it? How does this happen? This sounds like something you should have a look at. "Once thing that I can confirm is osr(notice): Detecting nodeid & nodename This does not always display the correct info, but it doesn't seem to be a problem either ?" You should always look at the nodeid the nodename is (more or less) only descriptive and might not be set as expected. But the nodeid should always be consistent. Does this help? About your notes (I only take the relevant ones): 1. osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] This message should not be misleading but only tells the these control files are being created inside the ramdisk. This has nothing to do with these files on your root file system. Nevertheless /etc/init.d/bootsr should take over this part and create the files. Please send me another bash -x /etc/init.d/bootsr start output. Please when those files are not existant. 2. vgs VG #PV #LV #SN Attr VSize VFree VG_SDATA 1 2 0 wz--nc 1000.00g 0 vg_osroot 1 1 0 wz--n- 60.00g 0 This is perfectly ok. This only means the vg is not clustered. But the filesystem IS. This does not have any connection. Hope this helps. Let me know about the open issues. Regards Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Tuesday, November 13, 2012 2:15:23 AM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi - I believe I have solved my problem, with your help, thank you. Yet, I'm not sure how I caused it - but the root volume group as you pointed out had the clustered attribute(and I had to have done something silly along the way). I re-installed from scratch see notes below and then just to prove that is a problem, I changed the attribute of the rootfs- vgchange -cy and rebooted and I ran into trouble, I changed it back and it is fine so that does cause problems on start-up, I'm not sure I understand why as there is an active quorum for the clvm to join and take part.. Despite it not being marked as a cluster volume cman_tool services show it as being, but clvmd status doesn't ? Is it safe to write to it with multiple nodes mounted? I am still a bit stuck when nodes with gfs2 mounted don't restart if instructed to do so, but I will read some more. Once thing that I can confirm is osr(notice): Detecting nodeid & nodename This does not always display the correct info, but it doesn't seem to be a problem either ? Thanks Jorge Notes: I decided to start from scratch and I blew away the rootfs and started from scratch as per the website. My assumption - that I edited something and messed it up (I did look at a lot of the scripts to try to "figure out and fix" the problem, I can send the history if you want or I can edit and contribute). I rebooted the server and I had an issue - I didn't disable selinux so I had to intervene in the boot stage. That completed, but I noticed that : osr(notice): Starting network configuration for lo0 [OK] osr(notice): Detecting nodeid & nodename Is blank, but somehow the correct nodeid and name was deduced. I had to rebuild the ram disk to fix the selinux disabled. I also added the following yum install pciutils - the mkinitrd warned about this so, I installed it. I also installed : yum install cluster-snmp yum install rgmanager in lvm On this reboot I noticed that despite this message sr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] Starting clvmd: dlm: Using TCP for communications Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange invocation. Parent PID 15995: /bin/bash File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID 15995: /bin/bash Skipping clustered volume group VG_SDATA 1 logical volume(s) in volume group "vg_osroot" now active the links weren't created and I did this manually ln -sf /var/comoonics/chroot//var/run/cman_admin /var/run/cman_admin ln -sf /var/comoonics/chroot//var/run/cman_client /var/run/cman_client I could then get clusterstatus etc, and clvmd was running ok I looked in /etc/lvm/lvm.conf and locking_type = 4 ? I then issued lvmconf --enable cluster - and this changed /etc/lvm/lvm.conf locking_type = 3. vgscan correctly showed up clusterd volumes and was working ok. I did not rebuild the ramdisk (I can confirm that the lvm .conf in the ramdisk has locking_type=4) I have rebooted and everything is working. Starting clvmd: dlm: Using TCP for communications Activating VG(s): File descriptor 3 (/dev/console) leaked on vgchange invocation. Parent PID 15983: /bin/bash File descriptor 4 (/dev/console) leaked on vgchange invocation. Parent PID 15983: /bin/bash Skipping clustered volume group VG_SDATA 1 logical volume(s) in volume group "vg_osroot" now active I have rebooted a number of times and am confident that things are ok, I decided to add two other nodes to the mix and I can confirm that everytime a new node is added these files are missing : /var/run/cman_admin /var/run/cman_client But I can see from the logs: osr(notice): Creating clusterfiles /var/run/cman_admin /var/run/cman_client.. [OK] despite the above message, also, the information below is not always detected, but still the nodeid etc is correct... osr(notice): Detecting nodeid & nodename So now I have 3 nodes in the cluster and things look ok: [root@bwccs302 ~]# cman_tool services fence domain member count 3 victim count 0 victim now 0 master nodeid 2 wait state none members 2 3 4 dlm lockspaces name home id 0xf8ee17aa flags 0x00000008 fs_reg change member 3 joined 1 remove 0 failed 0 seq 3,3 members 2 3 4 name clvmd id 0x4104eefa flags 0x00000000 change member 3 joined 1 remove 0 failed 0 seq 15,15 members 2 3 4 name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 3 joined 1 remove 0 failed 0 seq 7,7 members 2 3 4 gfs mountgroups name home id 0x686e3fc4 flags 0x00000048 mounted change member 3 joined 1 remove 0 failed 0 seq 3,3 members 2 3 4 name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 3 joined 1 remove 0 failed 0 seq 7,7 members 2 3 4 service clvmd status clvmd (pid 25771) is running... Clustered Volume Groups: VG_SDATA Active clustered Logical Volumes: LV_HOME LV_DEVDB it doesn't believe that the root file-system is clustered despite the output from the above. [root@bwccs302 ~]# vgs VG #PV #LV #SN Attr VSize VFree VG_SDATA 1 2 0 wz--nc 1000.00g 0 vg_osroot 1 1 0 wz--n- 60.00g 0 The above got me thinking on what you wanted me to do to diable the clusterd flag on the root volume - with it left on I was having problems (not sure how it got turned) on. With everything working ok, I remade ramdisk and now lvm.conf=3.. The systems start up and things look ok. |
From: Marc G. <gr...@at...> - 2012-11-12 19:14:09
|
Jorge, you did everything ok. We seem to just flag the volume group for the root filesystem to "non" clustered. Then everything should work just fine. This can only be done with the command I send you. Why is it not working? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 7:52:51 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, so to be clear - in RH5.X to do an openshared cluster the rootfs was of gfs type for all the nodes to boot and operate. Has this changed? I was following the general howto when I set this up and set up the root vg as gfs2 type - I can confirm that my root vg is of gfs2 type, if this is not needed i can re-start the process.. I can't run the command run the lvm --config command. Thanks jorge On Mon, Nov 12, 2012 at 1:37 PM, Marc Grimme < gr...@at... > wrote: So I think I have it. The root filesystem volumegroup is clustered. Which "shouldn't really be so". So try the following command: lvm --config 'global { locking_type=0 }' vgchange -cn VG_OSROOT then /etc/init.d/bootsr start /etc/init.d/clvmd start What happens? Regards Marc. P.S. if you still have problems send me again bash -x /etc/init.d/bootsr start. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 7:15:02 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, here it is. [root@bwccs302 ~]# ls -l /var total 152 drwxr-xr-x 3 root root 3864 Nov 8 15:14 account lrwxrwxrwx 1 root root 24 Nov 8 15:14 cache -> ../.cdsl.local/var/cache drwxr-xr-x 3 root root 3864 Nov 9 12:49 comoonics drwxr-xr-x 2 root root 3864 Feb 22 2012 cvs drwxr-xr-x 4 root root 3864 Jul 31 15:43 db drwxr-xr-x 3 root root 3864 Jul 30 19:36 empty drwxr-xr-x 2 root root 3864 Sep 23 2011 games drwxr-xr-x 30 root root 3864 Nov 11 16:24 lib lrwxrwxrwx 1 root root 24 Nov 8 15:14 local -> ../.cdsl.local/var/local drwxrwxr-x 5 root lock 3864 Nov 12 10:34 lock lrwxrwxrwx 1 root root 22 Nov 8 15:14 log -> ../.cdsl.local/var/log lrwxrwxrwx 1 root root 10 Jul 30 19:29 mail -> spool/mail drwxr-xr-x 2 root root 3864 Sep 23 2011 nis drwxr-xr-x 2 root root 3864 Sep 23 2011 opt drwxr-xr-x 2 root root 3864 Sep 23 2011 preserve drwxr-xr-x 28 root root 3864 Nov 12 12:50 run lrwxrwxrwx 1 root root 24 Nov 8 15:14 spool -> ../.cdsl.local/var/spool lrwxrwxrwx 1 root root 22 Nov 8 15:14 tmp -> ../.cdsl.local/var/tmp drwxr-xr-x 6 root root 3864 Nov 11 11:51 www drwxr-xr-x 2 root root 3864 Sep 23 2011 yp On Mon, Nov 12, 2012 at 1:10 PM, Marc Grimme < gr...@at... > wrote: So it looks like the cluster communication files $chrot/var/run/cman_admin, $chroot/var/run/cman_client or not symlinked to /var/run. This is the responsibility of /etc/init.d/bootsr. So I'd like to see the output of the following two commands: ls -l /var bash -x /etc/init.d/bootsr start Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 6:54:53 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, your help is much appreciated, I have restarted the server again and here is the output com-chroot ls -l /var/run total 28 drwxr-xr-x 2 root root 40 Nov 12 12:48 cluster -rw-r--r-- 1 root root 6 Nov 12 12:48 cman.pid srw------- 1 root root 0 Nov 12 12:48 cman_admin srw-rw---- 1 root root 0 Nov 12 12:48 cman_client drwxr-xr-x 2 root root 40 Nov 12 12:50 console -rw-r----- 1 root root 6 Nov 12 12:48 corosync.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 dlm_controld.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 faillock -rw-r--r-- 1 root root 6 Nov 12 12:49 fenced.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 gfs_controld.pid -rw-r--r-- 1 root root 5 Nov 12 12:48 qdiskd.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 sepermit -rw-r--r-- 1 root root 6 Nov 12 12:50 sshd.pid Thanks Jorge On Mon, Nov 12, 2012 at 12:49 PM, Marc Grimme < gr...@at... > wrote: We are getting there. Now I'd like to see the output of this command: com-chroot ls -l /var/run Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:46:58 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, sorry, here it is total 308 drwxr-xr-x 28 root root 3864 Nov 12 11:00 . drwxr-xr-x 16 root root 3864 Nov 11 11:51 .. -rw-r--r-- 1 root root 6 Nov 12 10:58 auditd.pid drwxr-xr-x 2 avahi avahi 3864 Nov 12 10:58 avahi-daemon srwxr-xr-x 1 root root 0 Nov 12 10:58 clumond.sock drwx--x--x 4 root root 3864 Nov 11 18:01 cluster drwxr-xr-x 2 root root 3864 Nov 2 19:51 console drwxr-xr-x 2 root root 3864 Nov 12 11:00 ConsoleKit -rw-r--r-- 1 root root 6 Nov 12 11:00 console-kit-daemon.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 crond.pid ---------- 1 root root 0 Nov 12 10:58 cron.reboot drwxr-xr-x 2 root root 3864 Nov 12 10:58 dbus drwxr-xr-x 2 root root 3864 Apr 16 2012 faillock drwx------ 2 haldaemon haldaemon 3864 Jul 19 2011 hald -rw-r--r-- 1 root root 6 Nov 12 10:58 haldaemon.pid drwxr-xr-x 3 root root 3864 Jul 30 20:34 heartbeat drwx--x--- 2 root apache 3864 Nov 11 17:04 httpd -rw-r--r-- 1 root root 6 Nov 12 10:58 ksmtune.pid drwxr-xr-x 5 root root 3864 Nov 12 10:58 libvirt -rw-r--r-- 1 root root 5 Nov 12 10:58 libvirtd.pid -rw------- 1 root root 32 Nov 12 10:58 lldpad.pid drwx------ 2 root root 3864 Nov 12 10:58 lvm drwx------ 2 root root 3864 Jun 22 08:32 mdadm -rw-r--r-- 1 root root 6 Nov 12 10:58 messagebus.pid -rw-r--r-- 1 root root 5 Nov 12 10:58 modclusterd.pid -rw-r--r-- 1 root root 5 Nov 12 10:57 multipathd.pid srwx------ 1 root root 0 Nov 12 10:57 multipathd.sock drwxr-xr-x 2 mysql mysql 3864 Nov 11 16:11 mysqld drwxrwxr-x 2 root root 3864 Sep 17 05:55 netreport drwxr-xr-x 2 root root 3864 Jul 23 02:09 net-snmp drwxr-xr-x 2 root root 3864 Nov 10 17:55 nscd drwxr-xr-x 2 nslcd root 3864 Nov 10 11:57 nslcd -rw-r--r-- 1 root root 5 Nov 12 10:58 ntpd.pid -rw------- 1 root root 5 Nov 12 10:58 oddjobd.pid drwxr-xr-x 2 root root 3864 Dec 8 2011 plymouth drwxr-xr-x 5 root root 3864 Nov 8 14:57 pm-utils drwxr-xr-x 2 root root 3864 Apr 3 2012 portreserve drwxr-xr-x 2 root root 3864 Aug 22 2010 ppp drwxr-xr-x 2 radvd radvd 3864 Nov 11 2010 radvd -rw-r--r-- 1 root root 5 Nov 12 10:58 ricci.pid -r--r--r-- 1 root root 0 Nov 12 10:58 rpcbind.lock -rw-r--r-- 1 root root 6 Nov 12 10:58 rpcbind.pid srw-rw-rw- 1 root root 0 Nov 12 10:58 rpcbind.sock drwxr-xr-x 2 root root 3864 Nov 12 10:58 saslauthd -rw------- 1 root smmsp 34 Nov 12 10:58 sendmail.pid drwxr-xr-x 2 root root 3864 Apr 16 2012 sepermit drwxr-xr-x 2 root root 3864 Jun 22 03:51 setrans -rw-r--r-- 1 smmsp smmsp 50 Nov 12 10:58 sm-client.pid -rw------- 1 root root 6 Nov 12 10:58 snmpd.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 sshd.pid -rw------- 1 root root 6 Nov 12 10:58 sssd.pid -rw------- 1 root root 6 Nov 12 10:58 syslogd.pid -rw-rw-r-- 1 root utmp 3456 Nov 12 11:00 utmp drwxr-xr-x 2 root root 3864 Jun 22 06:05 wpa_supplicant On Mon, Nov 12, 2012 at 11:40 AM, Marc Grimme < gr...@at... > wrote: ls -l /var/run is missing, right? Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:38:21 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, cluster.conf, is attatched. I did have it running yesterday, but then things went pear-shaped. /etc/cdsltab bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd ls -l /var/run rootfs / rootfs rw 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=12337912k,nr_inodes=3084478,mode=755 0 0 /dev/pts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 none /var/comoonics/chroot tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 proc /var/comoonics/chroot/proc proc rw,relatime 0 0 sysfs /var/comoonics/chroot/sys sysfs rw,relatime 0 0 none /var/comoonics/chroot/sys/kernel/config configfs rw,relatime 0 0 /dev/dm-7 / gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /.cdsl.local gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/run gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/lock gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/mapper/osbootpp1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 On Mon, Nov 12, 2012 at 11:21 AM, Marc Grimme < gr...@at... > wrote: Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 18:37:31
|
So I think I have it. The root filesystem volumegroup is clustered. Which "shouldn't really be so". So try the following command: lvm --config 'global { locking_type=0 }' vgchange -cn VG_OSROOT then /etc/init.d/bootsr start /etc/init.d/clvmd start What happens? Regards Marc. P.S. if you still have problems send me again bash -x /etc/init.d/bootsr start. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 7:15:02 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, here it is. [root@bwccs302 ~]# ls -l /var total 152 drwxr-xr-x 3 root root 3864 Nov 8 15:14 account lrwxrwxrwx 1 root root 24 Nov 8 15:14 cache -> ../.cdsl.local/var/cache drwxr-xr-x 3 root root 3864 Nov 9 12:49 comoonics drwxr-xr-x 2 root root 3864 Feb 22 2012 cvs drwxr-xr-x 4 root root 3864 Jul 31 15:43 db drwxr-xr-x 3 root root 3864 Jul 30 19:36 empty drwxr-xr-x 2 root root 3864 Sep 23 2011 games drwxr-xr-x 30 root root 3864 Nov 11 16:24 lib lrwxrwxrwx 1 root root 24 Nov 8 15:14 local -> ../.cdsl.local/var/local drwxrwxr-x 5 root lock 3864 Nov 12 10:34 lock lrwxrwxrwx 1 root root 22 Nov 8 15:14 log -> ../.cdsl.local/var/log lrwxrwxrwx 1 root root 10 Jul 30 19:29 mail -> spool/mail drwxr-xr-x 2 root root 3864 Sep 23 2011 nis drwxr-xr-x 2 root root 3864 Sep 23 2011 opt drwxr-xr-x 2 root root 3864 Sep 23 2011 preserve drwxr-xr-x 28 root root 3864 Nov 12 12:50 run lrwxrwxrwx 1 root root 24 Nov 8 15:14 spool -> ../.cdsl.local/var/spool lrwxrwxrwx 1 root root 22 Nov 8 15:14 tmp -> ../.cdsl.local/var/tmp drwxr-xr-x 6 root root 3864 Nov 11 11:51 www drwxr-xr-x 2 root root 3864 Sep 23 2011 yp On Mon, Nov 12, 2012 at 1:10 PM, Marc Grimme < gr...@at... > wrote: So it looks like the cluster communication files $chrot/var/run/cman_admin, $chroot/var/run/cman_client or not symlinked to /var/run. This is the responsibility of /etc/init.d/bootsr. So I'd like to see the output of the following two commands: ls -l /var bash -x /etc/init.d/bootsr start Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 6:54:53 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, your help is much appreciated, I have restarted the server again and here is the output com-chroot ls -l /var/run total 28 drwxr-xr-x 2 root root 40 Nov 12 12:48 cluster -rw-r--r-- 1 root root 6 Nov 12 12:48 cman.pid srw------- 1 root root 0 Nov 12 12:48 cman_admin srw-rw---- 1 root root 0 Nov 12 12:48 cman_client drwxr-xr-x 2 root root 40 Nov 12 12:50 console -rw-r----- 1 root root 6 Nov 12 12:48 corosync.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 dlm_controld.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 faillock -rw-r--r-- 1 root root 6 Nov 12 12:49 fenced.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 gfs_controld.pid -rw-r--r-- 1 root root 5 Nov 12 12:48 qdiskd.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 sepermit -rw-r--r-- 1 root root 6 Nov 12 12:50 sshd.pid Thanks Jorge On Mon, Nov 12, 2012 at 12:49 PM, Marc Grimme < gr...@at... > wrote: We are getting there. Now I'd like to see the output of this command: com-chroot ls -l /var/run Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:46:58 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, sorry, here it is total 308 drwxr-xr-x 28 root root 3864 Nov 12 11:00 . drwxr-xr-x 16 root root 3864 Nov 11 11:51 .. -rw-r--r-- 1 root root 6 Nov 12 10:58 auditd.pid drwxr-xr-x 2 avahi avahi 3864 Nov 12 10:58 avahi-daemon srwxr-xr-x 1 root root 0 Nov 12 10:58 clumond.sock drwx--x--x 4 root root 3864 Nov 11 18:01 cluster drwxr-xr-x 2 root root 3864 Nov 2 19:51 console drwxr-xr-x 2 root root 3864 Nov 12 11:00 ConsoleKit -rw-r--r-- 1 root root 6 Nov 12 11:00 console-kit-daemon.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 crond.pid ---------- 1 root root 0 Nov 12 10:58 cron.reboot drwxr-xr-x 2 root root 3864 Nov 12 10:58 dbus drwxr-xr-x 2 root root 3864 Apr 16 2012 faillock drwx------ 2 haldaemon haldaemon 3864 Jul 19 2011 hald -rw-r--r-- 1 root root 6 Nov 12 10:58 haldaemon.pid drwxr-xr-x 3 root root 3864 Jul 30 20:34 heartbeat drwx--x--- 2 root apache 3864 Nov 11 17:04 httpd -rw-r--r-- 1 root root 6 Nov 12 10:58 ksmtune.pid drwxr-xr-x 5 root root 3864 Nov 12 10:58 libvirt -rw-r--r-- 1 root root 5 Nov 12 10:58 libvirtd.pid -rw------- 1 root root 32 Nov 12 10:58 lldpad.pid drwx------ 2 root root 3864 Nov 12 10:58 lvm drwx------ 2 root root 3864 Jun 22 08:32 mdadm -rw-r--r-- 1 root root 6 Nov 12 10:58 messagebus.pid -rw-r--r-- 1 root root 5 Nov 12 10:58 modclusterd.pid -rw-r--r-- 1 root root 5 Nov 12 10:57 multipathd.pid srwx------ 1 root root 0 Nov 12 10:57 multipathd.sock drwxr-xr-x 2 mysql mysql 3864 Nov 11 16:11 mysqld drwxrwxr-x 2 root root 3864 Sep 17 05:55 netreport drwxr-xr-x 2 root root 3864 Jul 23 02:09 net-snmp drwxr-xr-x 2 root root 3864 Nov 10 17:55 nscd drwxr-xr-x 2 nslcd root 3864 Nov 10 11:57 nslcd -rw-r--r-- 1 root root 5 Nov 12 10:58 ntpd.pid -rw------- 1 root root 5 Nov 12 10:58 oddjobd.pid drwxr-xr-x 2 root root 3864 Dec 8 2011 plymouth drwxr-xr-x 5 root root 3864 Nov 8 14:57 pm-utils drwxr-xr-x 2 root root 3864 Apr 3 2012 portreserve drwxr-xr-x 2 root root 3864 Aug 22 2010 ppp drwxr-xr-x 2 radvd radvd 3864 Nov 11 2010 radvd -rw-r--r-- 1 root root 5 Nov 12 10:58 ricci.pid -r--r--r-- 1 root root 0 Nov 12 10:58 rpcbind.lock -rw-r--r-- 1 root root 6 Nov 12 10:58 rpcbind.pid srw-rw-rw- 1 root root 0 Nov 12 10:58 rpcbind.sock drwxr-xr-x 2 root root 3864 Nov 12 10:58 saslauthd -rw------- 1 root smmsp 34 Nov 12 10:58 sendmail.pid drwxr-xr-x 2 root root 3864 Apr 16 2012 sepermit drwxr-xr-x 2 root root 3864 Jun 22 03:51 setrans -rw-r--r-- 1 smmsp smmsp 50 Nov 12 10:58 sm-client.pid -rw------- 1 root root 6 Nov 12 10:58 snmpd.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 sshd.pid -rw------- 1 root root 6 Nov 12 10:58 sssd.pid -rw------- 1 root root 6 Nov 12 10:58 syslogd.pid -rw-rw-r-- 1 root utmp 3456 Nov 12 11:00 utmp drwxr-xr-x 2 root root 3864 Jun 22 06:05 wpa_supplicant On Mon, Nov 12, 2012 at 11:40 AM, Marc Grimme < gr...@at... > wrote: ls -l /var/run is missing, right? Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:38:21 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, cluster.conf, is attatched. I did have it running yesterday, but then things went pear-shaped. /etc/cdsltab bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd ls -l /var/run rootfs / rootfs rw 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=12337912k,nr_inodes=3084478,mode=755 0 0 /dev/pts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 none /var/comoonics/chroot tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 proc /var/comoonics/chroot/proc proc rw,relatime 0 0 sysfs /var/comoonics/chroot/sys sysfs rw,relatime 0 0 none /var/comoonics/chroot/sys/kernel/config configfs rw,relatime 0 0 /dev/dm-7 / gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /.cdsl.local gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/run gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/lock gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/mapper/osbootpp1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 On Mon, Nov 12, 2012 at 11:21 AM, Marc Grimme < gr...@at... > wrote: Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 18:10:22
|
So it looks like the cluster communication files $chrot/var/run/cman_admin, $chroot/var/run/cman_client or not symlinked to /var/run. This is the responsibility of /etc/init.d/bootsr. So I'd like to see the output of the following two commands: ls -l /var bash -x /etc/init.d/bootsr start Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 6:54:53 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, your help is much appreciated, I have restarted the server again and here is the output com-chroot ls -l /var/run total 28 drwxr-xr-x 2 root root 40 Nov 12 12:48 cluster -rw-r--r-- 1 root root 6 Nov 12 12:48 cman.pid srw------- 1 root root 0 Nov 12 12:48 cman_admin srw-rw---- 1 root root 0 Nov 12 12:48 cman_client drwxr-xr-x 2 root root 40 Nov 12 12:50 console -rw-r----- 1 root root 6 Nov 12 12:48 corosync.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 dlm_controld.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 faillock -rw-r--r-- 1 root root 6 Nov 12 12:49 fenced.pid -rw-r--r-- 1 root root 6 Nov 12 12:49 gfs_controld.pid -rw-r--r-- 1 root root 5 Nov 12 12:48 qdiskd.pid drwxr-xr-x 2 root root 40 Nov 12 12:50 sepermit -rw-r--r-- 1 root root 6 Nov 12 12:50 sshd.pid Thanks Jorge On Mon, Nov 12, 2012 at 12:49 PM, Marc Grimme < gr...@at... > wrote: We are getting there. Now I'd like to see the output of this command: com-chroot ls -l /var/run Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:46:58 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, sorry, here it is total 308 drwxr-xr-x 28 root root 3864 Nov 12 11:00 . drwxr-xr-x 16 root root 3864 Nov 11 11:51 .. -rw-r--r-- 1 root root 6 Nov 12 10:58 auditd.pid drwxr-xr-x 2 avahi avahi 3864 Nov 12 10:58 avahi-daemon srwxr-xr-x 1 root root 0 Nov 12 10:58 clumond.sock drwx--x--x 4 root root 3864 Nov 11 18:01 cluster drwxr-xr-x 2 root root 3864 Nov 2 19:51 console drwxr-xr-x 2 root root 3864 Nov 12 11:00 ConsoleKit -rw-r--r-- 1 root root 6 Nov 12 11:00 console-kit-daemon.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 crond.pid ---------- 1 root root 0 Nov 12 10:58 cron.reboot drwxr-xr-x 2 root root 3864 Nov 12 10:58 dbus drwxr-xr-x 2 root root 3864 Apr 16 2012 faillock drwx------ 2 haldaemon haldaemon 3864 Jul 19 2011 hald -rw-r--r-- 1 root root 6 Nov 12 10:58 haldaemon.pid drwxr-xr-x 3 root root 3864 Jul 30 20:34 heartbeat drwx--x--- 2 root apache 3864 Nov 11 17:04 httpd -rw-r--r-- 1 root root 6 Nov 12 10:58 ksmtune.pid drwxr-xr-x 5 root root 3864 Nov 12 10:58 libvirt -rw-r--r-- 1 root root 5 Nov 12 10:58 libvirtd.pid -rw------- 1 root root 32 Nov 12 10:58 lldpad.pid drwx------ 2 root root 3864 Nov 12 10:58 lvm drwx------ 2 root root 3864 Jun 22 08:32 mdadm -rw-r--r-- 1 root root 6 Nov 12 10:58 messagebus.pid -rw-r--r-- 1 root root 5 Nov 12 10:58 modclusterd.pid -rw-r--r-- 1 root root 5 Nov 12 10:57 multipathd.pid srwx------ 1 root root 0 Nov 12 10:57 multipathd.sock drwxr-xr-x 2 mysql mysql 3864 Nov 11 16:11 mysqld drwxrwxr-x 2 root root 3864 Sep 17 05:55 netreport drwxr-xr-x 2 root root 3864 Jul 23 02:09 net-snmp drwxr-xr-x 2 root root 3864 Nov 10 17:55 nscd drwxr-xr-x 2 nslcd root 3864 Nov 10 11:57 nslcd -rw-r--r-- 1 root root 5 Nov 12 10:58 ntpd.pid -rw------- 1 root root 5 Nov 12 10:58 oddjobd.pid drwxr-xr-x 2 root root 3864 Dec 8 2011 plymouth drwxr-xr-x 5 root root 3864 Nov 8 14:57 pm-utils drwxr-xr-x 2 root root 3864 Apr 3 2012 portreserve drwxr-xr-x 2 root root 3864 Aug 22 2010 ppp drwxr-xr-x 2 radvd radvd 3864 Nov 11 2010 radvd -rw-r--r-- 1 root root 5 Nov 12 10:58 ricci.pid -r--r--r-- 1 root root 0 Nov 12 10:58 rpcbind.lock -rw-r--r-- 1 root root 6 Nov 12 10:58 rpcbind.pid srw-rw-rw- 1 root root 0 Nov 12 10:58 rpcbind.sock drwxr-xr-x 2 root root 3864 Nov 12 10:58 saslauthd -rw------- 1 root smmsp 34 Nov 12 10:58 sendmail.pid drwxr-xr-x 2 root root 3864 Apr 16 2012 sepermit drwxr-xr-x 2 root root 3864 Jun 22 03:51 setrans -rw-r--r-- 1 smmsp smmsp 50 Nov 12 10:58 sm-client.pid -rw------- 1 root root 6 Nov 12 10:58 snmpd.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 sshd.pid -rw------- 1 root root 6 Nov 12 10:58 sssd.pid -rw------- 1 root root 6 Nov 12 10:58 syslogd.pid -rw-rw-r-- 1 root utmp 3456 Nov 12 11:00 utmp drwxr-xr-x 2 root root 3864 Jun 22 06:05 wpa_supplicant On Mon, Nov 12, 2012 at 11:40 AM, Marc Grimme < gr...@at... > wrote: ls -l /var/run is missing, right? Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:38:21 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, cluster.conf, is attatched. I did have it running yesterday, but then things went pear-shaped. /etc/cdsltab bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd ls -l /var/run rootfs / rootfs rw 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=12337912k,nr_inodes=3084478,mode=755 0 0 /dev/pts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 none /var/comoonics/chroot tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 proc /var/comoonics/chroot/proc proc rw,relatime 0 0 sysfs /var/comoonics/chroot/sys sysfs rw,relatime 0 0 none /var/comoonics/chroot/sys/kernel/config configfs rw,relatime 0 0 /dev/dm-7 / gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /.cdsl.local gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/run gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/lock gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/mapper/osbootpp1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 On Mon, Nov 12, 2012 at 11:21 AM, Marc Grimme < gr...@at... > wrote: Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 17:49:26
|
We are getting there. Now I'd like to see the output of this command: com-chroot ls -l /var/run Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 5:46:58 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, sorry, here it is total 308 drwxr-xr-x 28 root root 3864 Nov 12 11:00 . drwxr-xr-x 16 root root 3864 Nov 11 11:51 .. -rw-r--r-- 1 root root 6 Nov 12 10:58 auditd.pid drwxr-xr-x 2 avahi avahi 3864 Nov 12 10:58 avahi-daemon srwxr-xr-x 1 root root 0 Nov 12 10:58 clumond.sock drwx--x--x 4 root root 3864 Nov 11 18:01 cluster drwxr-xr-x 2 root root 3864 Nov 2 19:51 console drwxr-xr-x 2 root root 3864 Nov 12 11:00 ConsoleKit -rw-r--r-- 1 root root 6 Nov 12 11:00 console-kit-daemon.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 crond.pid ---------- 1 root root 0 Nov 12 10:58 cron.reboot drwxr-xr-x 2 root root 3864 Nov 12 10:58 dbus drwxr-xr-x 2 root root 3864 Apr 16 2012 faillock drwx------ 2 haldaemon haldaemon 3864 Jul 19 2011 hald -rw-r--r-- 1 root root 6 Nov 12 10:58 haldaemon.pid drwxr-xr-x 3 root root 3864 Jul 30 20:34 heartbeat drwx--x--- 2 root apache 3864 Nov 11 17:04 httpd -rw-r--r-- 1 root root 6 Nov 12 10:58 ksmtune.pid drwxr-xr-x 5 root root 3864 Nov 12 10:58 libvirt -rw-r--r-- 1 root root 5 Nov 12 10:58 libvirtd.pid -rw------- 1 root root 32 Nov 12 10:58 lldpad.pid drwx------ 2 root root 3864 Nov 12 10:58 lvm drwx------ 2 root root 3864 Jun 22 08:32 mdadm -rw-r--r-- 1 root root 6 Nov 12 10:58 messagebus.pid -rw-r--r-- 1 root root 5 Nov 12 10:58 modclusterd.pid -rw-r--r-- 1 root root 5 Nov 12 10:57 multipathd.pid srwx------ 1 root root 0 Nov 12 10:57 multipathd.sock drwxr-xr-x 2 mysql mysql 3864 Nov 11 16:11 mysqld drwxrwxr-x 2 root root 3864 Sep 17 05:55 netreport drwxr-xr-x 2 root root 3864 Jul 23 02:09 net-snmp drwxr-xr-x 2 root root 3864 Nov 10 17:55 nscd drwxr-xr-x 2 nslcd root 3864 Nov 10 11:57 nslcd -rw-r--r-- 1 root root 5 Nov 12 10:58 ntpd.pid -rw------- 1 root root 5 Nov 12 10:58 oddjobd.pid drwxr-xr-x 2 root root 3864 Dec 8 2011 plymouth drwxr-xr-x 5 root root 3864 Nov 8 14:57 pm-utils drwxr-xr-x 2 root root 3864 Apr 3 2012 portreserve drwxr-xr-x 2 root root 3864 Aug 22 2010 ppp drwxr-xr-x 2 radvd radvd 3864 Nov 11 2010 radvd -rw-r--r-- 1 root root 5 Nov 12 10:58 ricci.pid -r--r--r-- 1 root root 0 Nov 12 10:58 rpcbind.lock -rw-r--r-- 1 root root 6 Nov 12 10:58 rpcbind.pid srw-rw-rw- 1 root root 0 Nov 12 10:58 rpcbind.sock drwxr-xr-x 2 root root 3864 Nov 12 10:58 saslauthd -rw------- 1 root smmsp 34 Nov 12 10:58 sendmail.pid drwxr-xr-x 2 root root 3864 Apr 16 2012 sepermit drwxr-xr-x 2 root root 3864 Jun 22 03:51 setrans -rw-r--r-- 1 smmsp smmsp 50 Nov 12 10:58 sm-client.pid -rw------- 1 root root 6 Nov 12 10:58 snmpd.pid -rw-r--r-- 1 root root 6 Nov 12 10:58 sshd.pid -rw------- 1 root root 6 Nov 12 10:58 sssd.pid -rw------- 1 root root 6 Nov 12 10:58 syslogd.pid -rw-rw-r-- 1 root utmp 3456 Nov 12 11:00 utmp drwxr-xr-x 2 root root 3864 Jun 22 06:05 wpa_supplicant On Mon, Nov 12, 2012 at 11:40 AM, Marc Grimme < gr...@at... > wrote: ls -l /var/run is missing, right? Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:38:21 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, cluster.conf, is attatched. I did have it running yesterday, but then things went pear-shaped. /etc/cdsltab bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd ls -l /var/run rootfs / rootfs rw 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=12337912k,nr_inodes=3084478,mode=755 0 0 /dev/pts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 none /var/comoonics/chroot tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 proc /var/comoonics/chroot/proc proc rw,relatime 0 0 sysfs /var/comoonics/chroot/sys sysfs rw,relatime 0 0 none /var/comoonics/chroot/sys/kernel/config configfs rw,relatime 0 0 /dev/dm-7 / gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /.cdsl.local gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/run gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/lock gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/mapper/osbootpp1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 On Mon, Nov 12, 2012 at 11:21 AM, Marc Grimme < gr...@at... > wrote: Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 16:40:35
|
ls -l /var/run is missing, right? Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 5:38:21 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, cluster.conf, is attatched. I did have it running yesterday, but then things went pear-shaped. /etc/cdsltab bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd ls -l /var/run rootfs / rootfs rw 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=12337912k,nr_inodes=3084478,mode=755 0 0 /dev/pts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 none /var/comoonics/chroot tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev tmpfs rw,relatime 0 0 none /var/comoonics/chroot/dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 proc /var/comoonics/chroot/proc proc rw,relatime 0 0 sysfs /var/comoonics/chroot/sys sysfs rw,relatime 0 0 none /var/comoonics/chroot/sys/kernel/config configfs rw,relatime 0 0 /dev/dm-7 / gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /.cdsl.local gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/run gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 /dev/dm-7 /var/lock gfs2 rw,noatime,hostdata=jid=0,localflocks 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/mapper/osbootpp1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 On Mon, Nov 12, 2012 at 11:21 AM, Marc Grimme < gr...@at... > wrote: Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 16:21:59
|
Ok we need to look deeper into things. Could you send me /etc/cluster/cluster.conf, and - if there - /etc/cdsltab. Also the output of the following commands * cat /proc/mounts * ls -l /var/run Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 5:04:42 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc I have a boot log with the set +x (in text), much easier than jpgs.. and attatched is lvm.conf Thanks Jorge On Mon, Nov 12, 2012 at 10:53 AM, Marc Grimme < gr...@at... > wrote: Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 15:54:04
|
Ok, one last thing. Would you send me your /etc/lvm/lvm.conf? Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 4:50:19 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, please see below: [root@bwccs302 ~]# rpm -qa comoonics* comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch comoonics-bootimage-extras-dm-multipath-rhel6-5.0-2_rhel6.noarch I only have 1 node up as I don't want to cause even more problems.. [root@bwccs302 ~]# cman_tool services fence domain member count 1 victim count 0 victim now 0 master nodeid 2 wait state none members 2 dlm lockspaces name OSRoot id 0xab5404ad flags 0x00000008 fs_reg change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 gfs mountgroups name OSRoot id 0x659f7afe flags 0x00000048 mounted change member 1 joined 1 remove 0 failed 0 seq 1,1 members 2 clvmd -d - I have verified lvm.conf locking_type=3 [root@bwccs302 ~]# clvmd -d CLVMD[f724c7a0]: Nov 12 10:41:44 CLVMD started CLVMD[f724c7a0]: Nov 12 10:41:44 Connected to CMAN CLVMD[f724c7a0]: Nov 12 10:41:44 CMAN initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Opened existing DLM lockspace for CLVMD. CLVMD[f724c7a0]: Nov 12 10:41:44 DLM initialisation complete CLVMD[f724c7a0]: Nov 12 10:41:44 Cluster ready, doing some more initialisation CLVMD[f724c7a0]: Nov 12 10:41:44 starting LVM thread CLVMD[f724b700]: Nov 12 10:41:44 LVM thread function started WARNING: Locking disabled. Be careful! This could corrupt your metadata. CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gHE0iUJ3VIj83R3UAVmFkCUuijPMJaxxG CLVMD[f724b700]: Nov 12 10:41:44 getting initial lock for T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 sync_lock: 'T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2' mode:1 flags=1 CLVMD[f724b700]: Nov 12 10:41:44 hold_lock. lock at 1 failed: Invalid argument CLVMD[f724b700]: Nov 12 10:41:44 Failed to hold lock T80WbeSaEpLbxfJKk9yfaeLc4HRSXD4gaC97Wx4e1LI52FZ0CiZfDmFnTxY64eh2 CLVMD[f724b700]: Nov 12 10:41:44 Sub thread ready for work. CLVMD[f724b700]: Nov 12 10:41:44 LVM thread waiting for work CLVMD[f724c7a0]: Nov 12 10:41:44 clvmd ready for work CLVMD[f724c7a0]: Nov 12 10:41:44 Using timeout of 60 seconds On Mon, Nov 12, 2012 at 10:33 AM, Marc Grimme < gr...@at... > wrote: Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 15:33:42
|
Jorge, I would like to have a look at the following command outputs: rpm -qa comoonics* For all running nodes: cman_tool services ps axfwww | grep clvm Then try to start the clvmd manually and make it little chatty: clvmd -d And send me the output. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 4:19:37 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, I have serial console logging (appologies for not getting this done)and I can see all the cluster services starting as expected and properly. When I log on [root@bwccs302 ~]# clustat Cluster Status for ProdCluster01 @ Mon Nov 12 10:17:26 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ smc01a 1 Offline smc01b 2 Online, Local smc01c 3 Offline smc01d 4 Offline /dev/block/253:4 0 Online, Quorum Disk [root@bwccs302 ~]# vgdisplay cluster request failed: Invalid argument Can't get lock for VG_OSROOT cluster request failed: Invalid argument Can't get lock for vg_osroot [root@bwccs302 ~]# ls -al /dev/VG_OSROOT/LV_* lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_ROOT -> ../dm-8 lrwxrwxrwx 1 root root 7 Nov 12 10:14 /dev/VG_OSROOT/LV_SWAP -> ../dm-9 On Mon, Nov 12, 2012 at 9:57 AM, Jorge Silva < me...@je... > wrote: Marc Hi, the problem is I believe I had set up an openshared root where more than one node boots up off a GFS2 fs. I believe I managed to get the system to that state, when I sent you the first screenshot Where the cluster would get started and clustered volumes would get detected in the comoonics boot stage. My problem was that once the boot process switched to the gfs fs, none of the clustered volumes were visible via any vg commands and any vg* commands failed - if I looked in the /dev/VG, I could see all the clustered volumes, however clvmd would be running at 100% any attempt to start it would fail. however all cluster resoureces and commands would work, cman_tool nodes would list nodes etc.. I have just rebuild the ramdisk and I notice that none of the cluster services has started, which is even more odd.. Thanks Jorge On Mon, Nov 12, 2012 at 9:12 AM, Marc Grimme < gr...@at... > wrote: This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 14:12:21
|
This looks perfectly ok. The failed after activation of the vgs is because there are clustered vgs present (which again is perfectly ok). Then the bootup continues as expected as can be seen in the logs. I think I don't understand the problem you are talking about. Perhaps you could try to explain your problem in more detail. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 3:04:03 PM Subject: Re: Problem with VG activation clvmd runs at 100% Marc Hi, thanks for you help, I got rid of some of the clustered volumes for clarity, appologies for the unorthodox screen log, I must get console logging done via serial... I booted into emergency mode, and ls -l /etc/rc3.d/S* lrwxrwxrwx 1 root root 11 Nov 7 17:21 /etc/rc3.d/S99local -> ../rc.local.comoonics Edited rc.sysinit Line, on mine is line 205 and continued. Attached is output from set-x Thanks Jorge On Mon, Nov 12, 2012 at 6:01 AM, Marc Grimme < gr...@at... > wrote: Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" < me...@je... > To: "Marc Grimme" < gr...@at... > Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2012-11-12 11:36:31
|
Hi Jorge, try to boot the cluster into emergency mode by adding a "1" to the boot prompt. With this you should end up in a console. Then issue the following commands and send me the output: ls -l /etc/rc3.d/S* Also add the following line before lvm is started (rc.sysinit Line 199): + set -x + Then we should see more at the next bootup. Thanks Marc. ----- Original Message ----- From: "Jorge Silva" <me...@je...> To: "Marc Grimme" <gr...@at...> Sent: Monday, November 12, 2012 4:43:22 AM Subject: Problem with VG activation clvmd runs at 100% Marc Hi, apologise for not getting back to you and it has been some time since we communicated. I am an Equity derivatives trader, and at the time I was helping a friend set up a trading Equity platform as a proof of concept, it was pretty low priority and was more of a learning tool for me, so I didn't spend too much time on it . I was forced to upgrade recently as this has moved from proof of concept to the next step. I apologise for bothering, but I have spent the last few days trying to get an OSR cluster running on Centos 6.3 +gfs2 and I believe I am almost there, but I am stuck, I am unsure what is going on. The cluster seems to be working ok, but climbed is running at 100% and I can restart it and still the same result. Attached is a screen shot of the final phase of booting showing the error. The cluster is quorate and shut-down to works OK. Thanks Jorge an output of vgscan: vgscan connect() failed on local socket: No such file or directory Internal cluster locking initialisation failed. WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. Reading all physical volumes. This may take a while... Skipping clustered volume group VG_OSROOT Found volume group "VG_DBDISKS" using metadata type lvm2 Skipping clustered volume group VG_SDATA Found volume group "vg_osroot" using metadata type lvm2 These are the lvm2 : lvm2-2.02.95-10.el6_3.2.x86_64 lvm2-cluster-2.02.95-10.el6_3.2.x86_64 I think these are what is causing the problem, but I'm not sure... lrwxrwxrwx 1 root root 41 Nov 11 21:54 /var/run/cman_admin -> /var/comoonics/chroot//var/run/cman_admin lrwxrwxrwx 1 root root 42 Nov 11 21:54 /var/run/cman_client -> /var/comoonics/chroot//var/run/cman_client I have tried the re-ordering /etc/cdsltab, it currently is : bind /.cluster/cdsl/%(nodeid)s/var/run /var/run __initrd bind /.cluster/cdsl/%(nodeid)s/var/lock /var/lock __initrd I have tried : rm -fr /var/cache/comoonics-bootimage/* ;rm -fr /var/cache/comoonics-repository/*; mkinitrd -V /boot/initrd_sr-$(uname -r).img $(uname -r) My cluster conf for the nodes look like: <clusternode name="smc01b" nodeid="2" votes="1"> <multicast addr="229.192.0.2" interface="bond0.1762"/> <fence> <method name="single"> <device ipaddr="172.17.50.16" name="ipmi"/> </method> </fence> <com_info> <eth mac="00:30:48:F0:10:54" master="bond0" name="eth0" slave="yes"/> <eth mac="00:30:48:F0:10:55" master="bond0" name="eth1" slave="yes"/> <eth name="bond0"> <properties> <property name="BONDING_OPTS">BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2 "</property> </properties> </eth> <eth name="bond0.1762" ip="172.17.60.12" mask="255.255.255.0" gateway=""> <properties> <property name="VLAN">VLAN="yes"</property> </properties> </eth> </com_info> </clusternode> I have tried re-installing the packages below: comoonics-base-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-fencelib-5.0-1_rhel6.noarch comoonics-cdsl-py-5.0-3_rhel6.noarch comoonics-imsd-py-5.0-1_rhel6.noarch comoonics-cluster-py-5.0-2_rhel6.noarch comoonics-bootimage-listfiles-rhel6-5.0-4_rhel6.noarch comoonics-bootimage-imsd-5.0-5_rhel6.noarch comoonics-bootimage-listfiles-firmware-5.0-2_rhel6.noarch comoonics-release-5.0-2_rhel6.noarch comoonics-tools-py-5.0-2_rhel6.noarch comoonics-bootimage-initscripts-5.0-10_rhel6.noarch comoonics-imsd-plugins-py-5.0-1_rhel6.noarch comoonics-bootimage-extras-network-5.0-2_rhel6.noarch comoonics-cluster-tools-py-5.0-3_rhel6.noarch comoonics-bootimage-5.0-19_rhel6.noarch comoonics-bootimage-listfiles-rhel6-gfs2-5.0-3_rhel6.noarch comoonics-bootimage-extras-localconfigs-5.0-9_rhel6.noarch comoonics-bootimage-listfiles-all-5.0-4_rhel6.noarch |
From: Marc G. <gr...@at...> - 2011-12-04 13:16:11
|
Hello together, just a short note to make you aware that we are also posting information on Twitter. So anybody who is interested can follow on https://twitter.com/#!/OpenSharedroot. We'll try to give more information on background developments on both the com.oonics Project and Open-Sharedroot. We'll post community information as well as information on professional usage of the com.oonics Enterprise IT Platform. So just follow to get more information. Have fun Marc. ______________________________________________________________________________ Marc Grimme E-Mail: gr...@at... ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de Enterprise Linux einfach online kaufen: www.linux-subscriptions.com Registergericht: Amtsgericht München, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Thomas Merz (Vors.), Marc Grimme, Mark Hlawatschek, Jan R. Bergrath | Vorsitzender des Aufsichtsrats: Dr. Martin Buss |