You can subscribe to this list here.
2006 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
|
Jun
|
Jul
(4) |
Aug
(4) |
Sep
(4) |
Oct
|
Nov
(5) |
Dec
(9) |
2009 |
Jan
(3) |
Feb
(17) |
Mar
(11) |
Apr
(27) |
May
(16) |
Jun
(7) |
Jul
(3) |
Aug
(10) |
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2010 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(3) |
Oct
(2) |
Nov
(2) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(18) |
Dec
(3) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(1) |
2015 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Marc G. <gr...@at...> - 2009-06-08 09:19:31
|
Hi Klaus, On Thursday 28 May 2009 09:17:04 Klaus Steinberger wrote: > Hi, > > because of some testing of a samba/ctdb Cluster on top of OSR i tweaked > around with the Posix Locking Rate Limit of gfs_controld. The default > for plock_rate_limit is "100", which is quite low. > > I raised the limit now on one of my SL 5.3/GFS OSR clusters (a virtual > one) to 10000, which seems to be higher than the real rate (I measured > up to 2200 / sec using ping_pong ). > > I see a very interesting side effect: > > The startup of nodes seems to be quite faster now. Especially the ever > slow udev startup after changing root runs as a charm. > > So maybe somebody could try to confirm that? > > > To raise plock_rate_limit put the following line into > /etc/cluster/cluster.conf: > > <gfs_controld plock_rate_limit="10000"/> > > Please be aware that gfs_controld cannot be restarted or reconfigured on > a running node! A node have to be rebooted to change plock_rate_limit. > > Sincerly, > Klaus I've looked at your idea and could prove it also. Although I couldn't get udev to take very long I referred to a testprogram we've writen for those purposes and could get _VERY_ impressive results. The program just creates files that are flocked like this: int flock_files() { char* filename; int fd; int i; int count; for (i=0; i<count; i++) { filename=malloc(50*sizeof(char)); sprintf(filename, "%s/test-%i-%i", dir, pnumber, i); //printf("filename: %s/test-%i-%i\n", dir, pnumber, i); fd=open(filename,O_SYNC|O_RDWR|O_CREAT,0644); flock(fd, LOCK_EX); //close(fd); } } These are the results on a two node gfs cluster virtualized on Xen: Short sum up is: With default settings to create 10000 fcntl-locks with 10 processes it takes ~20 seconds and with plock_rate_limit=10000 it takes ~2 seconds. Description: 1.1 Default plock_rate_limit: <cluster config_version='3' name='axqad106'> <!-- <gfs_controld plock_rate_limit="10000"/>--> <clusternodes> ... </cluster> 1.2.1 Node1: time /atix/projects/com.oonics/nashead2004/management/comoonics-benchmarks/write_files /tmp/test 100 1000 10 5 Process 10 fcntl-locking 1000 files real 0m19.497s user 0m0.008s sys 0m0.064s 1.2.2 Node2: time /atix/projects/com.oonics/nashead2004/management/comoonics-benchmarks/write_files /tmp/test 100 1000 10 5 Process 10 fcntl-locking 1000 files real 0m19.974s user 0m0.000s<?xml version='1.0' encoding='UTF-8'?> sys 0m0.044s 2.1 plock_rate_limit=10000 <cluster config_version='2' name='axqad106'> <gfs_controld plock_rate_limit="10000"/> <clusternodes> ... </cluster 2.2.1 Node1 time /atix/projects/com.oonics/nashead2004/management/comoonics-benchmarks/write_files /tmp/test 100 1000 10 5 Process 10 fcntl-locking 1000 files real 0m2.205s<?xml version='1.0' encoding='UTF-8'?> user 0m0.000s sys 0m0.080s 2.2.2 Node2 time /atix/projects/com.oonics/nashead2004/management/comoonics-benchmarks/write_files /tmp/test 100 1000 10 5 Process 10 fcntl-locking 1000 files real 0m2.217s user 0m0.000s sys 0m0.068s That's something (factor ~ 1:10) I think. The consequence would be a recommendation for applications using a huge amount of posix-locks/fcntl-locks (e.g. samba -> tdb files) to play with those settings. Again thanks very much for pointing this out. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-06-04 15:57:27
|
Hi, I don't have the directory /de/fd but I have the directory /dev/fd0, however I created symlink: ln -s /proc/self/fd /dev/fd The service start on IP address 10.43.100.204 and can also relocate on the clu02 if the node clu01 go down but when the service start, in the log file messagges I read: Jun 4 16:46:11 clu01 clurgmgrd[15051]: <notice> Starting disabled service service:RHTTPD Jun 4 16:46:12 clu01 in.rdiscd[15537]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use Jun 4 16:46:12 clu01 in.rdiscd[15537]: Failed joining addresses Jun 4 16:46:13 clu01 clurgmgrd[15051]: <notice> Service service:RHTTPD started how do I make permanent the symlink that I created ? After the symlink, I have to re-create a new initrd file ? Thanks. Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... > > Message: 2 > Date: Mon, 1 Jun 2009 22:02:30 +0200 > From: Mark Hlawatschek <hla...@at...> > Subject: Re: [OSR-users] Relocate a service > To: ope...@li... > Message-ID: <200...@at...> > Content-Type: text/plain; charset="utf-8" > > Hi Stefano, > > the errors indicate, that the fd devices have not been configured > correctly. > udev is responsible for doing this in RHEL. > You can workaround this issue by creating the following symlink: > ln -s /proc/self/fd /dev/fd > > Hope this helps ! > > Mark > >> rg_test test /etc/cluster/cluster.conf start service RHTTPD >> Running in test mode. >> Starting RHTTPD... >> /usr/share/cluster/ip.sh: line 583: /dev/fd/62: No such file or >> directory >> /usr/share/cluster/ip.sh: line 673: /dev/fd/62: No such file or >> directory >> Failed to start RHTTPD >> /usr/share/cluster/ip.sh: line 583: /dev/fd/61: No such file or > |
From: Mark H. <hla...@at...> - 2009-06-04 10:11:15
|
Hi Stefano, your changes are breaking the logic of the ip.sh resource agent. for example: your changes: #CHANGED !!! /sbin/ip -o -f inet addr | awk '{print $1,$2,$4}' | while read idx dev ifaddr; do isSlave $dev if [ $? -ne 2 ]; then continue fi idx=${idx/:/} echo $dev ${ifaddr/\/*/} ${ifaddr/*\//} #done < <(/sbin/ip -o -f inet addr | awk '{print $1,$2,$4}') done In the while loop, the redirection operator < <(cmd) provides the stdin for the read command. Please note, that the redirection requires the /dev/fd/XX files. (See my previous mail) To verify the redirection mechanism try something like this: # cat < <(ls -l /etc/) -Mark On Wednesday 03 June 2009 14:48:59 Stefano Elmopi wrote: > Hi Mark, > > I changed two lines of the script /usr/share/cluster/ip.sh, > I have attached the script and the lines that I have changed are > immediately below the written CHANGED. > Now the service httpd start on the new ip address (10.43.100.204), and > if the nodo_1 goes down, > the service is relocated on nodo_2. > when I start the service, in the log messages I have: > > Jun 3 13:56:58 clu01 clurgmgrd[14899]: <notice> Starting disabled > service service:RHTTPD > Jun 3 13:56:59 clu01 in.rdiscd[15391]: setsockopt > (IP_ADD_MEMBERSHIP): Address already in use > Jun 3 13:56:59 clu01 in.rdiscd[15391]: Failed joining addresses > Jun 3 13:57:00 clu01 clurgmgrd[14899]: <notice> Service > service:RHTTPD started > > but despite this, the service httpd works. > I hope that the information that I am writing you, you are useful. > > > Bye. > > > > > > Ing. Stefano Elmopi > Gruppo Darco - Area ICT Sistemi > Via Ostiense 131/L Corpo B, 00154 Roma > > cell. 3466147165 > tel. 0657060500 > email:ste...@so... > > Il giorno 01/giu/09, alle ore 11:40, Stefano Elmopi ha scritto: > > Hi Mark, > > > > my cluster.conf is: > > > > <?xml version="1.0"?> > > <cluster config_version="5" name="cluOCFS2" type="ocfs2"> > > > > <cman expected_votes="1" two_node="1"/> > > > > <clusternodes> > > > > <clusternode name="clu01" votes="1" nodeid="1"> > > <com_info> > > <syslog name="clu01"/> > > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > > <eth name="eth0" ip="10.43.100.203" > > mac="00:15:60:56:75:FD"/> > > <fenceackserver user="root" passwd="test123"/> > > </com_info> > > </clusternode> > > > > <clusternode name="clu02" votes="1" nodeid="2"> > > <com_info> > > <syslog name="clu01"/> > > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > > <eth name="eth0" ip="10.43.105.15" > > mac="00:15:60:56:77:11"/> > > <fenceackserver user="root" passwd="test123"/> > > </com_info> > > </clusternode> > > > > <rm log_level="7" log_facility="local4"> > > <failoverdomains> > > <failoverdomain name="failover" ordered="1"> > > <failoverdomainnode name="clu01" > > priority="1"/> > > <failoverdomainnode name="clu02" > > priority="2"/> > > </failoverdomain> > > </failoverdomains> > > <resources> > > <ip address="10.43.100.204" monitor_link="1"/> > > <script file="/etc/init.d/httpd" > > name="rhttpd"/> > > </resources> > > <service autostart="0" domain="failover" > > name="RHTTPD"> > > <ip ref="10.43.100.204"/> > > <script ref="rhttpd"/> > > </service> > > </rm> > > > > </clusternodes> > > > > </cluster> > > > > and I added the line from your email: > > > > local4.debug /var/log/rgmanager.log to /etc/syslog.conf > > > > then I rebooted syslog but in the file rgmanager.log is logged only > > when CMAN start, > > while rgmanager is logged only in the file /va/log/messages but > > there is no additional information. > > Perhaps additional information can come from this tool, I hope: > > > > rg_test test /etc/cluster/cluster.conf start service RHTTPD > > Running in test mode. > > Starting RHTTPD... > > /usr/share/cluster/ip.sh: line 583: /dev/fd/62: No such file or > > directory > > /usr/share/cluster/ip.sh: line 673: /dev/fd/62: No such file or > > directory > > Failed to start RHTTPD > > /usr/share/cluster/ip.sh: line 583: /dev/fd/61: No such file or > > directory > > +++ Memory table dump +++ > > 0xb77306e4 (8 bytes) allocation trace: > > 0xb7734e74 (8 bytes) allocation trace: > > 0xb774aa6c (16 bytes) allocation trace: > > 0xb774b8d0 (16 bytes) allocation trace: > > 0xb77357f0 (16 bytes) allocation trace: > > 0xb774a9f4 (52 bytes) allocation trace: > > 0xb7741194 (912 bytes) allocation trace: > > --- End Memory table dump --- > > > > > > > > > > Bye > > > > > > Ing. Stefano Elmopi > > Gruppo Darco - Area ICT Sistemi > > Via Ostiense 131/L Corpo B, 00154 Roma > > > > cell. 3466147165 > > tel. 0657060500 > > email:ste...@so... > > > > Il giorno 28/mag/09, alle ore 21:00, Marc Grimme ha scritto: > >> On Thursday 28 May 2009 17:14:49 Stefano Elmopi wrote: > >>> Hi Mark, > >>> > >>> I have changed the service element from: > >>> > >>> <service autostart="0" domain="failover" name="RHTTPD"> > >>> <ip ref="10.43.100.204"/> > >>> <script ref="/etc/init.d/httpd"/> > >>> </service> > >>> > >>> to: > >>> > >>> <service autostart="0" domain="failover" name="RHTTPD"> > >>> <ip ref="10.43.100.204"/> > >>> <script ref="httpd"/> > >>> </service> > >>> > >>> but does not change the result, if I type clusvcadm -e RHTTPD the > >>> service fails and the messeges log: > >>> > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Starting disabled > >>> service service:RHTTPD > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> start on ip > >>> "10.43.100.204" returned 1 (generic error) > >> > >> Hmm, you could extend logging by catching debug messages from > >> rgmanager by > >> adding the line > >> local4.debug /var/log/rgmanager.log > >> to /etc/syslog.conf then restart syslog. > >> See if you can get more information from this file. > >> > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #68: Failed to > >>> start > >>> service:RHTTPD; return value: 1 > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Stopping service > >>> service:RHTTPD > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service > >>> service:RHTTPD is recovering > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #71: Relocating > >>> failed service service:RHTTPD > >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service > >>> service:RHTTPD is stopped > >>> > >>> a consideration, when rgmanager start, I should not ping the IP > >>> address 10.43.100.204 ?? > >>> > >>> the result of tool rg_test is: > >>> > >>> [root@clu01 ~]# rg_test test /etc/cluster/cluster.conf > >>> Running in test mode. > >>> Loaded 22 resource rules > >>> === Resources List === > >>> Resource type: script > >>> Agent: script.sh > >>> Attributes: > >>> name = httpd [ primary unique ] > >>> file = /etc/init.d/httpd [ unique required ] > >>> service_name [ inherit("service%name") ] > >>> > >>> Resource type: ip > >>> Instances: 1/1 > >>> Agent: ip.sh > >>> Attributes: > >>> address = 10.43.100.204 [ primary unique ] > >>> monitor_link = 1 > >>> nfslock [ inherit("service%nfslock") ] > >>> > >>> Resource type: service [INLINE] > >>> Instances: 1/1 > >>> Agent: service.sh > >>> Attributes: > >>> name = RHTTPD [ primary unique required ] > >>> domain = failover [ reconfig ] > >>> autostart = 0 [ reconfig ] > >>> hardrecovery = 0 [ reconfig ] > >>> exclusive = 0 [ reconfig ] > >>> nfslock = 0 > >>> recovery = restart [ reconfig ] > >>> depend_mode = hard > >>> max_restarts = 0 > >>> restart_expire_time = 0 > >>> > >>> === Resource Tree === > >>> service { > >>> name = "RHTTPD"; > >>> domain = "failover"; > >>> autostart = "0"; > >>> hardrecovery = "0"; > >>> exclusive = "0"; > >>> nfslock = "0"; > >>> recovery = "restart"; > >>> depend_mode = "hard"; > >>> max_restarts = "0"; > >>> restart_expire_time = "0"; > >>> ip { > >>> address = "10.43.100.204"; > >>> monitor_link = "1"; > >>> nfslock = "0"; > >>> } > >>> script { > >>> name = "httpd"; > >>> file = "/etc/init.d/httpd"; > >>> service_name = "RHTTPD"; > >>> } > >>> } > >>> === Failover Domains === > >>> Failover domain: failover > >>> Flags: Ordered > >>> Node clu01 (id 1, priority 1) > >>> Node clu02 (id 2, priority 2) > >>> === Event Triggers === > >>> Event Priority Level 100: > >>> Name: Default > >>> (Any event) > >>> File: /usr/share/cluster/default_event_script.sl > >>> +++ Memory table dump +++ > >>> 0xb77756e4 (8 bytes) allocation trace: > >>> 0xb7779e74 (8 bytes) allocation trace: > >>> 0xb778fce4 (52 bytes) allocation trace: > >>> --- End Memory table dump --- > >>> > >>> > >>> if I add the line: > >>> > >>> <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> > >>> > >>> to section <com_info> of the clu01, the service start: > >>> > >>> /etc/init.d/rgmanager start > >>> Starting Cluster Service Manager: [ OK ] > >>> > >>> the log is: > >>> > >>> May 28 16:59:21 clu01 kernel: dlm: Using TCP for communications > >>> May 28 16:59:30 clu01 clurgmgrd[15209]: <notice> Resource Group > >>> Manager Starting > >>> May 28 16:59:31 clu01 clurgmgrd: [15209]: <err> Failed to remove > >>> 10.43.100.204 > >>> May 28 16:59:31 clu01 clurgmgrd[15209]: <notice> stop on ip > >>> "10.43.100.204" returned 1 (generic error) > >> > >> That's clear. This ip is already setup by the bootprocess. So it > >> cannot be > >> setup. > >> > >>> clustat > >>> Cluster Status for cluOCFS2 @ Thu May 28 17:00:22 2009 > >>> Member Status: Quorate > >>> > >>> Member Name ID > >>> Status > >>> ------ ---- > >>> ---- > >>> ------ > >>> clu01 > >>> 1 Online, Local, rgmanager > >>> clu02 > >>> 2 Offline > >>> > >>> Service Name > >>> Owner (Last) > >>> State > >>> ------- ---- > >>> ----- ------ > >>> ----- > >>> service:RHTTPD > >>> (none) > >>> disabled > >>> > >>> and: > >>> > >>> clusvcadm -e RHTTPD > >>> Local machine trying to enable service:RHTTPD...Success > >>> service:RHTTPD is now running on clu01 > >>> > >>> but in this case the service does not relocate with the same ip !! > >>> > >>> > >>> > >>> Bye > >>> > >>> > >>> > >>> > >>> > >>> Ing. Stefano Elmopi > >>> Gruppo Darco - Area ICT Sistemi > >>> Via Ostiense 131/L Corpo B, 00154 Roma > >>> > >>> cell. 3466147165 > >>> tel. 0657060500 > >>> email:ste...@so... > >> > >> -- > >> Gruss / Regards, > >> > >> Marc Grimme > >> http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-06-03 12:49:06
|
Hi Mark, I changed two lines of the script /usr/share/cluster/ip.sh, I have attached the script and the lines that I have changed are immediately below the written CHANGED. Now the service httpd start on the new ip address (10.43.100.204), and if the nodo_1 goes down, the service is relocated on nodo_2. when I start the service, in the log messages I have: Jun 3 13:56:58 clu01 clurgmgrd[14899]: <notice> Starting disabled service service:RHTTPD Jun 3 13:56:59 clu01 in.rdiscd[15391]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use Jun 3 13:56:59 clu01 in.rdiscd[15391]: Failed joining addresses Jun 3 13:57:00 clu01 clurgmgrd[14899]: <notice> Service service:RHTTPD started but despite this, the service httpd works. I hope that the information that I am writing you, you are useful. Bye. Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... Il giorno 01/giu/09, alle ore 11:40, Stefano Elmopi ha scritto: > > > Hi Mark, > > my cluster.conf is: > > <?xml version="1.0"?> > <cluster config_version="5" name="cluOCFS2" type="ocfs2"> > > <cman expected_votes="1" two_node="1"/> > > <clusternodes> > > <clusternode name="clu01" votes="1" nodeid="1"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.100.203" > mac="00:15:60:56:75:FD"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <clusternode name="clu02" votes="1" nodeid="2"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.105.15" > mac="00:15:60:56:77:11"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <rm log_level="7" log_facility="local4"> > <failoverdomains> > <failoverdomain name="failover" ordered="1"> > <failoverdomainnode name="clu01" > priority="1"/> > <failoverdomainnode name="clu02" > priority="2"/> > </failoverdomain> > </failoverdomains> > <resources> > <ip address="10.43.100.204" monitor_link="1"/> > <script file="/etc/init.d/httpd" > name="rhttpd"/> > </resources> > <service autostart="0" domain="failover" > name="RHTTPD"> > <ip ref="10.43.100.204"/> > <script ref="rhttpd"/> > </service> > </rm> > > </clusternodes> > > </cluster> > > and I added the line from your email: > > local4.debug /var/log/rgmanager.log to /etc/syslog.conf > > then I rebooted syslog but in the file rgmanager.log is logged only > when CMAN start, > while rgmanager is logged only in the file /va/log/messages but > there is no additional information. > Perhaps additional information can come from this tool, I hope: > > rg_test test /etc/cluster/cluster.conf start service RHTTPD > Running in test mode. > Starting RHTTPD... > /usr/share/cluster/ip.sh: line 583: /dev/fd/62: No such file or > directory > /usr/share/cluster/ip.sh: line 673: /dev/fd/62: No such file or > directory > Failed to start RHTTPD > /usr/share/cluster/ip.sh: line 583: /dev/fd/61: No such file or > directory > +++ Memory table dump +++ > 0xb77306e4 (8 bytes) allocation trace: > 0xb7734e74 (8 bytes) allocation trace: > 0xb774aa6c (16 bytes) allocation trace: > 0xb774b8d0 (16 bytes) allocation trace: > 0xb77357f0 (16 bytes) allocation trace: > 0xb774a9f4 (52 bytes) allocation trace: > 0xb7741194 (912 bytes) allocation trace: > --- End Memory table dump --- > > > > > Bye > > > Ing. Stefano Elmopi > Gruppo Darco - Area ICT Sistemi > Via Ostiense 131/L Corpo B, 00154 Roma > > cell. 3466147165 > tel. 0657060500 > email:ste...@so... > > Il giorno 28/mag/09, alle ore 21:00, Marc Grimme ha scritto: > >> On Thursday 28 May 2009 17:14:49 Stefano Elmopi wrote: >>> Hi Mark, >>> >>> I have changed the service element from: >>> >>> <service autostart="0" domain="failover" name="RHTTPD"> >>> <ip ref="10.43.100.204"/> >>> <script ref="/etc/init.d/httpd"/> >>> </service> >>> >>> to: >>> >>> <service autostart="0" domain="failover" name="RHTTPD"> >>> <ip ref="10.43.100.204"/> >>> <script ref="httpd"/> >>> </service> >>> >>> but does not change the result, if I type clusvcadm -e RHTTPD the >>> service fails and the messeges log: >>> >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Starting disabled >>> service service:RHTTPD >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> start on ip >>> "10.43.100.204" returned 1 (generic error) >> Hmm, you could extend logging by catching debug messages from >> rgmanager by >> adding the line >> local4.debug /var/log/rgmanager.log >> to /etc/syslog.conf then restart syslog. >> See if you can get more information from this file. >> >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #68: Failed to >>> start >>> service:RHTTPD; return value: 1 >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Stopping service >>> service:RHTTPD >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service >>> service:RHTTPD is recovering >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #71: Relocating >>> failed service service:RHTTPD >>> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service >>> service:RHTTPD is stopped >>> >>> a consideration, when rgmanager start, I should not ping the IP >>> address 10.43.100.204 ?? >>> >>> the result of tool rg_test is: >>> >>> [root@clu01 ~]# rg_test test /etc/cluster/cluster.conf >>> Running in test mode. >>> Loaded 22 resource rules >>> === Resources List === >>> Resource type: script >>> Agent: script.sh >>> Attributes: >>> name = httpd [ primary unique ] >>> file = /etc/init.d/httpd [ unique required ] >>> service_name [ inherit("service%name") ] >>> >>> Resource type: ip >>> Instances: 1/1 >>> Agent: ip.sh >>> Attributes: >>> address = 10.43.100.204 [ primary unique ] >>> monitor_link = 1 >>> nfslock [ inherit("service%nfslock") ] >>> >>> Resource type: service [INLINE] >>> Instances: 1/1 >>> Agent: service.sh >>> Attributes: >>> name = RHTTPD [ primary unique required ] >>> domain = failover [ reconfig ] >>> autostart = 0 [ reconfig ] >>> hardrecovery = 0 [ reconfig ] >>> exclusive = 0 [ reconfig ] >>> nfslock = 0 >>> recovery = restart [ reconfig ] >>> depend_mode = hard >>> max_restarts = 0 >>> restart_expire_time = 0 >>> >>> === Resource Tree === >>> service { >>> name = "RHTTPD"; >>> domain = "failover"; >>> autostart = "0"; >>> hardrecovery = "0"; >>> exclusive = "0"; >>> nfslock = "0"; >>> recovery = "restart"; >>> depend_mode = "hard"; >>> max_restarts = "0"; >>> restart_expire_time = "0"; >>> ip { >>> address = "10.43.100.204"; >>> monitor_link = "1"; >>> nfslock = "0"; >>> } >>> script { >>> name = "httpd"; >>> file = "/etc/init.d/httpd"; >>> service_name = "RHTTPD"; >>> } >>> } >>> === Failover Domains === >>> Failover domain: failover >>> Flags: Ordered >>> Node clu01 (id 1, priority 1) >>> Node clu02 (id 2, priority 2) >>> === Event Triggers === >>> Event Priority Level 100: >>> Name: Default >>> (Any event) >>> File: /usr/share/cluster/default_event_script.sl >>> +++ Memory table dump +++ >>> 0xb77756e4 (8 bytes) allocation trace: >>> 0xb7779e74 (8 bytes) allocation trace: >>> 0xb778fce4 (52 bytes) allocation trace: >>> --- End Memory table dump --- >>> >>> >>> if I add the line: >>> >>> <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> >>> >>> to section <com_info> of the clu01, the service start: >>> >>> /etc/init.d/rgmanager start >>> Starting Cluster Service Manager: [ OK ] >>> >>> the log is: >>> >>> May 28 16:59:21 clu01 kernel: dlm: Using TCP for communications >>> May 28 16:59:30 clu01 clurgmgrd[15209]: <notice> Resource Group >>> Manager Starting >>> May 28 16:59:31 clu01 clurgmgrd: [15209]: <err> Failed to remove >>> 10.43.100.204 >>> May 28 16:59:31 clu01 clurgmgrd[15209]: <notice> stop on ip >>> "10.43.100.204" returned 1 (generic error) >> That's clear. This ip is already setup by the bootprocess. So it >> cannot be >> setup. >>> >>> clustat >>> Cluster Status for cluOCFS2 @ Thu May 28 17:00:22 2009 >>> Member Status: Quorate >>> >>> Member Name ID >>> Status >>> ------ ---- >>> ---- >>> ------ >>> clu01 >>> 1 Online, Local, rgmanager >>> clu02 >>> 2 Offline >>> >>> Service Name >>> Owner (Last) >>> State >>> ------- ---- >>> ----- ------ >>> ----- >>> service:RHTTPD >>> (none) >>> disabled >>> >>> and: >>> >>> clusvcadm -e RHTTPD >>> Local machine trying to enable service:RHTTPD...Success >>> service:RHTTPD is now running on clu01 >>> >>> but in this case the service does not relocate with the same ip !! >>> >>> >>> >>> Bye >>> >>> >>> >>> >>> >>> Ing. Stefano Elmopi >>> Gruppo Darco - Area ICT Sistemi >>> Via Ostiense 131/L Corpo B, 00154 Roma >>> >>> cell. 3466147165 >>> tel. 0657060500 >>> email:ste...@so... >> >> >> >> -- >> Gruss / Regards, >> >> Marc Grimme >> http://www.atix.de/ http://www.open-sharedroot.org/ >> > |
From: Mark H. <hla...@at...> - 2009-06-01 20:02:42
|
Hi Stefano, the errors indicate, that the fd devices have not been configured correctly. udev is responsible for doing this in RHEL. You can workaround this issue by creating the following symlink: ln -s /proc/self/fd /dev/fd Hope this helps ! Mark > rg_test test /etc/cluster/cluster.conf start service RHTTPD > Running in test mode. > Starting RHTTPD... > /usr/share/cluster/ip.sh: line 583: /dev/fd/62: No such file or > directory > /usr/share/cluster/ip.sh: line 673: /dev/fd/62: No such file or > directory > Failed to start RHTTPD > /usr/share/cluster/ip.sh: line 583: /dev/fd/61: No such file or |
From: Marc G. <gr...@at...> - 2009-06-01 10:10:16
|
Hi Dan, On Friday 29 May 2009 17:57:32 Dan Magenheimer wrote: > Hi Marc -- > > Thanks for your reply on the cdsl's. > > I'm about to start another round of setting up an > OSR for Xen/EL5/ocfs2. I have my current one > working and booting 8 virtual nodes. However > I made enough tweaks along the way that I want > to ensure I can reproduce it. > > Last time, I used the following rpm versions: > > comoonics-bootimage-1.4-21.noarch.rpm > comoonics-bootimage-extras-ocfs2-0.1-3.noarch.rpm > comoonics-bootimage-initscripts-1.4-9.rhel5.noarch.rpm > comoonics-bootimage-listfiles-1.3-8.el5.noarch.rpm > comoonics-bootimage-listfiles-all-0.1-5.noarch.rpm > comoonics-bootimage-listfiles-rhel-0.1-3.noarch.rpm > comoonics-bootimage-listfiles-rhel5-0.1-3.noarch.rpm > comoonics-cdsl-py-0.2-12.noarch.rpm > comoonics-cluster-py-0.1-17.noarch.rpm > comoonics-cs-py-0.1-56.noarch.rpm > comoonics-pythonosfix-py-0.1-2.noarch.rpm > SysVinit-comoonics-2.86-14.atix.1.i386.rpm > > Are there any newer versions of these? (And is > there any changelog list or other mechanism I > can use generally to check for newer versions? > I want to download rpm's, not use yum or up2date.) First no the version with the el5 detection patches is not yet rpm build. But it will be in the next version. I think I will release those fixes within the next week when we finished our first Q&A round on the beta channel. About changelogs. There are always the rpm changelogs. Every new major version will get release notes (automatically). This is new so I don't know if we will have it fully in 4.5 which is right now on Q&A. > > The version above still has the problem with the > /etc/redhat-release not parsing the Oracle version > string. Is this fixed somewhere? I'm currently > just patching it manually. Also, I'm wondering if > you might have taken my suggestion of using > "modprobe -q" so that ugly FATAL messages aren't > printed for modules that might already be built > into the kernel. (Specifically, xennet and xenblk) Ok. I'll keep it in mind. I didn't have it completly registered in my mind ;-) . But now it is. I have to think about it but I don't think it's a bad idea. I'll keep you up2date on this. > > Thanks, > Dan Regards -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-06-01 09:40:06
|
Hi Mark, my cluster.conf is: <?xml version="1.0"?> <cluster config_version="5" name="cluOCFS2" type="ocfs2"> <cman expected_votes="1" two_node="1"/> <clusternodes> <clusternode name="clu01" votes="1" nodeid="1"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.100.203" mac="00:15:60:56:75:FD"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <clusternode name="clu02" votes="1" nodeid="2"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.105.15" mac="00:15:60:56:77:11"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <rm log_level="7" log_facility="local4"> <failoverdomains> <failoverdomain name="failover" ordered="1"> <failoverdomainnode name="clu01" priority="1"/> <failoverdomainnode name="clu02" priority="2"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.43.100.204" monitor_link="1"/> <script file="/etc/init.d/httpd" name="rhttpd"/> </resources> <service autostart="0" domain="failover" name="RHTTPD"> <ip ref="10.43.100.204"/> <script ref="rhttpd"/> </service> </rm> </clusternodes> </cluster> and I added the line from your email: local4.debug /var/log/rgmanager.log to /etc/syslog.conf then I rebooted syslog but in the file rgmanager.log is logged only when CMAN start, while rgmanager is logged only in the file /va/log/messages but there is no additional information. Perhaps additional information can come from this tool, I hope: rg_test test /etc/cluster/cluster.conf start service RHTTPD Running in test mode. Starting RHTTPD... /usr/share/cluster/ip.sh: line 583: /dev/fd/62: No such file or directory /usr/share/cluster/ip.sh: line 673: /dev/fd/62: No such file or directory Failed to start RHTTPD /usr/share/cluster/ip.sh: line 583: /dev/fd/61: No such file or directory +++ Memory table dump +++ 0xb77306e4 (8 bytes) allocation trace: 0xb7734e74 (8 bytes) allocation trace: 0xb774aa6c (16 bytes) allocation trace: 0xb774b8d0 (16 bytes) allocation trace: 0xb77357f0 (16 bytes) allocation trace: 0xb774a9f4 (52 bytes) allocation trace: 0xb7741194 (912 bytes) allocation trace: --- End Memory table dump --- Bye Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... Il giorno 28/mag/09, alle ore 21:00, Marc Grimme ha scritto: > On Thursday 28 May 2009 17:14:49 Stefano Elmopi wrote: >> Hi Mark, >> >> I have changed the service element from: >> >> <service autostart="0" domain="failover" name="RHTTPD"> >> <ip ref="10.43.100.204"/> >> <script ref="/etc/init.d/httpd"/> >> </service> >> >> to: >> >> <service autostart="0" domain="failover" name="RHTTPD"> >> <ip ref="10.43.100.204"/> >> <script ref="httpd"/> >> </service> >> >> but does not change the result, if I type clusvcadm -e RHTTPD the >> service fails and the messeges log: >> >> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Starting disabled >> service service:RHTTPD >> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> start on ip >> "10.43.100.204" returned 1 (generic error) > Hmm, you could extend logging by catching debug messages from > rgmanager by > adding the line > local4.debug /var/log/rgmanager.log > to /etc/syslog.conf then restart syslog. > See if you can get more information from this file. > >> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #68: Failed to >> start >> service:RHTTPD; return value: 1 >> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Stopping service >> service:RHTTPD >> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service >> service:RHTTPD is recovering >> May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #71: Relocating >> failed service service:RHTTPD >> May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service >> service:RHTTPD is stopped >> >> a consideration, when rgmanager start, I should not ping the IP >> address 10.43.100.204 ?? >> >> the result of tool rg_test is: >> >> [root@clu01 ~]# rg_test test /etc/cluster/cluster.conf >> Running in test mode. >> Loaded 22 resource rules >> === Resources List === >> Resource type: script >> Agent: script.sh >> Attributes: >> name = httpd [ primary unique ] >> file = /etc/init.d/httpd [ unique required ] >> service_name [ inherit("service%name") ] >> >> Resource type: ip >> Instances: 1/1 >> Agent: ip.sh >> Attributes: >> address = 10.43.100.204 [ primary unique ] >> monitor_link = 1 >> nfslock [ inherit("service%nfslock") ] >> >> Resource type: service [INLINE] >> Instances: 1/1 >> Agent: service.sh >> Attributes: >> name = RHTTPD [ primary unique required ] >> domain = failover [ reconfig ] >> autostart = 0 [ reconfig ] >> hardrecovery = 0 [ reconfig ] >> exclusive = 0 [ reconfig ] >> nfslock = 0 >> recovery = restart [ reconfig ] >> depend_mode = hard >> max_restarts = 0 >> restart_expire_time = 0 >> >> === Resource Tree === >> service { >> name = "RHTTPD"; >> domain = "failover"; >> autostart = "0"; >> hardrecovery = "0"; >> exclusive = "0"; >> nfslock = "0"; >> recovery = "restart"; >> depend_mode = "hard"; >> max_restarts = "0"; >> restart_expire_time = "0"; >> ip { >> address = "10.43.100.204"; >> monitor_link = "1"; >> nfslock = "0"; >> } >> script { >> name = "httpd"; >> file = "/etc/init.d/httpd"; >> service_name = "RHTTPD"; >> } >> } >> === Failover Domains === >> Failover domain: failover >> Flags: Ordered >> Node clu01 (id 1, priority 1) >> Node clu02 (id 2, priority 2) >> === Event Triggers === >> Event Priority Level 100: >> Name: Default >> (Any event) >> File: /usr/share/cluster/default_event_script.sl >> +++ Memory table dump +++ >> 0xb77756e4 (8 bytes) allocation trace: >> 0xb7779e74 (8 bytes) allocation trace: >> 0xb778fce4 (52 bytes) allocation trace: >> --- End Memory table dump --- >> >> >> if I add the line: >> >> <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> >> >> to section <com_info> of the clu01, the service start: >> >> /etc/init.d/rgmanager start >> Starting Cluster Service Manager: [ OK ] >> >> the log is: >> >> May 28 16:59:21 clu01 kernel: dlm: Using TCP for communications >> May 28 16:59:30 clu01 clurgmgrd[15209]: <notice> Resource Group >> Manager Starting >> May 28 16:59:31 clu01 clurgmgrd: [15209]: <err> Failed to remove >> 10.43.100.204 >> May 28 16:59:31 clu01 clurgmgrd[15209]: <notice> stop on ip >> "10.43.100.204" returned 1 (generic error) > That's clear. This ip is already setup by the bootprocess. So it > cannot be > setup. >> >> clustat >> Cluster Status for cluOCFS2 @ Thu May 28 17:00:22 2009 >> Member Status: Quorate >> >> Member Name ID >> Status >> ------ ---- ---- >> ------ >> clu01 >> 1 Online, Local, rgmanager >> clu02 >> 2 Offline >> >> Service Name >> Owner (Last) >> State >> ------- ---- >> ----- ------ >> ----- >> service:RHTTPD >> (none) >> disabled >> >> and: >> >> clusvcadm -e RHTTPD >> Local machine trying to enable service:RHTTPD...Success >> service:RHTTPD is now running on clu01 >> >> but in this case the service does not relocate with the same ip !! >> >> >> >> Bye >> >> >> >> >> >> Ing. Stefano Elmopi >> Gruppo Darco - Area ICT Sistemi >> Via Ostiense 131/L Corpo B, 00154 Roma >> >> cell. 3466147165 >> tel. 0657060500 >> email:ste...@so... > > > > -- > Gruss / Regards, > > Marc Grimme > http://www.atix.de/ http://www.open-sharedroot.org/ > |
From: Dan M. <dan...@or...> - 2009-05-29 16:28:01
|
Hi Marc -- Thanks for your reply on the cdsl's. I'm about to start another round of setting up an OSR for Xen/EL5/ocfs2. I have my current one working and booting 8 virtual nodes. However I made enough tweaks along the way that I want to ensure I can reproduce it. Last time, I used the following rpm versions: comoonics-bootimage-1.4-21.noarch.rpm comoonics-bootimage-extras-ocfs2-0.1-3.noarch.rpm comoonics-bootimage-initscripts-1.4-9.rhel5.noarch.rpm comoonics-bootimage-listfiles-1.3-8.el5.noarch.rpm comoonics-bootimage-listfiles-all-0.1-5.noarch.rpm comoonics-bootimage-listfiles-rhel-0.1-3.noarch.rpm comoonics-bootimage-listfiles-rhel5-0.1-3.noarch.rpm comoonics-cdsl-py-0.2-12.noarch.rpm comoonics-cluster-py-0.1-17.noarch.rpm comoonics-cs-py-0.1-56.noarch.rpm comoonics-pythonosfix-py-0.1-2.noarch.rpm SysVinit-comoonics-2.86-14.atix.1.i386.rpm Are there any newer versions of these? (And is there any changelog list or other mechanism I can use generally to check for newer versions? I want to download rpm's, not use yum or up2date.) The version above still has the problem with the /etc/redhat-release not parsing the Oracle version string. Is this fixed somewhere? I'm currently just patching it manually. Also, I'm wondering if you might have taken my suggestion of using "modprobe -q" so that ugly FATAL messages aren't printed for modules that might already be built into the kernel. (Specifically, xennet and xenblk) Thanks, Dan > -----Original Message----- > From: Marc Grimme [mailto:gr...@at...] > Sent: Thursday, May 28, 2009 1:05 PM > To: ope...@li... > Cc: Dan Magenheimer > Subject: Re: [OSR-users] What other cdsl's to configure? > > > On Thursday 28 May 2009 18:53:56 Dan Magenheimer wrote: > > The howto (at least the RHEL5 OCFS2 one I am using) provides > > the minimum number of cdsl's necessary to boot the system. > > I'd be interested in suggestions on what other cdsl's should > > be set up to make a usable sharedroot system. > > > > For example, I see that /root is shared, which is likely > > problematic if root logs in on multiple nodes. > > > > Should /home be shared or cdsl? > > > > Should any other directories be cdsl? > Good question. I feel the answer is more personal then technical. > But the idea behind the whole shared root thing is. Share > every file as long > as there is good reason. If there is no good reason then make it > hostdependent. > I know this doesn't hold completly cause /var is hostdep and > _ONLY_ /var/lib > is reshared. But there might be a /var/www ... > > I'd say this is up to you. > > But I didn't see anybody making /root or /home hostdep. > > -- > Gruss / Regards, > > Marc Grimme > http://www.atix.de/ http://www.open-sharedroot.org/ > > |
From: Marc G. <gr...@at...> - 2009-05-28 19:14:01
|
Hi, this sounds like an interesting information. I'll try to validate this on Tuesday next week. Then I have better access to my testclusters. I'll keep you informed. Thanks Marc. On Thursday 28 May 2009 09:17:04 Klaus Steinberger wrote: > Hi, > > because of some testing of a samba/ctdb Cluster on top of OSR i tweaked > around with the Posix Locking Rate Limit of gfs_controld. The default > for plock_rate_limit is "100", which is quite low. > > I raised the limit now on one of my SL 5.3/GFS OSR clusters (a virtual > one) to 10000, which seems to be higher than the real rate (I measured > up to 2200 / sec using ping_pong ). > > I see a very interesting side effect: > > The startup of nodes seems to be quite faster now. Especially the ever > slow udev startup after changing root runs as a charm. > > So maybe somebody could try to confirm that? > > > To raise plock_rate_limit put the following line into > /etc/cluster/cluster.conf: > > <gfs_controld plock_rate_limit="10000"/> > > Please be aware that gfs_controld cannot be restarted or reconfigured on > a running node! A node have to be rebooted to change plock_rate_limit. > > Sincerly, > Klaus -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Marc G. <gr...@at...> - 2009-05-28 19:06:50
|
On Thursday 28 May 2009 17:14:49 Stefano Elmopi wrote: > Hi Mark, > > I have changed the service element from: > > <service autostart="0" domain="failover" name="RHTTPD"> > <ip ref="10.43.100.204"/> > <script ref="/etc/init.d/httpd"/> > </service> > > to: > > <service autostart="0" domain="failover" name="RHTTPD"> > <ip ref="10.43.100.204"/> > <script ref="httpd"/> > </service> > > but does not change the result, if I type clusvcadm -e RHTTPD the > service fails and the messeges log: > > May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Starting disabled > service service:RHTTPD > May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> start on ip > "10.43.100.204" returned 1 (generic error) Hmm, you could extend logging by catching debug messages from rgmanager by adding the line local4.debug /var/log/rgmanager.log to /etc/syslog.conf then restart syslog. See if you can get more information from this file. > May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #68: Failed to start > service:RHTTPD; return value: 1 > May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Stopping service > service:RHTTPD > May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service > service:RHTTPD is recovering > May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #71: Relocating > failed service service:RHTTPD > May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service > service:RHTTPD is stopped > > a consideration, when rgmanager start, I should not ping the IP > address 10.43.100.204 ?? > > the result of tool rg_test is: > > [root@clu01 ~]# rg_test test /etc/cluster/cluster.conf > Running in test mode. > Loaded 22 resource rules > === Resources List === > Resource type: script > Agent: script.sh > Attributes: > name = httpd [ primary unique ] > file = /etc/init.d/httpd [ unique required ] > service_name [ inherit("service%name") ] > > Resource type: ip > Instances: 1/1 > Agent: ip.sh > Attributes: > address = 10.43.100.204 [ primary unique ] > monitor_link = 1 > nfslock [ inherit("service%nfslock") ] > > Resource type: service [INLINE] > Instances: 1/1 > Agent: service.sh > Attributes: > name = RHTTPD [ primary unique required ] > domain = failover [ reconfig ] > autostart = 0 [ reconfig ] > hardrecovery = 0 [ reconfig ] > exclusive = 0 [ reconfig ] > nfslock = 0 > recovery = restart [ reconfig ] > depend_mode = hard > max_restarts = 0 > restart_expire_time = 0 > > === Resource Tree === > service { > name = "RHTTPD"; > domain = "failover"; > autostart = "0"; > hardrecovery = "0"; > exclusive = "0"; > nfslock = "0"; > recovery = "restart"; > depend_mode = "hard"; > max_restarts = "0"; > restart_expire_time = "0"; > ip { > address = "10.43.100.204"; > monitor_link = "1"; > nfslock = "0"; > } > script { > name = "httpd"; > file = "/etc/init.d/httpd"; > service_name = "RHTTPD"; > } > } > === Failover Domains === > Failover domain: failover > Flags: Ordered > Node clu01 (id 1, priority 1) > Node clu02 (id 2, priority 2) > === Event Triggers === > Event Priority Level 100: > Name: Default > (Any event) > File: /usr/share/cluster/default_event_script.sl > +++ Memory table dump +++ > 0xb77756e4 (8 bytes) allocation trace: > 0xb7779e74 (8 bytes) allocation trace: > 0xb778fce4 (52 bytes) allocation trace: > --- End Memory table dump --- > > > if I add the line: > > <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> > > to section <com_info> of the clu01, the service start: > > /etc/init.d/rgmanager start > Starting Cluster Service Manager: [ OK ] > > the log is: > > May 28 16:59:21 clu01 kernel: dlm: Using TCP for communications > May 28 16:59:30 clu01 clurgmgrd[15209]: <notice> Resource Group > Manager Starting > May 28 16:59:31 clu01 clurgmgrd: [15209]: <err> Failed to remove > 10.43.100.204 > May 28 16:59:31 clu01 clurgmgrd[15209]: <notice> stop on ip > "10.43.100.204" returned 1 (generic error) That's clear. This ip is already setup by the bootprocess. So it cannot be setup. > > clustat > Cluster Status for cluOCFS2 @ Thu May 28 17:00:22 2009 > Member Status: Quorate > > Member Name ID > Status > ------ ---- ---- > ------ > clu01 > 1 Online, Local, rgmanager > clu02 > 2 Offline > > Service Name > Owner (Last) State > ------- ---- > ----- ------ ----- > service:RHTTPD > (none) > disabled > > and: > > clusvcadm -e RHTTPD > Local machine trying to enable service:RHTTPD...Success > service:RHTTPD is now running on clu01 > > but in this case the service does not relocate with the same ip !! > > > > Bye > > > > > > Ing. Stefano Elmopi > Gruppo Darco - Area ICT Sistemi > Via Ostiense 131/L Corpo B, 00154 Roma > > cell. 3466147165 > tel. 0657060500 > email:ste...@so... -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Marc G. <gr...@at...> - 2009-05-28 19:05:14
|
On Thursday 28 May 2009 18:53:56 Dan Magenheimer wrote: > The howto (at least the RHEL5 OCFS2 one I am using) provides > the minimum number of cdsl's necessary to boot the system. > I'd be interested in suggestions on what other cdsl's should > be set up to make a usable sharedroot system. > > For example, I see that /root is shared, which is likely > problematic if root logs in on multiple nodes. > > Should /home be shared or cdsl? > > Should any other directories be cdsl? Good question. I feel the answer is more personal then technical. But the idea behind the whole shared root thing is. Share every file as long as there is good reason. If there is no good reason then make it hostdependent. I know this doesn't hold completly cause /var is hostdep and _ONLY_ /var/lib is reshared. But there might be a /var/www ... I'd say this is up to you. But I didn't see anybody making /root or /home hostdep. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Dan M. <dan...@or...> - 2009-05-28 16:54:26
|
The howto (at least the RHEL5 OCFS2 one I am using) provides the minimum number of cdsl's necessary to boot the system. I'd be interested in suggestions on what other cdsl's should be set up to make a usable sharedroot system. For example, I see that /root is shared, which is likely problematic if root logs in on multiple nodes. Should /home be shared or cdsl? Should any other directories be cdsl? Thanks, Dan |
From: Stefano E. <ste...@so...> - 2009-05-28 15:14:55
|
Hi Mark, I have changed the service element from: <service autostart="0" domain="failover" name="RHTTPD"> <ip ref="10.43.100.204"/> <script ref="/etc/init.d/httpd"/> </service> to: <service autostart="0" domain="failover" name="RHTTPD"> <ip ref="10.43.100.204"/> <script ref="httpd"/> </service> but does not change the result, if I type clusvcadm -e RHTTPD the service fails and the messeges log: May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Starting disabled service service:RHTTPD May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> start on ip "10.43.100.204" returned 1 (generic error) May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #68: Failed to start service:RHTTPD; return value: 1 May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Stopping service service:RHTTPD May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service service:RHTTPD is recovering May 28 15:09:00 clu01 clurgmgrd[15046]: <warning> #71: Relocating failed service service:RHTTPD May 28 15:09:00 clu01 clurgmgrd[15046]: <notice> Service service:RHTTPD is stopped a consideration, when rgmanager start, I should not ping the IP address 10.43.100.204 ?? the result of tool rg_test is: [root@clu01 ~]# rg_test test /etc/cluster/cluster.conf Running in test mode. Loaded 22 resource rules === Resources List === Resource type: script Agent: script.sh Attributes: name = httpd [ primary unique ] file = /etc/init.d/httpd [ unique required ] service_name [ inherit("service%name") ] Resource type: ip Instances: 1/1 Agent: ip.sh Attributes: address = 10.43.100.204 [ primary unique ] monitor_link = 1 nfslock [ inherit("service%nfslock") ] Resource type: service [INLINE] Instances: 1/1 Agent: service.sh Attributes: name = RHTTPD [ primary unique required ] domain = failover [ reconfig ] autostart = 0 [ reconfig ] hardrecovery = 0 [ reconfig ] exclusive = 0 [ reconfig ] nfslock = 0 recovery = restart [ reconfig ] depend_mode = hard max_restarts = 0 restart_expire_time = 0 === Resource Tree === service { name = "RHTTPD"; domain = "failover"; autostart = "0"; hardrecovery = "0"; exclusive = "0"; nfslock = "0"; recovery = "restart"; depend_mode = "hard"; max_restarts = "0"; restart_expire_time = "0"; ip { address = "10.43.100.204"; monitor_link = "1"; nfslock = "0"; } script { name = "httpd"; file = "/etc/init.d/httpd"; service_name = "RHTTPD"; } } === Failover Domains === Failover domain: failover Flags: Ordered Node clu01 (id 1, priority 1) Node clu02 (id 2, priority 2) === Event Triggers === Event Priority Level 100: Name: Default (Any event) File: /usr/share/cluster/default_event_script.sl +++ Memory table dump +++ 0xb77756e4 (8 bytes) allocation trace: 0xb7779e74 (8 bytes) allocation trace: 0xb778fce4 (52 bytes) allocation trace: --- End Memory table dump --- if I add the line: <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> to section <com_info> of the clu01, the service start: /etc/init.d/rgmanager start Starting Cluster Service Manager: [ OK ] the log is: May 28 16:59:21 clu01 kernel: dlm: Using TCP for communications May 28 16:59:30 clu01 clurgmgrd[15209]: <notice> Resource Group Manager Starting May 28 16:59:31 clu01 clurgmgrd: [15209]: <err> Failed to remove 10.43.100.204 May 28 16:59:31 clu01 clurgmgrd[15209]: <notice> stop on ip "10.43.100.204" returned 1 (generic error) clustat Cluster Status for cluOCFS2 @ Thu May 28 17:00:22 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ clu01 1 Online, Local, rgmanager clu02 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:RHTTPD (none) disabled and: clusvcadm -e RHTTPD Local machine trying to enable service:RHTTPD...Success service:RHTTPD is now running on clu01 but in this case the service does not relocate with the same ip !! Bye Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... |
From: Klaus S. <kla...@Ph...> - 2009-05-28 07:17:11
|
Hi, because of some testing of a samba/ctdb Cluster on top of OSR i tweaked around with the Posix Locking Rate Limit of gfs_controld. The default for plock_rate_limit is "100", which is quite low. I raised the limit now on one of my SL 5.3/GFS OSR clusters (a virtual one) to 10000, which seems to be higher than the real rate (I measured up to 2200 / sec using ping_pong ). I see a very interesting side effect: The startup of nodes seems to be quite faster now. Especially the ever slow udev startup after changing root runs as a charm. So maybe somebody could try to confirm that? To raise plock_rate_limit put the following line into /etc/cluster/cluster.conf: <gfs_controld plock_rate_limit="10000"/> Please be aware that gfs_controld cannot be restarted or reconfigured on a running node! A node have to be rebooted to change plock_rate_limit. Sincerly, Klaus |
From: Marc G. <gr...@at...> - 2009-05-27 16:43:28
|
On Wednesday 27 May 2009 17:45:12 Stefano Elmopi wrote: > Hi Marc, > > you don't worry about the late of the your response, I read that you > were away from base...... > I hope for some days of rest !!! > > I want to relocate a service from one node to another, keeping the > same ip. > My cluster.conf is: > > <?xml version="1.0"?> > <cluster config_version="5" name="cluOCFS2" type="ocfs2"> > > <cman expected_votes="1" two_node="1"/> > > <clusternodes> > > <clusternode name="clu01" votes="1" nodeid="1"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.100.203" > mac="00:15:60:56:75:FD"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <clusternode name="clu02" votes="1" nodeid="2"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.105.15" > mac="00:15:60:56:77:11"/> > <eth name="eth1" ip="10.43.105.25" > mac="00:15:60:56:77:10"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <rm log_level="7" log_facility="local4"> > <failoverdomains> > <failoverdomain name="failover" ordered="1"> > <failoverdomainnode name="clu01" > priority="1"/> > <failoverdomainnode name="clu02" > priority="2"/> > </failoverdomain> > </failoverdomains> > <resources> > <ip address="10.43.100.204" monitor_link="1"/> > <script file="/etc/init.d/httpd" name="httpd"/> > </resources> > <service autostart="0" domain="failover" name="RHTTPD"> > <ip ref="10.43.100.204"/> > <script ref="/etc/init.d/httpd"/> > </service> > </rm> > > </clusternodes> > > </cluster> > > > I am starting CMAN and RGMAN and everything is ok. > > clustat: > Cluster Status for cluOCFS2 @ Wed May 27 16:52:50 2009 > Member Status: Quorate > > Member Name ID > Status > ------ ---- ---- > ------ > clu01 > 1 Online, Local, rgmanager > clu02 > 2 Offline > > Service Name > Owner (Last) State > ------- ---- > ----- ------ ----- > service:RHTTPD > (none) > disabled > > > Then I type: > > clusvcadm -e RHTTPD > Local machine trying to enable service:RHTTPD...Failure > > and the log messages is: > > May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Starting disabled > service service:RHTTPD > May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> start on ip > "10.43.100.204" returned 1 (generic error) > May 27 16:53:09 clu01 clurgmgrd[16576]: <warning> #68: Failed to start > service:RHTTPD; return value: 1 > May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Stopping service > service:RHTTPD > May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Service > service:RHTTPD is recovering > May 27 16:53:09 clu01 clurgmgrd[16576]: <warning> #71: Relocating > failed service service:RHTTPD > May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Service > service:RHTTPD is stopped > > Thanks > > > > Ing. Stefano Elmopi > Gruppo Darco - Area ICT Sistemi > Via Ostiense 131/L Corpo B, 00154 Roma > > cell. 3466147165 > tel. 0657060500 > email:ste...@so... Try to change the service element as follows: <service autostart="0" domain="failover" name="RHTTPD"> <ip ref="10.43.100.204"/> <script ref="httpd"/> </service> What tells you rgmanager now? Also try to get used to the rg_test tool. That helps. Try e.g. rg_test /etc/cluster/cluster.conf .. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-05-27 15:45:27
|
Hi Marc, you don't worry about the late of the your response, I read that you were away from base...... I hope for some days of rest !!! I want to relocate a service from one node to another, keeping the same ip. My cluster.conf is: <?xml version="1.0"?> <cluster config_version="5" name="cluOCFS2" type="ocfs2"> <cman expected_votes="1" two_node="1"/> <clusternodes> <clusternode name="clu01" votes="1" nodeid="1"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.100.203" mac="00:15:60:56:75:FD"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <clusternode name="clu02" votes="1" nodeid="2"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.105.15" mac="00:15:60:56:77:11"/> <eth name="eth1" ip="10.43.105.25" mac="00:15:60:56:77:10"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <rm log_level="7" log_facility="local4"> <failoverdomains> <failoverdomain name="failover" ordered="1"> <failoverdomainnode name="clu01" priority="1"/> <failoverdomainnode name="clu02" priority="2"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.43.100.204" monitor_link="1"/> <script file="/etc/init.d/httpd" name="httpd"/> </resources> <service autostart="0" domain="failover" name="RHTTPD"> <ip ref="10.43.100.204"/> <script ref="/etc/init.d/httpd"/> </service> </rm> </clusternodes> </cluster> I am starting CMAN and RGMAN and everything is ok. clustat: Cluster Status for cluOCFS2 @ Wed May 27 16:52:50 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ clu01 1 Online, Local, rgmanager clu02 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:RHTTPD (none) disabled Then I type: clusvcadm -e RHTTPD Local machine trying to enable service:RHTTPD...Failure and the log messages is: May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Starting disabled service service:RHTTPD May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> start on ip "10.43.100.204" returned 1 (generic error) May 27 16:53:09 clu01 clurgmgrd[16576]: <warning> #68: Failed to start service:RHTTPD; return value: 1 May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Stopping service service:RHTTPD May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Service service:RHTTPD is recovering May 27 16:53:09 clu01 clurgmgrd[16576]: <warning> #71: Relocating failed service service:RHTTPD May 27 16:53:09 clu01 clurgmgrd[16576]: <notice> Service service:RHTTPD is stopped Thanks Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... |
From: Marc G. <gr...@at...> - 2009-05-26 14:29:50
|
Hi Stefano, sorry for the late response. On Wednesday 13 May 2009 10:51:04 Stefano Elmopi wrote: > Hi Marc, > > very very thank for your answer.......... I was becoming crazy !!! > I tried and now it works. > I take advantage of this email to ask for other informations. > > - It's possible configure two network interfaces of the server in > order to > have two interfaces Bond, for example bond0.11 and bond0.22, > one for communication intra-cluster and the other for the service > configured on the cluster ?? yes. http://www.open-sharedroot.org/documentation/administrators-handbook/part-viii-cluster-administration/ That should be self explanatory. VLANs and no vlans are described. If you want to use vlans you also need to install the rpm comoonics-bootimage-extras-network and build a new initrd. If not vconfig is not added to the initrd. > > > > - The ip address configured for the service given by the cluster, for > my files cluster.conf: > > <resources> > <ip address="10.43.100.204" monitor_link="1"/> > <script file="/etc/init.d/httpd" name="httpd"/> > </resources> > <service autostart="0" domain="failover" name="HTTPD"> > <ip ref="10.43.100.204"/> > <script ref="httpd"/> > </service> > > I have to configure on a node of the cluster > > <clusternode name="clu01" votes="1" nodeid="1"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.100.203" mac="00:15:60:56:75:FD"/> > <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> > </com_info> > </clusternode> > > If I want to relocate the service on the node_2 when the node_1 fails, > I must also configure the ip (10.43.100.204) in the section <com_info> > of the node_2 ?? No the ip to be failovered needs to be exclusively setup in the resource section. > > > - It's possible to configure a service, for example httpd, > to be active simultaneously on multiple nodes instead > of having it running on one node and move it when the node falls ?? Yes. But you will need a loadbalancer "in front" or "integrated in the cluster". -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-05-13 08:51:23
|
Hi Marc, very very thank for your answer.......... I was becoming crazy !!! I tried and now it works. I take advantage of this email to ask for other informations. - It's possible configure two network interfaces of the server in order to have two interfaces Bond, for example bond0.11 and bond0.22, one for communication intra-cluster and the other for the service configured on the cluster ?? - The ip address configured for the service given by the cluster, for my files cluster.conf: <resources> <ip address="10.43.100.204" monitor_link="1"/> <script file="/etc/init.d/httpd" name="httpd"/> </resources> <service autostart="0" domain="failover" name="HTTPD"> <ip ref="10.43.100.204"/> <script ref="httpd"/> </service> I have to configure on a node of the cluster <clusternode name="clu01" votes="1" nodeid="1"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.100.203" mac="00:15:60:56:75:FD"/> <eth name="eth1" ip="10.43.100.204" mac="00:15:60:56:75:FC"/> </com_info> </clusternode> If I want to relocate the service on the node_2 when the node_1 fails, I must also configure the ip (10.43.100.204) in the section <com_info> of the node_2 ?? - It's possible to configure a service, for example httpd, to be active simultaneously on multiple nodes instead of having it running on one node and move it when the node falls ?? Thanks Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... Il giorno 11/mag/09, alle ore 21:50, Marc Grimme ha scritto: > Hi Stefano, > On Monday 11 May 2009 18:49:44 Stefano Elmopi wrote: >> Hi Marc, >> >> I had read in the guide "Installing and Configuring a Shared Root >> Cluster" >> that you can make the bond with interfaces, I'm trying to do it but I >> am having problems. >> On the server in normal boot, without Open-Shared: >> >> ifconfig >> bond0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD >> inet addr:10.43.100.203 Bcast:10.43.255.255 Mask: >> 255.255.0.0 >> inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link >> UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 >> RX packets:30897 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:332 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:2688920 (2.5 MiB) TX bytes:38900 (37.9 KiB) >> >> eth0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD >> inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link >> UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 >> RX packets:15657 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:326 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:1363172 (1.3 MiB) TX bytes:38408 (37.5 KiB) >> Interrupt:217 >> >> eth1 Link encap:Ethernet HWaddr 00:15:60:56:75:FD >> inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link >> UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 >> RX packets:15240 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:1325748 (1.2 MiB) TX bytes:492 (492.0 b) >> Interrupt:225 >> >> >> >> My cluster.conf is: >> >> <?xml version="1.0"?> >> <cluster config_version="5" name="cluOCFS2" type="ocfs2"> >> >> <cman expected_votes="1" two_node="1"/> >> >> <clusternodes> >> >> <clusternode name="clu01" votes="1" nodeid="1"> >> <com_info> >> <syslog name="clu01"/> >> <rootvolume name="/dev/sda2" fstype="ocfs2"/> >> <eth name="eth0" mac="00:15:60:56:75:FD" master=bond0 >> slave=yes/> >> <eth name="eth1" mac="00:15:60:56:75:FC" master=bond0 >> slave=yes/> >> <eth name="bond0" ip="10.43.100.203" >> mask="255.255.0.0"/> >> <fenceackserver user="root" passwd="test123"/> >> </com_info> >> </clusternode> >> >> <clusternode name="clu02" votes="1" nodeid="2"> >> <com_info> >> <syslog name="clu01"/> >> <rootvolume name="/dev/sda2" fstype="ocfs2"/> >> <eth name="eth0" ip="10.43.100.187" >> mac="00:15:60:56:77:11"/> >> <fenceackserver user="root" passwd="test123"/> >> </com_info> >> </clusternode> >> >> <rm log_level="7" log_facility="local4"> >> <failoverdomains> >> <failoverdomain name="failover" ordered="0"> >> <failoverdomainnode name="clu01" >> priority="1"/> >> </failoverdomain> >> </failoverdomains> >> <resources> >> <ip address="10.43.100.203" >> monitor_link="1"/> >> <script file="/etc/init.d/httpd" >> name="httpd"/> >> </resources> >> <service autostart="0" domain="failover" >> name="HTTPD"> >> <ip ref="10.43.100.203"/> >> <script ref="httpd"/> >> </service> >> </rm> >> >> </clusternodes> >> >> </cluster> >> >> when Open-Shared start, it stopped saying that could not validate >> cluster configuration, >> and I can not understand what the parameter of the cluster.conf is >> not >> validated. >> I think that my bond configuration is not correct. >> >> Thanks > Looks good but you missed a tiny little thing. You need to enclose > the xml > attributes with ' or ". Means in line 12: > <eth name="eth0" mac="00:15:60:56:75:FD" master=bond0 slave=yes/> > => > <eth name="eth0" mac="00:15:60:56:75:FD" master="bond0" slave="yes"/> > dito for line 13. > > When I use your cluster.conf I can see it as follows: > [marc@generix3 ~]$ com-queryclusterconf -f /tmp/cluster.conf nodeids > Traceback (most recent call last): > File "/usr/bin/com-queryclusterconf", line 97, in ? > doc = reader.fromStream(file) > File "/usr/lib64/python2.4/site-packages/_xmlplus/dom/ext/reader/ > Sax2.py", > line 372, in fromStream > self.parser.parse(s) > File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/ > expatreader.py", line > 109, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/ > xmlreader.py", line > 123, in parse > self.feed(buffer) > File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/ > expatreader.py", line > 220, in feed > self._err_handler.fatalError(exc) > File "/usr/lib64/python2.4/site-packages/_xmlplus/dom/ext/reader/ > Sax2.py", > line 340, in fatalError > raise exception > xml.sax._exceptions.SAXParseException: <fdopen>:12:62: not well-formed > (invalid token) > when I change line12/13 like said it looks ok: > [marc@generix3 ~]$ com-queryclusterconf -f /tmp/cluster.conf nodeids > 2 1 > > Hope that helps keep going ;-) . > Regards Marc. > > -- > Gruss / Regards, > > Marc Grimme > http://www.atix.de/ http://www.open-sharedroot.org/ > > |
From: Dan M. <dan...@or...> - 2009-05-12 15:45:45
|
Sorry to take awhile to respond to this. I decided to fall back to a 2.6.18-92 (EL5u2) kernel. I still have to customize it, but thought it would be closer to your tested environment. I've worked around a few problems and eventually got an EL5u2 (with modified kernel) OCFS2 OSR to boot. First, I still had to go back and patch up the boot-lib.sh file to properly recognize "enterprise linux enterprise linux"... is there an rpm on download.atix that has that fixed yet? Next, I am still seeing many "FATAL" modprobes. Perhaps the usages of modprobe in your scripts should use the "-q" option? Anyway, none of the FATAL messages is really fatal I think. When I used the mkinitrd -l option, one of the DLM modules did not find its way into the initrd. (Console output read "unknown filesystem type 'ocfs2_dlmfs'). With no -l option this problem went away... and since the 2.6.18 config file I am using has fewer modules, I didn't run into the xen bug I saw before that failed to properly load large ramdisks. Last, in working through the above failed boots, failure always drops into a bash shell instead of a rescueshell. I've attached console output of a failed boot (not the final successful boot)... let me know if you need more. Thanks, Dan P.S. My rpm list is the same as before except I used bootimage-1.4-21 instead of 1.4.19. > -----Original Message----- > From: Marc Grimme [mailto:gr...@at...] > Sent: Wednesday, May 06, 2009 1:40 AM > To: Dan Magenheimer > Cc: ope...@li... > Subject: Re: [OSR-users] New Preview RPMs for next Release > Candidate of > comoonics-bootimage > > > On Tuesday 05 May 2009 01:29:28 Dan Magenheimer wrote: > > Hi Marc -- > > > > I've worked through the OSR setup process again in my > > environment and am documenting it. > > > > For the most part, it is working, but intermittently. > > Booting some nodes works fine once and then fails the > > next time with no changes. One common failure appears > > to be due to a failure in "Detecting nodeid & nodename..." > That's very strange perhaps something has not stabilized. > Could you start with com-debug and sent the output? > > > > But a problem I've seen with this new cluster: When > > I have a problem (such as the above "Detecting..."), > > the boot process on this cluster no longer falls into > > a rescue shell but instead into a bash shell. So I > > can't look at the repository. > That's also very strange that should never happen ;-) . Is this only > while "Detecting.." or at any other cases? > Again logs would be very interesting. > > > > One difference I've used with this cluster is that I did > > the com.../mkinitrd with your new "-l" option. I wonder > > if maybe this option is failing to copy a shared library > > or something else the rescueshell needs so the rescushell > > fails to work? > No I don't think so. The -l option reduces the amount of > modules loaded into > initrd. That's it. So I would doubt it. > > > > Thanks, > > Dan > Thanks Marc. > > -- > Gruss / Regards, > > Marc Grimme > http://www.atix.de/ http://www.open-sharedroot.org/ > > |
From: Marc G. <gr...@at...> - 2009-05-11 19:52:36
|
Hi Stefano, On Monday 11 May 2009 18:49:44 Stefano Elmopi wrote: > Hi Marc, > > I had read in the guide "Installing and Configuring a Shared Root > Cluster" > that you can make the bond with interfaces, I'm trying to do it but I > am having problems. > On the server in normal boot, without Open-Shared: > > ifconfig > bond0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD > inet addr:10.43.100.203 Bcast:10.43.255.255 Mask: > 255.255.0.0 > inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:30897 errors:0 dropped:0 overruns:0 frame:0 > TX packets:332 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:2688920 (2.5 MiB) TX bytes:38900 (37.9 KiB) > > eth0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD > inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:15657 errors:0 dropped:0 overruns:0 frame:0 > TX packets:326 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1363172 (1.3 MiB) TX bytes:38408 (37.5 KiB) > Interrupt:217 > > eth1 Link encap:Ethernet HWaddr 00:15:60:56:75:FD > inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:15240 errors:0 dropped:0 overruns:0 frame:0 > TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1325748 (1.2 MiB) TX bytes:492 (492.0 b) > Interrupt:225 > > > > My cluster.conf is: > > <?xml version="1.0"?> > <cluster config_version="5" name="cluOCFS2" type="ocfs2"> > > <cman expected_votes="1" two_node="1"/> > > <clusternodes> > > <clusternode name="clu01" votes="1" nodeid="1"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" mac="00:15:60:56:75:FD" master=bond0 > slave=yes/> > <eth name="eth1" mac="00:15:60:56:75:FC" master=bond0 > slave=yes/> > <eth name="bond0" ip="10.43.100.203" mask="255.255.0.0"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <clusternode name="clu02" votes="1" nodeid="2"> > <com_info> > <syslog name="clu01"/> > <rootvolume name="/dev/sda2" fstype="ocfs2"/> > <eth name="eth0" ip="10.43.100.187" > mac="00:15:60:56:77:11"/> > <fenceackserver user="root" passwd="test123"/> > </com_info> > </clusternode> > > <rm log_level="7" log_facility="local4"> > <failoverdomains> > <failoverdomain name="failover" ordered="0"> > <failoverdomainnode name="clu01" > priority="1"/> > </failoverdomain> > </failoverdomains> > <resources> > <ip address="10.43.100.203" monitor_link="1"/> > <script file="/etc/init.d/httpd" name="httpd"/> > </resources> > <service autostart="0" domain="failover" name="HTTPD"> > <ip ref="10.43.100.203"/> > <script ref="httpd"/> > </service> > </rm> > > </clusternodes> > > </cluster> > > when Open-Shared start, it stopped saying that could not validate > cluster configuration, > and I can not understand what the parameter of the cluster.conf is not > validated. > I think that my bond configuration is not correct. > > Thanks Looks good but you missed a tiny little thing. You need to enclose the xml attributes with ' or ". Means in line 12: <eth name="eth0" mac="00:15:60:56:75:FD" master=bond0 slave=yes/> => <eth name="eth0" mac="00:15:60:56:75:FD" master="bond0" slave="yes"/> dito for line 13. When I use your cluster.conf I can see it as follows: [marc@generix3 ~]$ com-queryclusterconf -f /tmp/cluster.conf nodeids Traceback (most recent call last): File "/usr/bin/com-queryclusterconf", line 97, in ? doc = reader.fromStream(file) File "/usr/lib64/python2.4/site-packages/_xmlplus/dom/ext/reader/Sax2.py", line 372, in fromStream self.parser.parse(s) File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib64/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/usr/lib64/python2.4/site-packages/_xmlplus/dom/ext/reader/Sax2.py", line 340, in fatalError raise exception xml.sax._exceptions.SAXParseException: <fdopen>:12:62: not well-formed (invalid token) when I change line12/13 like said it looks ok: [marc@generix3 ~]$ com-queryclusterconf -f /tmp/cluster.conf nodeids 2 1 Hope that helps keep going ;-) . Regards Marc. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Stefano E. <ste...@so...> - 2009-05-11 16:49:53
|
Hi Marc, I had read in the guide "Installing and Configuring a Shared Root Cluster" that you can make the bond with interfaces, I'm trying to do it but I am having problems. On the server in normal boot, without Open-Shared: ifconfig bond0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD inet addr:10.43.100.203 Bcast:10.43.255.255 Mask: 255.255.0.0 inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:30897 errors:0 dropped:0 overruns:0 frame:0 TX packets:332 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2688920 (2.5 MiB) TX bytes:38900 (37.9 KiB) eth0 Link encap:Ethernet HWaddr 00:15:60:56:75:FD inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:15657 errors:0 dropped:0 overruns:0 frame:0 TX packets:326 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1363172 (1.3 MiB) TX bytes:38408 (37.5 KiB) Interrupt:217 eth1 Link encap:Ethernet HWaddr 00:15:60:56:75:FD inet6 addr: fe80::215:60ff:fe56:75fd/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:15240 errors:0 dropped:0 overruns:0 frame:0 TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1325748 (1.2 MiB) TX bytes:492 (492.0 b) Interrupt:225 My cluster.conf is: <?xml version="1.0"?> <cluster config_version="5" name="cluOCFS2" type="ocfs2"> <cman expected_votes="1" two_node="1"/> <clusternodes> <clusternode name="clu01" votes="1" nodeid="1"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" mac="00:15:60:56:75:FD" master=bond0 slave=yes/> <eth name="eth1" mac="00:15:60:56:75:FC" master=bond0 slave=yes/> <eth name="bond0" ip="10.43.100.203" mask="255.255.0.0"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <clusternode name="clu02" votes="1" nodeid="2"> <com_info> <syslog name="clu01"/> <rootvolume name="/dev/sda2" fstype="ocfs2"/> <eth name="eth0" ip="10.43.100.187" mac="00:15:60:56:77:11"/> <fenceackserver user="root" passwd="test123"/> </com_info> </clusternode> <rm log_level="7" log_facility="local4"> <failoverdomains> <failoverdomain name="failover" ordered="0"> <failoverdomainnode name="clu01" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.43.100.203" monitor_link="1"/> <script file="/etc/init.d/httpd" name="httpd"/> </resources> <service autostart="0" domain="failover" name="HTTPD"> <ip ref="10.43.100.203"/> <script ref="httpd"/> </service> </rm> </clusternodes> </cluster> when Open-Shared start, it stopped saying that could not validate cluster configuration, and I can not understand what the parameter of the cluster.conf is not validated. I think that my bond configuration is not correct. Thanks Ing. Stefano Elmopi Gruppo Darco - Area ICT Sistemi Via Ostiense 131/L Corpo B, 00154 Roma cell. 3466147165 tel. 0657060500 email:ste...@so... |
From: Marc G. <gr...@at...> - 2009-05-06 07:40:26
|
On Tuesday 05 May 2009 01:29:28 Dan Magenheimer wrote: > Hi Marc -- > > I've worked through the OSR setup process again in my > environment and am documenting it. > > For the most part, it is working, but intermittently. > Booting some nodes works fine once and then fails the > next time with no changes. One common failure appears > to be due to a failure in "Detecting nodeid & nodename..." That's very strange perhaps something has not stabilized. Could you start with com-debug and sent the output? > > But a problem I've seen with this new cluster: When > I have a problem (such as the above "Detecting..."), > the boot process on this cluster no longer falls into > a rescue shell but instead into a bash shell. So I > can't look at the repository. That's also very strange that should never happen ;-) . Is this only while "Detecting.." or at any other cases? Again logs would be very interesting. > > One difference I've used with this cluster is that I did > the com.../mkinitrd with your new "-l" option. I wonder > if maybe this option is failing to copy a shared library > or something else the rescueshell needs so the rescushell > fails to work? No I don't think so. The -l option reduces the amount of modules loaded into initrd. That's it. So I would doubt it. > > Thanks, > Dan Thanks Marc. -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |
From: Dan M. <dan...@or...> - 2009-05-04 23:30:16
|
Hi Marc -- I've worked through the OSR setup process again in my environment and am documenting it. For the most part, it is working, but intermittently. Booting some nodes works fine once and then fails the next time with no changes. One common failure appears to be due to a failure in "Detecting nodeid & nodename..." But a problem I've seen with this new cluster: When I have a problem (such as the above "Detecting..."), the boot process on this cluster no longer falls into a rescue shell but instead into a bash shell. So I can't look at the repository. One difference I've used with this cluster is that I did the com.../mkinitrd with your new "-l" option. I wonder if maybe this option is failing to copy a shared library or something else the rescueshell needs so the rescushell fails to work? Thanks, Dan > -----Original Message----- > From: Dan Magenheimer > Sent: Thursday, April 30, 2009 10:44 AM > To: Marc Grimme; ope...@li... > Subject: RE: [OSR-users] New Preview RPMs for next Release > Candidate of > comoonics-bootimage > > > > > I think you are probably working with a xen-linux guest with > > > kernel version 2.6.18? I think the patch that causes > > > this problem came after linux-2.6.19 in this Linux commit: > > > > > > http://www.kernel.org/hg/index.cgi/linux-2.6/rev/710f6c6bd06c > > I read from this patch that heartbeat=local is needed if you > > want to use ocfs2 > > as local filesystem. That's the reason why heartbeat is set > > to be local. But > > that implies not to being able to share this filesystem with > > more nodes. > > > > But when you want to use it concurrently you must not use > > heartbeat=local. > > > > Or did I read something wrong? > > I agree that "heartbeat=local" is confusing. > > The ocfs2 team tells me that "heartbeat=local" is used > for a cluster where the heartbeat is on the same disk > as the filesystem. For non-clustered mounts, > "heartbeat=none" is used. There was going to be > a "heartbeat=global" to designate one volume as the > heartbeat device for all volumes, but that patch > was shelved. > > Unrelated, I often randomly get the following message > when trying to boot an OSR el5-ocfs2 image. When I > reboot (without changing anything at all) it boots fine. > All of my config files look fine (and they work fine > on reboot). Maybe there is a race somewhere? > > Thanks, > Dan > > The nodeid for this node could not be detected. This usually > means the MAC-Addresses > specified in the cluster configuration could not be matched > to any MAC-Adresses of this node. > > You can either fix this and build a new initrd or hardset > the nodeid via "setparameter nodeid <number>" > |
From: Dan M. <dan...@or...> - 2009-04-30 16:44:16
|
> > I think you are probably working with a xen-linux guest with > > kernel version 2.6.18? I think the patch that causes > > this problem came after linux-2.6.19 in this Linux commit: > > > > http://www.kernel.org/hg/index.cgi/linux-2.6/rev/710f6c6bd06c > I read from this patch that heartbeat=local is needed if you > want to use ocfs2 > as local filesystem. That's the reason why heartbeat is set > to be local. But > that implies not to being able to share this filesystem with > more nodes. > > But when you want to use it concurrently you must not use > heartbeat=local. > > Or did I read something wrong? I agree that "heartbeat=local" is confusing. The ocfs2 team tells me that "heartbeat=local" is used for a cluster where the heartbeat is on the same disk as the filesystem. For non-clustered mounts, "heartbeat=none" is used. There was going to be a "heartbeat=global" to designate one volume as the heartbeat device for all volumes, but that patch was shelved. Unrelated, I often randomly get the following message when trying to boot an OSR el5-ocfs2 image. When I reboot (without changing anything at all) it boots fine. All of my config files look fine (and they work fine on reboot). Maybe there is a race somewhere? Thanks, Dan The nodeid for this node could not be detected. This usually means the MAC-Addresses specified in the cluster configuration could not be matched to any MAC-Adresses of this node. You can either fix this and build a new initrd or hardset the nodeid via "setparameter nodeid <number>" |
From: Marc G. <gr...@at...> - 2009-04-30 15:03:37
|
On Wednesday 29 April 2009 15:26:31 Dan Magenheimer wrote: > > > FYI, the FATAL messages appear to be harmless and I have > > > now successfully booted to a login prompt, then booted > > > a second and third VM using the OSR also to login prompts! > > > > Great. Sound good. > > > > May I ask what kind of FATAL messages you get? The ones > > during module loading? > > Yes, they were module loading errors. Even though they > report as FATAL, I think since all of those modules are > compiled-in, they are not really fatal. > > > > I *think* an ocfs2 kernel patch is required, or possibly > > > a "-o heartbeat=local" option might need to be specified. > > > > This I didn't really understand. Cause for me it worked > > without something like this. > > I think you are probably working with a xen-linux guest with > kernel version 2.6.18? I think the patch that causes > this problem came after linux-2.6.19 in this Linux commit: > > http://www.kernel.org/hg/index.cgi/linux-2.6/rev/710f6c6bd06c I read from this patch that heartbeat=local is needed if you want to use ocfs2 as local filesystem. That's the reason why heartbeat is set to be local. But that implies not to being able to share this filesystem with more nodes. But when you want to use it concurrently you must not use heartbeat=local. Or did I read something wrong? Regards Marc. > > > > Let me reproduce everything and check on that. If you > > > have any changed/newer rpm's I should use, please > > > send them or send download URLs. Otherwise, I will > > > use this list and hack in any changes needed. > > > > I only know of the one with the distribution detection. This > > will go upstream > > with the next release (comoonics-bootimage-1.4-21). > > OK, if you could send me email when it is on download.atix.de, > I would appreciate it. > > Thanks, > Dan > > --------------------------------------------------------------------------- >--- Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code vel09scf > and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > Open-sharedroot-users mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/open-sharedroot-users -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) | Vorsitzender des Aufsichtsrats: Dr. Martin Buss |