Thread: [SSI-devel] OpenSSI 1.9.2 on Fedora Core 3 instability
Brought to you by:
brucewalker,
rogertsang
From: <fx...@ma...> - 2006-09-05 16:56:16
|
Hello All I'm experiencing a lots of problem (stability, performance) in a Fedora Core 1.9.2 SSI cluster (it's a 5 nodes, cluster). I'm first asking to ssic-linux-users but Roger Tsang ask me to try in this list. The problem, seems to be related to cfs_async process, that is very CPU bound (and give to the entire cluster bad performance): the scenario is a 5 node cluster used to build C code (so, a lots of read/write I/O of small files); the problem is that the I/O performance of SSI cluster is incredibly poor and the time (and CPU power) spent for I/O management have a great impact on the overall compile time (for example, the compile time on a "real" 4 CPU hyperthread is about 8 minuts, on my 5 node cluster, with an overall of 28 CPUs, using make -j 20 I got the same time of the "single server"... and I expect very better performance. A simple file copy on the same CFS filesystems, takes age. This is my cluster configuration: node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) node5: 2-way 3.00 GHz 1 MB L2 cache and 2 GB RAM (PXE boot) I/O for node1 and node2: SAN, 5 TB, 2 GB/s link connected via FC interconnect: 1 GB/s dedicated switch, connected to the interconnect NIC of each node. HA-LVS for load balanced connection over the public interface (each node have 2 GB/s nic, one for public network, one for interconnect). It's potentially a very very good cluster. On the other hand, the failover doesn't work at all; I have 2 init node connected to a SAN via FC controller; the 2 init node are master for failvoer but, if node1 go down (for example) the node2, and all the node in the cluster, print a message in console "ipcnamserver completed" and all the cluster is completly stuck (just power off all the server, and power on again). Why the SSI kernel is compiled with only 4GB RAM support? (In my cluster, there are servers with a lot of memory but only 4 GB is used) Below you can read all the story, anyone that can give me some help to debug CFS performance and failover problem? John Steinman, Roger suggest me to contact you: can you help me? (Let me know if need any other info or logs or whaever). My goal is to putting in place a farm in order to compile software (is this the correct utilization of openSSI?) Thanks in advance Pete ------------------------------------------------------------------------------ Hi Pete, I'm not running into these problems. Perhaps SSI-1.9.3 (to be released) will give you a better experience regarding CFS performance. Maybe this is SMP related since I am running UP. Ask John Steinman (on the devel list) if he can reproduce your CFS problems on his SMP cluster. Roger On 9/1/06, fx...@ma... <fx...@ma...> wrote: > Hello Roger > > another update. I run 4 process, from node 1,2,4,5. > > These process are: > > Node1: scp copy from a server, it's a repository of 20 GB of data to san partition n.1 > Node2: scp copy from another server, it's a repository of 8.6 GB data to san partition n.2 > Node4: TAR extraction of a tar archive (not compressed) of about 30 GB of data (on san partition n.2) > Node5: cp from SAN partition 1 to SAN partition 2 of a 30 GB file. > > The node1, start to load, progressively, untile he reach a load of 995 (I'm using the webview tool); then, node1 completely stuck (is unreacheable from the network): Node2, start failover, but, is not working (I can see the 'ipcnamserver completed' message on all console, but all the cluster is completely down (but is still pingable from the network). > > Seems that my theory about cfs_async processes is verified (start to load the server; maybe this is the reason for the bad I/O performance) and, then crash. > > > What do you think? > > > Regards > Pete. > > On Thursday, August 31, 2006, at 07:28AM, Roger Tsang <rog...@gm...> wrote: > > >Hi, > > > >Yup I am using dedicated network segment for ICS. At the moment mine > >are directly connected and gigE full-duplex and MTU 6800 though I > >think I will get similar numbers when connected to my gigE switch and > >MTU 1500. Both my ICS interfaces are using the same latest Yukon > >Marvell drive from syskonnect. > > > >1. /etc/clustertab tells you which interface is on the ICS. > >2. Have you done ttcp raw network speed tests on the ICS? Does your > >network support jumbo frames? Check for health of your network with > >netstat. Try connecting two nodes directly without your switch. Are > >you getting same results? > >3. You are copying from node1 to node1 on the same filesystem in your > >test below and getting 3MB/sec correct? > > > >Roger > > > > > >On 8/30/06, fx...@ma... <fx...@ma...> wrote: > >> Hello Roger, > >> > >> your test seems that your cfs works well; does your cluster use a dedicated network segment for the interconnect? If yes, what's the link speed? > >> > >> All the 2nd (eth1) network card of all my servers are connected to a dedicated switch 10/100/1000; all the card are e1000, and the link that I see in my logs is 1000 MB/s full duples but, when I try to write from any nodes to /home, for example, I got an average speed of 700/800 Kb/s... (the sar -N dev show me that the interconnect network segment is not busy at all) > >> > >> Where I can find the problem? > >> > >> Now the entire cluster is installed from scratch, and node1 and node2 are directly connected to a SAN (before the SSI installation, hdparm told me that the transfer rate is about 120 MB/s). > >> > >> This is my fstab: > >> > >> # This file is edited by fstab-sync - see 'man fstab-sync' for details > >> UUID=df880e73-72d8-4ffe-a397-bb4ab4aa95a5 / ext3 chard,defaults,node=1:2 1 1 > >> LABEL=/boot /boot ext3 defaults,node=1 1 2 > >> LABEL=/home1 /home1 ext3 chard,defaults,node=1:2 1 2 > >> LABEL=/home2 /home2 ext3 chard,defaults,node=1:2 1 2 > >> LABEL=/shared /shared ext3 chard,defaults,node=1:2 1 2 > >> none /dev/pts devpts gid=5,mode=620,node=* 0 0 > >> #none /dev/shm tmpfs defaults 0 0 > >> none /proc proc defaults,node=* 0 0 > >> none /sys sysfs defaults,node=* 0 0 > >> /dev/sdd2 swap swap defaults,node=1 0 0 > >> /dev/sdd2 swap swap defaults,node=2 0 0 > >> /dev/sda1 swap swap defaults,node=3 0 0 > >> /dev/sda1 swap swap defaults,node=4 0 0 > >> /dev/sda8 swap swap defaults,node=5 0 0 > >> /dev/hda /media/cdrom auto pamconsole,ro,exec,noauto,managed 0 0 > >> /dev/fd0 /media/floppy auto pamconsole,exec,noauto,managed 0 0 > >> > >> /, /home1, /home2 and /shared and /boot on the SAN > >> node2 have it's own boot device (and boot properly). > >> > >> I've added another node to the cluster, with the same configuration as the other ones (so now, it's 5 nodes) > >> > >> I've just tried from node1 (but the same result on all the others node and, consider that there are no other users connected) to copy a dir named "test" (about 4.6 GB) to "test.1". He took 26 minutes! (as you can see from my cut and paste below) > >> > >> [root@node1-public work1]# time cp -r test test1.1 > >> > >> real 26m5.864s > >> user 0m0.491s > >> sys 0m23.288s > >> [root@node1-public work1]# > >> > >> During the transfer, I notice that the interconnect segment is not busy at all. Why this happen? CFS works via interconnect, right? > >> > >> The behaviour is very strange; the copy start with a good speed, then, begin, without no reason to slow; often, the copy is completly stalled... > >> > >> Any idea? > >> > >> Thanks in advance for your kindly answer > >> Pete > >> > >> On Tuesday, August 29, 2006, at 01:13AM, Roger Tsang <rog...@gm...> wrote: > >> > >> >About your CFS performance problem I don't run into the same problem > >> >on my 2 node cluster and my cluster is not half as powerful as yours - > >> >just SATA 7200rpm disks and UP's. When copying whole directories > >> >about 1GB each into another directory on the same filesystem (chard > >> >mount), I get the following. I know it's kinda crude test, but it > >> >clearly doens't slow down to 2-3MB/sec on my cluster. > >> > > >> >File I/O on hard mount is slower than soft mount because hard mounts > >> >guarantee data has been written to support filesystem failover. > >> > > >> >Copy operation on just one node: > >> >real 0m37.734s > >> >user 0m0.093s > >> >sys 0m3.515s > >> > > >> >Copy operation when there is another copy operation on the 2nd node at > >> >the same time: > >> >real 0m57.153s > >> >user 0m0.092s > >> >sys 0m3.448s > >> > > >> >It doesn't slow down to 2-3MB/sec. > >> > > >> >I also have QoS (HTB+SFQ) on the ICS network interfaces putting things > >> >like ICMP and UDP ICS related traffic at highest priority. Maybe that > >> >helps. > >> > > >> >Roger > >> > > >> > > >> >On 8/28/06, fx...@ma... <fx...@ma...> wrote: > >> >> Hello guys > >> >> > >> >> (First of all sorry for my bad english, I will try to do the best I can) > >> >> > >> >> I've installed, 4 weeks ago, openSSI 1.9.2 on Fedora Core3, following all the instruction of the various README.* > >> >> > >> >> Initially I've a lot of difficulties due to the documentation not update for FC3 and SSI1.9.2 (especially regarding the DRBD) but, to the end I've just installed a 4 node SSI cluster and I have some problems that I would submitted to you folks. > >> >> > >> >> The cluster is a 4 node: > >> >> > >> >> node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM > >> >> node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM > >> >> node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > >> >> node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > >> >> > >> >> Each node have 2 NIC, one connected to the "public" network and the other one connected to the "interconnect" network segment. > >> >> > >> >> HA-CVIP is configured on 2 the primary init node, and all the cluster is seen from the public network with only 1 IP address and the connection load balancing is working well. > >> >> > >> >> Each network segment are full duplex 1 GB/s, and, clearly, the interconnect network is a dedicated segment connected to a private switch (a 3Com switch 1 GB/s) > >> >> > >> >> On the inits node, will be connected a 4.0 TB SAN in order to have root and home failover (at the moment, the cluster is configured with root failover but without the SAN attached so, il failover occour all the cluster will going down). > >> >> > >> >> I've notice some strange behaviour and I'm wondering if some of you folks can help me: > >> >> > >> >> 1) The I/O of the entire cluster is quite slow; if more than 1 user try to do some massive I/O, in read or write (for example, a cvs checkout of 3 GB module) the entire cluster performance will be affected; I've done a lots of tests, but the results seems that the I/O throught CFS is quite slow (for example, if I try an scp copy from the public network, my scp copy will be load-balanced from CVIP to one of the 4 nodes and I have a throughput of 40-50 MB/s; if another user try to do the same, concurrent scp, the network transfer go down to 2-3 MB/s.... in order to exclude a network problem, I log into the cluster, and try a CP from a directory to another; the transfer rate is about 25-30 MB/s, if I try to add another cp (or, whaever I/O) during the cp, the throughput go down to 2/3 MB/s and the entire cluster is completely in stuck. Of course, the disks on the inits node are Ultra320, 15000 rpm disks (so I expect better performance). > >> >> > >> >> 2) If I use a clusternode_shutdown -t0 -h -N2 now (for example, but the beahviour is the same on all the node), the kernel node panic. No problem with a clusternode_shutdown -t0 -r -Nxxx > >> >> > >> >> 3) The process load-balancing seems that doesn't balance the load in equal part on all the nodes; on my cluster, the node1 (the init node) is always much loaded than the other node > >> >> > >> >> 4) Randomly, one of the 2 normal (not init) node joining the cluster, panic. > >> >> > >> >> 5) Java and cvs pserver processes is not migrating at all (into dmesg I see a message like "the process has exited" or something like that); CVS migrate but is not working anymore (socket migration problem?) > >> >> > >> >> 6) Why the kernel is not compiled with BIG memory support? The are some technical reason? > >> >> > >> >> > >> >> Thanks to anyone would help me. > >> >> > >> >> Regards Pete |
From: Karl M. <km...@gm...> - 2006-09-05 19:21:08
|
Pete, I would highly suggest moving to 1.9.3 asap. Between 1.9.2 and 1.9.3 there was a bug found where an untar of a large file would hang the system/take a long time to finish. This bug has been fixed and it sounds like that is exactly what you are hitting currently. Relates to cfs_writepages. -Karl On 9/5/06, fx...@ma... <fx...@ma...> wrote: > > Hello All > > I'm experiencing a lots of problem (stability, performance) in a Fedora > Core 1.9.2 SSI cluster (it's a 5 nodes, cluster). > > I'm first asking to ssic-linux-users but Roger Tsang ask me to try in this > list. > > The problem, seems to be related to cfs_async process, that is very CPU > bound (and give to the entire cluster bad performance): the scenario is a 5 > node cluster used to build C code (so, a lots of read/write I/O of small > files); the problem is that the I/O performance of SSI cluster is incredibly > poor and the time (and CPU power) spent for I/O management have a great > impact on the overall compile time (for example, the compile time on a > "real" 4 CPU hyperthread is about 8 minuts, on my 5 node cluster, with an > overall of 28 CPUs, using make -j 20 I got the same time of the "single > server"... and I expect very better performance. > > A simple file copy on the same CFS filesystems, takes age. > > This is my cluster configuration: > > node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM > node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM > node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) > node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) > node5: 2-way 3.00 GHz 1 MB L2 cache and 2 GB RAM (PXE boot) > I/O for node1 and node2: SAN, 5 TB, 2 GB/s link connected via FC > interconnect: 1 GB/s dedicated switch, connected to the interconnect NIC > of each node. > HA-LVS for load balanced connection over the public interface (each node > have 2 GB/s nic, one for public network, one for interconnect). > > It's potentially a very very good cluster. > > On the other hand, the failover doesn't work at all; I have 2 init node > connected to a SAN via FC controller; the 2 init node are master for > failvoer but, if node1 go down (for example) the node2, and all the node in > the cluster, print a message in console "ipcnamserver completed" and all the > cluster is completly stuck (just power off all the server, and power on > again). > > Why the SSI kernel is compiled with only 4GB RAM support? (In my cluster, > there are servers with a lot of memory but only 4 GB is used) > > Below you can read all the story, anyone that can give me some help to > debug CFS performance and failover problem? > > John Steinman, Roger suggest me to contact you: can you help me? (Let me > know if need any other info or logs or whaever). > > My goal is to putting in place a farm in order to compile software (is > this the correct utilization of openSSI?) > > Thanks in advance > Pete > > > > ------------------------------------------------------------------------------ > > Hi Pete, > > I'm not running into these problems. Perhaps SSI-1.9.3 (to be > released) will give you a better experience regarding CFS performance. > Maybe this is SMP related since I am running UP. > > Ask John Steinman (on the devel list) if he can reproduce your CFS > problems on his SMP cluster. > > Roger > > > On 9/1/06, fx...@ma... <fx...@ma...> wrote: > > Hello Roger > > > > another update. I run 4 process, from node 1,2,4,5. > > > > These process are: > > > > Node1: scp copy from a server, it's a repository of 20 GB of data to san > partition n.1 > > Node2: scp copy from another server, it's a repository of 8.6 GB data to > san partition n.2 > > Node4: TAR extraction of a tar archive (not compressed) of about 30 GB > of data (on san partition n.2) > > Node5: cp from SAN partition 1 to SAN partition 2 of a 30 GB file. > > > > The node1, start to load, progressively, untile he reach a load of 995 > (I'm using the webview tool); then, node1 completely stuck (is unreacheable > from the network): Node2, start failover, but, is not working (I can see the > 'ipcnamserver completed' message on all console, but all the cluster is > completely down (but is still pingable from the network). > > > > Seems that my theory about cfs_async processes is verified (start to > load the server; maybe this is the reason for the bad I/O performance) and, > then crash. > > > > > > What do you think? > > > > > > Regards > > Pete. > > > > On Thursday, August 31, 2006, at 07:28AM, Roger Tsang <rog...@gm...> > wrote: > > > > >Hi, > > > > > >Yup I am using dedicated network segment for ICS. At the moment mine > > >are directly connected and gigE full-duplex and MTU 6800 though I > > >think I will get similar numbers when connected to my gigE switch and > > >MTU 1500. Both my ICS interfaces are using the same latest Yukon > > >Marvell drive from syskonnect. > > > > > >1. /etc/clustertab tells you which interface is on the ICS. > > >2. Have you done ttcp raw network speed tests on the ICS? Does your > > >network support jumbo frames? Check for health of your network with > > >netstat. Try connecting two nodes directly without your switch. Are > > >you getting same results? > > >3. You are copying from node1 to node1 on the same filesystem in your > > >test below and getting 3MB/sec correct? > > > > > >Roger > > > > > > > > >On 8/30/06, fx...@ma... <fx...@ma...> wrote: > > >> Hello Roger, > > >> > > >> your test seems that your cfs works well; does your cluster use a > dedicated network segment for the interconnect? If yes, what's the link > speed? > > >> > > >> All the 2nd (eth1) network card of all my servers are connected to a > dedicated switch 10/100/1000; all the card are e1000, and the link that I > see in my logs is 1000 MB/s full duples but, when I try to write from any > nodes to /home, for example, I got an average speed of 700/800 Kb/s... (the > sar -N dev show me that the interconnect network segment is not busy at all) > > >> > > >> Where I can find the problem? > > >> > > >> Now the entire cluster is installed from scratch, and node1 and node2 > are directly connected to a SAN (before the SSI installation, hdparm told me > that the transfer rate is about 120 MB/s). > > >> > > >> This is my fstab: > > >> > > >> # This file is edited by fstab-sync - see 'man fstab-sync' for > details > > >> UUID=df880e73-72d8-4ffe-a397-bb4ab4aa95a5 / > ext3 chard,defaults,node=1:2 1 1 > > >> LABEL=/boot /boot ext3 defaults,node=1 1 2 > > >> LABEL=/home1 /home1 ext3 chard,defaults,node=1:2 1 2 > > >> LABEL=/home2 /home2 ext3 chard,defaults,node=1:2 1 2 > > >> LABEL=/shared /shared ext3 chard,defaults,node=1:2 1 > 2 > > >> none /dev/pts devpts gid=5,mode=620,node=* 0 0 > > >> #none /dev/shm tmpfs > defaults 0 0 > > >> none /proc proc defaults,node=* 0 0 > > >> none /sys sysfs defaults,node=* 0 0 > > >> /dev/sdd2 swap swap defaults,node=1 0 0 > > >> /dev/sdd2 swap swap defaults,node=2 0 0 > > >> /dev/sda1 swap swap defaults,node=3 0 0 > > >> /dev/sda1 swap swap defaults,node=4 0 0 > > >> /dev/sda8 swap swap defaults,node=5 0 0 > > >> > /dev/hda /media/cdrom auto pamconsole,ro,exec,noauto,managed > 0 0 > > >> /dev/fd0 /media/floppy > auto pamconsole,exec,noauto,managed 0 0 > > >> > > >> /, /home1, /home2 and /shared and /boot on the SAN > > >> node2 have it's own boot device (and boot properly). > > >> > > >> I've added another node to the cluster, with the same configuration > as the other ones (so now, it's 5 nodes) > > >> > > >> I've just tried from node1 (but the same result on all the others > node and, consider that there are no other users connected) to copy a dir > named "test" (about 4.6 GB) to "test.1". He took 26 minutes! (as you can > see from my cut and paste below) > > >> > > >> [root@node1-public work1]# time cp -r test test1.1 > > >> > > >> real 26m5.864s > > >> user 0m0.491s > > >> sys 0m23.288s > > >> [root@node1-public work1]# > > >> > > >> During the transfer, I notice that the interconnect segment is not > busy at all. Why this happen? CFS works via interconnect, right? > > >> > > >> The behaviour is very strange; the copy start with a good speed, > then, begin, without no reason to slow; often, the copy is completly > stalled... > > >> > > >> Any idea? > > >> > > >> Thanks in advance for your kindly answer > > >> Pete > > >> > > >> On Tuesday, August 29, 2006, at 01:13AM, Roger Tsang < > rog...@gm...> wrote: > > >> > > >> >About your CFS performance problem I don't run into the same problem > > >> >on my 2 node cluster and my cluster is not half as powerful as yours > - > > >> >just SATA 7200rpm disks and UP's. When copying whole directories > > >> >about 1GB each into another directory on the same filesystem (chard > > >> >mount), I get the following. I know it's kinda crude test, but it > > >> >clearly doens't slow down to 2-3MB/sec on my cluster. > > >> > > > >> >File I/O on hard mount is slower than soft mount because hard mounts > > >> >guarantee data has been written to support filesystem failover. > > >> > > > >> >Copy operation on just one node: > > >> >real 0m37.734s > > >> >user 0m0.093s > > >> >sys 0m3.515s > > >> > > > >> >Copy operation when there is another copy operation on the 2nd node > at > > >> >the same time: > > >> >real 0m57.153s > > >> >user 0m0.092s > > >> >sys 0m3.448s > > >> > > > >> >It doesn't slow down to 2-3MB/sec. > > >> > > > >> >I also have QoS (HTB+SFQ) on the ICS network interfaces putting > things > > >> >like ICMP and UDP ICS related traffic at highest priority. Maybe > that > > >> >helps. > > >> > > > >> >Roger > > >> > > > >> > > > >> >On 8/28/06, fx...@ma... <fx...@ma... > wrote: > > >> >> Hello guys > > >> >> > > >> >> (First of all sorry for my bad english, I will try to do the best > I can) > > >> >> > > >> >> I've installed, 4 weeks ago, openSSI 1.9.2 on Fedora Core3, > following all the instruction of the various README.* > > >> >> > > >> >> Initially I've a lot of difficulties due to the documentation not > update for FC3 and SSI1.9.2 (especially regarding the DRBD) but, to the > end I've just installed a 4 node SSI cluster and I have some problems that I > would submitted to you folks. > > >> >> > > >> >> The cluster is a 4 node: > > >> >> > > >> >> node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM > > >> >> node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM > > >> >> node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > > >> >> node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > > >> >> > > >> >> Each node have 2 NIC, one connected to the "public" network and > the other one connected to the "interconnect" network segment. > > >> >> > > >> >> HA-CVIP is configured on 2 the primary init node, and all the > cluster is seen from the public network with only 1 IP address and the > connection load balancing is working well. > > >> >> > > >> >> Each network segment are full duplex 1 GB/s, and, clearly, the > interconnect network is a dedicated segment connected to a private switch (a > 3Com switch 1 GB/s) > > >> >> > > >> >> On the inits node, will be connected a 4.0 TB SAN in order to have > root and home failover (at the moment, the cluster is configured with root > failover but without the SAN attached so, il failover occour all the cluster > will going down). > > >> >> > > >> >> I've notice some strange behaviour and I'm wondering if some of > you folks can help me: > > >> >> > > >> >> 1) The I/O of the entire cluster is quite slow; if more than 1 > user try to do some massive I/O, in read or write (for example, a cvs > checkout of 3 GB module) the entire cluster performance will be affected; > I've done a lots of tests, but the results seems that the I/O throught CFS > is quite slow (for example, if I try an scp copy from the public network, my > scp copy will be load-balanced from CVIP to one of the 4 nodes and I have a > throughput of 40-50 MB/s; if another user try to do the same, concurrent > scp, the network transfer go down to 2-3 MB/s.... in order to exclude a > network problem, I log into the cluster, and try a CP from a directory to > another; the transfer rate is about 25-30 MB/s, if I try to add another cp > (or, whaever I/O) during the cp, the throughput go down to 2/3 MB/s and the > entire cluster is completely in stuck. Of course, the disks on the inits > node are Ultra320, 15000 rpm disks (so I expect better performance). > > >> >> > > >> >> 2) If I use a clusternode_shutdown -t0 -h -N2 now (for example, > but the beahviour is the same on all the node), the kernel node panic. No > problem with a clusternode_shutdown -t0 -r -Nxxx > > >> >> > > >> >> 3) The process load-balancing seems that doesn't balance the load > in equal part on all the nodes; on my cluster, the node1 (the init node) is > always much loaded than the other node > > >> >> > > >> >> 4) Randomly, one of the 2 normal (not init) node joining the > cluster, panic. > > >> >> > > >> >> 5) Java and cvs pserver processes is not migrating at all (into > dmesg I see a message like "the process has exited" or something like that); > CVS migrate but is not working anymore (socket migration problem?) > > >> >> > > >> >> 6) Why the kernel is not compiled with BIG memory support? The are > some technical reason? > > >> >> > > >> >> > > >> >> Thanks to anyone would help me. > > >> >> > > >> >> Regards Pete > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > |
From: Vladimir R. <one...@gm...> - 2006-09-05 20:58:26
|
The fix mentioned by Karl was discussed in the thread "cfs_writepages():Is it a bug?" started on 6/21/2006 and was submitted to the ssi source tree in July 2006 - cfs/write.c ver. 1.19-1.22 Vladimir On 9/5/06, Karl Merritts <km...@gm...> wrote: > Pete, > I would highly suggest moving to 1.9.3 asap. Between 1.9.2 and 1.9.3 there > was a bug found where an untar of a large file would hang the system/take a > long time to finish. This bug has been fixed and it sounds like that is > exactly what you are hitting currently. Relates to cfs_writepages. > > -Karl > > > On 9/5/06, fx...@ma... < fx...@ma...> wrote: > > Hello All > > > > I'm experiencing a lots of problem (stability, performance) in a Fedora > Core 1.9.2 SSI cluster (it's a 5 nodes, cluster). > > > > I'm first asking to ssic-linux-users but Roger Tsang ask me to try in this > list. > > > > The problem, seems to be related to cfs_async process, that is very CPU > bound (and give to the entire cluster bad performance): the scenario is a 5 > node cluster used to build C code (so, a lots of read/write I/O of small > files); the problem is that the I/O performance of SSI cluster is incredibly > poor and the time (and CPU power) spent for I/O management have a great > impact on the overall compile time (for example, the compile time on a > "real" 4 CPU hyperthread is about 8 minuts, on my 5 node cluster, with an > overall of 28 CPUs, using make -j 20 I got the same time of the "single > server"... and I expect very better performance. > > > > A simple file copy on the same CFS filesystems, takes age. > > > > This is my cluster configuration: > > > > node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM > > node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM > > node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) > > node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot) > > node5: 2-way 3.00 GHz 1 MB L2 cache and 2 GB RAM (PXE boot) > > I/O for node1 and node2: SAN, 5 TB, 2 GB/s link connected via FC > > interconnect: 1 GB/s dedicated switch, connected to the interconnect NIC > of each node. > > HA-LVS for load balanced connection over the public interface (each node > have 2 GB/s nic, one for public network, one for interconnect). > > > > It's potentially a very very good cluster. > > > > On the other hand, the failover doesn't work at all; I have 2 init node > connected to a SAN via FC controller; the 2 init node are master for > failvoer but, if node1 go down (for example) the node2, and all the node in > the cluster, print a message in console "ipcnamserver completed" and all the > cluster is completly stuck (just power off all the server, and power on > again). > > > > Why the SSI kernel is compiled with only 4GB RAM support? (In my cluster, > there are servers with a lot of memory but only 4 GB is used) > > > > Below you can read all the story, anyone that can give me some help to > debug CFS performance and failover problem? > > > > John Steinman, Roger suggest me to contact you: can you help me? (Let me > know if need any other info or logs or whaever). > > > > My goal is to putting in place a farm in order to compile software (is > this the correct utilization of openSSI?) > > > > Thanks in advance > > Pete > > > > > > > > > ------------------------------------------------------------------------------ > > Hi Pete, > > > > I'm not running into these problems. Perhaps SSI-1.9.3 (to be > > released) will give you a better experience regarding CFS performance. > > Maybe this is SMP related since I am running UP. > > > > Ask John Steinman (on the devel list) if he can reproduce your CFS > > problems on his SMP cluster. > > > > Roger > > > > > > On 9/1/06, fx...@ma... < fx...@ma...> wrote: > > > Hello Roger > > > > > > another update. I run 4 process, from node 1,2,4,5. > > > > > > These process are: > > > > > > Node1: scp copy from a server, it's a repository of 20 GB of data to san > partition n.1 > > > Node2: scp copy from another server, it's a repository of 8.6 GB data to > san partition n.2 > > > Node4: TAR extraction of a tar archive (not compressed) of about 30 GB > of data (on san partition n.2) > > > Node5: cp from SAN partition 1 to SAN partition 2 of a 30 GB file. > > > > > > The node1, start to load, progressively, untile he reach a load of 995 > (I'm using the webview tool); then, node1 completely stuck (is unreacheable > from the network): Node2, start failover, but, is not working (I can see the > 'ipcnamserver completed' message on all console, but all the cluster is > completely down (but is still pingable from the network). > > > > > > Seems that my theory about cfs_async processes is verified (start to > load the server; maybe this is the reason for the bad I/O performance) and, > then crash. > > > > > > > > > What do you think? > > > > > > > > > Regards > > > Pete. > > > > > > On Thursday, August 31, 2006, at 07:28AM, Roger Tsang < > rog...@gm...> wrote: > > > > > > >Hi, > > > > > > > >Yup I am using dedicated network segment for ICS. At the moment mine > > > >are directly connected and gigE full-duplex and MTU 6800 though I > > > >think I will get similar numbers when connected to my gigE switch and > > > >MTU 1500. Both my ICS interfaces are using the same latest Yukon > > > >Marvell drive from syskonnect. > > > > > > > >1. /etc/clustertab tells you which interface is on the ICS. > > > >2. Have you done ttcp raw network speed tests on the ICS? Does your > > > >network support jumbo frames? Check for health of your network with > > > >netstat. Try connecting two nodes directly without your switch. Are > > > >you getting same results? > > > >3. You are copying from node1 to node1 on the same filesystem in your > > > >test below and getting 3MB/sec correct? > > > > > > > >Roger > > > > > > > > > > > >On 8/30/06, fx...@ma... < fx...@ma...> wrote: > > > >> Hello Roger, > > > >> > > > >> your test seems that your cfs works well; does your cluster use a > dedicated network segment for the interconnect? If yes, what's the link > speed? > > > >> > > > >> All the 2nd (eth1) network card of all my servers are connected to a > dedicated switch 10/100/1000; all the card are e1000, and the link that I > see in my logs is 1000 MB/s full duples but, when I try to write from any > nodes to /home, for example, I got an average speed of 700/800 Kb/s... (the > sar -N dev show me that the interconnect network segment is not busy at all) > > > >> > > > >> Where I can find the problem? > > > >> > > > >> Now the entire cluster is installed from scratch, and node1 and node2 > are directly connected to a SAN (before the SSI installation, hdparm told me > that the transfer rate is about 120 MB/s). > > > >> > > > >> This is my fstab: > > > >> > > > >> # This file is edited by fstab-sync - see 'man fstab-sync' for > details > > > >> UUID=df880e73-72d8-4ffe-a397-bb4ab4aa95a5 / > ext3 chard,defaults,node=1:2 1 1 > > > >> LABEL=/boot /boot ext3 defaults,node=1 1 2 > > > >> LABEL=/home1 /home1 ext3 > chard,defaults,node=1:2 1 2 > > > >> LABEL=/home2 /home2 ext3 > chard,defaults,node=1:2 1 2 > > > >> LABEL=/shared /shared ext3 > chard,defaults,node=1:2 1 2 > > > >> none /dev/pts devpts > gid=5,mode=620,node=* 0 0 > > > >> #none /dev/shm > tmpfs defaults 0 0 > > > >> none /proc proc defaults,node=* 0 0 > > > >> none /sys sysfs defaults,node=* 0 0 > > > >> /dev/sdd2 swap swap defaults,node=1 0 0 > > > >> /dev/sdd2 swap swap defaults,node=2 0 0 > > > >> /dev/sda1 swap swap defaults,node=3 0 0 > > > >> /dev/sda1 swap swap defaults,node=4 0 0 > > > >> /dev/sda8 swap swap defaults,node=5 0 0 > > > >> /dev/hda /media/cdrom auto > pamconsole,ro,exec,noauto,managed 0 0 > > > >> /dev/fd0 /media/floppy auto > pamconsole,exec,noauto,managed 0 0 > > > >> > > > >> /, /home1, /home2 and /shared and /boot on the SAN > > > >> node2 have it's own boot device (and boot properly). > > > >> > > > >> I've added another node to the cluster, with the same configuration > as the other ones (so now, it's 5 nodes) > > > >> > > > >> I've just tried from node1 (but the same result on all the others > node and, consider that there are no other users connected) to copy a dir > named "test" (about 4.6 GB) to "test.1". He took 26 minutes! (as you can see > from my cut and paste below) > > > >> > > > >> [root@node1-public work1]# time cp -r test test1.1 > > > >> > > > >> real 26m5.864s > > > >> user 0m0.491s > > > >> sys 0m23.288s > > > >> [root@node1-public work1]# > > > >> > > > >> During the transfer, I notice that the interconnect segment is not > busy at all. Why this happen? CFS works via interconnect, right? > > > >> > > > >> The behaviour is very strange; the copy start with a good speed, > then, begin, without no reason to slow; often, the copy is completly > stalled... > > > >> > > > >> Any idea? > > > >> > > > >> Thanks in advance for your kindly answer > > > >> Pete > > > >> > > > >> On Tuesday, August 29, 2006, at 01:13AM, Roger Tsang < > rog...@gm...> wrote: > > > >> > > > >> >About your CFS performance problem I don't run into the same problem > > > >> >on my 2 node cluster and my cluster is not half as powerful as yours > - > > > >> >just SATA 7200rpm disks and UP's. When copying whole directories > > > >> >about 1GB each into another directory on the same filesystem (chard > > > >> >mount), I get the following. I know it's kinda crude test, but it > > > >> >clearly doens't slow down to 2-3MB/sec on my cluster. > > > >> > > > > >> >File I/O on hard mount is slower than soft mount because hard mounts > > > >> >guarantee data has been written to support filesystem failover. > > > >> > > > > >> >Copy operation on just one node: > > > >> >real 0m37.734s > > > >> >user 0m0.093s > > > >> >sys 0m3.515s > > > >> > > > > >> >Copy operation when there is another copy operation on the 2nd node > at > > > >> >the same time: > > > >> >real 0m57.153s > > > >> >user 0m0.092s > > > >> >sys 0m3.448s > > > >> > > > > >> >It doesn't slow down to 2-3MB/sec. > > > >> > > > > >> >I also have QoS (HTB+SFQ) on the ICS network interfaces putting > things > > > >> >like ICMP and UDP ICS related traffic at highest priority. Maybe > that > > > >> >helps. > > > >> > > > > >> >Roger > > > >> > > > > >> > > > > >> >On 8/28/06, fx...@ma... <fx...@ma... > wrote: > > > >> >> Hello guys > > > >> >> > > > >> >> (First of all sorry for my bad english, I will try to do the best > I can) > > > >> >> > > > >> >> I've installed, 4 weeks ago, openSSI 1.9.2 on Fedora Core3, > following all the instruction of the various README.* > > > >> >> > > > >> >> Initially I've a lot of difficulties due to the documentation not > update for FC3 and SSI1.9.2 (especially regarding the DRBD) but, to the end > I've just installed a 4 node SSI cluster and I have some problems that I > would submitted to you folks. > > > >> >> > > > >> >> The cluster is a 4 node: > > > >> >> > > > >> >> node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM > > > >> >> node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM > > > >> >> node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > > > >> >> node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM > > > >> >> > > > >> >> Each node have 2 NIC, one connected to the "public" network and > the other one connected to the "interconnect" network segment. > > > >> >> > > > >> >> HA-CVIP is configured on 2 the primary init node, and all the > cluster is seen from the public network with only 1 IP address and the > connection load balancing is working well. > > > >> >> > > > >> >> Each network segment are full duplex 1 GB/s, and, clearly, the > interconnect network is a dedicated segment connected to a private switch (a > 3Com switch 1 GB/s) > > > >> >> > > > >> >> On the inits node, will be connected a 4.0 TB SAN in order to have > root and home failover (at the moment, the cluster is configured with root > failover but without the SAN attached so, il failover occour all the cluster > will going down). > > > >> >> > > > >> >> I've notice some strange behaviour and I'm wondering if some of > you folks can help me: > > > >> >> > > > >> >> 1) The I/O of the entire cluster is quite slow; if more than 1 > user try to do some massive I/O, in read or write (for example, a cvs > checkout of 3 GB module) the entire cluster performance will be affected; > I've done a lots of tests, but the results seems that the I/O throught CFS > is quite slow (for example, if I try an scp copy from the public network, my > scp copy will be load-balanced from CVIP to one of the 4 nodes and I have a > throughput of 40-50 MB/s; if another user try to do the same, concurrent > scp, the network transfer go down to 2-3 MB/s.... in order to exclude a > network problem, I log into the cluster, and try a CP from a directory to > another; the transfer rate is about 25-30 MB/s, if I try to add another cp > (or, whaever I/O) during the cp, the throughput go down to 2/3 MB/s and the > entire cluster is completely in stuck. Of course, the disks on the inits > node are Ultra320, 15000 rpm disks (so I expect better performance). > > > >> >> > > > >> >> 2) If I use a clusternode_shutdown -t0 -h -N2 now (for example, > but the beahviour is the same on all the node), the kernel node panic. No > problem with a clusternode_shutdown -t0 -r -Nxxx > > > >> >> > > > >> >> 3) The process load-balancing seems that doesn't balance the load > in equal part on all the nodes; on my cluster, the node1 (the init node) is > always much loaded than the other node > > > >> >> > > > >> >> 4) Randomly, one of the 2 normal (not init) node joining the > cluster, panic. > > > >> >> > > > >> >> 5) Java and cvs pserver processes is not migrating at all (into > dmesg I see a message like "the process has exited" or something like that); > CVS migrate but is not working anymore (socket migration problem?) > > > >> >> > > > >> >> 6) Why the kernel is not compiled with BIG memory support? The are > some technical reason? > > > >> >> > > > >> >> > > > >> >> Thanks to anyone would help me. > > > >> >> > > > >> >> Regards Pete > > > > > > > ------------------------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, security? > > Get stuff done quickly with pre-integrated technology to make your job > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > > ssic-linux-devel mailing list > > ssi...@li... > > > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > > |
From: John S. <joh...@gm...> - 2006-09-08 18:37:04
|
Pete, I'm not sure if I can help you with this one or if it has been fix but lets see if anyone else might be able to point you in the right direction. I'm cc'ing the devel list. - John > > A question: the failover is not working for me, when node1 goes down, the > node2 apparently do the failover action but, at the end, I can see a message > "ipcnameserver completed" and nothing happen; the node2 is in stuck, and all > the cluster too. Where I can look in order to understand what happen? > > Thanks > Pete > > > -- John F. Steinman |