Thread: [SSI-devel] OpenSSI 1.9.2 on Fedora Core 3 instability

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello All

I'm experiencing a lots of problem (stability, performance) in a Fedora Core 1.9.2 SSI cluster (it's a 5 nodes, cluster).

I'm first asking to ssic-linux-users but Roger Tsang ask me to try in this list.

The problem, seems to be related to cfs_async process, that is very CPU bound (and give to the entire cluster bad performance): the scenario is a 5 node cluster used to build C code (so, a lots of read/write I/O of small files); the problem is that the I/O performance of SSI cluster is incredibly poor and the time (and CPU power) spent for I/O management have a great impact on the overall compile time (for example, the compile time on a "real" 4 CPU hyperthread is about 8 minuts, on my 5 node cluster, with an overall of 28 CPUs, using make -j 20 I got the same time of the "single server"... and I expect very better performance.

A simple file copy on the same CFS filesystems, takes age.

This is my cluster configuration:

node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM
node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM
node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot)
node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM (PXE boot)
node5: 2-way 3.00 GHz 1 MB L2 cache and 2 GB RAM (PXE boot)
I/O for node1 and node2: SAN, 5 TB, 2 GB/s link connected via FC
interconnect: 1 GB/s dedicated switch, connected to the interconnect NIC of each node.
HA-LVS for load balanced connection over the public interface (each node have 2 GB/s nic, one for public network, one for interconnect).

It's potentially a very very good cluster.

On the other hand, the failover doesn't work at all; I have 2 init node connected to a SAN via FC controller; the 2 init node are master for failvoer but, if node1 go down (for example) the node2, and all the node in the cluster, print a message in console "ipcnamserver completed" and all the cluster is completly stuck (just power off all the server, and power on again).

Why the SSI kernel is compiled with only 4GB RAM support? (In my cluster, there are servers with a lot of memory but only 4 GB is used)

Below you can read all the story, anyone that can give me some help to debug CFS performance and failover problem?

John Steinman, Roger suggest me to contact you: can you help me? (Let me know if need any other info or logs or whaever).

My goal is to putting in place a farm in order to compile software (is this the correct utilization of openSSI?)

Thanks in advance
Pete

------------------------------------------------------------------------------
Hi Pete,

I'm not running into these problems.  Perhaps SSI-1.9.3 (to be
released) will give you a better experience regarding CFS performance.
 Maybe this is SMP related since I am running UP.

Ask John Steinman (on the devel list) if he can reproduce your CFS
problems on his SMP cluster.

Roger

On 9/1/06, fx...@ma... <fx...@ma...> wrote:
> Hello Roger
>
> another update. I run 4 process, from node 1,2,4,5.
>
> These process are:
>
> Node1: scp copy from a server, it's a repository of 20 GB of data to san partition n.1
> Node2: scp copy from another server, it's a repository of 8.6 GB data to san partition n.2
> Node4: TAR extraction of a tar archive (not compressed) of about 30 GB of data (on san partition n.2)
> Node5: cp from SAN partition 1 to SAN partition 2 of a 30 GB file.
>
> The node1, start to load, progressively, untile he reach a load of 995 (I'm using the webview tool); then, node1 completely stuck (is unreacheable from the network): Node2, start failover, but, is not working (I can see the 'ipcnamserver completed' message on all console, but all the cluster is completely down (but is still pingable from the network).
>
> Seems that my theory about cfs_async processes is verified (start to load the server; maybe this is the reason for the bad I/O performance) and, then crash.
>
>
> What do you think?
>
>
> Regards
> Pete.
>
> On Thursday, August 31, 2006, at 07:28AM, Roger Tsang <rog...@gm...> wrote:
>
> >Hi,
> >
> >Yup I am using dedicated network segment for ICS.  At the moment mine
> >are directly connected and gigE full-duplex and MTU 6800 though I
> >think I will get similar numbers when connected to my gigE switch and
> >MTU 1500.  Both my ICS interfaces are using the same latest Yukon
> >Marvell drive from syskonnect.
> >
> >1. /etc/clustertab tells you which interface is on the ICS.
> >2. Have you done ttcp raw network speed tests on the ICS?  Does your
> >network support jumbo frames?  Check for health of your network with
> >netstat.  Try connecting two nodes directly without your switch.  Are
> >you getting same results?
> >3. You are copying from node1 to node1 on the same filesystem in your
> >test below and getting 3MB/sec correct?
> >
> >Roger
> >
> >
> >On 8/30/06, fx...@ma... <fx...@ma...> wrote:
> >> Hello Roger,
> >>
> >> your test seems that your cfs works well; does your cluster use a dedicated network segment for the interconnect? If yes, what's the link speed?
> >>
> >> All the 2nd (eth1) network card of all my servers are connected to a dedicated switch 10/100/1000; all the card are e1000, and the link that I see in my logs is 1000 MB/s full duples but, when I try to write from any nodes to /home, for example, I got an average speed of 700/800 Kb/s... (the sar -N dev show me that the interconnect network segment is not busy at all)
> >>
> >> Where I can find the problem?
> >>
> >> Now the entire cluster is installed from scratch, and node1 and node2 are directly connected to a SAN (before the SSI installation, hdparm told me that the transfer rate is about 120 MB/s).
> >>
> >> This is my fstab:
> >>
> >> # This file is edited by fstab-sync - see 'man fstab-sync' for details
> >> UUID=df880e73-72d8-4ffe-a397-bb4ab4aa95a5       /       ext3    chard,defaults,node=1:2 1       1
> >> LABEL=/boot     /boot   ext3    defaults,node=1 1       2
> >> LABEL=/home1    /home1  ext3    chard,defaults,node=1:2 1       2
> >> LABEL=/home2    /home2  ext3    chard,defaults,node=1:2 1       2
> >> LABEL=/shared      /shared    ext3    chard,defaults,node=1:2 1       2
> >> none    /dev/pts        devpts  gid=5,mode=620,node=*   0       0
> >> #none                    /dev/shm                tmpfs   defaults        0 0
> >> none    /proc   proc    defaults,node=* 0       0
> >> none    /sys    sysfs   defaults,node=* 0       0
> >> /dev/sdd2       swap    swap    defaults,node=1 0       0
> >> /dev/sdd2       swap    swap    defaults,node=2 0       0
> >> /dev/sda1       swap    swap    defaults,node=3 0       0
> >> /dev/sda1       swap    swap    defaults,node=4 0       0
> >> /dev/sda8       swap    swap    defaults,node=5 0       0
> >> /dev/hda                /media/cdrom            auto    pamconsole,ro,exec,noauto,managed 0 0
> >> /dev/fd0                /media/floppy           auto    pamconsole,exec,noauto,managed 0 0
> >>
> >> /, /home1, /home2 and /shared and /boot on the SAN
> >> node2 have it's own boot device (and boot properly).
> >>
> >> I've added another node to the cluster, with the same configuration as the other ones (so now, it's 5 nodes)
> >>
> >> I've just tried from node1 (but the same result on all the others node and, consider that there are no other users connected) to copy a dir named "test" (about 4.6 GB) to "test.1". He took 26 minutes! (as you can see from my cut and paste below)
> >>
> >> [root@node1-public work1]# time cp -r test test1.1
> >>
> >> real    26m5.864s
> >> user    0m0.491s
> >> sys     0m23.288s
> >> [root@node1-public work1]#
> >>
> >> During the transfer, I notice that the interconnect segment is not busy at all. Why this happen? CFS works via interconnect, right?
> >>
> >> The behaviour is very strange; the copy start with a good speed, then, begin, without no reason to slow; often, the copy is completly stalled...
> >>
> >> Any idea?
> >>
> >> Thanks in advance for your kindly answer
> >> Pete
> >>
> >> On Tuesday, August 29, 2006, at 01:13AM, Roger Tsang <rog...@gm...> wrote:
> >>
> >> >About your CFS performance problem I don't run into the same problem
> >> >on my 2 node cluster and my cluster is not half as powerful as yours -
> >> >just SATA 7200rpm disks and UP's.  When copying whole directories
> >> >about 1GB each into another directory on the same filesystem (chard
> >> >mount), I get the following.  I know it's kinda crude test, but it
> >> >clearly doens't slow down to 2-3MB/sec on my cluster.
> >> >
> >> >File I/O on hard mount is slower than soft mount because hard mounts
> >> >guarantee data has been written to support filesystem failover.
> >> >
> >> >Copy operation on just one node:
> >> >real    0m37.734s
> >> >user    0m0.093s
> >> >sys     0m3.515s
> >> >
> >> >Copy operation when there is another copy operation on the 2nd node at
> >> >the same time:
> >> >real    0m57.153s
> >> >user    0m0.092s
> >> >sys     0m3.448s
> >> >
> >> >It doesn't slow down to 2-3MB/sec.
> >> >
> >> >I also have QoS (HTB+SFQ) on the ICS network interfaces putting things
> >> >like ICMP and UDP ICS related traffic at highest priority.  Maybe that
> >> >helps.
> >> >
> >> >Roger
> >> >
> >> >
> >> >On 8/28/06, fx...@ma... <fx...@ma...> wrote:
> >> >> Hello guys
> >> >>
> >> >> (First of all sorry for my bad english, I will try to do the best I can)
> >> >>
> >> >> I've installed, 4 weeks ago, openSSI 1.9.2 on Fedora Core3, following all the instruction of the various README.*
> >> >>
> >> >> Initially I've a lot of difficulties due to the documentation not update for FC3 and SSI1.9.2 (especially regarding the DRBD) but, to the end I've just installed a 4 node SSI cluster and I have some problems that I would submitted to you folks.
> >> >>
> >> >> The cluster is a 4 node:
> >> >>
> >> >> node1 (init): 4-way 3,16 GHz Xeon with 1 MB L2 cache and 8 GB RAM
> >> >> node2 (init): 2x2 core 2.80 GHz 2 MB L2 cache and 12 GB RAM
> >> >> node3: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM
> >> >> node4: 2-way 3.00 GHz 1 MB L2 cache and 8 GB RAM
> >> >>
> >> >> Each node have 2 NIC, one connected to the "public" network and the other one connected to the "interconnect" network segment.
> >> >>
> >> >> HA-CVIP is configured on 2 the primary init node, and all the cluster is seen from the public network with only 1 IP address and the connection load balancing is working well.
> >> >>
> >> >> Each network segment are full duplex 1 GB/s, and, clearly, the interconnect network is a dedicated segment connected to a private switch (a 3Com switch 1 GB/s)
> >> >>
> >> >> On the inits node, will be connected a 4.0 TB SAN in order to have root and home failover (at the moment, the cluster is configured with root failover but without the SAN attached so, il failover occour all the cluster will going down).
> >> >>
> >> >> I've notice some strange behaviour and I'm wondering if some of you folks can help me:
> >> >>
> >> >> 1) The I/O of the entire cluster is quite slow; if more than 1 user try to do some massive I/O, in read or write (for example, a cvs checkout of 3 GB module) the entire cluster performance will be affected; I've done a lots of tests, but the results seems that the I/O throught CFS is quite slow (for example, if I try an scp copy from the public network, my scp copy will be load-balanced from CVIP to one of the 4 nodes and I have a throughput of 40-50 MB/s; if another user try to do the same, concurrent scp, the network transfer go down to 2-3 MB/s.... in order to exclude a network problem, I log into the cluster, and try a CP from a directory to another; the transfer rate is about 25-30 MB/s, if I try to add another cp (or, whaever I/O) during the cp, the throughput go down to 2/3 MB/s and the entire cluster is completely in stuck. Of course, the disks on the inits node are Ultra320, 15000 rpm disks (so I expect better performance).
> >> >>
> >> >> 2) If I use a clusternode_shutdown -t0 -h -N2 now (for example, but the beahviour is the same on all the node), the kernel node panic. No problem with a clusternode_shutdown -t0 -r -Nxxx
> >> >>
> >> >> 3) The process load-balancing seems that doesn't balance the load in equal part on all the nodes; on my cluster, the node1 (the init node) is always much loaded than the other node
> >> >>
> >> >> 4) Randomly, one of the 2 normal (not init) node joining the cluster, panic.
> >> >>
> >> >> 5) Java and cvs pserver processes is not migrating at all (into dmesg I see a message like "the process has exited" or something like that); CVS migrate but is not working anymore (socket migration problem?)
> >> >>
> >> >> 6) Why the kernel is not compiled with BIG memory support? The are some technical reason?
> >> >>
> >> >>
> >> >> Thanks to anyone would help me.
> >> >>
> >> >> Regards Pete

Thread: [SSI-devel] OpenSSI 1.9.2 on Fedora Core 3 instability

ssic-linux-devel