Re: [Fwd: Re: [SSI-users] frozen cluster when process migrating?]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Maurice,

The latest sk98lin driver for kernel 2.4/2.6 is at least v8.24.1.3. You
would want to download the driver from
syskonnect.com<http://syskonnect.com>or if this address doesn't work
try online search "SysKonnect".

Roger

On 11/16/05, Maurice Libes <Mau...@co...> wrote:
>
> Roger Tsang wrote:
> > Hi Maurice,
> >
> > Looks like you have found the problem.
>
> hmm let's say i have identified from where it comes but dont know why?;
>
> if i send to you a kernel debugging file, will you get better
> informations in order to understand what is the problem with this card?
>
> i am nevertheless unsatisfied since i haven't solve entirely the
> problem (use at gigab/s)
>
> for the moment i use the cluster on my old 100Mb switch
>
> >
> > Are you using the latest sk98lin driver from syskonnect? I bet the
> > driver that comes with your card is outdated.
>
> no i use the sk98lin driver which comes with openssi distrib ;-)
>
> (is there a newer version of sk98lin in openssi-1.9?)
>
>
> /lib/modules/2.4.22-1.2199.nptl-ssi-686-smp
> /kernel/drivers/net/sk98lin/sk98lin.o
>
>
> #define BOOT_STRING "sk98lin: Network Device Driver v6.15\n" \
> "(C)Copyright 1999-2003 Marvell(R)."
>
> #define VER_STRING "6.15"
>
> i will find and try a newer version...
>
> does somebody use this driver sk98lin with openSSI-1.2 on a gigab
> switch on the list?
> i am surprised to be alone in this case
>
> ML
>
> >
> > Roger
> >
> >
> > On 11/14/05, *Maurice Libes* <Mau...@co...
> > <mailto:Mau...@co...>> wrote:
> >
> > Roger Tsang wrote:
> >
> > > Hi,
> > >
> > > I think one way of making sure the process doesn't jump back and
> > forth between node1 and node4 which may be causing your problems is use
> > `migrate` command in your performance test. Surely the migrate command
> > will not cause your process to jump back to node1 - as long as symphoni=
e
> > is not on the loadlevel list.
> > >
> >
> >
> > hi... there are some news..
> > i made some tests with an old 100Mb/s switch.
> > when i change my extra new and performing netgear Gigabit switch by an
> > old 100Mb switch... problems evolve :
> >
> > 1. migration times of symphonie process, are still long but it seems to
> > be constant among each nodes and normal for a 100Mb/s speed
> >
> > e.g. for a process of 325MB it takes about 35s ... which is
> > approximatively normal ...for a speed of 90Mb/s (i measured speeds of
> > 90-95Mb/s between nodes with ttcp) =3D> 325*8 / 90 =3D 29s
> > (=3D> is my flow calculation correct?)
> >
> > i migrate symphonie process among everynode (with the 100Mb switch) and
> > this gave everytime, migration time of 30-35s in all direction.. so it'=
s
> > better and more regular than the 15mn when i use my gigabit switch ;-)
> >
> > 1131965796 mig :pid 67797(symphonie) -> node 4 mem 14680 my load 41
> > node4 load 26
> > 1131965821 mig: :pid 67797(symphonie) <- node 1 mem 404636 my load 1
> > node1 load 11
> >
> > (transfert from node 1 to 4 is now 25 s)
> >
> > and in the opposite direction from node 4 to 1 (39s) :
> > 1131965900 mig :pid 67797(symphonie) -> node 1 mem 404620 my load 98
> > node1 load 6
> > 1131965939 mig: :pid 67797(symphonie) <- node 4 mem 13980 my load 0
> > node4 load 29
> >
> >
> >
> > what do you think of this?
> > since a friend of mine has the same switch netgear gigabit on an openSS=
I
> > cluster,
> > i guess it could be a problem in relation with the NIC driver sk98lin a=
t
> > Gigabits flow? or kernel?.. strange since there are no errors messages
> > related to network :
> >
> > - netstat -i seems ok (see below)
> > - ttcp gives network speeds of ~600Mb/s (rather low for a giga bits
> > private network) but speeds of 90-93Mb/s with the old 100Mb/s switch
> > -i looked into /proc/net/sk98lin/eth0 and every thing seemed normal
> >
> >
> > i would want to send kernel debugging informations to you, but i never
> > done that and i dont know how to do.. i don't know how to use kdb
> >
> > -is the openssi kernel prepared for debugging purpose?
> > - is there a specific package to get on debian?
> > - have i just to reboot with kdb=3Don on the boot line parameters on ea=
ch
> > nodes?
> > and then what? what must i do in order to get the debug lines? i did'nt
> > find the kdb command
> >
> > sorry to ask you that..is there a howto for kernel debugging?
> >
> > thanks
> >
> >
> > ML
> >
> >
> > >
> > > Roger
> > >
> > > PS. Thanks. postcard, wine? It is too much. :-)
> > >
> > >
> > > On 11/10/05, *Maurice Libes* < Mau...@co...
> > <mailto:Mau...@co...>
> > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>> wrote:
> > >
> > > Roger Tsang wrote:
> > > > Hi,
> > > >
> > > > Maybe you have saturated node1's bandwidth with the migrations
> > going
> > > > on.
> > >
> > > hmm i don't think so (there was only one process running,
> > takin 320Mb
> > > RAM and 50% CPU)... but to eliminate this possibility i will
> > force one
> > > migration again with node 1 free from charge, right now, below
> > >
> > > > I assume you were migrating multiple instances of symphonie?
> > >
> > > no just only ONE! (there was one big process locked on node 1
> > (init
> > > node), and another one (symphonie) on node 4 for which i
> > forced the
> > > migration towards node 1 or 2 or 3
> > >
> > > What
> > > > is the traffic on node1's network card before/during/after
> > symphonie
> > > > migration from node1? Is there enough remaining
> > > > bandwidth/cpu/interrupts on node1 to support your symphonie
> > > migration?
> > > >
> > > > Can you not run anything on the cluster that can
> > load-balance when
> > > > testing the migration problem?
> > >
> > > yes i make this test, right now
> > > (nothing on node 1) and i launch symphonie on node 4...
> > >
> > > i type loadlevel -p 265684 and ther's no need to force
> > migration,
> > it is
> > > loadleveled some seconds later...because node 1 is better
> > >
> > > 1131646876 loadlb:pid 265684(symphonie) <- node 4 mem 12924
> > my load 0
> > > node4 load 72
> > >
> > > 1131646852 loadbl:pid 265684(symphonie) -> node 1 mem 74028 my
> > load 96
> > > node1 load 11
> > >
> > > 1131646852-1131646876 =3D 24 s (better than 180s or total freeze,
> > but not
> > > totally satisfaisant,.. it should take 5-6 seconds)
> > >
> > > process is taking 325M in RAM, and swap space not altered
> > >
> > > PID NODE USER PR NI VIRT RES SHR S
> > %CPU TIME+ COMMAND
> > > 265684 1 root 25 0 325m 323m 524 R 99.9 30:17.41
> > > symphonie
> > > root@comclust5:~# free
> > > total used free shared buffers
> > cached
> > > Mem: 1025816 1012068 13748 0 49504
> > 200804
> > > -/+ buffers/cache: 761760 264056
> > > Swap: 4096564 58688 4037876
> > >
> > >
> > >
> > > here is the ttcp test when process is runing on node 1
> > > nttcp -T -n 819200 -r comclust5
> > >
> > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > Calls Real-C/s
> > > CPU-C/s
> > > l-939524096 47.27 42.50 567.9044 631.6128
> > 1172745 24810.69
> > > 27594.0
> > > 1-939524096 47.27 12.39 567.8948
> > 2166.5493 819200 17330.77
> > > 66117.8
> > >
> > > nttcp -T -n 819200 comclust5
> > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > Calls Real-C/s
> > > CPU-C/s
> > > l-939524096 42.45 18.23 632.4208
> > 1472.4929 819200 19299.95
> > > 44936.9
> > > 1-939524096 42.45 35.63 632.3907 753.3973
> > 1817301 42812.68
> > > 51004.8
> > >
> > > seems to be 560-632Mbit/s in both directions
> > >
> > > >Try running one instance of symphonie on
> > > > node1. Then make it migrate to node2.
> > >
> > > yes i do all you want... symphonie is on node 1 (since late
> > > migration above)
> > >
> > > cat /proc/265684/where
> > > 1
> > >
> > >
> > > migrate 4 265684
> > >
> > > =3D=3D> here it is... we go to hell ... it's frozen ! (look until
> > the end)
> > >
> > > $ top
> > > =3D=3D> nothing
> > >
> > >
> > > $ onnode 4 cat /proc/cluster/loadlog |grep sympho
> > >
> > > [this is the last line logged.. which dont correspond to the
> > migrate
> > > operation]
> > >
> > > 1131646852 loadbl:pid 265684(symphonie) -> node 1 mem 74028 my
> > load 96
> > > node1 load 11
> > >
> > > ssh comcluster -l root
> > > Password:*****
> > > =3D> no answer (frozen)
> > >
> > > > See how fast/slow that is and
> > > > how much traffic it took by looking at `netstat -i` output.
> > >
> > > at the same time (netstat -i on node 4) (on node 1 i can't)
> > >
> > > root@comclust4# netstat -i
> > > Kernel Interface table
> > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR
> > TX-DRP
> > > TX-OVR Flg
> > > eth0 1500
> > 0 63443696 0 0 064300702 0 0
> > > 0 BMRU
> > > lo 16436 0 30658 0 0 0
> > 30658 0 0
> > > 0 LRU
> > >
> > > good or not?
> > >
> > > it's 20:24 ... 12 minutes later the migrate command is still
> > on the
> > > flight ... i still don't have the control of the cluster...(no
> > top ps w
> > > or ssh)
> > >
> > >
> > > be patient.. it comes.. 16mn later ...i have the control at
> > keyboard
> > > top succeeds and displays
> > > it is interesting.. symphonie hasn't migrated on node 4...
> > > at the end of these 16mn, it is still on node 1
> > > look
> > >
> > > onnode 1
> > > 1131650025 mig :pid 265684(symphonie) -> node 4 mem 15064 my
> > load 41
> > > node4 load 26
> > > 1131651008 loadlb:pid 265684(symphonie) <- node 4 mem 13480
> > my load 0
> > > node4 load 58
> > >
> > > onnode 4
> > > 1131650982 mig: :pid 265684(symphonie) <- node 1 mem 63840
> > my load 0
> > > node1 load 26
> > > 1131650985 loadbl:pid 265684(symphonie) -> node 1 mem 63860 my
> > load 67
> > > node1 load 25
> > >
> > > 1131650982-1131650025 =3D 957s =3D 16mn to reach node 4
> > >
> > > then at 3s later (1131650985) it is loadbalanced again towards
> > node 1 ,
> > > where it arrives 23s later (the same as the first test above)
> > > 1131650985-1131651008 =3D 23 s
> > >
> > >
> > > i resume my tests:
> > > i) from node 1 to nodes 2 3 4.. it freezes during about 16mn, the
> > > process dont stay on the node where i want to migrate it)
> > >
> > > ii) among nodes 2 3 4 : it takes about ~180 sec
> > >
> > > iii) from nodes 2 3 4 to node 1 : it is the best time : 25seconds
> > >
> > > onnode 1 netstat -i
> > > Kernel Interface table
> > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR
> > TX-DRP
> > > TX-OVR Flg
> > > eth0 1500
> > 0 233943933 0 0 0228886098 0 0
> > > 0 BMRU
> > > eth1 1500 0 3478874 0 0 0
> > 5859896 0 0
> > > 0 BMRU
> > > lo 16436 0 13882 0 0 0
> > 13882 0 0
> > > 0 LRU
> > >
> > > i think you have all the test? and symptoms
> > >
> > >
> > > > How about
> > > > if you had node1 and node2 directly connected (without
> > switch)
> > during
> > > > these tests?
> > >
> > > hmm...a friend of mine has the same netgear 8ports Giga switch...
> > with
> > > openSSI and my processe take 5sec on its cluster (on FC2)
> > > i bought last week a new swtich (same netgear model but new)..it
> > is the
> > > same!
> > >
> > > but...i will try tomorrow with a cross cable directly between
> > nodes
> > > 1 and 4
> > > i also will send you , the result of kdb when done...
> > >
> > > and? what about the sk98lin driver for my NIC 3Com 3C2000T ?
> > >
> > >
> > > many thanks
> > >
> > > ML
> > >
> > >
> > > PS: and.... if we solve this problem (i fear it will be a
> > silly thing
> > > when we find it), please give me your postal addresses
> > > i will be pleased to send you a postal card from marseille as
> > thanks
> > > ... and if you like french red wine a bottle of good wine?
> > >
> > > >
> > > > Roger
> > > >
> > > >
> > > > On 11/8/05, *Maurice Libes* <Mau...@co...
> > <mailto:Mau...@co...>
> > > <mailto: Mau...@co...
> > <mailto:Mau...@co...>>
> > > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>
> > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>>> wrote:
> > > >
> > > > Roger Tsang wrote:
> > > > > Hi Maurice,
> > > > >
> > > > > Have you tried monitoring the network?
> > > > >
> > > > > There is one thing you can do to get more
> > information. If
> > > you have
> > > > > serial console, you can enable kdb (boot with
> > kdb=3Don) or
> > > `echo 1 >
> > > > > /proc/sys/kernel/kdb`, place all nodes into kdb
> > (Ctrl-a-a
> > > on console)
> > > > > when this happens, and send the developers (Laura
> > or me)
> > > your "bta A"
> > > > > dump - preferably bzip'ed. Then we can tell you
> > exactly
> > > what froze.
> > > > >
> > > > > Roger
> > > >
> > > > ok thanks for your help
> > > > i will try, and send the informations you need, back
> > to you
> > > (not before
> > > > thursday)
> > > >
> > > > concerning the monitoring of the network i have used
> > nttcp on
> > > debian...
> > > > you will find enclosed the results of the test between
> > node 1 and
> > > > 2(diag.txt)
> > > >
> > > > john Hughes said they seems normal... about 600Mb/s in
> > both
> > > direction
> > > >
> > > >
> > > > i can send also the logs from /proc/cluster/loadlog
> > from each
> > > node
> > > > we see the processes loadbalanced.. may be it can be from
> > > some interest
> > > > to you
> > > >
> > > > if i resume my problem, it seems to me that:
> > > > 1.migration time is very long (often about 180
> > seconds) when
> > > migrating
> > > > from node 1 to 2 3 4 (or among 2 3 4) (that 's long but it
> > > succeeds..)
> > > > but
> > > > 2. when migrating from node 1 to nodes 2 3 or 4 ...the
> > system
> > > is often
> > > > blocked (cant' type commands from procps package top,
> > ps, w..)
> > > > and i must reboot the destination node in order to
> > retrieve
> > > nominal
> > > > conditions
> > > >
> > > > look an example right now, if i force the migration:
> > > >
> > > > (pid 198028 is on node 2)
> > > >
> > > > $root@comclust5:~# migrate 1 198028
> > > >
> > > >
> > > > onnode 2 cat /proc/cluster/loadlog | grep symph
> > > > 1131470437 mig :pid 198028(symphonie) -> node 1 mem
> > 53292
> > > my load 96
> > > > node1 load 41
> > > >
> > > >
> > > > root@comclust5:~# top
> > > > =3D=3D> nothing (since 10 minutes)
> > > >
> > > > root@comclust4:~# onnode 1 cat /proc/cluster/loadlog
> > | grep
> > > symphoni
> > > >
> > > > 1131470462 mig: :pid 198028(symphonie) <- node 2 mem
> > 13368
> > > my load 25
> > > > node2 load 31
> > > > (seems to have reached node 1, 25 sec later ,
> > that's good)
> > > >
> > > > 1131470463 loadbl:pid 198028(symphonie) -> node 2 mem
> > 12948
> > > my load 53
> > > > node2 load 56
> > > > (then leaves node 1 immediately due to
> > loadbalancing
> > > (there's
> > > > something running on node 1))
> > > >
> > > >
> > > > during this time the top command is frozen...
> > > >
> > > > ....18 minutes later (1131471541).....the "top" command
> > > displays on the
> > > > screen
> > > >
> > > > i retrieve the process on node 2... i never saw it on
> > node 1
> > > >
> > > > onnode 2 cat /proc/cluster/loadlog | grep symph
> > > >
> > > > 1131470437 mig :pid 198028(symphonie) -> node 1 mem
> > 53292
> > > my load 96
> > > > node1 load 41
> > > > 1131471541 loadlb:pid 198028(symphonie) <- node 1 mem
> > 53092
> > > my load 1
> > > > node1 load 81
> > > >
> > > > if i understand the path of these process, last logs
> > were:
> > > > leaving node 1 at 1131470463 and reaches node 2 at
> > 1131471541
> > > (1078 sec
> > > > =3D 18 mn)
> > > >
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3Dother log : m=
igration from node 2 to
> > > 4=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > >
> > > > 1131472179 mig :pid 198028(symphonie) -> node 4 mem
> > 51940
> > > my load 96
> > > > node4 load 26
> > > > 1131472332 mig: :pid 198028(symphonie) <- node 2 mem
> > 227792
> > > my load 0
> > > > node2 load 27
> > > >
> > > > migration time : 1131472332-1131472179 =3D 153 s ~ 3mn
> > > >
> > > > 180s : for a process of 325 Mb on a Gigabit private
> > network
> > > at about
> > > > 600Mbits/s
> > > > 198028 4 gatti 25 0 325m 323m 512 R 99.9
> > 1752:12
> > > > symphonie
> > > >
> > > > don't understand what's wrong
> > > >
> > > >
> > > > may be i have done something wrong? or i have a bad
> > hardware
> > > > (i plan to buy some better computers, and i will see the
> > > improvement)
> > > > bad module? bad driver? dont' know
> > > >
> > > > nota:a friend of mine has an openSSI cluster (on FC3 and
> > better
> > > > computers PIV at 3.2Ghz , same 1Gb switch)
> > > > and doesn't have this problem ! (with the same computing
> > process
> > > > "symphonie")
> > > >
> > > >
> > > > ML
> > > >
> > > >
> > > > >
> > > > >
> > > > > On 11/8/05, *Maurice Libes*
> > <Mau...@co... <mailto:Mau...@co...>
> > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>
> > > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>
> > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>>
> > > > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>
> > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>
> > > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>
> > > <mailto:Mau...@co...
> > <mailto:Mau...@co...>>>>> wrote:
> > > > >
> > > > > Mulyadi Santosa wrote:
> > > > > > Dear Maurice
> > > > > >
> > > > > >
> > > > > >>thanks for your help and analyze..
> > > > > >>i really don't know why there is such a long
> > time for
> > > > the process
> > > > > >>migration between some of my nodes (from node 1
> > towards
> > > > nodes 2 3 4)
> > > > > >
> > > > > >
> > > > > > Previously, you said you were running big
> > > application, how
> > > > "big"
> > > > > is it?
> > > > > > can you tell us the size of the application? And
> > > how big
> > > > the virtual
> > > > > > size is (+dynamic library+heap). You can use
> > "pmap"
> > > to see it
> > > > > >
> > > > >
> > > > > here are two of these computing processes
> > (symphonie,
> > > > bio_mars.exe) and
> > > > > occupied RAM ... 350Mb
> > > > >
> > > > > Tasks: 130 total, 3 running, 127 sleeping, 0
> > > stopped, 0
> > > > zombie
> > > > >
> > > > > PID NODE USER PR NI VIRT RES SHR S
> > > > %CPU TIME+ COMMAND
> > > > > 198028 3 gatti 25 0 328m 321m 13m R
> > > 99.4 1302:32
> > > > > symphonie
> > > > > 264695 1 faure 25 0 353m 251m 23m R
> > > 98.7 1512:52
> > > > > bio_mars.exe
> > > > >
> > > > >
> > > > >
> > > > > >>(node 1 is a recent machine Dell precision
> > 2.8Ghz
> > > 1Gb RAM)
> > > > > >>(nodes 2 3 4 are old Dell PIII 1Ghz 512Mb RAM)
> > > > > >
> > > > > >
> > > > > > On node 1 itself...when you run the process
> > alone
> > > (well,
> > > > along with
> > > > > > neccessary daemon and kernel threads of course),
> > > how much
> > > > do you use
> > > > > > the RAM usage? it the application swaps, how
> > much swap
> > > > space it uses?
> > > > > >
> > > > > here is the "free" command output on node 1
> > > > >
> > > > > root@comclust5:~# free
> > > > > total used free
> > > > shared buffers cached
> > > > > 1025816 1012828 12988 0
> > 5660
> > > > > 400496
> > > > > -/+ buffers/cache: 606672 419144
> > > > > Swap: 4096564 184556 3912008
> > > > >
> > > > >
> > > > > i create a swap space on each node, but swap is not
> > > very used on
> > > > > each node
> > > > >
> > > > > root@comclust5:~# onnode 2 free
> > > > > total used free
> > shared buffers
> > > > > cached
> > > > > Mem: 509788 296508
> > > > > 213280 0 160 47588
> > > > > -/+ buffers/cache: 248760 261028
> > > > > Swap: 522104 1168 520936
> > > > >
> > > > > root@comclust5:~# onnode 3 free
> > > > > total used free
> > shared buffers
> > > > > cached
> > > > > Mem: 509788 504120 5668
> > 0 160
> > > > > 108696
> > > > > -/+ buffers/cache: 395264 114524
> > > > > Swap: 1469908 19204 1450704
> > > > >
> > > > > root@comclust5:~# onnode 4 free
> > > > > total used free
> > shared buffers
> > > > > cached
> > > > > Mem: 1026356 434208 592148
> > 0 160
> > > > > 231624
> > > > > -/+ buffers/cache: 202424 823932
> > > > > Swap: 1469908 0 1469908
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >>note that there's no problem with little
> > benchmark a
> > > > big loop with
> > > > > >>awk, or plenty of mp32ogg processes
> > > > > >
> > > > > >
> > > > > > OK here comes my prediction. Page migration is
> > still on
> > > > the way, but
> > > > > > since you said it is "big", they are still "in
> > > flight". Note
> > > > > that, when
> > > > > > page arrive, they are still need to be allocated
> > first
> > > > (possibly in
> > > > > > blocking style...as alloc_pages() usually does).
> > > > > >
> > > > > > Maurice, maybe you compare it to your experience
> > > with oM?
> > > > Now you
> > > > > will
> > > > > > get clearer picture of the difference between
> > these two
> > > > (oM and
> > > > > > openSSI). Since openSSI implement full
> > process image
> > > > migration, be
> > > > > > prepared to watch longer interval during process
> > > > migration. The term
> > > > > > "longer" here is relative, it could be a bit, or
> > > waaayyyy
> > > > longer.
> > > > >
> > > > > yes i noticed that in normal conditions the process
> > > migration
> > > > time was
> > > > > longer than in oM...
> > > > > but in my case, one can not say it is long or
> > > longer... it simply
> > > > > freezes all following commands (no more top, ps, w,
> > > command
> > > > during 10 20
> > > > > 30 minutes... it is not "in flight" ;-))
> > > > >
> > > > > i can now reproduce my problem... but i still dont
> > > know how
> > > > to solve
> > > > > it...
> > > > >
> > > > > i) when my big processes are migrating from
> > nodes 2 3
> > > 4 to
> > > > node 1 (init
> > > > > node) there is no problem
> > > > > ii) when processes are migrating among node 2 3
> > 4 ..
> > > there is no
> > > > > problem
> > > > >
> > > > > but,
> > > > > iii) when of one these processes is migrating from
> > node 1
> > > > towards nodes
> > > > > 2 3 4... the problem occurs...the process seems to
> > > leave node
> > > > 1 (i see a
> > > > > log in /proc/cluster/loadlog) ... but never
> > reaches the
> > > > destination
> > > > > node
> > > > > (no log in /proc/cluster/loadlog of the destination
> > > node)...
> > > > > i must reboot the destination node in order to get
> > back to
> > > > the nominal
> > > > > conditions (process comes back or stay on node 1)
> > > > >
> > > > > may be a network problem? but why?
> > > > > since my 5 NIC cards are new (3Com 3C2000 gigabit
> > with
> > > > sklin98 drivers
> > > > > on Debian) as my switchs Netgear 8port gigabit
> > > > >
> > > > > any ideas?
> > > > >
> > > > > ML
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > My suggestion for openSSI developers is to
> > > implement something
> > > > > > differential page migration based on remote
> > demand
> > > paging.
> > > > page is
> > > > > > migrated on demand...and only those which is
> > recently
> > > > dirtied. I got
> > > > > > lost when tracing the internal codes of openSSI
> > > handling these
> > > > > stuffs,
> > > > > > so any hints are welcome here
> > > > > >
> > > > > > regards
> > > > > >
> > > > > > Mulyadi
> > > > >
> > > > >
> > > > > --
> > > > > Maurice Libes
> > > > > Tel : +33 (04) 91 82 93 25 Centre
> > > d'Oceanologie de
> > > > Marseille
> > > > > Fax : +33 (04) 91 82 65 48 UMS2196 CNRS-
> > > Campus de
> > > > Luminy,
> > > > > Case 901
> > > > > mailto:mau...@co...
> > <mailto:mau...@co...>
> > > <mailto:mau...@co...
> > <mailto:mau...@co...>>
> > > > <mailto: mau...@co...
> > <mailto:mau...@co...>
> > > <mailto:mau...@co...
> > <mailto:mau...@co...>>>
> > > > > <mailto:mau...@co...
> > <mailto:mau...@co...>
> > > <mailto:mau...@co...
> > <mailto:mau...@co...>>
> > > > <mailto: mau...@co...
> > <mailto:mau...@co...>
> > > <mailto:mau...@co...
> > <mailto:mau...@co...>>>> F-13288 Marseille cedex 9
> > > > > Annuaire :
> > > http://annuaire.univ-aix.fr/showuser.php?uid=3Dlibes
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Maurice Libes
> > > > Tel : +33 (04) 91 82 93 25 Centre
> > d'Oceanologie de
> > > Marseille
> > > > Fax : +33 (04) 91 82 65 48 UMS2196 CNRS-
> > Campus de
> > > Luminy,
> > > > Case 901
> > > > mailto:mau...@co...
> > <mailto:mau...@co...>
> > > <mailto: mau...@co...
> > <mailto:mau...@co...>>
> > > > <mailto:mau...@co...
> > <mailto:mau...@co...>
> > > <mailto: mau...@co...
> > <mailto:mau...@co...>>> F-13288 Marseille cedex 9
> > > > Annuaire :
> > http://annuaire.univ-aix.fr/showuser.php?uid=3Dlibes
> > > >
> > > >
> > > > ttcp between node 2 (comclust2) and node 1 (comclust5)
> > init node
> > > > UDP protocol
> > > >
> > > > root@comclust2:~# nttcp -T -n 819200 -u comclust5
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 37.72 18.43 711.7045
> > > > 1456.5136 819203 21719.58 44449.4
> > > > 1-1158230016 37.72 6.50 665.2546
> > > > 3860.5997 765806 20301.99 117816.3
> > > >
> > > >
> > > > root@comclust2:~# nttcp -T -n 819200 -u -r comclust5
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-941424640 56.87 23.64 471.7538
> > > > 1134.8706 818737 14396.80 34633.5
> > > > 1-939524096 56.87 25.45 472.0197
> > > > 1054.7562 819203 14404.95 32188.7
> > > >
> > > >
> > > >
> > > > ttcp analysis between node 2 and node 1 init node
> > > > **TCP protocol**
> > > >
> > > > root@comclust2:~# nttcp -T -n 819200 comclust5
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 42.73 18.72 628.1504
> > > > 1433.9501 819200 19169.63 43760.7
> > > > 1-939524096 42.74 37.26 628.1237 720.4387
> > > > 1540193 36039.64 41336.4
> > > >
> > > > root@comclust2:~# nttcp -T -n 819200 -r comclust5
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 47.05 43.27 570.5730 620.3731
> > > > 1099498 23370.38 25410.2
> > > > 1-939524096 47.05 9.05 570.5689
> > > > 2966.1376 819200 17412.38 90519.3
> > > >
> > > >
> > > >
> >
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >
> > > > nttcp serveur on comclust4
> > > > test from init node comclust5
> > > >
> > > > root@comclust5:~# nttcp -T -n 819200 comclust4
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 47.55 10.45 564.4774
> > > > 2568.7603 819200 17226.48 78392.3
> > > > 1-939524096 47.69 42.68 562.8214 628.9491
> > > > 1158740 24294.99 27149.5
> > > >
> > > >
> > > > root@comclust5:~# nttcp -T -n 819200 -r comclust4
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 43.17 36.08 621.7619 744.0007
> > > > 1558115 36089.74 43185.0
> > > > 1-939524096 43.17 19.99 621.7936
> > > > 1342.8487 819200 18975.63 40980.5
> > > >
> > > >
> > > > in UDP
> > > > root@comclust5 :~# nttcp -T -n 819200 -u comclust4
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-939524096 45.14 22.94 594.7364
> > > > 1170.1633 819203 18149.98 35710.7
> > > > 1-939913216 45.13
> > > > 29.90 594.6789 897.6733 819106 18148.18 27394.8
> > > >
> > > > root@comclust5:~# nttcp -T -n 819200 -u -r comclust4
> > > > Bytes Real s CPU s Real-MBit/s CPU-MBit/s
> > > > Calls Real-C/s CPU-C/s
> > > > l-1242050560 37.72 6.82 647.4045
> > > > 3581.1340 745342 19757.24 109287.7
> > > > 1-939524096 37.72 17.88 711.6422
> > > > 1501.3169 819203 21717.67 45816.7
> > > >
> > > >
> > > >
> > >
> >
> >
> >
>
>
> --
> Maurice Libes
> Tel : +33 (04) 91 82 93 25 Centre d'Oceanologie de Marseille
> Fax : +33 (04) 91 82 65 48 UMS2196 CNRS- Campus de Luminy,
> Case 901
> mailto:mau...@co... F-13288 Marseille cedex 9
> Annuaire : http://annuaire.univ-aix.fr/showuser.php?uid=3Dlibes
>
>
>