You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Daniel G. <dg...@cp...> - 2005-11-21 22:22:49
|
Hi All, Is anybody using Unified Parallel C on BProc? Has anybody implemented GASNet on BProc? Regards, Daniel -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Maurice H. <ma...@ha...> - 2005-11-19 22:41:58
|
Florent Calvayrac wrote: >This is probably a well know issue but since we have installed >Clustermatic (one version away) the latency of our Myrinet network >has gone way up, from 3-8 microseconds to about 100 >for any type of MPI call. I am sure not to have >installed the TCP/IP version (since only two >jobs per card seem to be running at the same time at maximum) >and I suspect the problem comes from dynamic routing. > >I tried to download and compile the router given on the clustermatic >home page but fail because of a missing "gm_undocumented_calls.h" >and other errors. > > I suggest you should try running "MX"" drivers rather than the older gmpich -- With our best regards, Maurice W. Hilarius Hard Data Ltd |
From: Ron S. <rse...@ha...> - 2005-11-19 18:45:13
|
>> nodeup : No premove function for nodeinfo ^^^^^^^^^^^^^^^^^^^^^^^ >I believe I've seen this before when the stage2 image and front-end >had bproc mismatches, i.e. I had {re-}built bproc on the > front-end w/o generating a new stage2 image. Could that > be possible in your case? I built everything 'stock' from the clustermatic rpms and even double-checked by rebuilding the stage 2 after making any config changes, and putting the new stage 2 onto the pxe server. I'm building the stage 2 with the command: [root@novac root]# beoboot -2 -i -v -o /tmp/phase2 On 11/17/05, Joshua Aune <lu...@ln...> wrote: > On Wed, 2005-11-16 at 23:05 -0600, Ron Senykoff wrote: > > Do you have anything coming out of the serial console (or vga console if > not configured with serial) on the node? Yes... any node I try (I've tried several to make sure it wasn't bad hardware) comes up : (.219 is the server, .220 starts the node range) boot: Server IP address: 192.168.46.219 boot: My IP address : 192.168.26.222 boot: starting bpslave: bpslave -d -i -v 192.168.46.219 2223 bpslave: connecting to 192.168.46.219:2223 bpslave: IO daemon started; pid=3D17 bpslave: connection to 192.168.46.219:2223 up and running bpslave: Setting node number to 2 I have now moved my install back to clustermatic 4 on Redhat 9 (hoping that i386 builds may remedy the situation) and I get the exact same message on the node, but I get 'Signal 4' instead of 'Signal 11' in /var/log/clustermatic/node.X: [root@novac root]# tail /var/log/clustermatic/node.2 vmadlib : loaded /lib/libnss_bproc.so.2 (size=3D25043;id=3D0,0;mode= =3D100755) vmadlib : loaded /usr/lib/libbproc.so.4.0.0 (size=3D21388;id=3D0,0;mode=3D100755) vmadlib : loaded /usr/lib/libstdc++.so.5.0.3 (size=3D710608;id=3D0,0;mode=3D100755) nodeup : Plugin vmadlib returned status 0 (ok) nodeup : No premove function for nodeinfo nodeup : Starting 1 child processes. nodeup : Finished creating child processes. nodeup : I/O error talking to child nodeup : Child process for node 2 died with signal 4 nodeup : Node setup returned status 1 This was another complete from scratch rebuild. Put Redhat 9 minimal + dev packages on laptop. Installed clustermatic. Reboot into new kernel. Build phase 2, put that on PXE server... Any help is much appreciated. TIA! -Ron |
From: Joshua A. <lu...@ln...> - 2005-11-18 17:22:41
|
On Wed, 2005-11-16 at 23:05 -0600, Ron Senykoff wrote: Do you have anything coming out of the serial console (or vga console if not configured with serial) on the node? > Could it be the difference between the laptop and the nodes? They are > both i586 correct? Any help is greatly appreciated. > > [root@novac root]# tail /var/log/clustermatic/node.0 > vmadlib : loaded /lib/libnss_bproc.so.2 (size=44922;id=0,0;mode=100755) > vmadlib : loaded /usr/lib/libbproc.so.4.0.0 > (size=40697;id=0,0;mode=100755) > vmadlib : loaded /usr/lib/libstdc++.so.5.0.5 > (size=732372;id=0,0;mode=100755) > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo > nodeup : Starting 1 child processes. > nodeup : Finished creating child processes. > nodeup : I/O error talking to child > nodeup : Child process for node 0 died with signal 11 > nodeup : Node setup returned status 1 > > [root@novac root]# cat /etc/clustermatic/config > interface eth0 > > #master 192.168.46.219 > master novac > > iprange 0 192.168.46.20 192.168.46.35 # Nodes 0-8 have addresses > from this range. > > bootfile /var/clustermatic/boot.img > > librariesfrombinary /bin/sleep /bin/ps /bin/ping /bin/ls # get libc,resolver > libraries /usr/lib/libstdc++* /usr/lib64/libstdc++* # C++ support > libraries /usr/lib/libbproc.so* /usr/lib64/libbproc.so* # BProc, of course. > libraries /lib/libnss_bproc* /lib64/libnss_bproc* # BProc resolver > > node 0 00:60:EF:21:61:AF > node 00:60:EF:21:63:16 > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. Get Certified Today > Register for a JBoss Training Course. Free Certification Exam > for All Training Attendees Through End of 2005. For more info visit: > http://ads.osdn.com/?ad_idv28&alloc_id845&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Florent C. <Flo...@un...> - 2005-11-18 15:43:21
|
Hi to all This is probably a well know issue but since we have installed Clustermatic (one version away) the latency of our Myrinet network has gone way up, from 3-8 microseconds to about 100 for any type of MPI call. I am sure not to have installed the TCP/IP version (since only two jobs per card seem to be running at the same time at maximum) and I suspect the problem comes from dynamic routing. I tried to download and compile the router given on the clustermatic home page but fail because of a missing "gm_undocumented_calls.h" and other errors. Is anyone aware of this problem ? Is it worth upgrading to the last version of clustermatic as this would probably involve several days of work ? (we have some local specificities and all our programs run fine excepted for one which is latency bound) thanks in advance -- Florent Calvayrac | Tel : 06 64 31 43 86 | http://www.jackywulf.com Directeur du SC Informatique Ressources Num. de l'Universite du Maine Lab. de Physique de l'Etat Condense UMR-CNRS 6087 Inst. de Rech. en Ingenierie Molec. et Matx Fonctionnels FR CNRS 2575 |
From: Daryl W. G. <dw...@la...> - 2005-11-18 15:12:30
|
> Date: Wed, 16 Nov 2005 23:05:16 -0600 > From: Ron Senykoff <rse...@gm...> > To: bpr...@li... > Subject: [BProc] clustermatic: node dies with signal 11 > > I am working on building a cluster of thin-clients for demo purposes > (32 nodes). I'm truly stuck, and these 8-week courses are killing us > grad students to get solid projects completed... > > Using PXE the nodes load stage2 of clustermatic fine. However, the > nodes can't connect correctly to the server (a laptop). > > A bit of info on the hardware: > laptop = PII w/ 64MB of RAM (I reduced it to be identical to the clients) > nodes = AMD K6-2 w/ 64 MB of RAM + DiskOnChip (we want to put the > successful build onto that so no network booting) > Could it be the difference between the laptop and the nodes? They are > both i586 correct? Any help is greatly appreciated. > > [root@novac root]# tail /var/log/clustermatic/node.0 > vmadlib : loaded /lib/libnss_bproc.so.2 (size=44922;id=0,0;mode=100755) > vmadlib : loaded /usr/lib/libbproc.so.4.0.0 > (size=40697;id=0,0;mode=100755) > vmadlib : loaded /usr/lib/libstdc++.so.5.0.5 > (size=732372;id=0,0;mode=100755) > nodeup : Plugin vmadlib returned status 0 (ok) > nodeup : No premove function for nodeinfo ^^^^^^^^^^^^^^^^^^^^^^^ I believe I've seen this before when the stage2 image and front-end had bproc mismatches, i.e. I had {re-}built bproc on the front-end w/o generating a new stage2 image. Could that be possible in your case? Daryl > nodeup : Starting 1 child processes. > nodeup : Finished creating child processes. > nodeup : I/O error talking to child > nodeup : Child process for node 0 died with signal 11 > nodeup : Node setup returned status 1 > > [root@novac root]# cat /etc/clustermatic/config > interface eth0 > > #master 192.168.46.219 > master novac > > iprange 0 192.168.46.20 192.168.46.35 # Nodes 0-8 have addresses > from this range. > > bootfile /var/clustermatic/boot.img > > librariesfrombinary /bin/sleep /bin/ps /bin/ping /bin/ls # get libc,resolver > libraries /usr/lib/libstdc++* /usr/lib64/libstdc++* # C++ support > libraries /usr/lib/libbproc.so* /usr/lib64/libbproc.so* # BProc, of course. > libraries /lib/libnss_bproc* /lib64/libnss_bproc* # BProc resolver > > node 0 00:60:EF:21:61:AF > node 00:60:EF:21:63:16 |
From: Ron S. <rse...@gm...> - 2005-11-17 05:05:21
|
I am working on building a cluster of thin-clients for demo purposes (32 nodes). I'm truly stuck, and these 8-week courses are killing us grad students to get solid projects completed... Using PXE the nodes load stage2 of clustermatic fine. However, the nodes can't connect correctly to the server (a laptop). A bit of info on the hardware: laptop =3D PII w/ 64MB of RAM (I reduced it to be identical to the clients) nodes =3D AMD K6-2 w/ 64 MB of RAM + DiskOnChip (we want to put the successful build onto that so no network booting) Could it be the difference between the laptop and the nodes? They are both i586 correct? Any help is greatly appreciated. [root@novac root]# tail /var/log/clustermatic/node.0 vmadlib : loaded /lib/libnss_bproc.so.2 (size=3D44922;id=3D0,0;mode= =3D100755) vmadlib : loaded /usr/lib/libbproc.so.4.0.0 (size=3D40697;id=3D0,0;mode=3D100755) vmadlib : loaded /usr/lib/libstdc++.so.5.0.5 (size=3D732372;id=3D0,0;mode=3D100755) nodeup : Plugin vmadlib returned status 0 (ok) nodeup : No premove function for nodeinfo nodeup : Starting 1 child processes. nodeup : Finished creating child processes. nodeup : I/O error talking to child nodeup : Child process for node 0 died with signal 11 nodeup : Node setup returned status 1 [root@novac root]# cat /etc/clustermatic/config interface eth0 #master 192.168.46.219 master novac iprange 0 192.168.46.20 192.168.46.35 # Nodes 0-8 have addresses from this range. bootfile /var/clustermatic/boot.img librariesfrombinary /bin/sleep /bin/ps /bin/ping /bin/ls # get libc,resolve= r libraries /usr/lib/libstdc++* /usr/lib64/libstdc++* # C++ support libraries /usr/lib/libbproc.so* /usr/lib64/libbproc.so* # BProc, of course= . libraries /lib/libnss_bproc* /lib64/libnss_bproc* # BProc resolver node 0 00:60:EF:21:61:AF node 00:60:EF:21:63:16 |
From: Joshua A. <lu...@ln...> - 2005-11-14 17:37:53
|
I know that LSF has been integrated to work with BProc as well, but I have just used bjs. On Mon, 2005-11-14 at 11:51 -0500, Daniel Gruner wrote: > Hi Patrice, > > I use bjs, which is the batch scheduler that comes with cm5. I use > version 1.6. It is pretty simple minded, but it works quite well. > The only issue with it - and it may not be an issue for you at all - > is that the batch system gives you nodes, rather than processors. > In other words, if your nodes are dual-cpu you need to run 2 processes > on the node in order to take advantage of the machine. We do this for > serial jobs by submitting 2 processes in the script that the user > submits to bjs. For mpi codes this is not a problem at all - you > simply tell mpirun to pack the processes on the nodes (you need to use > the mpirun that is supplied with clustermatic, since it is the only > one that is bproc-aware). > > It would be really nice if someone went through the trouble of porting > SGE or OpenPBS or something else to bproc, but in the meantime bjs > does the job. > > Regards, > Daniel > > On Mon, Nov 14, 2005 at 11:18:59AM +0100, patrice descourt wrote: > > Hello > > > > I use Clustermatic 5 right now on a 16 diskless nodes cluster and look > > for a good Queueing and Batch job submission system to run with > > Clustermatic 5 > > > > Does anyone know one ? > > > > Thx a lot > > > > P.Descourt > > > > U650 INSERM, > > Laboratoire de Traitement d'Information Medicale (LaTIM), > > Equipe 'Quantification en Tomographie d'Emission', > > CHU Morvan, > > 5 avenue Foch, > > 29609 Brest, France > > > > > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. Download > > it for free - -and be entered to win a 42" plasma tv or your very own > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Daniel G. <dg...@cp...> - 2005-11-14 16:51:45
|
Hi Patrice, I use bjs, which is the batch scheduler that comes with cm5. I use version 1.6. It is pretty simple minded, but it works quite well. The only issue with it - and it may not be an issue for you at all - is that the batch system gives you nodes, rather than processors. In other words, if your nodes are dual-cpu you need to run 2 processes on the node in order to take advantage of the machine. We do this for serial jobs by submitting 2 processes in the script that the user submits to bjs. For mpi codes this is not a problem at all - you simply tell mpirun to pack the processes on the nodes (you need to use the mpirun that is supplied with clustermatic, since it is the only one that is bproc-aware). It would be really nice if someone went through the trouble of porting SGE or OpenPBS or something else to bproc, but in the meantime bjs does the job. Regards, Daniel On Mon, Nov 14, 2005 at 11:18:59AM +0100, patrice descourt wrote: > Hello > > I use Clustermatic 5 right now on a 16 diskless nodes cluster and look > for a good Queueing and Batch job submission system to run with > Clustermatic 5 > > Does anyone know one ? > > Thx a lot > > P.Descourt > > U650 INSERM, > Laboratoire de Traitement d'Information Medicale (LaTIM), > Equipe 'Quantification en Tomographie d'Emission', > CHU Morvan, > 5 avenue Foch, > 29609 Brest, France > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: patrice d. <pat...@un...> - 2005-11-14 10:24:14
|
Hello I use Clustermatic 5 right now on a 16 diskless nodes cluster and look for a good Queueing and Batch job submission system to run with Clustermatic 5 Does anyone know one ? Thx a lot P.Descourt U650 INSERM, Laboratoire de Traitement d'Information Medicale (LaTIM), Equipe 'Quantification en Tomographie d'Emission', CHU Morvan, 5 avenue Foch, 29609 Brest, France |
From: Joshua A. <lu...@ln...> - 2005-11-11 22:53:16
|
On Fri, 2005-11-11 at 16:24 -0500, Jeff Palmucci wrote: > I'm having the exact same problem as mentioned below (previously on this > list). Was there any resolution? I have some patches for BProc4-pre and linux-2.6.14... Maybe that will help. I am hoping to get these put on the sourceforge page eventually. Anyone that is interested contact me off list for a pre-release. > ----------------------------------------- > > From: Motu <motudai@gm...> > Re: Clustermatic"s kernel-2.6.9-cm46: NMI lockup on CPU1, CPU3 > 2005-08-09 16:56 > > > > Hello, > I am working with a machine with 2 250 gb SATA harddrives, 2 dual-core > opteron 2.2 processors, 2 broadcom netxtreme 5704 gigabit ethernet network > cards, and a tyan s2881 motherboard. > When I try to boot with the kernel version 2.6.9-cm46 (the one provided > with > clustermatic 5)--as opposed to the 2.6.11-FC4smp that came with the Fedora > Core 4 system, I get NMI lockup messages on both the CPUs. > The following is what the tail-end of hte message looks like: > > > Code: 80 3f 00 7e f9 e9 8b fd ff ff e8 1e bf ec ff e9 a1 fd ff ff > > Console shuts up... > > > NMI Watchdog detected LOCKUP on CPU1, registers: > > CPU1 > > Modules Linked in: > > Pid:0, comm:swapper No tainted 2.6.9-cm46 > > RIP:0010:[] NMI Watchdog detected LOCKUP on CPU3, > > registers: > > CPU3 > > Modules Linked in: > > Pid:0, comm:swapper No tainted 2.6.9-cm46 > > RIP:0010:[]{schedulertick+385} > > > Does anyone know the cause of this problem? Anyway to deduce it? > The web seems to be useless: everyone with a NMI lockup fixed the problem > with a newer kernel, but in my case, I need to use clustermatic"s > kernel. Is > there any way out? > > Prabhas > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Jeff P. <jpa...@ma...> - 2005-11-11 21:24:59
|
I'm having the exact same problem as mentioned below (previously on this list). Was there any resolution? ----------------------------------------- From: Motu <motudai@gm...> Re: Clustermatic"s kernel-2.6.9-cm46: NMI lockup on CPU1, CPU3 2005-08-09 16:56 Hello, I am working with a machine with 2 250 gb SATA harddrives, 2 dual-core opteron 2.2 processors, 2 broadcom netxtreme 5704 gigabit ethernet network cards, and a tyan s2881 motherboard. When I try to boot with the kernel version 2.6.9-cm46 (the one provided with clustermatic 5)--as opposed to the 2.6.11-FC4smp that came with the Fedora Core 4 system, I get NMI lockup messages on both the CPUs. The following is what the tail-end of hte message looks like: > Code: 80 3f 00 7e f9 e9 8b fd ff ff e8 1e bf ec ff e9 a1 fd ff ff Console shuts up... > NMI Watchdog detected LOCKUP on CPU1, registers: > CPU1 > Modules Linked in: > Pid:0, comm:swapper No tainted 2.6.9-cm46 > RIP:0010:[] NMI Watchdog detected LOCKUP on CPU3, > registers: > CPU3 > Modules Linked in: > Pid:0, comm:swapper No tainted 2.6.9-cm46 > RIP:0010:[]{schedulertick+385} Does anyone know the cause of this problem? Anyway to deduce it? The web seems to be useless: everyone with a NMI lockup fixed the problem with a newer kernel, but in my case, I need to use clustermatic"s kernel. Is there any way out? Prabhas |
From: patrice d. <pat...@un...> - 2005-11-10 14:25:12
|
Hello I setup an AMD64 cluster with 27 AMD64 DISKLESS nodes with Clustermatic=20 5 and intend to use Clubmask 06b2 as a scheduler with Maui and Ganglia but when I try to compile the distribution via python setup.py install I get a lot of error messages in the compiling phase : gcc -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC=20 -I/usr/local/include/python2.2 -c src/bproc/bprocmodule.c -o=20 build/temp.linux-i686-2.2/bprocmodule.o -ggdb src/bproc/bprocmodule.c: In function `bp_masteraddr': src/bproc/bprocmodule.c:330: attention : implicit declaration of=20 function `bproc_masteraddr' src/bproc/bprocmodule.c: In function `bp_nodeinfo': src/bproc/bprocmodule.c:462: erreur: incompatible type for argument 1 of=20 `makeip' src/bproc/bprocmodule.c: In function `bp_nodelist': src/bproc/bprocmodule.c:479: attention : passage de l'argument n=B01 de =AB= =20 bproc_nodelist =BB d'un type pointeur incompatible src/bproc/bprocmodule.c:492: erreur: incompatible type for argument 1 of=20 `makeip' src/bproc/bprocmodule.c: In function `bp_nodenumber': src/bproc/bprocmodule.c:528: attention : implicit declaration of=20 function `bproc_nodenumber' src/bproc/bprocmodule.c: In function `bp_nodesetstatus': src/bproc/bprocmodule.c:601: erreur: `BPROC_NODE_NSTATES' undeclared=20 (first use in this function) src/bproc/bprocmodule.c:601: erreur: (Each undeclared identifier is=20 reported only once src/bproc/bprocmodule.c:601: erreur: for each function it appears in.) src/bproc/bprocmodule.c:606: attention : passage de l'argument n=B02 de =AB= =20 bproc_nodesetstatus =BB transforme un entier en pointeur sans transtypage src/bproc/bprocmodule.c: In function `bp_nodestatus': src/bproc/bprocmodule.c:629: attention : implicit declaration of=20 function `bproc_nodeup' src/bproc/bprocmodule.c: In function `initbproc': src/bproc/bprocmodule.c:869: erreur: `bproc_node_down' undeclared (first=20 use in this function) src/bproc/bprocmodule.c:870: erreur: `bproc_node_unavailable' undeclared=20 (first use in this function) src/bproc/bprocmodule.c:871: erreur: `bproc_node_error' undeclared=20 (first use in this function) src/bproc/bprocmodule.c:872: erreur: `bproc_node_up' undeclared (first=20 use in this function) src/bproc/bprocmodule.c:873: erreur: `bproc_node_reboot' undeclared=20 (first use in this function) src/bproc/bprocmodule.c:874: erreur: `bproc_node_pwroff' undeclared=20 (first use in this function) error: command 'gcc' failed with exit status 1 is it because I do not use clustermatic 4 ? or because my python version=20 (2.2.3) is not the right needed to compile clubmask ?? Thank for your help P.Descourt U650 INSERM, Laboratoire de Traitement d'Information Medicale (LATIM) Equipe 'Quantification en Tomographie d'Emission' CHU Morvan, B=E2timent 2Bis (I3S), 5 avenue Foch, 29609 Brest, France |
From: Dale H. <ro...@ma...> - 2005-11-01 23:12:03
|
On 2005-11-01 at 16:01, Joshua Aune <lu...@ln...> elucidated: > Is anyone using the uniprocessor kernels that come with the clustermatic > distribution? > I wasn't when I was using bproc. Rolled my own kernels for a SMP box. -- Dale Harris ro...@ma... /.-) |
From: Joshua A. <lu...@ln...> - 2005-11-01 22:59:23
|
Is anyone using the uniprocessor kernels that come with the clustermatic distribution? Just curious. Josh |
From: Joshua A. <lu...@ln...> - 2005-10-21 16:04:32
|
On Fri, 2005-10-21 at 03:20 -0400, Shobana Ravi wrote: > I am working on the Nimbus 4 system based on Clustermatic on a 15 node > cluster of Opterons. I am having intermittent problems running MPI > applications on this. > > While trying to run the the ASCI benchmark sPPM, 4 times out of 5, I get > errors that look like this : > > p3_16410: p4_error: interrupt SIGSEGV: 11 > > > If anybody could help me with this problem, or give me pointers as to what > to explore/debug, I would greatly appreciate it. Build your app with debugging symbols, get the task on the node to dump a core file when it crashes, copy the core file back to the host node and analyze using gdb. |
From: Shobana R. <sh...@cs...> - 2005-10-21 07:23:49
|
I am working on the Nimbus 4 system based on Clustermatic on a 15 node cluster of Opterons. I am having intermittent problems running MPI applications on this. While trying to run the the ASCI benchmark sPPM, 4 times out of 5, I get errors that look like this : p3_16410: p4_error: interrupt SIGSEGV: 11 p3_16410: (23.234375) net_send: could not write to fd=4, errno = 32 p4_16412: (33.445312) net_send: could not write to fd=4, errno = 32 p11_16426: (33.089844) net_send: could not write to fd=4, errno = 32 p2_16408: (33.558594) net_send: could not write to fd=4, errno = 32 p9_16422: (33.207031) net_send: could not write to fd=4, errno = 32 p4_error: latest msg from perror: Broken pipe p4_16412: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p11_16426: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p2_16408: p4_error: net_send write: -1 p4_error: latest msg from perror: Broken pipe p9_16422: p4_error: net_send write: -1 p12_16428: (34.140625) net_send: could not write to fd=4, errno = 32 p4_error: latest msg from perror: Broken pipe bm_list_16405: (71.308594) net_send: could not write to fd=4, errno = 9 p4_error: latest msg from perror: Bad file descriptor bm_list_16405: p4_error: net_send write: -1 .... and so on. This happens often but not always. Each time the node on which the seg fault occurs is different. The exact same code works fine on other clusters. My environment settings are BEOWULF_JOB_MAP=0:1:2:3:4:5:6:7:8:9:10:11:12:13:14 NP=15 Unfortunately, I dont have the sources for the MPICH. The installed version is in a directory named mpich-gnu-lila-p4-1.2.5..10. I tried compiling MPICH 1.2.7 for this, but I didn't find the patches to get it working with Nimbus. If anybody could help me with this problem, or give me pointers as to what to explore/debug, I would greatly appreciate it. Regards, Shobana |
From: Joshua A. <lu...@ln...> - 2005-10-19 15:06:06
|
Attached are the somewhat cleaned up kexec in bproc patches. On Tue, 2005-10-18 at 12:19 -0600, Joshua Aune wrote: > I have some working test patches to beoboot that swaps out kmonte with > kexec. So far I have booted one node using it and would love some > further testing. > > Benefits to kexec: > more arch support (works with x86_64 on stage1!) > is now in the kernel > works with smp kernels > > This has allowed me to use a single kernel build for both stage1 and > stage2 images (yay!). > > These patches depend on mkelfImage and a statically build kexec binary. > > If you are interested, let me know. I'd post them but they will > probably be changing over the next little while as I start cleaning them > up. > > Josh > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Power Architecture Resource Center: Free content, downloads, discussions, > and more. http://solutions.newsforge.com/ibmarch.tmpl > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Joshua A. <lu...@ln...> - 2005-10-18 18:18:47
|
I have some working test patches to beoboot that swaps out kmonte with kexec. So far I have booted one node using it and would love some further testing. Benefits to kexec: more arch support (works with x86_64 on stage1!) is now in the kernel works with smp kernels This has allowed me to use a single kernel build for both stage1 and stage2 images (yay!). These patches depend on mkelfImage and a statically build kexec binary. If you are interested, let me know. I'd post them but they will probably be changing over the next little while as I start cleaning them up. Josh |
From: Michal J. <mi...@ha...> - 2005-10-14 03:34:12
|
On Thu, Oct 13, 2005 at 05:22:52PM -0600, Joshua Aune wrote: > The attached patch adds a check and > error message. Is a variation "dependancy" <-> "dependency" in an error message to distinguish between cases? Pretty subtle. :-) Michal |
From: Joshua A. <lu...@ln...> - 2005-10-13 23:22:23
|
Just had a fun error where the stage 2 image couldn't resolve a module dependancy and was seg faulting. The attached patch adds a check and error message. Who do I submit patches to? Thanks, Josh |
From: Julian S. <ju...@va...> - 2005-10-06 02:47:18
|
I'm running Valgrind on a BProc cluster. Every now and again V looks at /proc/self/maps to make sure it hasn't lost track of the process' address space layout. Mostly this is fine. However, mappings which started out referring to "/dev/zero" seem to get renamed to "/dev/zero (deleted)", possibly as a result of process migration (not sure). Also, the mapping's device and inode numbers change. This isn't harmful, but I am curious to know what's going on. It doesn't happen to anonymous mappings, which is a bit strange. J |
From: Andrew P. <ap...@ro...> - 2005-09-27 16:41:36
|
Greg, Yes, using bjs, I've now added the (-s numberofseconds) argument and that did it, thank you very much for your help! - Andrew On Sep 27, 2005, at 11:50 AM, Greg Watson wrote: > Are you using bjs to allocate the nodes? The default allocation > time is 1 second I think. > > Greg > > On Sep 27, 2005, at 8:34 AM, Andrew Pitre wrote: > > >> I'm having trouble getting mpi programs to execute for >= 1 sec. >> I have a simple program that loops, prints the execution time then >> quits. >> >> When the loop count is increased to where the execution time is >> greater than or about 1 sec, the program fails with the following >> messages: >> "mpirun: error: child process (rank=0; node=0) exited abnormally. >> mpirun: error: aborting." >> >> Replacing the loop with a sleep() statement has a similar effect, >> processes can sleep for any amount of time < 1 sec, e.g. sleep(. >> 999999) is ok, but if sleep(1) is called the program fails with >> the above error. >> >> I've tried adjusting the pingtimeout with settings 30, 3000, and >> 30000, without success. >> >> The environment is Clustermatic 5 with a custom compiled 2.6.9 >> kernel and bproc4.0.0pre8 module on Opteron processors. This >> problem does not appear on a LAM based non-bproc cluster with the >> same source code. >> >> Any help with this will be greatly appreciated. >> >> - Andrew >> >> >> >> >> >> ------------------------------------------------------- >> SF.Net email is sponsored by: >> Tame your development challenges with Apache's Geronimo App >> Server.Download it for free - -and be entered to win a 42" plasma >> tv or your very >> own Sony(tm)PSP. Click here to play: http://sourceforge.net/ >> geronimo.php >> _______________________________________________ >> BProc-users mailing list >> BPr...@li... >> https://lists.sourceforge.net/lists/listinfo/bproc-users >> >> > > |
From: Greg W. <gw...@la...> - 2005-09-27 15:50:49
|
Are you using bjs to allocate the nodes? The default allocation time is 1 second I think. Greg On Sep 27, 2005, at 8:34 AM, Andrew Pitre wrote: > I'm having trouble getting mpi programs to execute for >= 1 sec. I > have a simple program that loops, prints the execution time then > quits. > > When the loop count is increased to where the execution time is > greater than or about 1 sec, the program fails with the following > messages: > "mpirun: error: child process (rank=0; node=0) exited abnormally. > mpirun: error: aborting." > > Replacing the loop with a sleep() statement has a similar effect, > processes can sleep for any amount of time < 1 sec, e.g. sleep(. > 999999) is ok, but if sleep(1) is called the program fails with the > above error. > > I've tried adjusting the pingtimeout with settings 30, 3000, and > 30000, without success. > > The environment is Clustermatic 5 with a custom compiled 2.6.9 > kernel and bproc4.0.0pre8 module on Opteron processors. This > problem does not appear on a LAM based non-bproc cluster with the > same source code. > > Any help with this will be greatly appreciated. > > - Andrew > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App > Server.Download it for free - -and be entered to win a 42" plasma > tv or your very > own Sony(tm)PSP. Click here to play: http://sourceforge.net/ > geronimo.php > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Andrew P. <ap...@ro...> - 2005-09-27 14:35:15
|
I'm having trouble getting mpi programs to execute for >= 1 sec. I have a simple program that loops, prints the execution time then quits. When the loop count is increased to where the execution time is greater than or about 1 sec, the program fails with the following messages: "mpirun: error: child process (rank=0; node=0) exited abnormally. mpirun: error: aborting." Replacing the loop with a sleep() statement has a similar effect, processes can sleep for any amount of time < 1 sec, e.g. sleep(. 999999) is ok, but if sleep(1) is called the program fails with the above error. I've tried adjusting the pingtimeout with settings 30, 3000, and 30000, without success. The environment is Clustermatic 5 with a custom compiled 2.6.9 kernel and bproc4.0.0pre8 module on Opteron processors. This problem does not appear on a LAM based non-bproc cluster with the same source code. Any help with this will be greatly appreciated. - Andrew |