You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Erik H. <er...@he...> - 2005-09-23 16:37:20
|
On 9/22/05, Jeff Rasmussen <jra...@ln...> wrote: > Does anyone know of a way to run bproc in a mixed 32/64 bit > architecture? > > I need a 64bit host with 64bit kernel to be able to boot both 64bit and > 32bit nodes. Anyone know if this is possible with the current software > stack? It's not... and for good reason. If you look under the covers (VMADump process layout and message packing details), x86 and x86_64 are almost as different as x86 and ppc64. The register files are different, memory space details are different, etc. A 32 bit process dumped on x86_64 is not directly undumpable on x86 and vice versa. - Erik |
From: Jeff R. <jra...@ln...> - 2005-09-22 20:17:35
|
Does anyone know of a way to run bproc in a mixed 32/64 bit architecture? I need a 64bit host with 64bit kernel to be able to boot both 64bit and 32bit nodes. Anyone know if this is possible with the current software stack? Thanks, -- Jeff Rasmussen |
From: Tomasz P. <bp...@o2...> - 2005-09-20 23:57:57
|
Hello, > Daniel Gruner wrote: > Actually, all you need to do is install the package mpich-1.2.5.2-3, fo= r > which the .src.rpm is included in the cm5 distribution. Just recompile > the mpich-1.2.5.2-3.src.rpm on your system, and install it. >=20 > The only trick afterwards is that you MUST use the /usr/bin/mpirun that > is part of the cmtools package, which seems not to be in the cm5 distri= bution. > cmtools-1.4 is in the sourceforge.net/projects/bproc page, and you can > get it from there and build it for your system. I finally rebuild the mpich-1.2.5.3.src.rpm (delivered with CM5) and=20 install it (of course, before I removed standard CM5 version), but still=20 I don't have any fortran compiler in /usr/mpich-p4/bin. Any suggestions what can I do to compile parallel fortran program? Thank you in advance for adivices. --=20 Tomasz Perlik V FDK Faculty of Physics, Astronomy and Informatics Nicolaus Copernicus University, Toru=F1, POLAND |
From: kumanan c. <ku...@gm...> - 2005-09-14 10:43:52
|
can anyone refer me beowulf documentation so that it would helpful for me t= o=20 implement in linux basic as well as indepth documentation |
From: Greg W. <gw...@la...> - 2005-09-02 02:25:29
|
The machines file is not used. Instead, mpirun will allocate nodes that are available to you for execution (as shown by bpstat) or nodes that you specify using the NODES environment variable. Greg On Sep 1, 2005, at 4:53 PM, Andrew Pitre wrote: > I've setup bproc on my development cluster (1 master, 1 slave). > Commands execute successfully on the slave using bpsh. > > The clustermatic 5 version of mpich seems to be installed and working, > however, i don't seem to be able to alter the machines file that > mpich uses > to allocate processes to nodes. > > I've tried editing: > /usr/mpich-p4/share/machines.LINUX > and > /util/machines/machines.LINUX > to contain the ip addresses of my two nodes: > 10.0.1.100 > 10.0.1.101 > > > mpirun -np 2 mpiTest > Not enough nodes to allocate all processes > > alternately > > mpirun -np 1 mpiTest > Test successful on process number: 0 > elapsed_time = > 8.3e-05 > > Can anyone tell me how to edit the machines file, where it is > located, and how to run on the master node. Can mpich do round > robin process allocation within this machines file? > > - Andrew > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/ > bsce5sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Andrew P. <ap...@ro...> - 2005-09-01 22:54:08
|
I've setup bproc on my development cluster (1 master, 1 slave). Commands execute successfully on the slave using bpsh. The clustermatic 5 version of mpich seems to be installed and working, however, i don't seem to be able to alter the machines file that mpich uses to allocate processes to nodes. I've tried editing: /usr/mpich-p4/share/machines.LINUX and /util/machines/machines.LINUX to contain the ip addresses of my two nodes: 10.0.1.100 10.0.1.101 > mpirun -np 2 mpiTest Not enough nodes to allocate all processes alternately > mpirun -np 1 mpiTest Test successful on process number: 0 elapsed_time = 8.3e-05 Can anyone tell me how to edit the machines file, where it is located, and how to run on the master node. Can mpich do round robin process allocation within this machines file? - Andrew |
From: Daniel G. <dg...@cp...> - 2005-08-31 21:52:21
|
Hi Is anybody using v9fs file systems with bproc? If so, how well does it work? I am looking for an alternative to NFS which I suspect of causing crashes to my master node (Dual Opteron, running Clustermatic 5 on top of Fedora Core 2). Daniel -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Daryl W. G. <dw...@la...> - 2005-08-31 14:17:22
|
> Date: Tue, 30 Aug 2005 12:13:42 -0400 > From: Daniel Gruner <dg...@cp...> > To: ro...@or... > Cc: BProc users list <bpr...@li...> > Subject: Re: [BProc] Reply to : mpif77 and mpif90 in Clustermatic 5 (Tomasz Perlik > > Actually, all you need to do is install the package mpich-1.2.5.2-3, for > which the .src.rpm is included in the cm5 distribution. Just recompile > the mpich-1.2.5.2-3.src.rpm on your system, and install it. FWIW, LA-MPI (http://public.lanl.gov/lampi/) also contains support for BProc, but again you must use their mpirun to launch your job. This code is being deprecated, however, as the LA-MPI guys have joined forces with LAM (and others) to create Open MPI (http://www.open-mpi.org/) which, presumably, will also have BProc support from the get-go. Daryl > > The only trick afterwards is that you MUST use the /usr/bin/mpirun that > is part of the cmtools package, which seems not to be in the cm5 distribution. > cmtools-1.4 is in the sourceforge.net/projects/bproc page, and you can > get it from there and build it for your system. > > If you need the cmtools source package let me know and I can send it to > you. > > Regards, > Daniel > > On Tue, Aug 30, 2005 at 05:15:19PM +0100, ro...@or... wrote: > > > > I have the same problem. I use Clustermatic 5 for our cluster and some programs > > ( MPQC ) requires a fortran compiler. Clustermatic5 does not have any so I took > > another mpich that had all the compilers present (mpich-p4, I don't remember > > from where, but it should not be hard to find via google). When I e-mailed the > > staff at clustermatic (www.clustermatic.org) they said that this could be a > > possible approach, however one has to change the libmpich.a to the one supplied > > with the clustermatic5. But if this does not work one has to e-mail them, > > becuase one then has to try the second approach, which involves recompiling the > > whole thing from the begining. This has to be done to get the compilers working > > for bproc type systems. > > I hope this can be of some help to you. > > > > /Sincerely > > Robert Eklund, Ph D > > > > > > Quoting bpr...@li...: > > > > > Send BProc-users mailing list submissions to > > > bpr...@li... > > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > or, via email, send a message with subject or body 'help' to > > > bpr...@li... > > > > > > You can reach the person managing the list at > > > bpr...@li... > > > > > > When replying, please edit your Subject line so it is more specific > > > than "Re: Contents of BProc-users digest..." > > > > > > > > > Today's Topics: > > > > > > 1. mpif77 and mpif90 in Clustermatic 5 (Tomasz Perlik) > > > > > > -- __--__-- > > > > > > Message: 1 > > > Date: Tue, 30 Aug 2005 02:57:36 -0400 > > > From: Tomasz Perlik <bp...@o2...> > > > To: bpr...@li... > > > Subject: [BProc] mpif77 and mpif90 in Clustermatic 5 > > > > > > Helo, > > > > > > I try compile my program in fortran on Clustermatic 5 and I don't have=20 > > > compilers such mpif77 and mpif90? I remember that in Clustermatic 4 I=20 > > > used mpif77. Why? Maybe I do something wrong or there is another way? > > > > > > Any suggestion is appreciated. > > > > > > --=20 > > > > > > Tomasz Perlik > > > > > > V FDK > > > Faculty of Physics, Astronomy and Informatics > > > Nicolaus Copernicus University, Toru=F1, POLAND > > > > > > > > > > > > > > > > > > -- __--__-- > > > > > > _______________________________________________ > > > BProc-users mailing list > > > BPr...@li... > > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > > > > > End of BProc-users Digest > > > > > > > > > -- > > > > > > > > > > ------------------------------------------------------- > > SF.Net email is Sponsored by the Better Software Conference & EXPO > > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > -- > > Dr. Daniel Gruner dg...@ch... > Dept. of Chemistry dan...@ut... > University of Toronto phone: (416)-978-8689 > 80 St. George Street fax: (416)-978-5325 > Toronto, ON M5S 3H6, Canada finger for PGP public key > > > -- ***** Correspondence ***** |
From: Daniel G. <dg...@cp...> - 2005-08-30 16:14:03
|
Actually, all you need to do is install the package mpich-1.2.5.2-3, for which the .src.rpm is included in the cm5 distribution. Just recompile the mpich-1.2.5.2-3.src.rpm on your system, and install it. The only trick afterwards is that you MUST use the /usr/bin/mpirun that is part of the cmtools package, which seems not to be in the cm5 distribution. cmtools-1.4 is in the sourceforge.net/projects/bproc page, and you can get it from there and build it for your system. If you need the cmtools source package let me know and I can send it to you. Regards, Daniel On Tue, Aug 30, 2005 at 05:15:19PM +0100, ro...@or... wrote: > > I have the same problem. I use Clustermatic 5 for our cluster and some programs > ( MPQC ) requires a fortran compiler. Clustermatic5 does not have any so I took > another mpich that had all the compilers present (mpich-p4, I don't remember > from where, but it should not be hard to find via google). When I e-mailed the > staff at clustermatic (www.clustermatic.org) they said that this could be a > possible approach, however one has to change the libmpich.a to the one supplied > with the clustermatic5. But if this does not work one has to e-mail them, > becuase one then has to try the second approach, which involves recompiling the > whole thing from the begining. This has to be done to get the compilers working > for bproc type systems. > I hope this can be of some help to you. > > /Sincerely > Robert Eklund, Ph D > > > Quoting bpr...@li...: > > > Send BProc-users mailing list submissions to > > bpr...@li... > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > or, via email, send a message with subject or body 'help' to > > bpr...@li... > > > > You can reach the person managing the list at > > bpr...@li... > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of BProc-users digest..." > > > > > > Today's Topics: > > > > 1. mpif77 and mpif90 in Clustermatic 5 (Tomasz Perlik) > > > > --__--__-- > > > > Message: 1 > > Date: Tue, 30 Aug 2005 02:57:36 -0400 > > From: Tomasz Perlik <bp...@o2...> > > To: bpr...@li... > > Subject: [BProc] mpif77 and mpif90 in Clustermatic 5 > > > > Helo, > > > > I try compile my program in fortran on Clustermatic 5 and I don't have=20 > > compilers such mpif77 and mpif90? I remember that in Clustermatic 4 I=20 > > used mpif77. Why? Maybe I do something wrong or there is another way? > > > > Any suggestion is appreciated. > > > > --=20 > > > > Tomasz Perlik > > > > V FDK > > Faculty of Physics, Astronomy and Informatics > > Nicolaus Copernicus University, Toru=F1, POLAND > > > > > > > > > > > > --__--__-- > > > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > > End of BProc-users Digest > > > > > -- > > > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <ro...@or...> - 2005-08-30 15:17:33
|
I have the same problem. I use Clustermatic 5 for our cluster and some pr= ograms ( MPQC ) requires a fortran compiler. Clustermatic5 does not have any so = I took another mpich that had all the compilers present (mpich-p4, I don't remem= ber from where, but it should not be hard to find via google). When I e-maile= d the staff at clustermatic (www.clustermatic.org) they said that this could be= a possible approach, however one has to change the libmpich.a to the one su= pplied with the clustermatic5. But if this does not work one has to e-mail them, becuase one then has to try the second approach, which involves recompili= ng the whole thing from the begining. This has to be done to get the compilers w= orking for bproc type systems. I hope this can be of some help to you. /Sincerely Robert Eklund, Ph D Quoting bpr...@li...: > Send BProc-users mailing list submissions to > bpr...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/bproc-users > or, via email, send a message with subject or body 'help' to > bpr...@li... > > You can reach the person managing the list at > bpr...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BProc-users digest..." > > > Today's Topics: > > 1. mpif77 and mpif90 in Clustermatic 5 (Tomasz Perlik) > > --__--__-- > > Message: 1 > Date: Tue, 30 Aug 2005 02:57:36 -0400 > From: Tomasz Perlik <bp...@o2...> > To: bpr...@li... > Subject: [BProc] mpif77 and mpif90 in Clustermatic 5 > > Helo, > > I try compile my program in fortran on Clustermatic 5 and I don't have=3D= 20 > compilers such mpif77 and mpif90? I remember that in Clustermatic 4 I=3D= 20 > used mpif77. Why? Maybe I do something wrong or there is another way? > > Any suggestion is appreciated. > > --=3D20 > > Tomasz Perlik > > V FDK > Faculty of Physics, Astronomy and Informatics > Nicolaus Copernicus University, Toru=3DF1, POLAND > > > > > > --__--__-- > > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > End of BProc-users Digest > -- |
From: Tomasz P. <bp...@o2...> - 2005-08-30 00:52:23
|
Helo, I try compile my program in fortran on Clustermatic 5 and I don't have=20 compilers such mpif77 and mpif90? I remember that in Clustermatic 4 I=20 used mpif77. Why? Maybe I do something wrong or there is another way? Any suggestion is appreciated. --=20 Tomasz Perlik V FDK Faculty of Physics, Astronomy and Informatics Nicolaus Copernicus University, Toru=F1, POLAND |
From: Jeff R. <jra...@ln...> - 2005-08-26 14:40:22
|
Adding: kmod mtdcore mtdchar chipreg gen_probe jedec_probe cfi_cmdset_0002 amd76xrom in place of your kernel modules should fix this . I currently have this working on the hdama (I believe this is the board you are using). Jeff On Thu, 2005-08-25 at 22:27 -0500, Rene Salmon wrote: > Hi list, > > We have some compute nodes with linux bios on them and we are trying to > get a Bproc kernel with the MTD support to boot on these nodes. The > idea is that once we get this working we can use the lbflash, cmos_util > tools to change or update the setting on the BIOS. > > I think I am missing something here. The head node boots fine and I can > load the MTD modules just fine. > > lsmod: > ------ > amd76xrom 5632 0 > chipreg 4992 3 jedec_probe,cfi_probe,amd76xrom > map_funcs 3328 1 amd76xrom > mtdcore 9616 4 mtdpart,amd76xrom,mtdchar > > > So what I did was add this to the /etc/clustermatic/node_up.conf > here are the last few lines of that: > ----------------------------------------- > # Put the file system together > plugin setupfs # File system - requires kmod.... > plugin miscfiles /dev/null /dev/zero /dev/ptmx /dev/mem > plugin miscfiles /etc/localtime /etc/ld.so.cache /tmp # copy files > plugin miscfiles /etc/clustermatic/nsswitch.conf>/etc/nsswitch.conf > > plugin vmadlib # Setup shared libraries. > plugin nodeinfo # Make note of information about this node > > #this is for nfs > kmod nfs > kmod lockd > kmod sunrpc > #this is for LINUXBIOS > kmod mtdchar > kmod amd76xrom > > > When I boot the nodes up I get this boot message: > ----------------------------------------------------------- > > boot: RARP: eth0 00:50:45:5C:A0:5C -> 10.0.0.4/255.0.0.0 > boot: RARP: bproc=2223; file=4711; file=/var/clustermatic/boot.img > boot: Server IP address: 10.0.0.2 > boot: My IP address : 10.0.0.4 > boot: starting bpslave: bpslave -devi 10.0.0.2 2223 > bpslave: IO daemon started; pid=209 > bpslave: Starting new slave 0 > bpslave-0: Connecting to 10.0.0.2:2223... > bpslave-0: Connection to 10.0.0.2:2223 up and running > bpslave-0: Setting node number to 1 > bpslave: Master sets: > bpslave: 0: 10.0.0.2 2223 > bcm5700: eth1 NIC Link is DOWN > nodeup : ****amd76xrom amd76xrom_init_one(): Unable to register resource > 0xff > b00000-0xffffffff - kernel bug? > ** This is node amd76xrom window : 500000 at ffb00000 > 1 ****** > amd76xrom amd76xrom_init_one(): Unable to register resource > 0xffc00000-0xfffffff > f - kernel bug? > amd76xrom window : 400000 at ffc00000 > amd76xrom amd76xrom_init_one(): Unable to register resource > 0xffff0000-0xfffffff > f - kernel bug? > amd76xrom window : 10000 at ffff0000 > nodeup : Node setup completed successfully. > > > Any ideas as to what I am missing? does anyone have a node_up.conf file > that works that I can use as a reference? > > Thanks > Rene > > > > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Joshua A. <lu...@ln...> - 2005-08-26 07:26:30
|
> Any ideas as to what I am missing? does anyone have a node_up.conf file > that works that I can use as a reference? I am by no means an MTD expert, but in the past I have had MTD drivers be very picky about the order that they are loaded in. It looks like you arn't calling out all the modules in nodeup.conf. I would start by calling out all the mtd drivers in the specific order you want them loading in. As for your specific errors I have no clue. FWIW you don't have to have MTD working in order to use cmos_util, just to actually reflash the bios. What Mobo are you using? I may have a node_up.conf laying around for it. Josh |
From: Rene S. <rs...@tu...> - 2005-08-26 03:27:44
|
Hi list, We have some compute nodes with linux bios on them and we are trying to get a Bproc kernel with the MTD support to boot on these nodes. The idea is that once we get this working we can use the lbflash, cmos_util tools to change or update the setting on the BIOS. I think I am missing something here. The head node boots fine and I can load the MTD modules just fine. lsmod: ------ amd76xrom 5632 0 chipreg 4992 3 jedec_probe,cfi_probe,amd76xrom map_funcs 3328 1 amd76xrom mtdcore 9616 4 mtdpart,amd76xrom,mtdchar So what I did was add this to the /etc/clustermatic/node_up.conf here are the last few lines of that: ----------------------------------------- # Put the file system together plugin setupfs # File system - requires kmod.... plugin miscfiles /dev/null /dev/zero /dev/ptmx /dev/mem plugin miscfiles /etc/localtime /etc/ld.so.cache /tmp # copy files plugin miscfiles /etc/clustermatic/nsswitch.conf>/etc/nsswitch.conf plugin vmadlib # Setup shared libraries. plugin nodeinfo # Make note of information about this node #this is for nfs kmod nfs kmod lockd kmod sunrpc #this is for LINUXBIOS kmod mtdchar kmod amd76xrom When I boot the nodes up I get this boot message: ----------------------------------------------------------- boot: RARP: eth0 00:50:45:5C:A0:5C -> 10.0.0.4/255.0.0.0 boot: RARP: bproc=2223; file=4711; file=/var/clustermatic/boot.img boot: Server IP address: 10.0.0.2 boot: My IP address : 10.0.0.4 boot: starting bpslave: bpslave -devi 10.0.0.2 2223 bpslave: IO daemon started; pid=209 bpslave: Starting new slave 0 bpslave-0: Connecting to 10.0.0.2:2223... bpslave-0: Connection to 10.0.0.2:2223 up and running bpslave-0: Setting node number to 1 bpslave: Master sets: bpslave: 0: 10.0.0.2 2223 bcm5700: eth1 NIC Link is DOWN nodeup : ****amd76xrom amd76xrom_init_one(): Unable to register resource 0xff b00000-0xffffffff - kernel bug? ** This is node amd76xrom window : 500000 at ffb00000 1 ****** amd76xrom amd76xrom_init_one(): Unable to register resource 0xffc00000-0xfffffff f - kernel bug? amd76xrom window : 400000 at ffc00000 amd76xrom amd76xrom_init_one(): Unable to register resource 0xffff0000-0xfffffff f - kernel bug? amd76xrom window : 10000 at ffff0000 nodeup : Node setup completed successfully. Any ideas as to what I am missing? does anyone have a node_up.conf file that works that I can use as a reference? Thanks Rene |
From: Motu <mo...@gm...> - 2005-08-11 02:20:06
|
Hello all, Does anyone have any information on whether the cvs builds of bproc will be= =20 usable with the 2.6.11 kernel? Prabhas |
From: Marcel A. <m_a...@ho...> - 2005-08-10 20:20:28
|
I'm using a Beowulf cluster, but found that many scripts doesn't work, but do if directly executed. Even this simple command fails: $ bpsh 1 /bin/bash -c 'env' /bin/bash: env: command not found I want to avoid including the full path to /every/ executable in the script (as in bpsh 1 /bin/bash -c '/usr/bin/env'). Any suggestion? Marcel |
From: Motu <mo...@gm...> - 2005-08-09 21:56:06
|
Hello, I am working with a machine with 2 250 gb SATA harddrives, 2 dual-core=20 opteron 2.2 processors, 2 broadcom netxtreme 5704 gigabit ethernet network= =20 cards, and a tyan s2881 motherboard. When I try to boot with the kernel version 2.6.9-cm46 (the one provided wit= h=20 clustermatic 5)--as opposed to the 2.6.11-FC4smp that came with the Fedora= =20 Core 4 system, I get NMI lockup messages on both the CPUs. The following is what the tail-end of hte message looks like: > Code: 80 3f 00 7e f9 e9 8b fd ff ff e8 1e bf ec ff e9 a1 fd ff ff Console shuts up... >=20 NMI Watchdog detected LOCKUP on CPU1, registers: > CPU1 > Modules Linked in: > Pid:0, comm:swapper No tainted 2.6.9-cm46 > RIP:0010:[<ffffffff8032451b7>] NMI Watchdog detected LOCKUP on CPU3,=20 > registers: > CPU3 > Modules Linked in: > Pid:0, comm:swapper No tainted 2.6.9-cm46 > RIP:0010:[<ffffffff8032451b7>]<fffffffff80136a21>{schedulertick+385}=20 Does anyone know the cause of this problem? Anyway to deduce it? The web seems to be useless: everyone with a NMI lockup fixed the problem= =20 with a newer kernel, but in my case, I need to use clustermatic's kernel. I= s=20 there any way out? Prabhas |
From: Sun Y. <sy...@pl...> - 2005-08-06 01:47:42
|
The lock-up only occurs if SIGXCPU signal is sent to a process group, killing the processes in the group respectively with SIGXCPU seems fine. Regards, Yi Sun -----Original Message----- From: Sun Yi=20 Sent: Tuesday, August 02, 2005 5:13 PM To: 'bpr...@li...' Subject: Broc front end host locked up I just ran a test sending SIGXCPU to a process group, and expected BProc would forward the signal to the processes on remote node, however after issuing "kill -SIGXCPU -process_group_id", the master node got locked up, and a reboot had to be performed. I'm not sure if this is known issue, any input will be appreciated. Regards, Yi Sun |
From: Peter L. <lar...@um...> - 2005-08-04 16:52:59
|
On Aug 2, 2005, at 7:04 PM, adrian wrote: > On 8/1/05, Peter Larkowski <lar...@um...> wrote: >> Hello: >> >> I'm setting up a smallish (24 node, 48 processor) cluster to run a >> software package called dacapo (atomic simulation package). This >> package consists of a binary (pile of fortran compiled against mpi) >> that >> calculates the wavefunctions and some python modules to control what >> the >> binaries calculate. I have the cluster setup with clustermatic 5 and >> the binary executes fine with mpirun. What happens after the >> wavefunctions are calculated (interitive solution) is python is >> supposed >> to get a signal and do the next thing in the script (geometry step or >> finish or whatever the script says to do next, etc...) but this never >> happens and it just runs forever or until bjs effectively "kills" the >> job. >> >> If I kill the job myself, the python script on the head node just runs >> forever (it doesn't ever seem to figure out that dacapo has died). >> >> I've played around with various ways of running this software >> (bpsh'ing >> the python script to a node first, and then having dacapo execute - >> this >> doesn't work at all which is I guess expected) and I've played messed >> with /proc/sys/bproc/shell_hack, etc.... Just running the python >> script >> on the head node and having it call a shell script which executes >> mpirun >> is the closest I get, but it does what I describe above. >> >> I realize the people on this list probably don't have experience with >> this software, and I'm not even sure I understand the internals >> completely anyway (I've dug into some of the python and I'll admit >> some >> of it starts to border on magic to me.....), but the basic question >> here >> is does this sound like the kind of thing that bproc just doesn't >> handle >> well yet? I'm starting to get the feeling I should cut my losses and >> just setup independent nodes with ssh keys shared, lam-mpi or mpich, >> and >> either openpbs or pbs-pro. We have access to a cluster set up this >> way, >> and the software does work fine, but this clustermatic setup is nifty >> from an admin standpoint so I'd love to make it work, but we need this >> cluster running, so I'm starting to think I should abort and go the >> other way now. What do you guys think? Sorry for the long message, >> but >> I thought I should describe my problems as thoroughly as I can. >> >> > Peter, > I would be willing to take a look at the python code for you, > and hopefully shed a little light on what the python code is doing. > Have you tried using the python debugger ? From you explaination, it > sounds like perfect clustermatic material to me. > I've done some more testing, and I think I can get around my orginal problem at least to some extent, but a larger problem has presented itself. Our code is run from python scripts, many of which read in very large input files (> 1GB). On a traditional cluster, the jobs get run on the 1st node on the list of nodes assigned to that job. On our bproc cluster, all the python is getting executed on the head node and a couple of jobs running simultaneously brings the head node to its knees. Is there a way to execute python scripts on the slave nodes, and then have them execute the compute binaries via mpirun? I've read about some bproc python bindings, but they seem very old and don't compile against 4.0.0pre8. Is there a way to start the job on the head node, and then migrate it to the slave node or something along those lines? Thanks, Peter |
From: Sun Y. <sy...@pl...> - 2005-08-02 23:09:33
|
Eric, Thanks for asking.=20 It is BProc 4, Linux kernel 2.6 on AMD64. Regards, Yi Sun -----Original Message----- From: Erik Hendriks [mailto:eah...@gm...] Sent: Tuesday, August 02, 2005 5:46 PM To: Sun Yi Cc: BPr...@li... Subject: Re: [BProc] syslog message What version of Linux and BProc are you using? - Erik On 7/27/05, Sun Yi <sy...@pl...> wrote: > Linux syslog contains a lot messages as shown below, any idea? > kernel: bproc: ghost: signal: signr=3D=3D 0 >=20 > When they are presented, the linux system may hang, > and changing ownership of a node may fail. >=20 > Thanks in advance. >=20 > Yi Sun >=20 >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO = September > 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing = & QA > Security * Process Improvement & Measurement * = http://www.sqe.com/bsce5sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Erik H. <eah...@gm...> - 2005-08-02 21:47:34
|
What version of Linux and BProc are you using? - Erik On 7/27/05, Sun Yi <sy...@pl...> wrote: > Linux syslog contains a lot messages as shown below, any idea? > kernel: bproc: ghost: signal: signr=3D=3D 0 >=20 > When they are presented, the linux system may hang, > and changing ownership of a node may fail. >=20 > Thanks in advance. >=20 > Yi Sun >=20 >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO Septem= ber > 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & Q= A > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Sun Y. <sy...@pl...> - 2005-08-02 21:13:21
|
I just ran a test sending SIGXCPU to a process group, and expected BProc would forward the signal to the processes on remote node, however after issuing "kill -SIGXCPU -process_group_id", the master node got locked up, and a reboot had to be performed. I'm not sure if this is known issue, any input will be appreciated. Regards, Yi Sun |
From: Peter L. <lar...@um...> - 2005-08-01 16:25:51
|
Hello: I'm setting up a smallish (24 node, 48 processor) cluster to run a software package called dacapo (atomic simulation package). This package consists of a binary (pile of fortran compiled against mpi) that calculates the wavefunctions and some python modules to control what the binaries calculate. I have the cluster setup with clustermatic 5 and the binary executes fine with mpirun. What happens after the wavefunctions are calculated (interitive solution) is python is supposed to get a signal and do the next thing in the script (geometry step or finish or whatever the script says to do next, etc...) but this never happens and it just runs forever or until bjs effectively "kills" the job. If I kill the job myself, the python script on the head node just runs forever (it doesn't ever seem to figure out that dacapo has died). I've played around with various ways of running this software (bpsh'ing the python script to a node first, and then having dacapo execute - this doesn't work at all which is I guess expected) and I've played messed with /proc/sys/bproc/shell_hack, etc.... Just running the python script on the head node and having it call a shell script which executes mpirun is the closest I get, but it does what I describe above. I realize the people on this list probably don't have experience with this software, and I'm not even sure I understand the internals completely anyway (I've dug into some of the python and I'll admit some of it starts to border on magic to me.....), but the basic question here is does this sound like the kind of thing that bproc just doesn't handle well yet? I'm starting to get the feeling I should cut my losses and just setup independent nodes with ssh keys shared, lam-mpi or mpich, and either openpbs or pbs-pro. We have access to a cluster set up this way, and the software does work fine, but this clustermatic setup is nifty from an admin standpoint so I'd love to make it work, but we need this cluster running, so I'm starting to think I should abort and go the other way now. What do you guys think? Sorry for the long message, but I thought I should describe my problems as thoroughly as I can. Thanks for your input. -Peter |
From: Sun Y. <sy...@pl...> - 2005-07-27 23:04:18
|
Linux syslog contains a lot messages as shown below, any idea? kernel: bproc: ghost: signal: signr=3D=3D 0 When they are presented, the linux system may hang, and changing ownership of a node may fail. Thanks in advance. Yi Sun |
From: Rene S. <rs...@tu...> - 2005-07-27 02:18:56
|
Hi List, The problem is somewhat reprodusable. The cluster has crashed a couple of times in the past few days with similar messages. Following is the syslog entry right before the crash for today. The cluster consists of 40 dual opteron nodes. Here is what we do to reproduce the crash: 1)The cluster boots fine and all the nodes come up no problem. 2)We then procced to queue up some mpi and non mpi jobs to run. 3)After a few hours of running the jobs the cluster becomes unresponsibe meaning we can't get a prompt at the console, we can't ssh in we can't get anywhere or do anything other than power cycle the whole thing. Is there any way to get more debug info out of bproc so that I can get a clue as to where to start looking at what might be causing this problem? thank you for any advice/help on this. Rene > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu last message repeated 577 times > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: proc: ghproc: ghost: siproc: > ghosproc: ghost: sigproc: gproc: ghost: sigproc: ghost: sigproc: ghost: sigproc > : ghost: sigproc: ghost: siproc: ghost: sigproc: ghostproc: ghost: sigproc: ghos > t: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghos > tproc: ghost: sigproc: ghost: signal: signr == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3631 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3639 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 Erik Hendriks wrote: > I've never seen a problem like that. Is it reproducable? > > - Erik > |