You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: <er...@he...> - 2004-11-30 16:52:35
|
On Mon, Nov 29, 2004 at 01:18:36PM -0700, Joshua Bernstein wrote: > Actually I'm using the offical Scyld release from Penguin Computing > (version 29cz3). bpsh -v reports 3.1.9 > > I'd be curious to know if 32-bit migration with a 64-bit kernel works > in the Open Source version... It does. If the Scyld one is not open source they're in violation of the license on my code. My code (which forms the basis of their code) is GPLed. Their version must be GPLed as well. - Erik > On Nov 29, 2004, at 6:19 AM, er...@he... wrote: > > > On Sat, Nov 20, 2004 at 01:33:38AM -0700, Joshua Bernstein wrote: > >> Daniel, > >> > >> Typically this is because the computer nodes are unable to execute > >> the > >> binary file. I have seen this on my cluster when the code I'm > >> attempting to run is compilied as 32-bit code under an Opteron > >> platform. The reason for this is that BPROC cannot migrate 32-bit code > >> under a 64-bit kernel. While this may sound bad its not as bad as it > >> sounds. You can still get code to run by submitting it under a 64-bit > >> program, namely a shell, like sh. > >> > >> For this example, assume that the binary named a.out is a 32-bit > >> binary. So something like > >> > >> $bpsh <node> a.out > >> > >> will fail. but something like > >> > >> $bpsh <node> sh -c ./a.out > >> > >> will work because sh is a 64-bit binary and that is actually what > >> becomes migrated, not the actual 32-bit binary, a.out. > >> > >> The bad news here is that you're using MPI so this work around > >> doesn't > >> work. The best this is to make sure your code is 64-bits... > > > > It took me a while to add mixed 64/32 bit support but I think this > > should work now. Which BProc are you using? > > > > - erik |
From: Daniel G. <dg...@ti...> - 2004-11-29 21:24:24
|
Boy, that is an old version! I don't think that any work is being done on that version of BProc. CM5 is it, as far as I am concerned... :-) Daniel On Mon, Nov 29, 2004 at 01:18:36PM -0700, Joshua Bernstein wrote: > Actually I'm using the offical Scyld release from Penguin Computing > (version 29cz3). bpsh -v reports 3.1.9 > > I'd be curious to know if 32-bit migration with a 64-bit kernel works > in the Open Source version... > > -Josh > > > On Nov 29, 2004, at 6:19 AM, er...@he... wrote: > > > On Sat, Nov 20, 2004 at 01:33:38AM -0700, Joshua Bernstein wrote: > >> Daniel, > >> > >> Typically this is because the computer nodes are unable to execute > >> the > >> binary file. I have seen this on my cluster when the code I'm > >> attempting to run is compilied as 32-bit code under an Opteron > >> platform. The reason for this is that BPROC cannot migrate 32-bit code > >> under a 64-bit kernel. While this may sound bad its not as bad as it > >> sounds. You can still get code to run by submitting it under a 64-bit > >> program, namely a shell, like sh. > >> > >> For this example, assume that the binary named a.out is a 32-bit > >> binary. So something like > >> > >> $bpsh <node> a.out > >> > >> will fail. but something like > >> > >> $bpsh <node> sh -c ./a.out > >> > >> will work because sh is a 64-bit binary and that is actually what > >> becomes migrated, not the actual 32-bit binary, a.out. > >> > >> The bad news here is that you're using MPI so this work around > >> doesn't > >> work. The best this is to make sure your code is 64-bits... > > > > It took me a while to add mixed 64/32 bit support but I think this > > should work now. Which BProc are you using? > > > > - erik > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Joshua B. <bj...@en...> - 2004-11-29 20:18:39
|
Actually I'm using the offical Scyld release from Penguin Computing (version 29cz3). bpsh -v reports 3.1.9 I'd be curious to know if 32-bit migration with a 64-bit kernel works in the Open Source version... -Josh On Nov 29, 2004, at 6:19 AM, er...@he... wrote: > On Sat, Nov 20, 2004 at 01:33:38AM -0700, Joshua Bernstein wrote: >> Daniel, >> >> Typically this is because the computer nodes are unable to execute >> the >> binary file. I have seen this on my cluster when the code I'm >> attempting to run is compilied as 32-bit code under an Opteron >> platform. The reason for this is that BPROC cannot migrate 32-bit code >> under a 64-bit kernel. While this may sound bad its not as bad as it >> sounds. You can still get code to run by submitting it under a 64-bit >> program, namely a shell, like sh. >> >> For this example, assume that the binary named a.out is a 32-bit >> binary. So something like >> >> $bpsh <node> a.out >> >> will fail. but something like >> >> $bpsh <node> sh -c ./a.out >> >> will work because sh is a 64-bit binary and that is actually what >> becomes migrated, not the actual 32-bit binary, a.out. >> >> The bad news here is that you're using MPI so this work around >> doesn't >> work. The best this is to make sure your code is 64-bits... > > It took me a while to add mixed 64/32 bit support but I think this > should work now. Which BProc are you using? > > - erik |
From: Daniel G. <dg...@ti...> - 2004-11-29 18:53:29
|
On Mon, Nov 29, 2004 at 06:19:55AM -0700, er...@he... wrote: > On Sat, Nov 20, 2004 at 01:33:38AM -0700, Joshua Bernstein wrote: > > Daniel, > > > > Typically this is because the computer nodes are unable to execute the > > binary file. I have seen this on my cluster when the code I'm > > attempting to run is compilied as 32-bit code under an Opteron > > platform. The reason for this is that BPROC cannot migrate 32-bit code > > under a 64-bit kernel. While this may sound bad its not as bad as it > > sounds. You can still get code to run by submitting it under a 64-bit > > program, namely a shell, like sh. > > > > For this example, assume that the binary named a.out is a 32-bit > > binary. So something like > > > > $bpsh <node> a.out > > > > will fail. but something like > > > > $bpsh <node> sh -c ./a.out > > > > will work because sh is a 64-bit binary and that is actually what > > becomes migrated, not the actual 32-bit binary, a.out. > > > > The bad news here is that you're using MPI so this work around doesn't > > work. The best this is to make sure your code is 64-bits... > > It took me a while to add mixed 64/32 bit support but I think this > should work now. Which BProc are you using? > > - erik Hi Erik, Thanks for this. The problem turned out to be simply that I wasn't specifying a time limit on the bjssub command line, so the job was getting killed after 1 sec. However, your mixed 64/32 support may come in handy for running commercial applications. Have you ever tried gridMathematica? I am getting a demo version, which I will try to configure for one of my clusters. I am using clustermatic 5 now, both on athlon and opteron. Works great! Regards, Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <er...@he...> - 2004-11-29 18:36:02
|
On Sat, Nov 20, 2004 at 01:33:38AM -0700, Joshua Bernstein wrote: > Daniel, > > Typically this is because the computer nodes are unable to execute the > binary file. I have seen this on my cluster when the code I'm > attempting to run is compilied as 32-bit code under an Opteron > platform. The reason for this is that BPROC cannot migrate 32-bit code > under a 64-bit kernel. While this may sound bad its not as bad as it > sounds. You can still get code to run by submitting it under a 64-bit > program, namely a shell, like sh. > > For this example, assume that the binary named a.out is a 32-bit > binary. So something like > > $bpsh <node> a.out > > will fail. but something like > > $bpsh <node> sh -c ./a.out > > will work because sh is a 64-bit binary and that is actually what > becomes migrated, not the actual 32-bit binary, a.out. > > The bad news here is that you're using MPI so this work around doesn't > work. The best this is to make sure your code is 64-bits... It took me a while to add mixed 64/32 bit support but I think this should work now. Which BProc are you using? - erik |
From: Peter J. <Pet...@ut...> - 2004-11-25 01:35:58
|
Hi, this regarding a problem with a new Intel Pro/1000MT NIC. The problem is that the e1000 module will not load. The new NICs pci id and the correct module are in the phase 1 boot media (floppy) but when I try to boot off the floppy I get the error message- /modules/e1000.o : init_module: No such device boot: module install failed boot: failed command: insmod /modules/e1000.o I hadn't seen this at first, sorry about that Thomas. thanks again, Peter -- UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. |
From: Tryggvi E. <Try...@CT...> - 2004-11-22 16:34:34
|
Hello, =20 How do bproc users deal with the master going down?=20 Is there any way of having a standby machine assume the role of master, if it detects the master going offline? =20 In my case, I have lots of relatively small, quick, calculations to perform, and ship these out to the nodes to carry out. The nodes respond with the results back to the master. If the nodes don't respond within some timeout period (maybe the node has gone down),=20 this will be noticed, and the calculation can be re-sent to a (different) node, =20 =20 If the master goes away, the effect would be the same as none of the nodes responding,=20 and if a 'standby spare master' could come online at that time, the lost calculations could=20 be re-sent to the available nodes. In this simple configuration, there are no long-running=20 processes that need to be migrated or saved, and the total loss would be only some delay=20 while some calculations would have to be repeated.=20 =20 Have you bproc-users had a similar problem --and found a solution :-) -- ? =20 Many thanks, =20 Tryggvi EDWALD, Software engineer, CTBTO/IDC/WI/SI Comprehensive Nuclear Test Ban Treaty Organization VIENNA, AUSTRIA =20 |
From: Daniel G. <dg...@ti...> - 2004-11-21 19:11:30
|
Hi Greg, Thanks for your instructions. It seems that the problem was in fact not having specified a time limit. Once I did that there was no problem submitting mpi jobs. I thought that since the policy I am using does not enforce time limits, there would not be much need for specifying the limits explicitly. What are the chances of getting the little bjs manual updated a bit? (sorry, I just had to ask...) Regards, Daniel On Fri, Nov 19, 2004 at 08:40:31PM -0700, Greg Watson wrote: > You could try running bjssub interactively using the command: > > bjssub -i -s 1000 -n 4 bash > > When you get a prompt, check which nodes are allocated: > > echo $NODES > > and check they have been allocated to you using bpstat. > > Then try running your script. mpirun should use $NODES rather than the > -np argument, but you might want to use -np 4 just to be certain. > > Another thing to check is that you've specified -s 1000 (or some number > of seconds greater than the expected run length of the job). I think > the default is 1 second , which might be causing the problem. > > Greg > > > On Nov 19, 2004, at 6:40 PM, Daniel Gruner wrote: > > > When I am not running under bjs, I typically use: > > > > /usr/bin/mpirun -P -np 4 -- /home/dgruner/myjob >& > > /home/dgruner/myjob.out > > > > When I try to do it for bjs, I typically have been trying to submit a > > script like: > > > > ---------------------- > > > > #!/bin/bash > > > > NN=$NODES > > echo "Running job on nodes $NN\\n" > > > > /usr/bin/mpirun -P -np 1 -s -- /home/dgruner/ifranco/dynamics >& > > /home/dgruner/ifranco/dynamics.out > > > > ---------------------- > > > > Now, a similar script, but for a serial job (i.e. non-mpi), works just > > fine, > > but when I try the mpi job it gives me the iofwd: message. > > > > Daniel > > > > > > On Fri, Nov 19, 2004 at 03:36:26PM -0700, Greg Watson wrote: > >> What command are you using? > >> > >> greg > >> > >> On Nov 19, 2004, at 2:37 PM, Daniel Gruner wrote: > >> > >>> Hi > >>> > >>> I am having problems running mpi codes (using /usr/bin/mpirun) > >>> under the bjs batch system. I am running clustermatic 5 on an > >>> Opteron cluster. My mpi jobs work fine when submitted directly > >>> from the shell (i.e. not under bjs). When I try to run it under > >>> bjs I get the message: "iofwd: Child process exited abnormally" > >>> from every process. > >>> > >>> Does anyone know how I am supposed to run these jobs? > >>> > >>> Thanks, > >>> Daniel > >>> -- > >>> > >>> Dr. Daniel Gruner > >>> dg...@ti... > >>> Dept. of Chemistry dan...@ut... > >>> University of Toronto phone: (416)-978-8689 > >>> 80 St. George Street fax: (416)-978-5325 > >>> Toronto, ON M5S 3H6, Canada finger for PGP public key > >>> > >>> > >>> ------------------------------------------------------- > >>> This SF.Net email is sponsored by: InterSystems CACHE > >>> FREE OODBMS DOWNLOAD - A multidimensional database that combines > >>> robust object and relational technologies, making it a perfect match > >>> for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 > >>> _______________________________________________ > >>> BProc-users mailing list > >>> BPr...@li... > >>> https://lists.sourceforge.net/lists/listinfo/bproc-users > >>> > > > > -- > > > > Dr. Daniel Gruner dg...@ti... > > Dept. of Chemistry dan...@ut... > > University of Toronto phone: (416)-978-8689 > > 80 St. George Street fax: (416)-978-5325 > > Toronto, ON M5S 3H6, Canada finger for PGP public key > > -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Joshua B. <bj...@en...> - 2004-11-20 08:33:48
|
Daniel, Typically this is because the computer nodes are unable to execute the binary file. I have seen this on my cluster when the code I'm attempting to run is compilied as 32-bit code under an Opteron platform. The reason for this is that BPROC cannot migrate 32-bit code under a 64-bit kernel. While this may sound bad its not as bad as it sounds. You can still get code to run by submitting it under a 64-bit program, namely a shell, like sh. For this example, assume that the binary named a.out is a 32-bit binary. So something like $bpsh <node> a.out will fail. but something like $bpsh <node> sh -c ./a.out will work because sh is a 64-bit binary and that is actually what becomes migrated, not the actual 32-bit binary, a.out. The bad news here is that you're using MPI so this work around doesn't work. The best this is to make sure your code is 64-bits... Hope this helps. -Joshua Bernstein Systems Analyst University of Arizona Tucson, Arizona, USA |
From: Greg W. <gw...@la...> - 2004-11-20 03:40:42
|
You could try running bjssub interactively using the command: bjssub -i -s 1000 -n 4 bash When you get a prompt, check which nodes are allocated: echo $NODES and check they have been allocated to you using bpstat. Then try running your script. mpirun should use $NODES rather than the -np argument, but you might want to use -np 4 just to be certain. Another thing to check is that you've specified -s 1000 (or some number of seconds greater than the expected run length of the job). I think the default is 1 second , which might be causing the problem. Greg On Nov 19, 2004, at 6:40 PM, Daniel Gruner wrote: > When I am not running under bjs, I typically use: > > /usr/bin/mpirun -P -np 4 -- /home/dgruner/myjob >& > /home/dgruner/myjob.out > > When I try to do it for bjs, I typically have been trying to submit a > script like: > > ---------------------- > > #!/bin/bash > > NN=$NODES > echo "Running job on nodes $NN\\n" > > /usr/bin/mpirun -P -np 1 -s -- /home/dgruner/ifranco/dynamics >& > /home/dgruner/ifranco/dynamics.out > > ---------------------- > > Now, a similar script, but for a serial job (i.e. non-mpi), works just > fine, > but when I try the mpi job it gives me the iofwd: message. > > Daniel > > > On Fri, Nov 19, 2004 at 03:36:26PM -0700, Greg Watson wrote: >> What command are you using? >> >> greg >> >> On Nov 19, 2004, at 2:37 PM, Daniel Gruner wrote: >> >>> Hi >>> >>> I am having problems running mpi codes (using /usr/bin/mpirun) >>> under the bjs batch system. I am running clustermatic 5 on an >>> Opteron cluster. My mpi jobs work fine when submitted directly >>> from the shell (i.e. not under bjs). When I try to run it under >>> bjs I get the message: "iofwd: Child process exited abnormally" >>> from every process. >>> >>> Does anyone know how I am supposed to run these jobs? >>> >>> Thanks, >>> Daniel >>> -- >>> >>> Dr. Daniel Gruner >>> dg...@ti... >>> Dept. of Chemistry dan...@ut... >>> University of Toronto phone: (416)-978-8689 >>> 80 St. George Street fax: (416)-978-5325 >>> Toronto, ON M5S 3H6, Canada finger for PGP public key >>> >>> >>> ------------------------------------------------------- >>> This SF.Net email is sponsored by: InterSystems CACHE >>> FREE OODBMS DOWNLOAD - A multidimensional database that combines >>> robust object and relational technologies, making it a perfect match >>> for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 >>> _______________________________________________ >>> BProc-users mailing list >>> BPr...@li... >>> https://lists.sourceforge.net/lists/listinfo/bproc-users >>> > > -- > > Dr. Daniel Gruner dg...@ti... > Dept. of Chemistry dan...@ut... > University of Toronto phone: (416)-978-8689 > 80 St. George Street fax: (416)-978-5325 > Toronto, ON M5S 3H6, Canada finger for PGP public key > |
From: Daniel G. <dg...@ti...> - 2004-11-20 01:41:11
|
When I am not running under bjs, I typically use: /usr/bin/mpirun -P -np 4 -- /home/dgruner/myjob >& /home/dgruner/myjob.out When I try to do it for bjs, I typically have been trying to submit a script like: ---------------------- #!/bin/bash NN=$NODES echo "Running job on nodes $NN\\n" /usr/bin/mpirun -P -np 1 -s -- /home/dgruner/ifranco/dynamics >& /home/dgruner/ifranco/dynamics.out ---------------------- Now, a similar script, but for a serial job (i.e. non-mpi), works just fine, but when I try the mpi job it gives me the iofwd: message. Daniel On Fri, Nov 19, 2004 at 03:36:26PM -0700, Greg Watson wrote: > What command are you using? > > greg > > On Nov 19, 2004, at 2:37 PM, Daniel Gruner wrote: > > > Hi > > > > I am having problems running mpi codes (using /usr/bin/mpirun) > > under the bjs batch system. I am running clustermatic 5 on an > > Opteron cluster. My mpi jobs work fine when submitted directly > > from the shell (i.e. not under bjs). When I try to run it under > > bjs I get the message: "iofwd: Child process exited abnormally" > > from every process. > > > > Does anyone know how I am supposed to run these jobs? > > > > Thanks, > > Daniel > > -- > > > > Dr. Daniel Gruner dg...@ti... > > Dept. of Chemistry dan...@ut... > > University of Toronto phone: (416)-978-8689 > > 80 St. George Street fax: (416)-978-5325 > > Toronto, ON M5S 3H6, Canada finger for PGP public key > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: InterSystems CACHE > > FREE OODBMS DOWNLOAD - A multidimensional database that combines > > robust object and relational technologies, making it a perfect match > > for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Greg W. <gw...@la...> - 2004-11-19 22:36:42
|
What command are you using? greg On Nov 19, 2004, at 2:37 PM, Daniel Gruner wrote: > Hi > > I am having problems running mpi codes (using /usr/bin/mpirun) > under the bjs batch system. I am running clustermatic 5 on an > Opteron cluster. My mpi jobs work fine when submitted directly > from the shell (i.e. not under bjs). When I try to run it under > bjs I get the message: "iofwd: Child process exited abnormally" > from every process. > > Does anyone know how I am supposed to run these jobs? > > Thanks, > Daniel > -- > > Dr. Daniel Gruner dg...@ti... > Dept. of Chemistry dan...@ut... > University of Toronto phone: (416)-978-8689 > 80 St. George Street fax: (416)-978-5325 > Toronto, ON M5S 3H6, Canada finger for PGP public key > > > ------------------------------------------------------- > This SF.Net email is sponsored by: InterSystems CACHE > FREE OODBMS DOWNLOAD - A multidimensional database that combines > robust object and relational technologies, making it a perfect match > for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Daniel G. <dg...@ti...> - 2004-11-19 21:37:34
|
Hi I am having problems running mpi codes (using /usr/bin/mpirun) under the bjs batch system. I am running clustermatic 5 on an Opteron cluster. My mpi jobs work fine when submitted directly from the shell (i.e. not under bjs). When I try to run it under bjs I get the message: "iofwd: Child process exited abnormally" from every process. Does anyone know how I am supposed to run these jobs? Thanks, Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Dale H. <ro...@ma...> - 2004-11-18 17:45:13
|
Has anyone ever noticed bizarre process times for processes on bproc? This would be for 3.2.6. For example: 62 rodmur 18119 18010 0 Oct02 pts/23 00:00:07 [poy] 66 rodmur 18120 18026 0 Oct02 pts/23 00:00:07 [poy] 79 rodmur 18121 18106 0 Sep04 pts/23 00:00:05 [poy] 63 rodmur 18122 18005 0 Oct02 pts/23 00:00:07 [poy] 71 rodmur 18123 18075 0 Sep04 pts/23 00:00:05 [poy] 77 rodmur 18124 18104 0 Sep04 pts/23 00:00:06 [poy] I just started these processes but it claims they have been running since Sept 4th? -- Dale Harris ro...@ma... /.-) |
From: Thomas E. <eck...@gm...> - 2004-11-17 09:50:34
|
On Wed, 17 Nov 2004, Peter Jones wrote: > yes, when I boot the node off the floppy, I get the message "No usable > network interfaces found." No complaints about the driver. ok, as the "-v"-switch suggested by Erik is not available in your version for verbose output of beoboot we need to check by hand: - mount the floppy - copy the initrd.img from the floppy to e.g ~/initrd.img.gz and decompress it: gzip -d ~/initrd.img.gz - loop-mount the image, e.g. mount -o loop,ro ~/initrd.img /mnt/foo - 1st: check the content of /mnt/foo/config.boot; is your added pci-id in there? - 2nd: check if the e1000.o-module is in /mnt/foo/modules/ > I have tested the card with Windows XP and it works OK. To test that the new > slave is OK, I tried including the driver for the 3com (3c59x) interface > onboard the Tyan MB in the phase 1 floppy. In that case the node boots and > the master happily adds the new slave to the cluster, through the 3com LAN > of course. So the Intel NIC works and the new slave works, but the e1000 > driver is not loaded nor is there an attempt to load it, so far as I can > see. if you already added the new node via the 3com-card you could additionally try loading the e1000-module on that booted node (copy over with bpcp and load with bpsh) to make sure the module will work once the "No usable network ..."-issue is solved. Thomas -- Reality continues to ruin my life. -- Calvin |
From: Peter J. <Pet...@ut...> - 2004-11-17 06:52:50
|
Hi, yes, when I boot the node off the floppy, I get the message "No usable network interfaces found." No complaints about the driver. I have tested the card with Windows XP and it works OK. To test that the new slave is OK, I tried including the driver for the 3com (3c59x) interface onboard the Tyan MB in the phase 1 floppy. In that case the node boots and the master happily adds the new slave to the cluster, through the 3com LAN of course. So the Intel NIC works and the new slave works, but the e1000 driver is not loaded nor is there an attempt to load it, so far as I can see. thanks, Peter -- UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. |
From: Thomas E. <eck...@gm...> - 2004-11-16 14:20:28
|
On Tue, 16 Nov 2004, Peter Jones wrote: > thanks Erik and Thomas for your help. Yes all the slaves have the same > hardware. I' m not sure I understand however. I have edited config.boot and > added a new line "pci 0x8086 0x1026 e1000". This is the pci id of the new > intel NIC, which I read off the console of the new slave after trying to > boot phase 1 off a (newly created) floppy. To me, the slave is not getting > the information that it should load the e1000 driver into the NIC with pci > id 0x8086 0x1026, presumably because that info is not getting from > config.boot to the phase 1 boot image when I create the boot floppy with- > beoboot -1 -k /boot/vmlinuz-2.4.19-lanl.22beoboot -f -o /dev/fd0 my fault, i should have read your post more carefully (and then i would have realized that you already added the pci-id to config.boot). so what happens exactly: - is bproc complaining about "No usable network interfaces found."? - is the e1000-driver complaining (maybe the e1000.o you are using is not supporting your NIC)? - something other? thomas |
From: Peter J. <Pet...@ut...> - 2004-11-16 12:51:31
|
----- Original Message ----- From: Thomas Eckert <eck...@gm...> Date: Tuesday, November 16, 2004 8:38 pm Subject: Re: [BProc] CM3 PCI IDs > On Tue, 16 Nov 2004, Peter Jones wrote: > (...) > > If that's what you mean? I am pretty sure the e1000 driver is > being included > > in the phase 1 image because it's the only bootmodule in > config.boot and all > > the other slaves boot off the floppy OK. I have looked at the > slave console > > and its not loading the e1000 driver. It simply lists the device > ID and > > gives the message to edit config.boot. > > The other slaves have _identical_ hardware? The e1000-chips change > frequently(and thus the pci-id) and the "new id" may not be in your > pci-id-list. To > check that you could compare the "lspci -n"-output of a working > slave with > your trouble-machine and add the new ID to /etc/beowulf/config.boot. > > Thomas > hi, thanks Erik and Thomas for your help. Yes all the slaves have the same hardware. I' m not sure I understand however. I have edited config.boot and added a new line "pci 0x8086 0x1026 e1000". This is the pci id of the new intel NIC, which I read off the console of the new slave after trying to boot phase 1 off a (newly created) floppy. To me, the slave is not getting the information that it should load the e1000 driver into the NIC with pci id 0x8086 0x1026, presumably because that info is not getting from config.boot to the phase 1 boot image when I create the boot floppy with- beoboot -1 -k /boot/vmlinuz-2.4.19-lanl.22beoboot -f -o /dev/fd0 thanks, Peter -- UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. |
From: Thomas E. <eck...@gm...> - 2004-11-16 09:42:15
|
On Tue, 16 Nov 2004, Peter Jones wrote: (...) > If that's what you mean? I am pretty sure the e1000 driver is being included > in the phase 1 image because it's the only bootmodule in config.boot and all > the other slaves boot off the floppy OK. I have looked at the slave console > and its not loading the e1000 driver. It simply lists the device ID and > gives the message to edit config.boot. The other slaves have _identical_ hardware? The e1000-chips change frequently (and thus the pci-id) and the "new id" may not be in your pci-id-list. To check that you could compare the "lspci -n"-output of a working slave with your trouble-machine and add the new ID to /etc/beowulf/config.boot. Thomas |
From: Peter J. <Pet...@ut...> - 2004-11-16 06:49:10
|
----- Original Message ----- From: er...@he... Date: Tuesday, November 16, 2004 6:30 am Subject: Re: [BProc] CM3 PCI IDs > On Fri, Nov 12, 2004 at 03:33:32PM +1100, Peter Jones wrote: > > Hello, > > > > I have a small cluster (Tyan MBs, Dual Athlons, e1000 NICs, floppy > > boot) running Clustermatic 3. I recently tried to add a new > > node. However the PCI ID of the new Intel Pro/1000MT gigabit NIC is > > not on the list in config.boot. Editing config.boot and adding a > > line for the new id (pci 0x8086 0x1026 e1000) and creating a new > > floppy phase 1 boot disc did not fix this problem. Does anyone know > > how to fix this? > > Are you sure the driver was included in the phase 1 image? You can > run beoboot with '-v' to verify the driver got sucked in. Aside from > that, I would look at the slave's console to see if it's loading and > failing or what. > > - Erik > Hi, thanks very much for the reply. When I run beoboot with '-v' I only get the version inform ation -bash-2.05b# beoboot -v beoboot cm.1.4 If that's what you mean? I am pretty sure the e1000 driver is being included in the phase 1 image because it's the only bootmodule in config.boot and all the other slaves boot off the floppy OK. I have looked at the slave console and its not loading the e1000 driver. It simply lists the device ID and gives the message to edit config.boot. thanks again, Peter Jones Hi, thanks very much for the reply. When I run beoboot with '-v' I only get the version information -bash-2.05b# beoboot -v beoboot cm.1.4 If that's what you mean? I am pretty sure the e1000 driver is being included in the phase 1 image because it's the only bootmodule in config.boot and all the other slaves boot off the floppy OK. I have looked at the slave console and its not loading the e1000 driver. It simply lists the device ID and gives the message to edit config.boot. thanks again, Peter Jones -- UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. |
From: <er...@he...> - 2004-11-15 21:39:12
|
On Sun, Nov 14, 2004 at 11:56:33AM -0200, Gustavo Gobi Martinelli wrote: > All, >=20 > I am writing something about BPROC but I didn=B4t find some history. Wh= en and how > its started. >=20 > Someone knows where? I first had the idea about 5 years ago when I was indirectly working for NASA at the Goddard Space Flight Center (GSFC). The basic idea was to use something in the parent's process tree to represent and control remote processes. The process migration stuff came a little later. I believe it was Dan Ridge prompted the inclusion of process migration. He told me that he really wanted a system with an rfork system call. rfork meant remote fork in this case, not the BSD rfork system call. =20 The first test bed for BProc was a vanilla 64 node cluster at GSFC. It provided an easy platform to test out a lot of the kernel stuff. Dan Ridge was the first person to try building a cluster with really light weight nodes based on BProc. That work was done at the NASA Office of Inspector General (OIG). I think that was the first production use of BProc. Most of the ideas for building light weight nodes with BProc came out of his work there. Both of us joined Scyld Computing and worked on 'Scyld Beowulf' which was the first attempt at taking these ideas and polishing them enough to make a turn-key cluster management system out of it. If you look at Clustermatic today, the beoboot piece is the most significant open source piece to come out of that effort. It got us to the point of really having zero software installed on the nodes. The root file system was a ramdisk who's contents were sent from the front end when the node booted. This led to us doing a few stunts. For example, at ALS in 2000(?) we installed the Scyld distribution we'd created on a single front end machine. Then put a bootable CD in every machine in their 'email garden' and turned the whole thing into a cluster. We ran a few little demos on the cluster like a parallel mandelbrot browser. Then we rebooted the machines and it was an email garden again. This was possible because we never touched the local disks on the machines. I left Scyld due to irreconcilable differences with the management. Since then I moved to Los Alamos National Laboratory (LANL). Ron Minnich (from LANL) had been a guinea pig for some early versions of the Scyld software. Ron built the first cluster that combined LinuxBIOS and BProc. Using LinuxBIOS he was able to place the phase 1 boot image into the flash on the slave nodes in the cluster. This led to a whole new level of reliability for the slave nodes. LinuxBIOS got rid of all the stupid stuff people deal with on commercial bioses. (e.g. No keyboard, press F1 to continue.) After I got to LANL, I started to periodically release "Clustermatic". Clustermatic combined some of the open source pieces of the Scyld distribution and new development on BProc. The name "clustermatic" was kind of a joke at the time but it stuck. We were talking about Ron Popeil, Ronco and products like the vegematic over lunch one day. Since our team lead was also named Ron we thought we should have the Clustermatic. In the last two years, Clustermatic has gotten a lot of traction at the lab. There are now two 1024+ node clusters (and many smaller ones) running Clustermatic at LANL. Anywhere, there's a rough history from my point of view. - Erik |
From: <er...@he...> - 2004-11-15 20:57:51
|
On Fri, Nov 12, 2004 at 03:33:32PM +1100, Peter Jones wrote: > Hello, > > I have a small cluster (Tyan MBs, Dual Athlons, e1000 NICs, floppy > boot) running Clustermatic 3. I recently tried to add a new > node. However the PCI ID of the new Intel Pro/1000MT gigabit NIC is > not on the list in config.boot. Editing config.boot and adding a > line for the new id (pci 0x8086 0x1026 e1000) and creating a new > floppy phase 1 boot disc did not fix this problem. Does anyone know > how to fix this? Are you sure the driver was included in the phase 1 image? You can run beoboot with '-v' to verify the driver got sucked in. Aside from that, I would look at the slave's console to see if it's loading and failing or what. - Erik |
From: Gustavo G. M. <gu...@ma...> - 2004-11-14 13:56:37
|
All, I am writing something about BPROC but I didn´t find some history. When and how its started. Someone knows where? -- Atenciosamente, Gustavo Gobi Martinelli Linux User# 270627 |
From: Peter J. <Pet...@ut...> - 2004-11-12 04:33:35
|
Hello, I have a small cluster (Tyan MBs, Dual Athlons, e1000 NICs, floppy boot) running Clustermatic 3. I recently tried to add a new node. However the PCI ID of the new Intel Pro/1000MT gigabit NIC is not on the list in config.boot. Editing config.boot and adding a line for the new id (pci 0x8086 0x1026 e1000) and creating a new floppy phase 1 boot disc did not fix this problem. Does anyone know how to fix this? thanks, Peter Jones -- UTS CRICOS Provider Code: 00099F DISCLAIMER: This email message and any accompanying attachments may contain confidential information. If you are not the intended recipient, do not read, use, disseminate, distribute or copy this message or attachments. If you have received this message in error, please notify the sender immediately and delete this message. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views the University of Technology Sydney. Before opening any attachments, please check them for viruses and defects. |
From: <er...@he...> - 2004-11-05 17:27:48
|
On Thu, Oct 28, 2004 at 01:48:47PM -0400, Ted Sariyski wrote: > I'll work later to set up bjs. I have more serious problem now. > Sometimes, not always, a small job like pi3 kills nodes. In the example > below 0,1 and 4 where already killed in the same way: > > #> ~/mpi_examples/64> mpirun -d --p4 -np 8 --nper 2 /u/ted/mpi_examples/ > 64/pi3 . > Bogus number of cpus on node 2: 0 > Bogus number of cpus on node 3: 0 > Bogus number of cpus on node 5: 0 > Bogus number of cpus on node 6: 0 > Bogus number of cpus on node 7: 0 > Bogus number of cpus on node 8: 0 > Bogus number of cpus on node 9: 0 > Bogus number of cpus on node 10: 0 > Bogus number of cpus on node 11: 0 > Bogus number of cpus on node 12: 0 > Bogus number of cpus on node 13: 0 > Bogus number of cpus on node 14: 0 > listen: 192.168.0.101 42017 > 6: Slave node died > 6: Slave node died > Not all processes started, aborting. > > On a node console there is something like (different for different nodes): > > Pid: 197, comm: init Not tained 2.6.7 > RIP: ... > Call Trace: < ...> {bproc: do_recv_proc_stub+496} > > or on another node: {bproc: masq_add_proc+496}. > > There is nothing on the nodes log files. What 'Bogus number of cpus on > node' means? Help, please? > Thanks, Ted Bogus number of CPUs means that the 'nodeinfo' part of node setup either didn't run or didn't do its job correctly. It's supposed to look at the nodes when they come up and make note of it one the front end. That error shouldn't cause any major problems. Can you send the whole trace and exactly which version of BProc you're using? - Erik > Daniel Gruner wrote: > > >Ted, > > > >bjs changes the permissions on the nodes, so that only root will be > >able to submit without going through bjs. > > > >I have only done a bit of bjs stuff, and then only for single processor > >stuff. Typically it involves using scripts that themselves use bpsh > >to submit the jobs to the nodes. In bpsh you can redirect the input, > >output and stderr using the -I, -O and -E directives. > > > >If your nodes mount the master node (via nfs), then you can redirect > >the output from your job directly to the user's home directory. > >Alternatively, if you have local disk on the nodes then you can > >run the job and create the output file(s) on the local disk, and > >later bpcp the files to the master. > > > >I am not sure how this would work for mpi codes that run under bjs. > >In fact, I have never tried to run mpi codes in bjs... > >I guess I will have to find out sooner or later anyway, as I am planning > >to set up my newest cluster with bjs for users' jobs. > > > >Keep me posted... > >Daniel > > > >On Thu, Oct 28, 2004 at 10:07:10AM -0400, Ted Sariyski wrote: > > > > > >>Thanks Daniel, > >>With /usr/bin/mpirun I am able to submit jobs but I had to change the > >>permissions to the nodes to xxx. Is it the normal mode of > >>permissions if users is supposed to submit jobs only through bjs? I > >>still do not get the output file when the job is submitted with bjs. Any > >>idea where to look? > >>Thanks a lot, > >>Ted > >> > >>Daniel Gruner wrote: > >> > >> > >> > >>>Hi Ted, > >>> > >>>On Thu, Oct 28, 2004 at 08:54:14AM -0400, Ted Sariyski wrote: > >>> > >>> > >>> > >>> > >>>>Hi, > >>>>I thought that I patched mpich with BProc but the Makefile had > >>>>RSHCOMMAND set to /bin/rsh. I rebuild mpich again. I'll provide more > >>>>details how I did it because although that the build went seamless I > >>>>still get errors. I use the source of mpich and the patches from CM4. > >>>> > >>>>Pathching finished without errors: > >>>>#> patch -p1 < ../mpich-1.2.5..10-p4-bproc.patch > >>>>patching file mpid/ch_p4/p4/lib/p4_sock_conn.c > >>>>patching file mpid/ch_p4/p4/lib/p4_sock_sr.c > >>>>patching file mpid/ch_p4/p4/lib/p4_utils.c > >>>>patching file mpid/ch_p4/p4priv.c > >>>>#> patch -p1 < ../mpich-1.2.5..10-totalview.patch > >>>>patching file src/env/initutil.c > >>>> > >>>>Makefile was generated by: > >>>>export CC=gcc > >>>>export FC=pgf77 > >>>>export F90=pgf90 > >>>>export F77=pgf77 > >>>>export RSHCOMMAND=bproc > >>>>./configure --with-device=ch_p4 \ > >>>> --prefix=/usr/local/mpic-p4_1.2.5.2 \ > >>>> --enable-debug \ > >>>> -optcc="-O3" \ > >>>> -c++=g++ \ > >>>> --enable-f90 --enable-f77 \ > >>>> --enable-romio --with-file-system=nfs | tee configure.log > >>>>Compilation finished without errors. > >>>> > >>>>Now mpirun returns: > >>>>#> mpirun -np 2 --p4 /u/ted/mpi_examples/64/pi3 > >>>> Unrecognized argument --p4 ignored. > >>>> p0_6984: p4_error: Path to program is invalid while starting > >>>>/u/ted/mpi_examples/64/pi3 with bproc on xtreme101: -1 > >>>> p4_error: latest msg from perror: No such file or directory > >>>> > >>>> > >>>> > >>>> > >>>Try using /usr/bin/mpirun, rather than whatever is coming up first in > >>>your path (likely the mpirun from the examples directory). The > >>>/usr/bin/mpirun is from cmtools, and it is the only one that works > >>>(at least for me). It will happily take the --p4 argument. > >>> > >>>Daniel > >>> > >>> > >>> > >>> > > > > > > > > > ------------------------------------------------------- > This Newsletter Sponsored by: Macrovision > For reliable Linux application installations, use the industry's leading > setup authoring tool, InstallShield X. Learn more and evaluate > today. http://clk.atdmt.com/MSI/go/ins0030000001msi/direct/01/ > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |