You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Dale H. <ro...@ma...> - 2005-03-28 22:51:00
|
I figured out my problem it was in the setupfs module. I had NFS configured at the module in the kernel and setupfs was not able to modprobe it... I think it'd help to perhaps have a little more liberal use of fflush() so we can have a chance of seeing some these errors. Would have saved me some time diagnosing my problem. Dale |
From: Dale H. <ro...@ma...> - 2005-03-28 22:04:09
|
So I've been putting some printf (or log_print..) around the nodeup source to try and figure out what is going on. But as far as I can tell at this point is that the postmove for the gm module is exiting just fine. So I'm flummoxed. Dale |
From: Dale H. <ro...@ma...> - 2005-03-28 15:49:26
|
Such as it is... I'm still having this problem. I already have node_up logging cranked up to the max (level 4). Is there some other way to increase logging to perhaps debug this problem? Any suggestions would be appreciated. Dale Harris |
From: Dale H. <ro...@ma...> - 2005-03-26 01:31:36
|
On Fri, Mar 25, 2005 at 04:26:59PM -0800, Dale Harris elucidated: > > > I did upgrade to GM 2.0.19... could that be part of the problem? > Doesn't seem to be the GM driver version. Went back to 2.0.16 which was working before. Attaching serial console output for completeness. Dale |
From: Dale H. <ro...@ma...> - 2005-03-26 00:27:05
|
On Fri, Mar 25, 2005 at 03:31:51PM -0800, Dale Harris elucidated: > > Hi, > > Having a problem trying to get a bproc cluster going. I can't figure > out what is going on. Is there supposed to be a postmove function for > GM? Anyone spot what I'm missing? > > node log and node_up.conf attached. > I did upgrade to GM 2.0.19... could that be part of the problem? Dale |
From: Dale H. <ro...@ma...> - 2005-03-25 23:32:08
|
Hi, Having a problem trying to get a bproc cluster going. I can't figure out what is going on. Is there supposed to be a postmove function for GM? Anyone spot what I'm missing? node log and node_up.conf attached. -- Dale Harris ro...@ma... /.-) |
From: Sinan Al-S. <si...@al...> - 2005-03-25 22:36:04
|
Hello, Setup: We are trying to use Bproc and IB (Mellanox Gen-1) on our 16 node 2.6.6 linux kernel cluster. Problem: IB seems to work fine on the head node (the vstat command succeeds). However, on the remote node it fails. More details: Tracing showed that sys_clone in the VAPI (verbs IB api) is failing on the remote nodes. Sys_clone() is used to clone a thread. If you have any ideas, kindly let me know. Thanks, Sinan --------------------- [root@unm bin]# bpsh 13 ./vstat. 1 HCA found: hca_id=InfiniHost0 Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EGEN) [root@unm bin]# -------------------- . |
From: Luke S. <lu...@ac...> - 2005-03-25 16:03:17
|
Attached is a simple Perl script that I can use to tank the system. The script uses blocking NFS file locking (a great, simple way to coordinate jobs across a cluster), and works fine on other computers. For example, if you spawn a bunch of them at once for i in `seq 1 8 ` ; do filelocktest name_of_existing_file & done The last script will finish 8 seconds later, each script taking a turn holding the lock on the file for 1 second. It also works across multiple (non-clustermatic) machines if the name_of_existing_file is on a commonly NFS mounted directory. However, if you try the script on our cluster (where all the nodes have /home NFS mounted and /proc/sys/bproc/shell_hack is off): bpsh 1-30 filelocktest name_of_existing_file It does not run in 30 seconds as expected. The locks are obtained much more slowly than 1/sec and after little while the whole system freezes up and dumps the message that I sent earlier. Note that while using ~10 nodes takes much longer than 10 seconds, it usually succeeds after a certain amount of time, and doesn't crash. 30 nodes and more crashes pretty reliably. On another note our final piece of cluster weirdness that I've detected is also NFS related, though not as important. When I read a file off a master NFS server drive from a node I get 50 MB/s, which is how fast the drive goes (Yay! The 2.4 kernel maxed out at ~20MB/s over NFS for a single client.) But then I read the same file from the master NFS server again from a different node now that it is cached on the server and I get only 10 MB/s. To make certain that I'm not nuts I read the same file over NFS from a non-clustermatic computer and I get 100 MB/s, the legal gigabit limit (Sweet!). Summary: NFS to clustermatic nodes is much slower if the file is cached in the master NFS server. It seems very odd that I'm getting these NFS problems. Shouldn't that be pretty much be independent of the bproc changes to the kernel? Would having an NFS server separate from the bproc master fix things? |
From: Daryl W. G. <dw...@la...> - 2005-03-23 00:02:35
|
I want to advertise a new package that is appearing today on bproc.sf.net which contains Perl BProc bindings for much of the API. I've tested this under BProc v3/v4 on a combination of RH8, RH9 and FC2/64 (Opteron/P4 arch's). Once installed, you can 'man BProc(3)' to see the details. I'm also including some code snippets in the examples directory to give an idea of the type of things I wanted this XSUB for. N-joy! Daryl P.s. Thanks to Erik for permitting this software to reside on the site! |
From: Greg W. <gw...@la...> - 2005-03-14 22:50:56
|
On Mar 14, 2005, at 2:56 PM, Nguyen, Vinh-Thach wrote: > Hi Greg, > > Thank you very much for the clarification on bproc and bjs. > So I tried this command and it works: > bjssub -p default -i -s 60 'bpsh $NODES eldo p690.cir' > > Without "-s #" option the command above does not work because the > $NODES > is still owned by root. > With "-s #" option the owner is changed to the current user for # > seconds. > After # seconds the application dies because the ownership of $NODES > is changed back > to root. Is it how bjs works ? Yes, the '-s' option specifies the job duration. Use a large number if you expect the job to run for a long time > On the other hand, in case of dual or quad CPU on node 0 for instance, > if the node 0 is allocated to one user whose application uses only one > CPU, > then other users can not use node 0 any more, until the node is > released, > although other CPUs are available. Is it the expected behavior of > Bproc/bjs ? Yes, allocation is by node, not by CPU. > > It seems Bproc/bjs cluster is built for parallel applications, not > really suitable for applications which can use only single CPU. Each > node > can run only one process, therefore other CPUs of a node can not be > used/shared > in case of applications using single CPU. Is it correct ? Yes, bproc/bjs was primarily designed for parallel applications. However you can easily share nodes using the 'groups' command. Alternatively, don't use bjs at all and set the node permissions so the nodes are accessible by other users. > > Is there other cluster packages (for dual/quad CPU nodes) more > suitable for > single process single CPU applications ? Or another schedulers besides > bjs > which is more transparent to the users, automatically > distribute/balance applications > to nodes ? LSF can be used with bproc, but it's commercial. Greg |
From: Nguyen, Vinh-T. <vt...@in...> - 2005-03-14 21:58:16
|
Hi Greg, Thank you very much for the clarification on bproc and bjs. So I tried this command and it works: bjssub -p default -i -s 60 'bpsh $NODES eldo p690.cir' Without "-s #" option the command above does not work because the $NODES is still owned by root. With "-s #" option the owner is changed to the current user for # = seconds. After # seconds the application dies because the ownership of $NODES is = changed back=20 to root. Is it how bjs works ?=20 On the other hand, in case of dual or quad CPU on node 0 for instance, if the node 0 is allocated to one user whose application uses only one = CPU,=20 then other users can not use node 0 any more, until the node is = released,=20 although other CPUs are available. Is it the expected behavior of = Bproc/bjs ? It seems Bproc/bjs cluster is built for parallel applications, not=20 really suitable for applications which can use only single CPU. Each = node can run only one process, therefore other CPUs of a node can not be = used/shared in case of applications using single CPU. Is it correct ? Is there other cluster packages (for dual/quad CPU nodes) more suitable = for=20 single process single CPU applications ? Or another schedulers besides = bjs=20 which is more transparent to the users, automatically distribute/balance = applications to nodes ? Regards, Vinh-Thach Nguyen=20 -----Original Message----- From: Greg Watson [mailto:gw...@la...] Sent: Sunday, March 13, 2005 7:28 AM To: Nguyen, Vinh-Thach Cc: ha...@no...; bpr...@li... Subject: Re: [BProc] Clustermatic 5, load balancing and process distribution A common mistake that new bproc users make is to assume that bjs works=20 like other schedulers. In fact, all bjs does is allocate nodes, it does=20 not place jobs on the nodes. This is the responsibility of the script=20 or program that is being scheduled. To facilitate this, bjs sets the=20 environment variable NODES to contain a list of the node numbers that=20 have been allocated to the job. The script or program can then use this=20 variable to place processes onto the nodes by using bpsh or some other=20 means. You could try something like: bjssub -p default -i eldo 'bpsh $NODES p690.cir' The owner/permissions you see on the nodes are correct. Nodes=20 controlled by bjs are normally owned by root and executable only by=20 root. When bjs schedules the nodes to you, it will change the owner of=20 the node to your user id so that you can run the job. Once the job is=20 completed, the owner of the node will return to root. Greg On Mar 12, 2005, at 6:33 PM, Nguyen, Vinh-Thach wrote: > I've tried bjssub for submitting jobs=A0but somehow they=A0still all = run=20 > on the master, > not on the slave nodes. > My test setup=A0has one master and one node,=A0each machine has one = =A0CPU. > This is my bjs.conf=A0: > =A0spooldir /var/spool/bjs > =A0policypath /usr/lib64/bjs:/usr/lib/bjs > =A0socketpath /tmp/.bjs > =A0#acctlog /tmp/acct.log > =A0pool default > =A0=A0=A0 policy filler > =A0 =A0 nodes 0-5 > =A0 > The bjsstat returns: > =A0Pool: default Nodes (total/up/free): 1/1/1 > =A0ID User Command Requirements > =A0 > I submitted=A0many jobs with this command: > =A0bjssub -p default -i eldo p690.cir > and got the JOBID back from the bjs server for each submitted job . > The return of bjsstat still looks like the one above, with no jobs in=20 > the list. > All jobs run=A0on=A0the master. "bpsh 0 ps -e" shows=A0none=A0running = on=20 > the=A0node=A00. > =A0 > The node 0 is owned by root and executable only by root. I already=20 > =A0tried to change the > owner to the user who runs bjssub and also the mode executable to 111=20 > but everytime > I run bjssub they all come back as default, owned by root and=20 > executable only by root. > Is this the main problem ? I also tried to submit the jobs as root but = > still hopeless. > Again, launching a job explicitly on node 0 with bpsh works fine. > As Vaclav said I'm close to my goal. > I greatly appreciate if someone can give me a help on this. > =A0 > Vinh-Thach Nguyen > > =A0 > From: ha...@no... [mailto:ha...@no...] > Sent: Sat 3/12/2005 07:28 > To: Nguyen, Vinh-Thach > Cc: bpr...@li... > Subject: Re: [BProc] Clustermatic 5, load balancing and process=20 > distribution > > > > I set up a cluster with Clustermatic 5 and run multiple > > simulations on the master but I do not see the master distribute > > jobs to slave nodes. All jobs run on the master. > > Bproc is not OpenMosix. It is not supposed to migrate those jobs by > itself and even cannot. > > > Although I'm able to run simulation on slave node by using=20 > explicitly bpsh. > > Good. You are close. > > > Did I forget something in the configuration ? Or Clustermatic 5 does > > not automatically distribute jobs to slave nodes > > or does not have load balancing capability ? > > With bproc you need some job spooling system (like bjs) to do load=20 > balancing. > > Vaclav Hanzl > |
From: Greg W. <gw...@la...> - 2005-03-13 15:28:18
|
A common mistake that new bproc users make is to assume that bjs works=20= like other schedulers. In fact, all bjs does is allocate nodes, it does=20= not place jobs on the nodes. This is the responsibility of the script=20 or program that is being scheduled. To facilitate this, bjs sets the=20 environment variable NODES to contain a list of the node numbers that=20 have been allocated to the job. The script or program can then use this=20= variable to place processes onto the nodes by using bpsh or some other=20= means. You could try something like: bjssub -p default -i eldo 'bpsh $NODES p690.cir' The owner/permissions you see on the nodes are correct. Nodes=20 controlled by bjs are normally owned by root and executable only by=20 root. When bjs schedules the nodes to you, it will change the owner of=20= the node to your user id so that you can run the job. Once the job is=20 completed, the owner of the node will return to root. Greg On Mar 12, 2005, at 6:33 PM, Nguyen, Vinh-Thach wrote: > I've tried bjssub for submitting jobs=A0but somehow they=A0still all = run=20 > on the master, > not on the slave nodes. > My test setup=A0has one master and one node,=A0each machine has one = =A0CPU. > This is my bjs.conf=A0: > =A0spooldir /var/spool/bjs > =A0policypath /usr/lib64/bjs:/usr/lib/bjs > =A0socketpath /tmp/.bjs > =A0#acctlog /tmp/acct.log > =A0pool default > =A0=A0=A0 policy filler > =A0 =A0 nodes 0-5 > =A0 > The bjsstat returns: > =A0Pool: default Nodes (total/up/free): 1/1/1 > =A0ID User Command Requirements > =A0 > I submitted=A0many jobs with this command: > =A0bjssub -p default -i eldo p690.cir > and got the JOBID back from the bjs server for each submitted job . > The return of bjsstat still looks like the one above, with no jobs in=20= > the list. > All jobs run=A0on=A0the master. "bpsh 0 ps -e" shows=A0none=A0running = on=20 > the=A0node=A00. > =A0 > The node 0 is owned by root and executable only by root. I already=20 > =A0tried to change the > owner to the user who runs bjssub and also the mode executable to 111=20= > but everytime > I run bjssub they all come back as default, owned by root and=20 > executable only by root. > Is this the main problem ? I also tried to submit the jobs as root but=20= > still hopeless. > Again, launching a job explicitly on node 0 with bpsh works fine. > As Vaclav said I'm close to my goal. > I greatly appreciate if someone can give me a help on this. > =A0 > Vinh-Thach Nguyen > > =A0 > From: ha...@no... [mailto:ha...@no...] > Sent: Sat 3/12/2005 07:28 > To: Nguyen, Vinh-Thach > Cc: bpr...@li... > Subject: Re: [BProc] Clustermatic 5, load balancing and process=20 > distribution > > > > I set up a cluster with Clustermatic 5 and run multiple > > simulations on the master but I do not see the master distribute > > jobs to slave nodes. All jobs run on the master. > > Bproc is not OpenMosix. It is not supposed to migrate those jobs by > itself and even cannot. > > > Although I'm able to run simulation on slave node by using=20 > explicitly bpsh. > > Good. You are close. > > > Did I forget something in the configuration ? Or Clustermatic 5 does > > not automatically distribute jobs to slave nodes > > or does not have load balancing capability ? > > With bproc you need some job spooling system (like bjs) to do load=20 > balancing. > > Vaclav Hanzl > |
From: Nguyen, Vinh-T. <vt...@in...> - 2005-03-13 01:35:16
|
I've tried bjssub for submitting jobs but somehow they still all run on = the master, not on the slave nodes. My test setup has one master and one node, each machine has one CPU. This is my bjs.conf : spooldir /var/spool/bjs policypath /usr/lib64/bjs:/usr/lib/bjs socketpath /tmp/.bjs #acctlog /tmp/acct.log pool default policy filler nodes 0-5 =20 The bjsstat returns: Pool: default Nodes (total/up/free): 1/1/1 ID User Command Requirements=20 =20 I submitted many jobs with this command: bjssub -p default -i eldo p690.cir and got the JOBID back from the bjs server for each submitted job . The return of bjsstat still looks like the one above, with no jobs in = the list. All jobs run on the master. "bpsh 0 ps -e" shows none running on the = node 0. =20 The node 0 is owned by root and executable only by root. I already = tried to change the owner to the user who runs bjssub and also the mode executable to 111 = but everytime I run bjssub they all come back as default, owned by root and executable = only by root. Is this the main problem ? I also tried to submit the jobs as root but = still hopeless. Again, launching a job explicitly on node 0 with bpsh works fine. As Vaclav said I'm close to my goal. I greatly appreciate if someone can give me a help on this. =20 Vinh-Thach Nguyen =20 ________________________________ From: ha...@no... [mailto:ha...@no...] Sent: Sat 3/12/2005 07:28 To: Nguyen, Vinh-Thach Cc: bpr...@li... Subject: Re: [BProc] Clustermatic 5, load balancing and process = distribution > I set up a cluster with Clustermatic 5 and run multiple > simulations on the master but I do not see the master distribute > jobs to slave nodes. All jobs run on the master. Bproc is not OpenMosix. It is not supposed to migrate those jobs by itself and even cannot. > Although I'm able to run simulation on slave node by using explicitly = bpsh. Good. You are close. > Did I forget something in the configuration ? Or Clustermatic 5 does > not automatically distribute jobs to slave nodes > or does not have load balancing capability ? With bproc you need some job spooling system (like bjs) to do load = balancing. Vaclav Hanzl |
From: <ha...@no...> - 2005-03-12 15:21:02
|
> I set up a cluster with Clustermatic 5 and run multiple > simulations on the master but I do not see the master distribute > jobs to slave nodes. All jobs run on the master. Bproc is not OpenMosix. It is not supposed to migrate those jobs by itself and even cannot. > Although I'm able to run simulation on slave node by using explicitly bpsh. Good. You are close. > Did I forget something in the configuration ? Or Clustermatic 5 does > not automatically distribute jobs to slave nodes > or does not have load balancing capability ? With bproc you need some job spooling system (like bjs) to do load balancing. Vaclav Hanzl |
From: Nguyen, Vinh-T. <vt...@in...> - 2005-03-12 01:45:06
|
Hi, I set up a cluster with Clustermatic 5 and run multiple simulations on the master but I do not see the master distribute jobs to slave nodes. All jobs run on the master. Although I'm able to run simulation on slave node by using explicitly = bpsh.=20 Did I forget something in the configuration ? Or Clustermatic 5 does not automatically distribute jobs to slave nodes=20 or does not have load balancing capability ? Thanks, Vinh-Thach Nguyen |
From: Greg W. <gw...@la...> - 2005-03-07 18:04:07
|
Mitch, Sorry for the delay in getting back to you, I've been traveling for the=20= last week. Answers below. On Feb 28, 2005, at 4:46 PM, Mitch Roberts wrote: > > Hi Greg, > > Hope the public holiday when well. If you have time could I bother you=20= > for some information please > > - Is there some additional documentation for clustermatic 5. I would=20= > like not to bother you to much if there is something that I should of=20= > read ;) Not really. Have you taken a look at the Clustermatic tutorial slides=20 (the CD image is on the web site)? This provides a pretty good overview=20= of how to set up the system. Other than that, the bproc mailing list=20 <bpr...@li...> is a good forum for asking=20 questions. > > - trying to get NFS or local disk working from within=20 > /etc/clustermatic/fstab > =A0 =A0 =A0 =A0 - what is the supports filesystem types (ext2 or ext3 = etc) ? Clustermatic is not fussy about filesystems. It should support any=20 format that is available to the kernel. > =A0 =A0 =A0 =A0 - can you use a remote NFS server other than the head = node ? Yes, you should be able to provide the IP address of the server in=20 /etc/clustermatic/fstab, instead of using MASTER. > =A0 =A0 =A0 =A0 - do i have to construct the stage 2 boot with the nfs = and=20 > ext3 capabilities (see below) > > beoboot -2 -i -o /tftpboot/node --plugin XXXXX > > =A0 =A0 =A0 =A0 where xxxx is things like NFS ext3 etc ? Provided there's ext3 support in the kernel (or an ext3 module), it=20 should be automatically loaded. NFS is enabled using the 'kmod nfs'=20 line in node_up.conf. You shouldn't need to use the plugin directive. > > Todate: > > We have installed clustermatic 5 and have a 4 node cluster running=20 > happily using PXE boot > > Problems today: > > - cant mount the SATA internal disks of the nodes as swap or as a=20 > scratch filesystem Are you getting an error message? > > Tasks Currently Working On: > > - Installing supermon and integrating into external stats collection=20= > and monitoring > - Trying hand integrate > =A0 =A0 =A0 =A0 beoboot > =A0 =A0 =A0 =A0 bpproc > =A0 =A0 =A0 =A0 bjs > into the new 2.6.10 kernel to better understand how clustermatic works Good luck! We have someone here doing more 2.6 kernel integration. If=20 you get stuck I can give you his contact details. Regards, Greg |
From: Rene S. <rs...@tu...> - 2005-03-04 20:02:51
|
Hi list, We also need a full featured queuing system with support for priorities preemption,etc.. BJS is just not going to cut it. Called LSF a few weeks ago and unfortunately LSF does not support the combination kernel 2.6/x86_64/bproc-4 at this time. The good news is that they promised to release a supported version in their next upcoming release around April/May. Bad thing is we need lots of $$$$ to get it. This is hard to come by at an EDU site. It would be nice to have a good "full featured" scheduler/queuing system for bproc. After all this is (at least in our site) the main and only way our users interact with the cluster. Torque looks like it can work. At least is compiles and runs well on our head node but can't quite figure out what or how to get it to tie in with bproc/bpsh to get cluster status and run jobs. Rene On Fri, 4 Mar 2005, Greg Watson wrote: > There already is bproc support for LSF, so it's definitely possible to adapt > existing job schedulers to work with bproc. Adding support to Torque should be > relatively straight forward, provided Torque provides this kind of > extensibility. I believe that someone was attempting to add bproc support to > PBS, but I don't know the status of that project. > > Greg > > > On Mar 4, 2005, at 9:35 AM, Dale Harris wrote: > > > > > Something I've been thinking about is schedulers on bproc and how the > > situation still really sucks. I have MPI applications which work find > > on bproc, but as soon as I try to submit them via BJS, they fail... so > > it's unuseable. BJS doesn't support LAM, either. And given the general > > lack of documentation and support it doesn't really seem worth working > > on BJS to make it better. So I'm thinking more along the lines of > > turing to Torque and making to bproc aware. There is a much more active > > developer base to Torque, given it's a split off from PBS. One big > > problem, however, I'm not really a very experience developer. So I'm > > wondering if anyone else has had any thoughts about this. Whether they > > think it's a good path? How hard would it be to do? > > > > One thing that kind of perked my ears recently was that OpenMPI was > > going to be scheduler aware, but I think they only mentioned PBS, which > > I would assume would include Torque. So it would be nice to have bproc, > > OpenMPI and Torque (with Maui even) being all tightly integrated. To > > quote Cartman, it'd be "kickass". > > > > -- > > Dale Harris > > ro...@ma... > > /.-) > > > > > > ------------------------------------------------------- > > SF email is sponsored by - The IT Product Guide > > Read honest & candid reviews on hundreds of IT Products from real users. > > Discover which products truly live up to the hype. Start reading now. > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Greg W. <gw...@la...> - 2005-03-04 16:51:53
|
There already is bproc support for LSF, so it's definitely possible to adapt existing job schedulers to work with bproc. Adding support to Torque should be relatively straight forward, provided Torque provides this kind of extensibility. I believe that someone was attempting to add bproc support to PBS, but I don't know the status of that project. Greg On Mar 4, 2005, at 9:35 AM, Dale Harris wrote: > > Something I've been thinking about is schedulers on bproc and how the > situation still really sucks. I have MPI applications which work find > on bproc, but as soon as I try to submit them via BJS, they fail... so > it's unuseable. BJS doesn't support LAM, either. And given the > general > lack of documentation and support it doesn't really seem worth working > on BJS to make it better. So I'm thinking more along the lines of > turing to Torque and making to bproc aware. There is a much more > active > developer base to Torque, given it's a split off from PBS. One big > problem, however, I'm not really a very experience developer. So I'm > wondering if anyone else has had any thoughts about this. Whether they > think it's a good path? How hard would it be to do? > > One thing that kind of perked my ears recently was that OpenMPI was > going to be scheduler aware, but I think they only mentioned PBS, which > I would assume would include Torque. So it would be nice to have > bproc, > OpenMPI and Torque (with Maui even) being all tightly integrated. To > quote Cartman, it'd be "kickass". > > -- > Dale Harris > ro...@ma... > /.-) > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Dale H. <ro...@ma...> - 2005-03-04 16:36:02
|
Something I've been thinking about is schedulers on bproc and how the situation still really sucks. I have MPI applications which work find on bproc, but as soon as I try to submit them via BJS, they fail... so it's unuseable. BJS doesn't support LAM, either. And given the general lack of documentation and support it doesn't really seem worth working on BJS to make it better. So I'm thinking more along the lines of turing to Torque and making to bproc aware. There is a much more active developer base to Torque, given it's a split off from PBS. One big problem, however, I'm not really a very experience developer. So I'm wondering if anyone else has had any thoughts about this. Whether they think it's a good path? How hard would it be to do? One thing that kind of perked my ears recently was that OpenMPI was going to be scheduler aware, but I think they only mentioned PBS, which I would assume would include Torque. So it would be nice to have bproc, OpenMPI and Torque (with Maui even) being all tightly integrated. To quote Cartman, it'd be "kickass". -- Dale Harris ro...@ma... /.-) |
From: Luke S. <lu...@ac...> - 2005-03-02 19:34:49
|
I'm using bproc in a clustermatic 5 setup, and am having trouble with NFS. I have a smp system as the master, and single processor machines as the clients, MASTER=192.168.50.1 # First mount our no locking, read-only file systems for i in $MASTER:/bin $MASTER:/usr/bin $MASTER:/usr/lib $MASTER:/usr/share \ $MASTER:/usr/X11R6 $MASTER:/opt 192.168.44.232:/usr/local/bin; do lcl=` echo $i | perl -e '$a=<>; $a=~/.+:(.+$)/; print $1'` echo $lcl bpsh -n $NODE /bin/mkdir -p $lcl until bpsh -n $NODE /bin/mount -t nfs -o rsize=8192,wsize=8192,nolock,ro $i $lcl ; do sleep 3; done done Seems to work fine if I boot the master with a non-smp kernel, but not with an smp kernel. Is there any way to keep the smp kernel, and get the nfs mounts to work automatically on node boot? Luke Scheirer |
From: Yinghai L. <yh...@ty...> - 2005-02-24 01:05:40
|
Why use NFS with bproc? YH _____ From: bpr...@li... [mailto:bpr...@li...] On Behalf Of bc...@au... Sent: Saturday, February 19, 2005 2:55 PM To: bpr...@li... Subject: [BProc] NFS and bproc Hello all, I'm running into a problem wtih clustermatic 5 ,FC3, NFS. all nodes NFS mounted successfully. However, I see the node_up in the ps output eventhough I have no problem access /home(exportfs) from n16 and it will not go away. root 7554 2268 0 17:16 ? 00:00:00 /bin/sh /etc/clustermatic/node_up 16 root 7582 7554 0 17:16 ? 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home root 7663 2268 0 17:19 ? 00:00:00 /bin/sh /etc/clustermatic/node_up 16 root 7691 7663 0 17:19 ? 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home. thanks in advance for your help Brady ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ BProc-users mailing list BPr...@li... https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: YhLu <Yh...@ty...> - 2005-02-24 01:03:03
|
Just use mpi from Bproc CM5 CD to run that. YH ________________________________ From: bc...@au... [mailto:bc...@au...] Sent: Sunday, February 20, 2005 4:15 PM To: bpr...@li... Subject: [BProc] HPL and Bproc Hello list, Just wondering if there is a special version of HPL for running under clustermatic(BProc) thanks Brady ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ BProc-users mailing list BPr...@li... https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Matthew O'C. <moc...@is...> - 2005-02-23 14:54:31
|
I am using the bproc-4.0.0pre7 for the 2.6.8.1 kernel, and in both that version and the pre8 for 2.6.9, master.c (for bpmaster) uses epoll quite extensively. Since Slackware is its own beast of sorts, I patch the kernel, rebuild, and then compile the bproc services. It is at that time that the epoll issue becomes relevent. I have managed to find a epoll library implementation on a really, REALLY obscure web page, and that makes me think that perhaps I am not using the correct version of bproc for my kernel versions of choice. What's even more nasty is that the only places I have really encountered epoll is on kernel development pages - evidently part of the 2.5 development (which I thought gave rise to the 2.6 release series). Ultimately, the ClusterMatic5 CD has source RPMs that I investigated...and those use epoll as well!! (It also does not seem to carry the epoll library.) So, should I not be attempting to use these 4.0 prerelease versions? Incidentally, they seem to work marvelously on the few tests I have run... ...matt > Message: 1 > Date: Tue, 22 Feb 2005 00:02:57 -0500 > From: Matthew O'Connor <moc...@is...> > To: bpr...@li... > Subject: [BProc] epoll library? > > Hello! > I am installing bproc on Slackware boxes using the 2.6.9 and 2.6.8.1 > kernels from kernel.org (homogeneously of course, depending on which one > happens to be on more machines). The problem I consistently run into is > that libepoll is not found! > > I worked around it by locating the library source on the net and > installing it, and then I had to modify the Makefile in > bproc-???/daemons for bpmaster to link successfully - is there something > I'm missing? The location of the library source seemed obscure and > rather difficult to find, which makes me think I'm missing something and > should not have to do the above steps. Everything works in the end, but > it just seems a bit strange that what seems to be an integral library is > not included with the kernel source... Any thoughts?? > > Thanks!! > ...matt > > > --__--__-- > > Message: 2 > Date: Tue, 22 Feb 2005 08:58:43 -0800 > From: Dale Harris <ro...@ma...> > To: bpr...@li... > Subject: Re: [BProc] epoll library? > > On Tue, Feb 22, 2005 at 12:02:57AM -0500, Matthew O'Connor elucidated: > >>Hello! >>I am installing bproc on Slackware boxes using the 2.6.9 and 2.6.8.1 >>kernels from kernel.org (homogeneously of course, depending on which one >>happens to be on more machines). The problem I consistently run into is >>that libepoll is not found! >> > > > I'm not sure what that library would be used for. I have bproc > installed on a Debian system, but I don't have that library anywhere. > > Are you compiling it from scratch? I'd recommend doing that on a > Slackware system. It's what I did on my system, hopefully I'll get > around to making debs for bproc one of these days. > > > Dale > |
From: Dale H. <ro...@ma...> - 2005-02-22 16:58:49
|
On Tue, Feb 22, 2005 at 12:02:57AM -0500, Matthew O'Connor elucidated: > Hello! > I am installing bproc on Slackware boxes using the 2.6.9 and 2.6.8.1 > kernels from kernel.org (homogeneously of course, depending on which one > happens to be on more machines). The problem I consistently run into is > that libepoll is not found! > I'm not sure what that library would be used for. I have bproc installed on a Debian system, but I don't have that library anywhere. Are you compiling it from scratch? I'd recommend doing that on a Slackware system. It's what I did on my system, hopefully I'll get around to making debs for bproc one of these days. Dale |
From: Matthew O'C. <moc...@is...> - 2005-02-22 05:03:11
|
Hello! I am installing bproc on Slackware boxes using the 2.6.9 and 2.6.8.1 kernels from kernel.org (homogeneously of course, depending on which one happens to be on more machines). The problem I consistently run into is that libepoll is not found! I worked around it by locating the library source on the net and installing it, and then I had to modify the Makefile in bproc-???/daemons for bpmaster to link successfully - is there something I'm missing? The location of the library source seemed obscure and rather difficult to find, which makes me think I'm missing something and should not have to do the above steps. Everything works in the end, but it just seems a bit strange that what seems to be an integral library is not included with the kernel source... Any thoughts?? Thanks!! ...matt |