You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Kimitoshi T. <kt...@cl...> - 2004-04-04 19:49:38
|
er...@he... wrote: >On Thu, Apr 01, 2004 at 03:04:29AM +0900, Kimitoshi Takahashi wrote: >> Hi all, >> >> Form what I read from Bproc documents, the process migration is volunatry, >> meaning bproc_move() must be called from the proccess to be moved. >> >> The lovely bpsh seems to wrap non-bproc program, and cause the program to move involuntary, >> using bproc_vexecmove(), only at the begining. >> >> I'm wondering if there is any way to cause a non-bproc procces to move involuntary any time >> at user's will. >> >> My colleague uses a heterogeneous cluster where the memory sizes on nodes vary. >> He sometimes wants to move small process on a large memory machine >> before he starts obviously huge proccess. He is only using bpsh to start processes. >> >> Is it technically feasible to write a like of bpsh which always wraps a process on slave nodes, >> and handles a "move now to where" signal ? >> >> How would you deal with the situation my colleague has ? > >I think a "wrapper" could take the form of a shared library. You >could manually LD_PRELOAD yourself or you could modify bpsh to >automatically set LD_PRELOAD for the child processes. I'm afraid I don't fully understand what you meant, probably I need to learn more about basics of C programing .... My guess is that signal handler is in libc and you suggested to preload a signal handler which calls bproc_move() when it gets certain signal. Is that what you meant ? >A signal seems like a good way to get the process's attention but you >still need another way to tell it where to move to. I can't think of >anything easy for that off the top of my head. How about making it a two step process: 1. When a process gets certain signal, it VMAdumps itself to the network stream and bpmaster stores it into a file on the master. 2. You can then manually restart the process explicitly specifying where to move. It's not cool in that the process migration is not peer to peer, rather it is origin-master-target. This could be also used as a general check point/restarting functionality. I have no talent in programing, though, :-( Sincerely, Kimitoshi Takahashi |
From: <ha...@no...> - 2004-04-03 20:51:15
|
Hi Luke, I never tried Mosix and I never had 1GB process in my small bproc-based cluster but I guess bproc users could enjoy more stability simply because bproc does not do many complicated things done in Mosix. I do not know if you are aware of the functionality delivered by bproc - people often expect it to be much more Mosix-like than it is just because it also migrates processes. On the other hand, it is quite possible that bproc will make former Mosix users happy because they might be willing to accept slightly bigger mental involvement than before to get more stability than before. To summarise shortly (either for you or others reading this in archives later on), bproc system usually migrates process once only, from master node to one of slave nodes, just after process creation. Main motivation for migration is simplified cluster setup - slave nodes can contain nearly nothing, they can even be so simple that the process could not be started on them (and have to be migrated there). Or the slaves can be more complete but the processes might be still created on the master just to make it easier to control them. Load balancing can still be done with bproc, thought the decision where shall the process run is done at process startup only. Balancing decisions must be done by a batch spooling system (and not bproc itself) - we use SGE (which is rather unusual with bproc but works fine), there are others like bjs (which I never tried). HTH Vaclav Hanzl |
From: Luke P. <lop...@wi...> - 2004-04-03 19:13:24
|
Hi everyone, I administer a small cluster of 8 dual-xeon machines. I support about ten users, who keep the cluster continuously busy with long running large-memory jobs. We have been using openMosix for some time now- really cool system, but for us it crashes constantly. I am considering switching to bproc or another load balancing system. Can anyone comment on the stability of bproc for heavy loads from ~1GB processes? Thanks -Luke |
From: <sk...@em...> - 2004-04-01 11:25:27
|
Erik, I installed clustermatic 4 on a fresh install of RedHat 9 (kernel 2.4.20-6). I'm having a strange problem, though. It seems that I can no longer access the outside internet from the master node using the bproc-modified kernels. I can access our internal company network, and obviously the closed cluster network, but I can't seem to get past our gateway without using an internal proxy server. This only occurs when I boot into the CM - kernels. Booting up with the stock kernels seems to fix the problem, so I don't think its my network settings. I've got eth0 setup for the cluster, and eth1 connects to the company network. Any thoughts as to why this might be happening? -Steve Kearns |
From: <er...@he...> - 2004-03-31 20:08:17
|
On Thu, Apr 01, 2004 at 03:04:29AM +0900, Kimitoshi Takahashi wrote: > Hi all, > > Form what I read from Bproc documents, the process migration is volunatry, > meaning bproc_move() must be called from the proccess to be moved. > > The lovely bpsh seems to wrap non-bproc program, and cause the program to move involuntary, > using bproc_vexecmove(), only at the begining. > > I'm wondering if there is any way to cause a non-bproc procces to move involuntary any time > at user's will. > > My colleague uses a heterogeneous cluster where the memory sizes on nodes vary. > He sometimes wants to move small process on a large memory machine > before he starts obviously huge proccess. He is only using bpsh to start processes. > > Is it technically feasible to write a like of bpsh which always wraps a process on slave nodes, > and handles a "move now to where" signal ? > > How would you deal with the situation my colleague has ? I think a "wrapper" could take the form of a shared library. You could manually LD_PRELOAD yourself or you could modify bpsh to automatically set LD_PRELOAD for the child processes. A signal seems like a good way to get the process's attention but you still need another way to tell it where to move to. I can't think of anything easy for that off the top of my head. - Erik |
From: <er...@he...> - 2004-03-31 18:33:15
|
On Mon, Mar 29, 2004 at 08:01:54PM +0100, Gerben Roest wrote: > Hi all, > > I want to know how to test BProc's funtionality. I have a > cpu-time-consuming little c program, which I want to run on the master and > hopefully see it running on the slave (I have a 2-node cluster for this to > test). > > On the master the daemons bpmaster and bpslave both run (the master can > join if he likes) and on another node only bpslave runs. > Doing a "bpstat" I get: > > Node(s) Status Mode User Group > 0-1 up ---x--x--x root root > > but when I run the program twice, they both run on the master, when I look > at it with "beotop". I mean, one should run on the master and one on the > slave, right? > Or do I have to run one on the master and run the other on the slave, by > hand? That's not so fancy.. I can do that with normal rsh or ssh as well. Stuff doesn't automatically move to different nodes on a BProc system. You need to place the processes on nodes manually. bpsh is usually an easy way to do this. After that a process can move itself using the BProc API. (bproc_move(), etc.) - Erik |
From: Kimitoshi T. <kt...@cl...> - 2004-03-31 18:04:48
|
Hi all, Form what I read from Bproc documents, the process migration is volunatry, meaning bproc_move() must be called from the proccess to be moved. The lovely bpsh seems to wrap non-bproc program, and cause the program to move involuntary, using bproc_vexecmove(), only at the begining. I'm wondering if there is any way to cause a non-bproc procces to move involuntary any time at user's will. My colleague uses a heterogeneous cluster where the memory sizes on nodes vary. He sometimes wants to move small process on a large memory machine before he starts obviously huge proccess. He is only using bpsh to start processes. Is it technically feasible to write a like of bpsh which always wraps a process on slave nodes, and handles a "move now to where" signal ? How would you deal with the situation my colleague has ? Sincerely, Kimitoshi Takahashi |
From: Gerben R. <g....@li...> - 2004-03-29 18:02:09
|
Hi all, I want to know how to test BProc's funtionality. I have a cpu-time-consuming little c program, which I want to run on the master and hopefully see it running on the slave (I have a 2-node cluster for this to test). On the master the daemons bpmaster and bpslave both run (the master can join if he likes) and on another node only bpslave runs. Doing a "bpstat" I get: Node(s) Status Mode User Group 0-1 up ---x--x--x root root but when I run the program twice, they both run on the master, when I look at it with "beotop". I mean, one should run on the master and one on the slave, right? Or do I have to run one on the master and run the other on the slave, by hand? That's not so fancy.. I can do that with normal rsh or ssh as well. regards, Gerben Roest. -- Linvision HPC BV tel: +31-15-7502310 Elektronicaweg 16 d fax: +31-15-7502319 2628 XG Delft g....@li... The Netherlands www.linvision.com |
From: <er...@he...> - 2004-03-22 18:23:59
|
On Fri, Mar 19, 2004 at 04:19:29PM -0800, Dale Harris wrote: > On Fri, Mar 19, 2004 at 11:01:53AM -0700, er...@he... elucidated: > > On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > > > Thanks, Erik. It works just fine. The issue with the interactive > > > jobs is also fixed. > > > > > > I guess it is worth at least a minor revision number... :-) > > > > Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little > > change. I changed "BJS_JOBID" to just JOBID since other schedulers on > > BProc might want to emulate the interface. > > > > http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 > > > > > Does that version of BJS support 3.2.6? It doesn't. A similar patch should be very easy to do though. I can try and iolate the bits related to setting JOBID if anybody wants to take a crack at it. - Erik |
From: Daniel G. <dg...@ti...> - 2004-03-20 03:57:06
|
On Fri, Mar 19, 2004 at 04:48:56PM -0700, er...@he... wrote: > On Fri, Mar 19, 2004 at 02:02:32PM -0500, Daniel Gruner wrote: > > On Fri, Mar 19, 2004 at 11:01:53AM -0700, er...@he... wrote: > > > On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > > > > Thanks, Erik. It works just fine. The issue with the interactive > > > > jobs is also fixed. > > > > > > > > I guess it is worth at least a minor revision number... :-) > > > > > > Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little > > > change. I changed "BJS_JOBID" to just JOBID since other schedulers on > > > BProc might want to emulate the interface. > > > > > > http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 > > > > > > - Erik > > > > Great! It seems to be working fine for me. Both for interactive and for > > batch jobs. I will get the latest from sourceforge. > > > > I have a question: How does bjs deal with multiprocessor nodes? I have so > > far only run it on clusters of uniprocessors, but I would like to install > > it on my other clusters, which have dual-cpu machines. > > As far as BJS and BProc are concerned "we're renting rooms." The room > in this case is the node. You ask for nodes, if it's got multiple > CPUs, it's up to you to do something sensible with both. It would be really nice if we could specify "how big the rooms are", so that the scheduler could send more than one job per machine if it is defined as having several cpus. One way could be to think of repeated nodes out of a list, or a variant on the simple or filler policies that take a variable of "jobsPerNode", or whatever. Is this feasible (easily feasible, I mean)? > > > I am using the simple policy, since I don't ever want more processes running > > on the nodes than there are processors (well, at least for now). Can you tell > > me a bit more about the filler policy? > > Filler is a simple back filler. It will let later jobs run first if > there's a big enough hole. For example job A is running using 30/32 > nodes and has 100 sec left. job B wants 32/32. Job C after that one > only wants 2 for 90 sec. "filler" will see that that fits in the hole > and start it. "simple" only starts stuff in order. Ok, neat! Sounds like the one to use if you have a mixture of serial and parallel jobs. -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Dale H. <ro...@ma...> - 2004-03-20 00:19:30
|
On Fri, Mar 19, 2004 at 11:01:53AM -0700, er...@he... elucidated: > On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > > Thanks, Erik. It works just fine. The issue with the interactive > > jobs is also fixed. > > > > I guess it is worth at least a minor revision number... :-) > > Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little > change. I changed "BJS_JOBID" to just JOBID since other schedulers on > BProc might want to emulate the interface. > > http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 > Does that version of BJS support 3.2.6? Dale |
From: <er...@he...> - 2004-03-20 00:12:38
|
On Fri, Mar 19, 2004 at 02:02:32PM -0500, Daniel Gruner wrote: > On Fri, Mar 19, 2004 at 11:01:53AM -0700, er...@he... wrote: > > On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > > > Thanks, Erik. It works just fine. The issue with the interactive > > > jobs is also fixed. > > > > > > I guess it is worth at least a minor revision number... :-) > > > > Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little > > change. I changed "BJS_JOBID" to just JOBID since other schedulers on > > BProc might want to emulate the interface. > > > > http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 > > > > - Erik > > Great! It seems to be working fine for me. Both for interactive and for > batch jobs. I will get the latest from sourceforge. > > I have a question: How does bjs deal with multiprocessor nodes? I have so > far only run it on clusters of uniprocessors, but I would like to install > it on my other clusters, which have dual-cpu machines. As far as BJS and BProc are concerned "we're renting rooms." The room in this case is the node. You ask for nodes, if it's got multiple CPUs, it's up to you to do something sensible with both. > I am using the simple policy, since I don't ever want more processes running > on the nodes than there are processors (well, at least for now). Can you tell > me a bit more about the filler policy? Filler is a simple back filler. It will let later jobs run first if there's a big enough hole. For example job A is running using 30/32 nodes and has 100 sec left. job B wants 32/32. Job C after that one only wants 2 for 90 sec. "filler" will see that that fits in the hole and start it. "simple" only starts stuff in order. - Erik |
From: Daniel G. <dg...@ti...> - 2004-03-19 19:02:49
|
On Fri, Mar 19, 2004 at 11:01:53AM -0700, er...@he... wrote: > On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > > Thanks, Erik. It works just fine. The issue with the interactive > > jobs is also fixed. > > > > I guess it is worth at least a minor revision number... :-) > > Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little > change. I changed "BJS_JOBID" to just JOBID since other schedulers on > BProc might want to emulate the interface. > > http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 > > - Erik Great! It seems to be working fine for me. Both for interactive and for batch jobs. I will get the latest from sourceforge. I have a question: How does bjs deal with multiprocessor nodes? I have so far only run it on clusters of uniprocessors, but I would like to install it on my other clusters, which have dual-cpu machines. I am using the simple policy, since I don't ever want more processes running on the nodes than there are processors (well, at least for now). Can you tell me a bit more about the filler policy? Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <er...@he...> - 2004-03-19 18:28:49
|
On Wed, Mar 17, 2004 at 11:12:08PM -0500, Daniel Gruner wrote: > Thanks, Erik. It works just fine. The issue with the interactive > jobs is also fixed. > > I guess it is worth at least a minor revision number... :-) Ok, it's up on sourceforge as BJS 1.5. There's only one tiny little change. I changed "BJS_JOBID" to just JOBID since other schedulers on BProc might want to emulate the interface. http://sourceforge.net/project/showfiles.php?group_id=24453&package_id=55872&release_id=224841 - Erik |
From: Daniel G. <dg...@ti...> - 2004-03-18 04:13:25
|
Thanks, Erik. It works just fine. The issue with the interactive jobs is also fixed. I guess it is worth at least a minor revision number... :-) Regards, Daniel On Wed, Mar 17, 2004 at 03:22:23PM -0700, er...@he... wrote: > On Wed, Mar 17, 2004 at 04:45:55PM -0500, Daniel Gruner wrote: > > Hi Erik, > > > > Would you care to actually send me your proposed diffs? I have been > > trying to make sense of your code, but... :-) > > Ok, ok :) I had my head in BJS today anyway for another reason. It > turned out to be a little more involved than what I said below because > bjs wasn't even sending the job ID to bjssub in the interactive case. > There's a few other bug fixes in there too. See the ChangeLog part of > the diff for details on that. The environment variable is > "BJS_JOBID". Note that BJS_JOBIDs may be re-used if bjs is restarted. > > - Erik > > > On Wed, Mar 17, 2004 at 12:46:15PM -0700, er...@he... wrote: > > > On Wed, Mar 17, 2004 at 11:36:56AM -0500, Daniel Gruner wrote: > > > > HI > > > > > > > > Is there an environment variable setup by bjs such that the submitted > > > > script can know its id? I know the $NODES variable is set by bjs, but > > > > it would be useful to also have the JOBID set in the environment, so that > > > > one can set up individual directories based on the JOBID, etc. > > > > > > It doesn't. That's a good idea though. I think it could be added > > > with two lines in bjs.c:bjs_job_environment() and the analagous spot > > > (where NODES is set) in bjssub.c. > > > > > > - Erik > > > Index: ChangeLog > =================================================================== > RCS file: /home/repository/bjs/ChangeLog,v > retrieving revision 1.7 > retrieving revision 1.10 > diff -u -r1.7 -r1.10 > --- ChangeLog 10 Nov 2003 19:48:15 -0000 1.7 > +++ ChangeLog 17 Mar 2004 22:41:45 -0000 1.10 > @@ -1,3 +1,16 @@ > +Changes from 1.4 to > + > + * Fixed signal setup for batch mode jobs started by the daemons. > + Signal handling options and signal masks are now properly reset to > + defaults for the child processes. > + > + * Fixed signal handling behavior in the daemon so that the daemon > + won't get slain by SIGPIPE. > + > + * Fixed a problem with node ranges not getting assigned properly. > + > + * Added BJS_JOBID environment variable. > + > Changes from 1.3 to 1.4 > > * Updated to work reasonably on x86_64 machines. > Index: bjs.c > =================================================================== > RCS file: /home/repository/bjs/bjs.c,v > retrieving revision 1.27 > retrieving revision 1.30 > diff -u -r1.27 -r1.30 > --- bjs.c 10 Nov 2003 19:48:15 -0000 1.27 > +++ bjs.c 17 Mar 2004 22:41:45 -0000 1.30 > @@ -24,7 +24,7 @@ > * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public > * License for more detail. > * > - * $Id: bjs.c,v 1.27 2003/11/10 19:48:15 hendriks Exp $ > + * $Id: bjs.c,v 1.30 2004/03/17 22:41:45 hendriks Exp $ > *--------------------------------------------------------------------*/ > #include <stdio.h> > #include <stdlib.h> > @@ -318,10 +318,10 @@ > > static struct bproc_node_set_t clean_set = BPROC_EMPTY_NODESET; > void bjs_do_clean(void) { > +#if 1 > int i, j, nprocs, killed_one; > struct bproc_proc_info_t *plist; > > - > /* XXX It would be much better if we had the option of killing > * only the processes related to the job. We're going to end up > * killing off mon, etc. here. */ > @@ -340,6 +340,9 @@ > > if (nprocs > 0) free(plist); > } while(killed_one); > +#else > +#warning "bjs_do_clean is commented out!!!!" > +#endif > > bproc_nodeset_free(&clean_set); > } > @@ -435,6 +438,9 @@ > tmp[len-1] = 0; > printf("NODES=\"%s\"\n", tmp); fflush(0); > setenv("NODES", tmp, 1); > + > + sprintf(tmp, "%d", j->job_id); > + setenv("BJS_JOBID", tmp, 1); > } > > static > @@ -524,8 +530,18 @@ > return -1; > } > if (pid == 0) { > - /* Close file descriptors for clients */ > struct list_head *l; > + sigset_t sset; > + > + /* Restore signal handling defaults */ > + signal(SIGCHLD, SIG_DFL); > + signal(SIGHUP, SIG_DFL); > + signal(SIGPIPE, SIG_IGN); > + > + sigfillset(&sset); > + sigprocmask(SIG_UNBLOCK, &sset, 0); > + > + /* Close file descriptors for clients */ > for (l=clients.next; l != &clients; l = l->next) { > struct client_t *c; > c = list_entry(l, struct client_t, list); > @@ -572,6 +588,8 @@ > char tmp[20]; > > sx = sexp_create_list("nodes", 0); > + sprintf(tmp, "%d", j->job_id); > + sexp_append_atom(sx, tmp); > for (l = j->nodes.next; l != &j->nodes; l = l->next) { > struct node_alloc_t *n = list_entry(l,struct node_alloc_t,nodes_list); > sprintf(tmp, "%d", bjs_node_idx[n->node]->node); > @@ -1617,61 +1635,6 @@ > return current_pool; > } > > -#if 0 > -static > -int do_nodelist(char *str_, int **numlist_) { > - char *end1, *end2, *next, *str; > - int num1, num2, i; > - int numlist_len = 0; > - int *numlist = 0; > - > - for (str = str_; *str; str = next) { > - /* Look for a number */ > - num1 = strtol(str, &end1, 0); > - switch (*end1) { > - case 0: > - num2 = num1; > - next = end1; > - break; > - case ',': > - num2 = num1; > - next = end1+1; > - break; > - case '-': > - num2 = strtol(end1+1, &end2, 0); > - if (end2 == end1+1) { > - if (numlist) free(numlist); > - return -1; > - } > - switch (*end2) { > - case 0: > - next = end2; > - break; > - case ',': > - next = end2+1; > - break; > - default: > - if (numlist) free(numlist); > - return -1; > - } > - break; > - default: > - if (numlist) free(numlist); > - return -1; > - } > - > - /* Fill in range */ > - numlist = realloc_chk(numlist, > - sizeof(int) * (numlist_len + num2 - num1 + 1)); > - for (i = num1; i <= num2; i++) > - numlist[numlist_len++] = i; > - } > - > - * numlist_ = numlist; > - return numlist_len; > -} > -#endif > - > static > int config_nodes_callback(struct cmconf *cnf, char **args) { > int i, j; > @@ -1694,14 +1657,12 @@ > return -1; > } > > - /*printf("%d %d %s\n", p->nnodes, ns2.size, args[i]);*/ > - > p->nodes = realloc_chk(p->nodes, sizeof(int) * (p->nnodes + ns2.size)); > for (j=0; j < ns2.size; j++) { > /* XXX Do we want to sanity check node numbers at this > * point? It seems that we need to be able to handle > * bogus node numbers. */ > - p->nodes[p->nnodes++] = ns.node[j].node; > + p->nodes[p->nnodes++] = ns2.node[j].node; > /* We handle machine setup inside config_xfer */ > } > bproc_nodeset_free(&ns); > @@ -2398,6 +2359,7 @@ > "Usage: %s [options...]\n" > " -h Print this message and exit.\n" > " -V Print version information and exit.\n" > +" -v Increase verbose level.\n" > " -C file Read configuration from file (default=%s)\n" > , arg0, DEFAULT_CONFIG_FILE); > } > @@ -2474,7 +2436,7 @@ > sigprocmask(SIG_BLOCK, &sset, 0); > signal(SIGCHLD, signal_handler); > signal(SIGHUP, signal_handler); > - //signal(SIGPIPE, SIG_IGN); > + signal(SIGPIPE, SIG_IGN); > > /*-- main select loop ------------------------------------------*/ > while (1) { > Index: bjssub.c > =================================================================== > RCS file: /home/repository/bjs/bjssub.c,v > retrieving revision 1.12 > retrieving revision 1.13 > diff -u -r1.12 -r1.13 > --- bjssub.c 19 Sep 2002 20:28:20 -0000 1.12 > +++ bjssub.c 17 Mar 2004 22:41:45 -0000 1.13 > @@ -24,7 +24,7 @@ > * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public > * License for more detail. > * > - * $Id: bjssub.c,v 1.12 2002/09/19 20:28:20 hendriks Exp $ > + * $Id: bjssub.c,v 1.13 2004/03/17 22:41:45 hendriks Exp $ > *--------------------------------------------------------------------*/ > #include <stdio.h> > #include <stdlib.h> > @@ -292,16 +292,19 @@ > > /* Put together the nodes string and stick it in the environment */ > len = 0; > - for (sx = nodesx->list->next; sx; sx=sx->next) > + for (sx = nodesx->list->next->next; sx; sx=sx->next) > len += strlen(sx->val) + 1; > nodesstr = alloca(len); > nodesstr[0] = 0; > - for (sx = nodesx->list->next; sx; sx=sx->next) { > + for (sx = nodesx->list->next->next; sx; sx=sx->next) { > strcat(nodesstr, sx->val); > if (sx->next) strcat(nodesstr, ","); > } > setenv("NODES", nodesstr, 1); > printf("NODES=%s\n", nodesstr); > + > + setenv("BJS_JOBID", nodesx->list->next->val, 1); > + printf("BJS_JOBID=%s\n", nodesx->list->next->val); > > if (pwd && chdir(pwd)) { > fprintf(stderr, "chdir(\"%s\"): %s\n", pwd, strerror(errno)); > Index: sexps.txt > =================================================================== > RCS file: /home/repository/bjs/sexps.txt,v > retrieving revision 1.3 > retrieving revision 1.4 > diff -u -r1.3 -r1.4 > --- sexps.txt 17 Sep 2002 03:55:10 -0000 1.3 > +++ sexps.txt 17 Mar 2004 22:41:45 -0000 1.4 > @@ -20,7 +20,7 @@ > > Job submission (interactive) responses: > (ok ID) > -(nodes NODE ...) > +(nodes ID NODE ...) > (error MSG) > > -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <er...@he...> - 2004-03-17 20:09:29
|
On Wed, Mar 17, 2004 at 11:36:56AM -0500, Daniel Gruner wrote: > HI > > Is there an environment variable setup by bjs such that the submitted > script can know its id? I know the $NODES variable is set by bjs, but > it would be useful to also have the JOBID set in the environment, so that > one can set up individual directories based on the JOBID, etc. It doesn't. That's a good idea though. I think it could be added with two lines in bjs.c:bjs_job_environment() and the analagous spot (where NODES is set) in bjssub.c. - Erik |
From: Daniel G. <dg...@ti...> - 2004-03-17 16:37:05
|
HI Is there an environment variable setup by bjs such that the submitted script can know its id? I know the $NODES variable is set by bjs, but it would be useful to also have the JOBID set in the environment, so that one can set up individual directories based on the JOBID, etc. Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Daniel G. <dg...@ti...> - 2004-03-16 21:52:30
|
On Tue, Mar 16, 2004 at 02:18:46PM -0700, Michal Jaegermann wrote: > On Tue, Mar 16, 2004 at 10:24:24AM -0500, Daniel Gruner wrote: > > Any more suggestions? > > Well, my connection got better and I did some searching through > Google. It is actually not hard to find quite a few entries > using "Alpha CACHE INCORRECTLY CONFIGURED" for a search string. > > Have a look, in particular, at: > http://www.cs.helsinki.fi/linux/linux-kernel/2001-09/0400.html > and a related thread. > > You have full sources for a kernel I installed on your machine and > which was not having that problem. Do you see in there Alpha > specific patches with anything which is similar? In source rpm they > are separate from original sources and often have somewhat > descriptive names. Or anything else there which is involved in > PCI-to-PCI handling? These pieces, and most likely all other stuff > which touches PCI on Alpha, probably apply to what you are using > right now. Possibly they need a bit of massaging but really only > you can tell. There is a distinct possibility that this would solve > the problem. Thanks, Michal. I will look at these. > > BTW - which SCSI driver you are using right now? If this is not > sym53c8xx_2, but instead sym53c8xx or ncr53c8xx, then chances are > that switching to it will help. What you got from me it was using > sym53c8xx_2, IIRC. It may be a real problem with PCI bridges > handling though, or an intrplay with what milo sets and expects, and > then you need to carry over patches which apply. I thought about this, and I have built new kernels that define the sym53c8xx_2 as included in the kernel, rather than the sym53c8xx that was there originally. It didn't make a difference... Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Daniel G. <dg...@ti...> - 2004-03-16 20:15:29
|
On Tue, Mar 16, 2004 at 12:47:04PM -0700, er...@he... wrote: > On Mon, Mar 15, 2004 at 03:07:44PM -0500, Daniel Gruner wrote: > > Well, it works! I have managed to coerce the cluster into clustermatic 4, > > with the only modification being what I mention above, i.e. setting > > CONFIG_ALPHA_LEGACY_START_ADDRESS=y in the config files. I rebuilt all the > > rpms, and bingo. > > > > The scheduler problem I was having with the earlier version is now gone too. > > I guess it must have been that particular kernel combined with BProc 3.2.6. > > > > I still see one problem, and it has to do with rebooting the nodes. For some > > reason milo ends up screwed up when doing a warm reboot (e.g. with bpctl), > > and complains about CACHE INCORRECTLY CONFIGURED, which ends up in > > MILO: unknown filesystem on device sda2 > > MILO: Failed to load the kernel > > That's gotta be some weird alpha-ism. I haven't used MILO in a LONG > time. I replaced alpha bios with the SRM (and used aboot) on the > machine were I was using it and life got a lot better. Unfortunately there is no SRM for the UX (Ruffian) boards... :-{( Otherwise we would have done the same as you, long time ago... I have tried compiling the kernel specifically for Ruffian, but it still does the same. It is as if milo itself gets corrupted by Linux, and it fails to see the scsi disks after that (even if it still has the driver for the scsi controller). I have even tried changing the default driver to SYM53CXX_2, as it used to be in the old version of the kernel I was using, but to no avail. There must be some other dirty trick... Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <er...@he...> - 2004-03-16 20:10:07
|
On Mon, Mar 15, 2004 at 03:07:44PM -0500, Daniel Gruner wrote: > Well, it works! I have managed to coerce the cluster into clustermatic 4, > with the only modification being what I mention above, i.e. setting > CONFIG_ALPHA_LEGACY_START_ADDRESS=y in the config files. I rebuilt all the > rpms, and bingo. > > The scheduler problem I was having with the earlier version is now gone too. > I guess it must have been that particular kernel combined with BProc 3.2.6. > > I still see one problem, and it has to do with rebooting the nodes. For some > reason milo ends up screwed up when doing a warm reboot (e.g. with bpctl), > and complains about CACHE INCORRECTLY CONFIGURED, which ends up in > MILO: unknown filesystem on device sda2 > MILO: Failed to load the kernel That's gotta be some weird alpha-ism. I haven't used MILO in a LONG time. I replaced alpha bios with the SRM (and used aboot) on the machine were I was using it and life got a lot better. > When the machine boots cold, or is reset hard (i.e. with the reset button) > it just boots, no questions asked. > > With the previous kernel, and the same milo, the machine behaves properly, > and can be rebooted warm. > > Other than that, it seems to be working great! Good. - Erik |
From: Daniel G. <dg...@ti...> - 2004-03-15 20:08:07
|
On Thu, Mar 11, 2004 at 03:29:02PM -0700, er...@he... wrote: > On Thu, Mar 11, 2004 at 05:42:30PM -0500, Daniel Gruner wrote: > > On Thu, Mar 11, 2004 at 02:34:03PM -0700, er...@he... wrote: > > > On Thu, Mar 11, 2004 at 02:38:24PM -0500, Daniel Gruner wrote: > > > > Hi Erik, > > > > > > > > Well, as I am still struggling with the alpha (UX) machines and the > > > > scheduling problem that starves the nodes, I would like to try to rebuild > > > > the Clustermatic 4 kernel, and test it on my machine. The need to rebuild > > > > is due to the fact that my machines use milo, and there is no srm console > > > > for them. I know I need to change one parameter in the config file(s) for > > > > the kernel(s): CONFIG_ALPHA_LEGACY_START_ADDRESS=y. > > > > > > > > Can you tell me what the easiest way to do this is? Do I simply modify the > > > > kernel-2.4.22-alpha.config file that got unpacked from the src.rpm and > > > > then use rpmbuild on the kernel-2.4.spec file? Do I need to have other > > > > packages installed on the machine prior to rebuilding the kernel (e.g. > > > > bproc-4.0.0pre3-1)? > > > > > > That should do it... I'm pretty sure you don't need to have anything > > > funny installed. > > > > In other words, just doing something like > > > > rpmbuild kernel-2.4.spec > > > > after modifying the *alpha.config files should do the trick? > > I will try it, and report back... > > Yeah, > > rpmbuild -ba kernel-2.4.spec > > Modify the .configs in the /usr/src/redhat/SOURCES directory. The > specfile copies from there. You might want to bump the release number > in kernel-2.4.spec too just to avoid confusion. > > - Erik Erik, Well, it works! I have managed to coerce the cluster into clustermatic 4, with the only modification being what I mention above, i.e. setting CONFIG_ALPHA_LEGACY_START_ADDRESS=y in the config files. I rebuilt all the rpms, and bingo. The scheduler problem I was having with the earlier version is now gone too. I guess it must have been that particular kernel combined with BProc 3.2.6. I still see one problem, and it has to do with rebooting the nodes. For some reason milo ends up screwed up when doing a warm reboot (e.g. with bpctl), and complains about CACHE INCORRECTLY CONFIGURED, which ends up in MILO: unknown filesystem on device sda2 MILO: Failed to load the kernel When the machine boots cold, or is reset hard (i.e. with the reset button) it just boots, no questions asked. With the previous kernel, and the same milo, the machine behaves properly, and can be rebooted warm. Other than that, it seems to be working great! Regards, Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: <er...@he...> - 2004-03-12 20:32:19
|
On Fri, Mar 12, 2004 at 12:06:12PM -0800, Josh England wrote: > It's GPL'd, so here is the tarball. Maybe this would be a good thing to > include in the regular Bproc distro?? I tossed it in beoboot tree. It'll be in the next tarball. - Erik |
From: YhLu <Yh...@ty...> - 2004-03-12 20:03:28
|
Thanks, Matt. I will wait for the new stack. Before that I will test that in local HD = + vapi + (mpich over vapi). YH -----=D3=CA=BC=FE=D4=AD=BC=FE----- =B7=A2=BC=FE=C8=CB: Matt L. Leininger = [mailto:mll...@hp...]=20 =B7=A2=CB=CD=CA=B1=BC=E4: 2004=C4=EA3=D4=C212=C8=D5 10:36 =CA=D5=BC=FE=C8=CB: er...@he... =B3=AD=CB=CD: YhLu; bpr...@li... =D6=F7=CC=E2: Re: [BProc] InfiBand and Bproc We are working towards a complete InfiniBand Science Appliance, but it will probably be ~6 months before we have something useful for the = folks running BProc in production mode. Our current Bproc IB cluster at Sandia Livermore runs BProc over the GigE network. =20 The main issue has been that each InfiniBand company had its our proprietary software stack. Over the last 6 weeks the InfiniBand community (mainly industry and DOE national labs) have come together = and released nearly all the code open source. We are currently setting up an organization to deliver a unified (best of breed) open source InfiniBand stack to the Linux community. This stack will be making its way into the Linux kernel (kernel.org), and into the SuSE and Redhat distros. The official announcement for the OS stack and organization will be released in about 2 weeks and I'll forward it to the BProc mail list. With a single open source stack we can finally have a multi-vendor supported solution for Science Appliance clusters. As = Erik mentioned we'll be running BProc via IP over InfiniBand (IPoIB) and eliminate the need for a secondary ethernet network. =20 - Matt On Fri, 2004-03-12 at 08:29, er...@he... wrote: > On Thu, Mar 11, 2004 at 09:12:38PM -0800, YhLu wrote: > > Erik, > >=20 > > Any howto or doc that is talking about using bproc with IB? > >=20 > > I saw there is some plug-in option for myerinet ... >=20 > I'm not aware of anything like that. I haven't set it up myself. = The > clusters that I've seen using IB at this point were still using > ethernet as the management network. BProc did all its stuff on that > network. There was some stuff whacked in to the BProc setup to get = IB > drivers loaded. That was non-trivial since there were a lot of > scripts involved. >=20 > I have seen BProc run using IP over IB on little bladed thing once = but > we cheated and had a local linux install on the slave node to get the > IB setup. >=20 > Some of the Sandia California guys are on this list. They have a lot > more experience with this than I do. Maybe they can comment or > recommend how to go about doing this. Matt/Josh/Mitch ??? >=20 > - Erik >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > = administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl= ick > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users --=20 Matt L. Leininger, Ph.D.=20 Sandia National Laboratory, Livermore CA High Performance Computing and Networking E-mail: ml...@ca... World Wide Web: http://aros.ca.sandia.gov/~mlleinin/Matt.html Office phone: (925) 294-4842 Office fax: (925) 294-2776=20 Mail Stop: 9915 |
From: Josh E. <jj...@sa...> - 2004-03-12 19:44:50
|
It will be nice when we get things going purely over IB. In the meantime we are using an 'exec' module for Bproc originally developed by LinuxNetworx to run the slew of IB scripts needed for most VAPI stacks. It's GPL'd, so here is the tarball. Maybe this would be a good thing to include in the regular Bproc distro?? -JE On Fri, 2004-03-12 at 10:35, Matt L. Leininger wrote: > We are working towards a complete InfiniBand Science Appliance, but it > will probably be ~6 months before we have something useful for the folks > running BProc in production mode. Our current Bproc IB cluster at > Sandia Livermore runs BProc over the GigE network. > > The main issue has been that each InfiniBand company had its our > proprietary software stack. Over the last 6 weeks the InfiniBand > community (mainly industry and DOE national labs) have come together and > released nearly all the code open source. We are currently setting up > an organization to deliver a unified (best of breed) open source > InfiniBand stack to the Linux community. This stack will be making its > way into the Linux kernel (kernel.org), and into the SuSE and Redhat > distros. The official announcement for the OS stack and organization > will be released in about 2 weeks and I'll forward it to the BProc mail > list. With a single open source stack we can finally have a > multi-vendor supported solution for Science Appliance clusters. As Erik > mentioned we'll be running BProc via IP over InfiniBand (IPoIB) and > eliminate the need for a secondary ethernet network. > > - Matt > > > On Fri, 2004-03-12 at 08:29, er...@he... wrote: > > On Thu, Mar 11, 2004 at 09:12:38PM -0800, YhLu wrote: > > > Erik, > > > > > > Any howto or doc that is talking about using bproc with IB? > > > > > > I saw there is some plug-in option for myerinet ... > > > > I'm not aware of anything like that. I haven't set it up myself. The > > clusters that I've seen using IB at this point were still using > > ethernet as the management network. BProc did all its stuff on that > > network. There was some stuff whacked in to the BProc setup to get IB > > drivers loaded. That was non-trivial since there were a lot of > > scripts involved. > > > > I have seen BProc run using IP over IB on little bladed thing once but > > we cheated and had a local linux install on the slave node to get the > > IB setup. > > > > Some of the Sandia California guys are on this list. They have a lot > > more experience with this than I do. Maybe they can comment or > > recommend how to go about doing this. Matt/Josh/Mitch ??? > > > > - Erik > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: IBM Linux Tutorials > > Free Linux tutorial presented by Daniel Robbins, President and CEO of > > GenToo technologies. Learn everything from fundamentals to system > > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Matt L. L. <mll...@hp...> - 2004-03-12 18:48:11
|
We are working towards a complete InfiniBand Science Appliance, but it will probably be ~6 months before we have something useful for the folks running BProc in production mode. Our current Bproc IB cluster at Sandia Livermore runs BProc over the GigE network. The main issue has been that each InfiniBand company had its our proprietary software stack. Over the last 6 weeks the InfiniBand community (mainly industry and DOE national labs) have come together and released nearly all the code open source. We are currently setting up an organization to deliver a unified (best of breed) open source InfiniBand stack to the Linux community. This stack will be making its way into the Linux kernel (kernel.org), and into the SuSE and Redhat distros. The official announcement for the OS stack and organization will be released in about 2 weeks and I'll forward it to the BProc mail list. With a single open source stack we can finally have a multi-vendor supported solution for Science Appliance clusters. As Erik mentioned we'll be running BProc via IP over InfiniBand (IPoIB) and eliminate the need for a secondary ethernet network. - Matt On Fri, 2004-03-12 at 08:29, er...@he... wrote: > On Thu, Mar 11, 2004 at 09:12:38PM -0800, YhLu wrote: > > Erik, > > > > Any howto or doc that is talking about using bproc with IB? > > > > I saw there is some plug-in option for myerinet ... > > I'm not aware of anything like that. I haven't set it up myself. The > clusters that I've seen using IB at this point were still using > ethernet as the management network. BProc did all its stuff on that > network. There was some stuff whacked in to the BProc setup to get IB > drivers loaded. That was non-trivial since there were a lot of > scripts involved. > > I have seen BProc run using IP over IB on little bladed thing once but > we cheated and had a local linux install on the slave node to get the > IB setup. > > Some of the Sandia California guys are on this list. They have a lot > more experience with this than I do. Maybe they can comment or > recommend how to go about doing this. Matt/Josh/Mitch ??? > > - Erik > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Matt L. Leininger, Ph.D. Sandia National Laboratory, Livermore CA High Performance Computing and Networking E-mail: ml...@ca... World Wide Web: http://aros.ca.sandia.gov/~mlleinin/Matt.html Office phone: (925) 294-4842 Office fax: (925) 294-2776 Mail Stop: 9915 |