You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Daniel W. <wi...@ci...> - 2004-07-15 21:00:28
|
Greetings, After some hunting we discovered (bproc 3.2.6 again) that if some erroneous argument is provided for (at least) the -I option of bpsh, hilarity ensues. # bpsh 3 -I /dev/null/ hostname 3: Not a directory # bpsh 3 -I /dev/null hostname (correct information) Known feature? Dan W. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Liniac Project, CIS Dept., SEAS, University of Pennsylvania -- Mail: CIS Dept, 302 Levine 3330 Walnut St Philadelphia, PA 19104 |
From: <er...@he...> - 2004-07-15 19:30:07
|
On Thu, Jul 15, 2004 at 01:51:59PM -0400, Daniel Widyono wrote: > Doh. Completely forgot we had instituted a memory limit on the master node > to keep runaway processes in check. Is there a way to _prevent_ bproc from > migrating the ulimit? Currently, no but that seems like a good idea for a feature. It's not clear to me what that feature would look like. - Erik |
From: Daniel W. <wi...@ci...> - 2004-07-15 17:52:03
|
Doh. Completely forgot we had instituted a memory limit on the master node to keep runaway processes in check. Is there a way to _prevent_ bproc from migrating the ulimit? The compute nodes are owned completely by the user, so there's no issue about sharing memory with other users on the compute nodes, thus I'd like to keep the limit on the master node but erase it on the compute nodes. Thanks, Dan W. > Normally *should* migrate ulimits along with other process details so > you should have the same ulimit on a slave node that you had on the > front end. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Liniac Project, CIS Dept., SEAS, University of Pennsylvania -- Mail: CIS Dept, 302 Levine 3330 Walnut St Philadelphia, PA 19104 |
From: <er...@he...> - 2004-07-15 17:32:59
|
On Wed, Jul 14, 2004 at 04:17:09PM -0400, Daniel Widyono wrote: > Greetings, > > I couldn't find "limit" in my personal bproc list archive, and > www.geocrawler.com is refusing connections right now (home of the mailing > list archives, as per bproc.sf.net). Under bproc 3.2.6: > > #!/bin/sh > # ulimit > ulimit -m > > bpsh 0 ./ulimit shows 512MB (2GB RAM on compute node) > ssh node0 ./ulimit shows unlimited > > Is this a known issue? Is it configurable and I just can't find the > configuration? Normally *should* migrate ulimits along with other process details so you should have the same ulimit on a slave node that you had on the front end. Is this not what it's doing? - Erik |
From: <er...@he...> - 2004-07-15 17:31:21
|
On Thu, Jul 08, 2004 at 02:49:59PM -0600, Michal Jaegermann wrote: > Changes in Makefiles mean that vmadump kernel module is installed > somewhat indirectly. An unfortunate side-effect of that is that > vmadump.h header is no longer installed in /usr/include/sys. > This prevents compilation of things like cmtools and beoboot as they > want to see that header. This is likely harder to notice when > you have "historical" headers present but hits when you are doing > that from scratch. > > There are likely various ways to get around that. I just split > an 'install' target in vmadump/Makefile into 'install-headers' > and 'install', which depend on the first, and in the top Makefile > added to INSTALL_TARGETS something which does > '$(MAKE) -C vmadump install-headers' and things are fine again. > > BTW - "BuildRequires:" in beboot.spec for beoboot-cm1.9 needs > both cmtools-devel and bproc-devel as headers from both packages > are used. > > Another note - a default value for LINUX in various spec files > should really be /lib/modules/$(uname -r)/build instead of > /usr/src/linux. Nowadays you will not find what is needed in > the later, and similar, locations while data required for building > external modules for the current kernel should be there under > /lib/modules/$(uname -r)/build - be that a symlink or an actual > directory. Thanks for finding this stuff. Can make some patches and send them to me? - Erik |
From: Daniel W. <wi...@ci...> - 2004-07-14 20:17:15
|
Greetings, I couldn't find "limit" in my personal bproc list archive, and www.geocrawler.com is refusing connections right now (home of the mailing list archives, as per bproc.sf.net). Under bproc 3.2.6: #!/bin/sh # ulimit ulimit -m bpsh 0 ./ulimit shows 512MB (2GB RAM on compute node) ssh node0 ./ulimit shows unlimited Is this a known issue? Is it configurable and I just can't find the configuration? Thanks, Dan W. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Liniac Project, CIS Dept., SEAS, University of Pennsylvania -- Mail: CIS Dept, 302 Levine 3330 Walnut St Philadelphia, PA 19104 |
From: Michal J. <mi...@ha...> - 2004-07-08 20:50:19
|
Changes in Makefiles mean that vmadump kernel module is installed somewhat indirectly. An unfortunate side-effect of that is that vmadump.h header is no longer installed in /usr/include/sys. This prevents compilation of things like cmtools and beoboot as they want to see that header. This is likely harder to notice when you have "historical" headers present but hits when you are doing that from scratch. There are likely various ways to get around that. I just split an 'install' target in vmadump/Makefile into 'install-headers' and 'install', which depend on the first, and in the top Makefile added to INSTALL_TARGETS something which does '$(MAKE) -C vmadump install-headers' and things are fine again. BTW - "BuildRequires:" in beboot.spec for beoboot-cm1.9 needs both cmtools-devel and bproc-devel as headers from both packages are used. Another note - a default value for LINUX in various spec files should really be /lib/modules/$(uname -r)/build instead of /usr/src/linux. Nowadays you will not find what is needed in the later, and similar, locations while data required for building external modules for the current kernel should be there under /lib/modules/$(uname -r)/build - be that a symlink or an actual directory. Michal |
From: <er...@he...> - 2004-07-08 18:04:20
|
On Wed, Jul 07, 2004 at 04:49:19PM -0600, Michal Jaegermann wrote: > > I believe that I run into another typo in bproc-4.0.0pre5. In dcache.h > there is the following definition: > > struct qstr { > unsigned int hash; > const unsigned char *name; > unsigned int len; > }; > > Initializations in kernel/bpfs.c attempt to assign a pointer to a string > to 'hash'. Therefore it seems to me that the following was really intended: Yep. They moved those fields around at some point between 2.4 and here. I changed it so it's a stack variable and made the definitions look like this. This should avoid this issue if they decide to move the fields around again. Thanks for pointing that out. struct qstr name = { .name = name_, .len = strlen(name_), .hash = 0}; /* screw the hash... */ - Erik |
From: Michal J. <mi...@ha...> - 2004-07-07 22:49:32
|
I believe that I run into another typo in bproc-4.0.0pre5. In dcache.h there is the following definition: struct qstr { unsigned int hash; const unsigned char *name; unsigned int len; }; Initializations in kernel/bpfs.c attempt to assign a pointer to a string to 'hash'. Therefore it seems to me that the following was really intended: --- bproc-4.0.0pre5/kernel/bpfs.c~ 2004-05-14 13:42:39.000000000 -0600 +++ bproc-4.0.0pre5/kernel/bpfs.c 2004-07-07 16:23:56.007513784 -0600 @@ -1315,7 +1315,7 @@ * "bproc:" much like the socketfs or pipefs. This will get * ignored for other opens in a mounted bpfs file system since * those will be mounted on top of something else. */ - sb->s_root = d_alloc(NULL, &(const struct qstr) { "bproc:", 6, 0 }); + sb->s_root = d_alloc(NULL, &(const struct qstr) { 0, "bproc:", 6}); if (!sb->s_root) { iput(root_inode); return -ENOMEM; @@ -1525,7 +1525,7 @@ struct dentry *get_dentry(struct vfsmount *mount, struct inode *inode, char *name_) { struct dentry *parent, *dentry; - struct qstr name = {name_, strlen(name_), 0}; /* screw the hash... */ + struct qstr name = {0, name_, strlen(name_)}; /* screw the hash... */ parent = mount->mnt_root; Michal |
From: Michal J. <mi...@ha...> - 2004-07-05 21:46:38
|
On Mon, Jul 05, 2004 at 02:15:32PM -0600, Michal Jaegermann wrote: > On Thu, Jun 17, 2004 at 10:10:19AM -0600, Kevin Russell wrote: > > > > The following test would be better: > > > > r = getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status)); > > if (r < 0 || r > sizeof(info->status)){ > > errno = BE_INVALIDNODE; > > return -1; > > } > > > > r = getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr)); > > if (r < 0 || r > sizeof(info->status)){ > ^^^^ > > errno = BE_INVALIDNODE; > > return -1; > > } > > Should that be in the second case 'sizeof(info->status)' or rather > 'sizeof(info->addr)'? As a matter of fact it seems to me now that both 'info->status' in the second group should be 'info->addr'. Otherwise what we got into 'info->status' on the first 'getxattr()' call will be overwritten if the second call succeeds. Am I missing something? Michal |
From: Michal J. <mi...@ha...> - 2004-07-05 20:15:45
|
On Thu, Jun 17, 2004 at 10:10:19AM -0600, Kevin Russell wrote: > > The following test would be better: > > r = getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status)); > if (r < 0 || r > sizeof(info->status)){ > errno = BE_INVALIDNODE; > return -1; > } > > r = getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr)); > if (r < 0 || r > sizeof(info->status)){ ^^^^ > errno = BE_INVALIDNODE; > return -1; > } Should that be in the second case 'sizeof(info->status)' or rather 'sizeof(info->addr)'? Even looking through kernel sources does not make that more clear to me. Also results of 'sizeof()' are unsigned and long so explicit casts on comparisons with ints would be likely good. Michal |
From: <er...@he...> - 2004-07-01 17:27:27
|
On Wed, Jun 23, 2004 at 10:53:17AM -0500, Luke Palmer wrote: > Hey Erik (or anyone else that knows), > > I desperately need to be able to build my own kernels, and the only > success story I have heard is with 2.6 kernels. As of pre4, Erik said > bproc on 2.6 wasn't entirely stable yet. The release notes for pre5 > speak of significant fixes. So, my question is- > > Is bproc4.0.0pre5 on 2.6 kernels considered stable? Hrm. That's a tough question. It's certainly getting better. Beating on it here has revealed a number of bugs. I think that the 3.x is still a little more stable modulo the known shortcomings. I was going to fix one more thing before releasing another tarball but it's proving more difficult than I had hoped. - Erik |
From: Maurice H. <ma...@ha...> - 2004-06-24 04:05:47
|
I write today to ask if there is an "official" stance on the use of these. Or at least some recommendations? Platform is Opteron dual board (Tyan S2881/2882) Onboard GbE (Broadcom) etc.. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue mailto:ma...@ha... Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 |
From: Luke P. <lo...@du...> - 2004-06-23 15:53:20
|
Hey Erik (or anyone else that knows), I desperately need to be able to build my own kernels, and the only success story I have heard is with 2.6 kernels. As of pre4, Erik said bproc on 2.6 wasn't entirely stable yet. The release notes for pre5 speak of significant fixes. So, my question is- Is bproc4.0.0pre5 on 2.6 kernels considered stable? Thanks! -Luke |
From: <er...@he...> - 2004-06-22 19:35:32
|
On Thu, Jun 17, 2004 at 10:10:19AM -0600, Kevin Russell wrote: > Running "lamboot -d" gets me the following: > > > [...] > > n-1<29352> ssi:boot:bproc: found master node (head). Skipping checks. > > n-1<29352> ssi:boot:bproc: n0 node status: up > > n-1<29352> ssi:boot:bproc: n0 bproc_nodeinfo failed (Unknown error 300) > > n-1<29352> ssi:boot:bproc: n1 node status: up > > n-1<29352> ssi:boot:bproc: n1 bproc_nodeinfo failed (Unknown error 300) > > [...] > > Inspecting the bproc code in bproc-4.0.0pre5/clients/bproc.c > I find the following within the bproc_nodeinfo: > > if (getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status))){ > errno = BE_INVALIDNODE; > return -1; > } > > if (getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr))){ > errno = BE_INVALIDNODE; > return -1; > } > > It is my understanding that "getxattr" returns the length of > the extended attribute. In my case, the length is 3 ("up\0"). > The above will evaluate as true and the function will return > in error. I don't think this is the desired results. > > The following test would be better: > > r = getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status)); > if (r < 0 || r > sizeof(info->status)){ > errno = BE_INVALIDNODE; > return -1; > } > > r = getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr)); > if (r < 0 || r > sizeof(info->status)){ > errno = BE_INVALIDNODE; > return -1; > } Thanks. I made the change. - Erik |
From: Daniel G. <dg...@ti...> - 2004-06-22 15:29:51
|
On Tue, Jun 22, 2004 at 08:19:41AM -0700, Brian Barrett wrote: > I think I'm going to crawl into a corner and sleep for a while. Maybe > that will help me with my utter stupidity. So the n<integer> notation > is overloaded on a LAM/BProc cluster. LAM assigned nodes under it's > run-time environment a node number, always starting from 0. BProc > obviously assigned nodes a node number, with the -1 for MASTER. So on > your output, we have: > > n-1<28809> ssi:boot:bproc: resolved hosts: > n-1<28809> ssi:boot:bproc: n0 192.168.101.1 --> 192.168.101.1 (origin) > n-1<28809> ssi:boot:bproc: n1 192.168.101.100 --> 192.168.101.100 > n-1<28809> ssi:boot:bproc: n2 192.168.101.101 --> 192.168.101.101 > n-1<28809> ssi:boot:bproc: n3 192.168.101.102 --> 192.168.101.102 > n-1<28809> ssi:boot:bproc: n4 192.168.101.103 --> 192.168.101.103 > n-1<28809> ssi:boot:bproc: n5 192.168.101.104 --> 192.168.101.104 > n-1<28809> ssi:boot:bproc: found master node (192.168.101.1). Skipping > checks. > n-1<28809> ssi:boot:bproc: n0 node status: up > n-1<28809> ssi:boot:bproc: n0 access rights not checked. > n-1<28809> ssi:boot:bproc: n1 node status: up > n-1<28809> ssi:boot:bproc: n1 access rights not checked. > n-1<28809> ssi:boot:bproc: n2 node status: up > n-1<28809> ssi:boot:bproc: n2 access rights not checked. > n-1<28809> ssi:boot:bproc: n3 node status: up > n-1<28809> ssi:boot:bproc: n3 access rights not checked. > n-1<28809> ssi:boot:bproc: n4 node status: up > n-1<28809> ssi:boot:bproc: n4 access rights not checked. > > > The list under resolved hosts is of the format: <LAM node number> > <HOSTNAME> <IP>. Because of some BProc-3 behaviors, hostname is > already resolved to IP. The next section is where we look at BProc > node numbers. As you can see, 101.1 was found to be the master, as it > should. So LAM n0 will be BProc n-1, LAM n0 will be BProc n0, etc. > Confused yet? Not to worry. I see what is going on, and as long as LAM is not confused then I am happy with it. > > If you run the "lamnodes" command, you should see all 6 machines > listed. Next to the Master node, you should see a notation like > (origin, this_node, no_schedule). Meaning that it was the node used to > boot LAM, it is the node lamnodes is running on, and it has been set > not to be available for scheduling jobs. So aside from the already > mentioned bproc_nodeinfo() tests that we are currently just skipping > right now, I think all looks good. Yep. Here is what lamnodes had to say: racaille:dgruner{105}> lamnodes n0 master:1:no_schedule,origin,this_node n1 n0:1: n2 n1:1: n3 n2:1: n4 n3:1: n5 n4:1: This looks to me like normal behaviour, and I am satisfied that mpi jobs actually run. Thanks a bunch! (and do take a rest, it is important...:-) Daniel > > Brian > > > On Jun 22, 2004, at 7:06 AM, Daniel Gruner wrote: > > > Hi Brian, > > > > The latest version in SVN (as of yesterday evening) seems to work fine! > > I can lamboot, and then mpirun a process compiled with mpif77, and all > > that jazz... > > > > The master is still assigned as n0, but I didn't notice whether the > > actual mpi code is running on the master or not (the job I tested was > > too quick. When the job starts it prints: > > > > racaille:dgruner{132}> mpirun -np 4 ./fpi > > Process 1 of 4 is alive > > Process 0 of 4 is alive > > Process 2 of 4 is alive > > Process 3 of 4 is alive > > > > and lamboot produced the output in the attached file. It looks ok to > > me, but I want to make sure that the mpi jobs are NOT run on the > > master. > > > > Regards, > > Daniel > > > > > > On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote: > >> Hey all - > >> > >> I think I finally fixed things so LAM really does avoid the > >> bproc_nodeinfo() call on BProc 4 clusters. Changes should be in SVN. > >> Let mw know if you have any problems. I think this leaves us with > >> Luke's master node being identified as n0 problem as the one remaining > >> bug. Still not sure why Luke would see that and Daniel wouldn't. > >> *shrug*. > >> > >> Brian > >> > >> -- > >> Brian Barrett > >> LAM/MPI developer and all around nice guy > >> Have a LAM/MPI day: http://www.lam-mpi.org/ > > > > -- > > > > Dr. Daniel Gruner dg...@ti... > > Dept. of Chemistry dan...@ut... > > University of Toronto phone: (416)-978-8689 > > 80 St. George Street fax: (416)-978-5325 > > Toronto, ON M5S 3H6, Canada finger for PGP public key > > <junk> > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Brian B. <brb...@la...> - 2004-06-22 15:23:04
|
I think I'm going to crawl into a corner and sleep for a while. Maybe that will help me with my utter stupidity. So the n<integer> notation is overloaded on a LAM/BProc cluster. LAM assigned nodes under it's run-time environment a node number, always starting from 0. BProc obviously assigned nodes a node number, with the -1 for MASTER. So on your output, we have: n-1<28809> ssi:boot:bproc: resolved hosts: n-1<28809> ssi:boot:bproc: n0 192.168.101.1 --> 192.168.101.1 (origin) n-1<28809> ssi:boot:bproc: n1 192.168.101.100 --> 192.168.101.100 n-1<28809> ssi:boot:bproc: n2 192.168.101.101 --> 192.168.101.101 n-1<28809> ssi:boot:bproc: n3 192.168.101.102 --> 192.168.101.102 n-1<28809> ssi:boot:bproc: n4 192.168.101.103 --> 192.168.101.103 n-1<28809> ssi:boot:bproc: n5 192.168.101.104 --> 192.168.101.104 n-1<28809> ssi:boot:bproc: found master node (192.168.101.1). Skipping checks. n-1<28809> ssi:boot:bproc: n0 node status: up n-1<28809> ssi:boot:bproc: n0 access rights not checked. n-1<28809> ssi:boot:bproc: n1 node status: up n-1<28809> ssi:boot:bproc: n1 access rights not checked. n-1<28809> ssi:boot:bproc: n2 node status: up n-1<28809> ssi:boot:bproc: n2 access rights not checked. n-1<28809> ssi:boot:bproc: n3 node status: up n-1<28809> ssi:boot:bproc: n3 access rights not checked. n-1<28809> ssi:boot:bproc: n4 node status: up n-1<28809> ssi:boot:bproc: n4 access rights not checked. The list under resolved hosts is of the format: <LAM node number> <HOSTNAME> <IP>. Because of some BProc-3 behaviors, hostname is already resolved to IP. The next section is where we look at BProc node numbers. As you can see, 101.1 was found to be the master, as it should. So LAM n0 will be BProc n-1, LAM n0 will be BProc n0, etc. Confused yet? If you run the "lamnodes" command, you should see all 6 machines listed. Next to the Master node, you should see a notation like (origin, this_node, no_schedule). Meaning that it was the node used to boot LAM, it is the node lamnodes is running on, and it has been set not to be available for scheduling jobs. So aside from the already mentioned bproc_nodeinfo() tests that we are currently just skipping right now, I think all looks good. Brian On Jun 22, 2004, at 7:06 AM, Daniel Gruner wrote: > Hi Brian, > > The latest version in SVN (as of yesterday evening) seems to work fine! > I can lamboot, and then mpirun a process compiled with mpif77, and all > that jazz... > > The master is still assigned as n0, but I didn't notice whether the > actual mpi code is running on the master or not (the job I tested was > too quick. When the job starts it prints: > > racaille:dgruner{132}> mpirun -np 4 ./fpi > Process 1 of 4 is alive > Process 0 of 4 is alive > Process 2 of 4 is alive > Process 3 of 4 is alive > > and lamboot produced the output in the attached file. It looks ok to > me, but I want to make sure that the mpi jobs are NOT run on the > master. > > Regards, > Daniel > > > On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote: >> Hey all - >> >> I think I finally fixed things so LAM really does avoid the >> bproc_nodeinfo() call on BProc 4 clusters. Changes should be in SVN. >> Let mw know if you have any problems. I think this leaves us with >> Luke's master node being identified as n0 problem as the one remaining >> bug. Still not sure why Luke would see that and Daniel wouldn't. >> *shrug*. >> >> Brian >> >> -- >> Brian Barrett >> LAM/MPI developer and all around nice guy >> Have a LAM/MPI day: http://www.lam-mpi.org/ > > -- > > Dr. Daniel Gruner dg...@ti... > Dept. of Chemistry dan...@ut... > University of Toronto phone: (416)-978-8689 > 80 St. George Street fax: (416)-978-5325 > Toronto, ON M5S 3H6, Canada finger for PGP public key > <junk> -- Brian Barrett LAM/MPI developer and all around nice guy Have a LAM/MPI day: http://www.lam-mpi.org/ |
From: Daniel G. <dg...@ti...> - 2004-06-22 14:06:50
|
Hi Brian, The latest version in SVN (as of yesterday evening) seems to work fine! I can lamboot, and then mpirun a process compiled with mpif77, and all that jazz... The master is still assigned as n0, but I didn't notice whether the actual mpi code is running on the master or not (the job I tested was too quick. When the job starts it prints: racaille:dgruner{132}> mpirun -np 4 ./fpi Process 1 of 4 is alive Process 0 of 4 is alive Process 2 of 4 is alive Process 3 of 4 is alive and lamboot produced the output in the attached file. It looks ok to me, but I want to make sure that the mpi jobs are NOT run on the master. Regards, Daniel On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote: > Hey all - > > I think I finally fixed things so LAM really does avoid the > bproc_nodeinfo() call on BProc 4 clusters. Changes should be in SVN. > Let mw know if you have any problems. I think this leaves us with > Luke's master node being identified as n0 problem as the one remaining > bug. Still not sure why Luke would see that and Daniel wouldn't. > *shrug*. > > Brian > > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Luke P. <lo...@du...> - 2004-06-21 15:54:45
|
> Unfortunately not. I assume you are using the autoconf way of building > the packages, right? I don't see any autoconf stuff in there... > Have you tried to use the stock clustermatic 4? Why would you absolutely > need a newer kernel? Well, the primary motivation is that I need to do some non-standard networking stuff in my kernels. However, a newer motivation is that Brian, Dan, and I have discovered some brokenness in bproc_nodeinfo and I was wanting to look into it. I tried the pre3 version and alas, it doesn't compile either. Closer, though. I get stuff like this: /usr/src/linux-2.4.22-cm36/include/linux/kernel.h:60:31: invalid suffix "d5eeb25" on integer constant ...which is a result of #defines in ksyms.ver, I think. I remember running into this a while back, but the solution eludes me. Anyone? By the way, there is a working-ish version of LAM-MPI for bproc in their SVN tree, as of last night. Brian rules... -Luke |
From: Thomas E. <eck...@gm...> - 2004-06-21 08:05:52
|
Luke, I only can offer a kind of "me too" statement with a litte addon: bproc-4.0.0_pre{4,5} (I'm not sure for _pre3) id not build with 2.4-kernels (result as you describe it) BUT both do with 2.6-kernels (vanilla-kernels with only bproc-patch applied). All testing was done with a faily current Gentoo. My _guess_ would be that testing concentrates on 2.6-kernels at the moment as this is the interesting new stuff (Erik?). For the LAM-side: I've not tested LAM with bproc3 up to now but with the nightly snapshots on LAM ("lam-7.1a1r9708.tar.gz" in this case) and the bproc-4.0.0_pre5-patch suggested by Kevin Russell to clients/bproc.c a few days ago on bproc-users it comiles and is able of "mpirun"ning jobs on x86. Hope this helps a bit, Thomas On Sun, 20 Jun 2004, Luke wrote: > Daniel, > > Thanks for the reply. I'm cc-ing the bproc list on this just in case > anyone else has ideas. > > Unless you know of a download loaction that I don't, all the > Clustermatic stuff is bproc4.0.0pre3. I wasn't able to mix versions- > how are you using pre4? I don't have an immediate need to use a newer > version of bproc, but I do have an immediate need for LAM. I'm going to > ask Brian if I can help with development. > > Here's a rundown of my problems. At the very start of the pre4 build, I > see: > > make -C vmadump vmadump.ko kver > make[1]: Entering directory `/usr/src/bproc-4.0.0pre4/vmadump' > gcc -D__KERNEL__ > -I/lib/modules/2.4.25-bproc/build//usr/src/linux-2.4.25-bproc/include > -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing > -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 > -march=i686 -DMODULE -DPACKAGE_VERSION='"4.0.0pre4"' -I. -c > vmadump_common.c > > The include line, which is automatically generated, is obviously wrong. > If I correct it, the build makes sense a bit. The client programs build > fine. When building vmadump and the kernel modules, I see errors like > these, repeated many times. Except the second error, they all have to > do with the variable "current", but I'm afraid I've been unable to > figure out what it is (my knowledge of kernel-ish stuff is quite > little). There are many, many warnings as well, that I have left out. > > vmadump_common.c:679: error: structure has no member named `sighand' > vmadump_common.c:681: error: too few arguments to function > `recalc_sigpending' > vmadump_common.c:707: error: structure has no member named `clear_child_tid' > ghost.c:432: error: structure has no member named `utime' > ghost.c:433: error: structure has no member named `stime' > ghost.c:434: error: structure has no member named `cutime' > ghost.c:435: error: structure has no member named `cstime' > > Any ideas? > > Thanks > -Luke > > Daniel Gruner wrote: > > >Hi Luke, > > > >I am using the stuff from clustermatic. At least the kernels. > >Some of the other packages too, as far as I recall. > > > >Now, I am not working with Fedora on any of my clusters yet, and that > >may be why you are seeing problems. Can you tell me what the build > >problems are? I have built bproc on many different systems, such > >as RH7.2 on alpha, RH7.3 on i386, RH9 on athlon, etc. > > > >Also, in my experience, Fedora 1 is missing a bunch of stuff, such as > >libraries missing from packages, and whatever else. Let me know more > >details, and perhaps I can help. > > > >Now as for success with LAM... that is another story. Brian was still > >working on it, but it seems that some functions in the BProc api are > >either broken or not properly documented. The note you quote below is old, > >and has been revised since to "not working". > > > >Regards, > >Daniel > > > >On Sun, Jun 20, 2004 at 01:36:17AM -0500, Luke Palmer wrote: > > > > > >>Hey Daniel, > >> > >>I have a couple of questions for you. I cannot replicate your success > >>with LAM on bproc. I am hoping the difference is that I am using > >>bproc4.0.0pre3 (from clustermatic). > >> > >>You say you are using bproc4.0.0pre4. I was wondering if there is any > >>trick to getting it to build? I am on Fedora Core 1, and the build is > >>quite badly broken in my environment. > >> > >>Please let me know if you can help! > >> > >>Thanks > >>-Luke > >> > >>On Sun, 2004-06-13 at 18:42, Daniel Gruner wrote: > >> > >> > >>>Brian, > >>> > >>>Success!!! (well, at least apparently). The thing lamboots fine, starts > >>>the lamd on all the nodes, and seems to run mpi jobs, so until further > >>>testing it looks like we are in business! > >>> > >>>I agree with you about BProc's lack of documentation, but I still like > >>>the system. It is the only cluster software that makes the cluster > >>>seem like a single system image. I have been running different versions > >>>of it for over 2 years, and I insist on it for all my clusters. It may > >>>be time to hack on the BProc guys for more documentation (Erik...). > >>> > >>>Anyway, thanks for your work on LAM, and let me know if I can be of more > >>>help. > >>> > >>>Regards, > >>>Daniel > >>> > >>> > >>>On Sun, Jun 13, 2004 at 04:01:11PM -0700, Brian Barrett wrote: > >>> > >>> > >>>>On Jun 13, 2004, at 3:32 PM, Daniel Gruner wrote: > >>>> > >>>> > >>>> > >>>>>Ok, here it goes again. The master node is still somehow screwed up, > >>>>>according to lamboot... Here is the output: > >>>>> > >>>>> > >>>><snip> > >>>> > >>>> > >>>> > >>>>>n-1<8638> ssi:boot:bproc: n-1 nodestatus failed (-1) > >>>>>n-1<8638> ssi:boot:bproc: n-1 node status down, failure > >>>>>n-1<8638> ssi:boot:bproc: n0 node status: up > >>>>> > >>>>> > >>>>Well, on the good side, we detect the master node correctly now :). > >>>> > >>>>You know, life would be so much better if BProc documented things like > >>>>that. I added some code to make sure that NODE_MASTER is never used > >>>>for the parameter to bproc_nodestatus or the like. Hopefully, that > >>>>will make life better all around. If you could svn up and let me know > >>>>how it goes, I'd appreciate it. Only file changed is > >>>>share/ssi/boot/bproc/src/ssi_boot_bproc.c, so you should be able to svn > >>>>up and run make (without the autogen or configure stuff). > >>>> > >>>>Thanks! > >>>> > >>>>Brian > >>>> > >>>>-- > >>>> Brian Barrett > >>>> LAM/MPI developer and all around nice guy > >>>> Have a LAM/MPI day: http://www.lam-mpi.org/ > >>>> > >>>> > > > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference > Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer > Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA > REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > > -- It seems like once people grow up, they have no idea what's cool. -- Calvin |
From: Daniel G. <dg...@ti...> - 2004-06-20 21:35:51
|
Hi Luke, On Sun, Jun 20, 2004 at 02:35:44PM -0500, Luke wrote: > Daniel, > > Thanks for the reply. I'm cc-ing the bproc list on this just in case > anyone else has ideas. > > Unless you know of a download loaction that I don't, all the > Clustermatic stuff is bproc4.0.0pre3. I wasn't able to mix versions- > how are you using pre4? I don't have an immediate need to use a newer > version of bproc, but I do have an immediate need for LAM. I'm going to > ask Brian if I can help with development. I am not using pre4. If somewhere I mentioned that then it was by mistake only. Nor am I sure that I have ever rebuilt the packages. > > Here's a rundown of my problems. At the very start of the pre4 build, I > see: > > make -C vmadump vmadump.ko kver > make[1]: Entering directory `/usr/src/bproc-4.0.0pre4/vmadump' > gcc -D__KERNEL__ > -I/lib/modules/2.4.25-bproc/build//usr/src/linux-2.4.25-bproc/include > -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing > -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 > -march=i686 -DMODULE -DPACKAGE_VERSION='"4.0.0pre4"' -I. -c > vmadump_common.c > > The include line, which is automatically generated, is obviously wrong. > If I correct it, the build makes sense a bit. The client programs build > fine. When building vmadump and the kernel modules, I see errors like > these, repeated many times. Except the second error, they all have to > do with the variable "current", but I'm afraid I've been unable to > figure out what it is (my knowledge of kernel-ish stuff is quite > little). There are many, many warnings as well, that I have left out. > > vmadump_common.c:679: error: structure has no member named `sighand' > vmadump_common.c:681: error: too few arguments to function > `recalc_sigpending' > vmadump_common.c:707: error: structure has no member named `clear_child_tid' > ghost.c:432: error: structure has no member named `utime' > ghost.c:433: error: structure has no member named `stime' > ghost.c:434: error: structure has no member named `cutime' > ghost.c:435: error: structure has no member named `cstime' > > Any ideas? Unfortunately not. I assume you are using the autoconf way of building the packages, right? If so, then I can only assume that either pre4 is "funny", or that Fedora1 is missing stuff or some stuff has changed on it. I realize that you would not want to use RH9, since it is not "supported" anymore, but it may be the way to go for you. Fedora 2 is not another option, since it uses 2.6 kernels and these are not yet supported by bproc. Have you tried to use the stock clustermatic 4? Why would you absolutely need a newer kernel? Regards, Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Luke <lo...@du...> - 2004-06-20 19:35:09
|
Daniel, Thanks for the reply. I'm cc-ing the bproc list on this just in case anyone else has ideas. Unless you know of a download loaction that I don't, all the Clustermatic stuff is bproc4.0.0pre3. I wasn't able to mix versions- how are you using pre4? I don't have an immediate need to use a newer version of bproc, but I do have an immediate need for LAM. I'm going to ask Brian if I can help with development. Here's a rundown of my problems. At the very start of the pre4 build, I see: make -C vmadump vmadump.ko kver make[1]: Entering directory `/usr/src/bproc-4.0.0pre4/vmadump' gcc -D__KERNEL__ -I/lib/modules/2.4.25-bproc/build//usr/src/linux-2.4.25-bproc/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 -DMODULE -DPACKAGE_VERSION='"4.0.0pre4"' -I. -c vmadump_common.c The include line, which is automatically generated, is obviously wrong. If I correct it, the build makes sense a bit. The client programs build fine. When building vmadump and the kernel modules, I see errors like these, repeated many times. Except the second error, they all have to do with the variable "current", but I'm afraid I've been unable to figure out what it is (my knowledge of kernel-ish stuff is quite little). There are many, many warnings as well, that I have left out. vmadump_common.c:679: error: structure has no member named `sighand' vmadump_common.c:681: error: too few arguments to function `recalc_sigpending' vmadump_common.c:707: error: structure has no member named `clear_child_tid' ghost.c:432: error: structure has no member named `utime' ghost.c:433: error: structure has no member named `stime' ghost.c:434: error: structure has no member named `cutime' ghost.c:435: error: structure has no member named `cstime' Any ideas? Thanks -Luke Daniel Gruner wrote: >Hi Luke, > >I am using the stuff from clustermatic. At least the kernels. >Some of the other packages too, as far as I recall. > >Now, I am not working with Fedora on any of my clusters yet, and that >may be why you are seeing problems. Can you tell me what the build >problems are? I have built bproc on many different systems, such >as RH7.2 on alpha, RH7.3 on i386, RH9 on athlon, etc. > >Also, in my experience, Fedora 1 is missing a bunch of stuff, such as >libraries missing from packages, and whatever else. Let me know more >details, and perhaps I can help. > >Now as for success with LAM... that is another story. Brian was still >working on it, but it seems that some functions in the BProc api are >either broken or not properly documented. The note you quote below is old, >and has been revised since to "not working". > >Regards, >Daniel > >On Sun, Jun 20, 2004 at 01:36:17AM -0500, Luke Palmer wrote: > > >>Hey Daniel, >> >>I have a couple of questions for you. I cannot replicate your success >>with LAM on bproc. I am hoping the difference is that I am using >>bproc4.0.0pre3 (from clustermatic). >> >>You say you are using bproc4.0.0pre4. I was wondering if there is any >>trick to getting it to build? I am on Fedora Core 1, and the build is >>quite badly broken in my environment. >> >>Please let me know if you can help! >> >>Thanks >>-Luke >> >>On Sun, 2004-06-13 at 18:42, Daniel Gruner wrote: >> >> >>>Brian, >>> >>>Success!!! (well, at least apparently). The thing lamboots fine, starts >>>the lamd on all the nodes, and seems to run mpi jobs, so until further >>>testing it looks like we are in business! >>> >>>I agree with you about BProc's lack of documentation, but I still like >>>the system. It is the only cluster software that makes the cluster >>>seem like a single system image. I have been running different versions >>>of it for over 2 years, and I insist on it for all my clusters. It may >>>be time to hack on the BProc guys for more documentation (Erik...). >>> >>>Anyway, thanks for your work on LAM, and let me know if I can be of more >>>help. >>> >>>Regards, >>>Daniel >>> >>> >>>On Sun, Jun 13, 2004 at 04:01:11PM -0700, Brian Barrett wrote: >>> >>> >>>>On Jun 13, 2004, at 3:32 PM, Daniel Gruner wrote: >>>> >>>> >>>> >>>>>Ok, here it goes again. The master node is still somehow screwed up, >>>>>according to lamboot... Here is the output: >>>>> >>>>> >>>><snip> >>>> >>>> >>>> >>>>>n-1<8638> ssi:boot:bproc: n-1 nodestatus failed (-1) >>>>>n-1<8638> ssi:boot:bproc: n-1 node status down, failure >>>>>n-1<8638> ssi:boot:bproc: n0 node status: up >>>>> >>>>> >>>>Well, on the good side, we detect the master node correctly now :). >>>> >>>>You know, life would be so much better if BProc documented things like >>>>that. I added some code to make sure that NODE_MASTER is never used >>>>for the parameter to bproc_nodestatus or the like. Hopefully, that >>>>will make life better all around. If you could svn up and let me know >>>>how it goes, I'd appreciate it. Only file changed is >>>>share/ssi/boot/bproc/src/ssi_boot_bproc.c, so you should be able to svn >>>>up and run make (without the autogen or configure stuff). >>>> >>>>Thanks! >>>> >>>>Brian >>>> >>>>-- >>>> Brian Barrett >>>> LAM/MPI developer and all around nice guy >>>> Have a LAM/MPI day: http://www.lam-mpi.org/ >>>> >>>> > > > |
From: Luke P. <lop...@wi...> - 2004-06-20 06:20:11
|
Hi everyone, I can't get bproc4.0.0pre4 or pre5 to come anywhere close to building on Fedora Core 1, 2.4.25 kernel. The kernel patch and build works fine, it's just building bproc itself that doesn't work. Can anyone report success doing this? Can anyone offer any tips? Up until now I've been using the prebuilt clustermatic stuff, but I really need a custom kernel... Thanks -Luke |
From: Kevin R. <Kev...@dr...> - 2004-06-17 16:10:32
|
Running "lamboot -d" gets me the following: > [...] > n-1<29352> ssi:boot:bproc: found master node (head). Skipping checks. > n-1<29352> ssi:boot:bproc: n0 node status: up > n-1<29352> ssi:boot:bproc: n0 bproc_nodeinfo failed (Unknown error 300) > n-1<29352> ssi:boot:bproc: n1 node status: up > n-1<29352> ssi:boot:bproc: n1 bproc_nodeinfo failed (Unknown error 300) > [...] Inspecting the bproc code in bproc-4.0.0pre5/clients/bproc.c I find the following within the bproc_nodeinfo: if (getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status))){ errno = BE_INVALIDNODE; return -1; } if (getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr))){ errno = BE_INVALIDNODE; return -1; } It is my understanding that "getxattr" returns the length of the extended attribute. In my case, the length is 3 ("up\0"). The above will evaluate as true and the function will return in error. I don't think this is the desired results. The following test would be better: r = getxattr(path, BPROC_STATE_XATTR, info->status, sizeof(info->status)); if (r < 0 || r > sizeof(info->status)){ errno = BE_INVALIDNODE; return -1; } r = getxattr(path, BPROC_ADDR_XATTR, info->status, sizeof(info->addr)); if (r < 0 || r > sizeof(info->status)){ errno = BE_INVALIDNODE; return -1; } -- Kevin Russell (DND/DRDC/MES/TDG) Kev...@dr... (403) 544-4746, DRDC Suffield, Box 4000, Medicine Hat, AB, Canada, T1A 8K6 |
From: Daniel G. <dg...@ti...> - 2004-06-13 23:43:29
|
Brian, Success!!! (well, at least apparently). The thing lamboots fine, starts the lamd on all the nodes, and seems to run mpi jobs, so until further testing it looks like we are in business! I agree with you about BProc's lack of documentation, but I still like the system. It is the only cluster software that makes the cluster seem like a single system image. I have been running different versions of it for over 2 years, and I insist on it for all my clusters. It may be time to hack on the BProc guys for more documentation (Erik...). Anyway, thanks for your work on LAM, and let me know if I can be of more help. Regards, Daniel On Sun, Jun 13, 2004 at 04:01:11PM -0700, Brian Barrett wrote: > On Jun 13, 2004, at 3:32 PM, Daniel Gruner wrote: > > > Ok, here it goes again. The master node is still somehow screwed up, > > according to lamboot... Here is the output: > > <snip> > > > n-1<8638> ssi:boot:bproc: n-1 nodestatus failed (-1) > > n-1<8638> ssi:boot:bproc: n-1 node status down, failure > > n-1<8638> ssi:boot:bproc: n0 node status: up > > Well, on the good side, we detect the master node correctly now :). > > You know, life would be so much better if BProc documented things like > that. I added some code to make sure that NODE_MASTER is never used > for the parameter to bproc_nodestatus or the like. Hopefully, that > will make life better all around. If you could svn up and let me know > how it goes, I'd appreciate it. Only file changed is > share/ssi/boot/bproc/src/ssi_boot_bproc.c, so you should be able to svn > up and run make (without the autogen or configure stuff). > > Thanks! > > Brian > > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |