You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Nicholas H. <he...@se...> - 2002-02-26 16:39:39
|
make sure you have done an 'modprobe bproc' to load the kernel module that is neccesary to use either bpmaster or bpslave. Nic On Tue, 26 Feb 2002, Ryan Madison wrote: > Hello, > I am attempting to setup a pair of boxes using the bproc package. The two machines are Dell Optiplex GX240's, with 512MB RAM, 1.5Ghz CPU's. I have re-compiled, and patched the kernel on each box, checking Beowulf shared process space in the menuconfig. > > On the machine that I am designating as the master node, I have bpmaster running. I can even run a bpslave instance on that node, and use bpstat to check the status of it. > > On the first slave node, I have followed every step that I followed on the master to set it up, (re-compiling the kernel, patching kernel, compiling and installing the bproc package). When I go to run bpslave <masternode>, I get the error message BProc: function not implemented. > > I didn't see any troubleshooting doc regarding this problem, nor did I see a FAQ. Can any of you help me to try and figure out this problem. > > -Thanks, RYAN > > > Ryan Madison > Engineer > ITX @ Sigma > rma...@ro... > > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Ryan M. <rma...@ro...> - 2002-02-26 15:43:14
|
Hello, I am attempting to setup a pair of boxes using the bproc package. The two machines are Dell Optiplex GX240's, with 512MB RAM, 1.5Ghz CPU's. I have re-compiled, and patched the kernel on each box, checking Beowulf shared process space in the menuconfig. On the machine that I am designating as the master node, I have bpmaster running. I can even run a bpslave instance on that node, and use bpstat to check the status of it. On the first slave node, I have followed every step that I followed on the master to set it up, (re-compiling the kernel, patching kernel, compiling and installing the bproc package). When I go to run bpslave <masternode>, I get the error message BProc: function not implemented. I didn't see any troubleshooting doc regarding this problem, nor did I see a FAQ. Can any of you help me to try and figure out this problem. -Thanks, RYAN Ryan Madison Engineer ITX @ Sigma rma...@ro... |
From: Erik A. H. <er...@he...> - 2002-02-15 00:29:53
|
BProc 3.1.7 is out and in the usual spot: http://sourceforge.net/project/showfiles.php?group_id=24453&release_id=75172 Release notes and change log follow: 3.1.7 --------------------------------------------------------------------- BProc has been ported to the PowerPC. It hasn't seen much testing since I don't have a lot in the way of PPC hardware but the first take is there and it seems to work fine. Another Linux patch is added to this release to fix a SMP problem in the process ID allocator. This patch should be applied to the kernel in addition to the BProc patch. There's an important bug fix in the master daemon too. Changes from 3.1.6 to 3.1.7 * Added PowerPC support!! * Fixed a possible SIGHUP reconfiguration crash in bpmaster. * Fixed connection handling bug in bpmaster that caused slave connections to disfunction. * Added /proc/sys/bproc/proc_pid_map to control process ID mapping in /proc on the slave nodes. 2=map for all (default), 1=map for non-root only, 0=no mapping. * Added patch for Linux PID allocator bug. -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Erik A. H. <er...@he...> - 2002-02-06 21:23:22
|
On Wed, Feb 06, 2002 at 03:47:25PM -0500, Nicholas Henke wrote: > 3.1.6 Hrm.. Ok. Here's something quick and easy to try with bpsh.c: /* need timeouts for dead nodes and the like */ if (late_connections) { - tmo.tv_sec = 0; - tmo.tv_usec = 50000; + tmo.tv_sec = 5; + tmo.tv_usec = 0; } else { tmo.tv_sec = 300; /* completely arbitrary */ tmo.tv_usec = 0; } That's just cranking one of the connection timeout for processes that I've received SIGCHLD for. I don't *think* it should make a big difference but I've seen too much weirdness out of TCP recently to not try it. An strace trace of it screwing up might be useful but strace on my alphas here seems to get confused my all the SIGCHLDs showing up at once. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Nicholas H. <he...@se...> - 2002-02-06 20:47:38
|
3.1.6 Nic On Wed, 6 Feb 2002, Erik Arjan Hendriks wrote: > On Wed, Feb 06, 2002 at 03:34:28PM -0500, Nicholas Henke wrote: > > Hey Erik -- > > Here is a report from our sysadmin. Anything I can do to help > > track this one down? > > What version of bpsh are you using? (bpsh -v) That looks just like an > IO forwarding problem I thought I fixed in bproc 3.1.3. > > I know I used to see that on the machines here but I don't anymore. > > - Erik > > > [root@admin root]# bpsh -a echo hi|wc -l > > 45 > > [root@admin root]# bpsh -a echo hi|wc -l > > 64 > > [root@admin root]# bpsh -a echo hi|wc -l > > 64 > > [root@admin root]# bpsh -a echo hi|wc -l > > 44 > -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: Nicholas H. <he...@se...> - 2002-02-06 20:47:13
|
More info on last bug-- ---------- Forwarded message ---------- Date: Wed, 6 Feb 2002 15:44:43 -0500 From: Daniel Widyono <wi...@ci...> To: Daniel Widyono <wi...@ci...> Cc: Nicholas Henke <he...@se...> Subject: Re: bpsh bug Nic, more info. I've selected different subsets, i.e. [root@admin root]# bpsh 0-31 hostname|wc -l 32 [root@admin root]# bpsh 0-42 hostname|wc -l 43 [root@admin root]# bpsh 32-63 hostname|wc -l 32 [root@admin root]# bpsh 21-63 hostname|wc -l 43 Those above consistenly worked fine. When we get to the 44th node is when things go awry (it takes longer for bpsh to return, the hangup is between the return from the 43rd node and the return from the 44th node): [root@admin root]# bpsh 0-43 hostname|wc -l 44 [root@admin root]# bpsh 0-43 hostname|wc -l 43 [root@admin root]# bpsh 20-63 hostname|wc -l 44 [root@admin root]# bpsh 20-63 hostname|wc -l 43 It looks like there's some bound being hit between using 43 nodes and using 44 nodes. I think it might be on the bproc or clubmask side (if bproc communicates with clubmask somehow), since a simple sleep 5 && echo script works fine (bpsh -a runs on all and returns all nodes); therefore it looks like a timing issue on the server end. Of course, not having read the source code and actually debugging, this is pure conjecture. Dan W. On Wed, Feb 06, 2002 at 03:31:25PM -0500, Daniel Widyono wrote: > Just rebooted all clients, diagnose -n now reports all 64 idle. bpstat shows > all 64 up. Here is some sample output: > > [root@admin root]# bpsh -a echo hi|wc -l > 45 > [root@admin root]# bpsh -a echo hi|wc -l > 64 > [root@admin root]# bpsh -a echo hi|wc -l > 64 > [root@admin root]# bpsh -a echo hi|wc -l > 44 > > bpsh -a hostname yielded 44 replies once: I get node0 through node42 in > order, then node63. By the way, bpsh -a hostname when showing all nodes, > shows nodes 43 through 63 out of order. Wacky? Did those hosts get inserted > into the database incorrectly, get their IP addresses in the wrong order, or > something? > > Dan W. > > On Wed, Feb 06, 2002 at 02:34:32PM -0500, Nicholas Henke wrote: > > Could you redirect the output to a file and send it to me so that I could > > send it to Erik as a bug report. > > -- > -- Daniel Widyono http://www.cis.upenn.edu/~widyono > -- Linux Cluster Group, CIS Dept., SEAS, University of Pennsylvania > -- Mail: Rm 556, CIS Dept 200 S 33rd St Philadelphia, PA 19104 -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Linux Cluster Group, CIS Dept., SEAS, University of Pennsylvania -- Mail: Rm 556, CIS Dept 200 S 33rd St Philadelphia, PA 19104 |
From: Erik A. H. <er...@he...> - 2002-02-06 20:41:25
|
On Wed, Feb 06, 2002 at 03:34:28PM -0500, Nicholas Henke wrote: > Hey Erik -- > Here is a report from our sysadmin. Anything I can do to help > track this one down? What version of bpsh are you using? (bpsh -v) That looks just like an IO forwarding problem I thought I fixed in bproc 3.1.3. I know I used to see that on the machines here but I don't anymore. - Erik > [root@admin root]# bpsh -a echo hi|wc -l > 45 > [root@admin root]# bpsh -a echo hi|wc -l > 64 > [root@admin root]# bpsh -a echo hi|wc -l > 64 > [root@admin root]# bpsh -a echo hi|wc -l > 44 |
From: Nicholas H. <he...@se...> - 2002-02-06 20:34:40
|
Hey Erik -- Here is a report from our sysadmin. Anything I can do to help track this one down? Nic ---------- Forwarded message ---------- Date: Wed, 6 Feb 2002 15:31:25 -0500 From: Daniel Widyono <wi...@ci...> To: Nicholas Henke <he...@se...> Cc: Daniel Widyono <wi...@ci...> Subject: Re: bpsh bug Just rebooted all clients, diagnose -n now reports all 64 idle. bpstat shows all 64 up. Here is some sample output: [root@admin root]# bpsh -a echo hi|wc -l 45 [root@admin root]# bpsh -a echo hi|wc -l 64 [root@admin root]# bpsh -a echo hi|wc -l 64 [root@admin root]# bpsh -a echo hi|wc -l 44 bpsh -a hostname yielded 44 replies once: I get node0 through node42 in order, then node63. By the way, bpsh -a hostname when showing all nodes, shows nodes 43 through 63 out of order. Wacky? Did those hosts get inserted into the database incorrectly, get their IP addresses in the wrong order, or something? Dan W. On Wed, Feb 06, 2002 at 02:34:32PM -0500, Nicholas Henke wrote: > Could you redirect the output to a file and send it to me so that I could > send it to Erik as a bug report. -- -- Daniel Widyono http://www.cis.upenn.edu/~widyono -- Linux Cluster Group, CIS Dept., SEAS, University of Pennsylvania -- Mail: Rm 556, CIS Dept 200 S 33rd St Philadelphia, PA 19104 |
From: Erik A. H. <er...@he...> - 2002-02-06 16:43:39
|
On Wed, Feb 06, 2002 at 01:40:13AM -0500, henken wrote: > Well this is a new one: > Feb 5 21:54:32 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,10985 from=1,10985 result=11 > Feb 5 22:11:53 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,5616 from=1,5616 result=11 > Feb 5 22:17:40 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,13421 from=1,13421 result=0 > Feb 5 23:50:04 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,6783 from=1,6783 result=11 > Feb 5 23:53:01 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,12550 from=1,12550 result=11 > Feb 6 00:04:19 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,22287 from=1,22287 result=11 > Feb 6 00:12:47 admin /usr/sbin/bpmaster: write(ghost): missing process > for message type 15 req; to=3,29342 from=1,29342 result=11 This looks to me like the moves are failing somewhere. These are EXIT messages again. Most of them have died with SIGSEGV. Are there any messages in the kernel ring buffer on the remote machine? I suspect there's some confusion going one about whether or not the moves are failing. A message trace showing the move and exit requests and responses would be handy. Also, are you able to reproduce this reasonable easily? If so how? I half expect this to be some kind of vmadump failure that's not getting reported quite right. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: Erik A. H. <er...@he...> - 2002-02-06 16:27:52
|
On Tue, Feb 05, 2002 at 08:15:51PM -0500, henken wrote: > We are in the middle of having fun dealing with a unique request. We have > a group that wants to be able to execute a job on a single cpu, but the > entire cluster is made up of duals. Is there a way to configure bproc in > such a way that it would allow multiple users onto the machine at once? > The only way we have seen to do this would be to create groups on the fly, > and add the user to that group before allowing their job to execute, but > this seems really hackish. I'm afraid not. You'd need some kind of ACL to do that without creating a special group. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: henken <he...@se...> - 2002-02-06 06:40:22
|
Well this is a new one: Feb 5 21:54:32 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,10985 from=1,10985 result=11 Feb 5 22:11:53 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,5616 from=1,5616 result=11 Feb 5 22:17:40 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,13421 from=1,13421 result=0 Feb 5 23:50:04 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,6783 from=1,6783 result=11 Feb 5 23:53:01 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,12550 from=1,12550 result=11 Feb 6 00:04:19 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,22287 from=1,22287 result=11 Feb 6 00:12:47 admin /usr/sbin/bpmaster: write(ghost): missing process for message type 15 req; to=3,29342 from=1,29342 result=11 and the output from ps -jx PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 14593 14596 14596 14596 pts/4 24500 S 27659 0:00 -bash 14840 14844 14844 14844 pts/6 17876 S 27659 0:00 -bash 14844 17876 17876 14844 pts/6 17876 S 27659 4:49 top 1 1153 714 714 ? -1 S 27659 0:02 /bin/bash /usr/local/cl 1 1154 714 714 ? -1 S 27659 0:09 /bin/bash /usr/local/cl 1 1204 714 714 ? -1 S 27659 0:09 /bin/bash /usr/local/cl 1 1610 714 714 ? -1 S 27659 0:00 /bin/bash /usr/local/cl 1 1618 714 714 ? -1 S 27659 0:08 /bin/bash /usr/local/cl 1 1619 714 714 ? -1 S 27659 0:11 /bin/bash /usr/local/cl 1 1770 714 714 ? -1 S 27659 0:01 /bin/bash /usr/local/cl 1 1893 714 714 ? -1 S 27659 0:11 /bin/bash /usr/local/cl 1 1949 714 714 ? -1 S 27659 0:00 /bin/bash /usr/local/cl 1949 10718 714 714 ? -1 S 27659 0:00 bpsh 9 /home/henken/job 10718 10985 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 1610 27346 714 714 ? -1 D 27659 0:00 bpsh 28 /home/henken/jo 1770 5379 714 714 ? -1 S 27659 0:00 bpsh 0 /home/henken/job 5379 5616 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 1153 13243 714 714 ? -1 S 27659 0:00 bpsh 42 /home/henken/jo 13243 13421 714 714 ? -1 RW 27659 0:00 [noop] 1618 13048 714 714 ? -1 S 27659 0:00 bpsh 29 /home/henken/jo 13048 13260 714 714 ? -1 RW 27659 0:00 [noop] 1154 6474 714 714 ? -1 S 27659 0:00 bpsh 43 /home/henken/jo 6474 6783 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 1204 12251 714 714 ? -1 S 27659 0:00 bpsh 33 /home/henken/jo 12251 12550 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 1893 22100 714 714 ? -1 S 27659 0:00 bpsh 7 /home/henken/job 22100 22287 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 1619 29166 714 714 ? -1 S 27659 0:00 bpsh 1 /home/henken/job 29166 29342 714 714 ? -1 S 27659 0:00 /home/henken/jobs/bin/n 14596 24500 24500 14596 pts/4 24500 R 27659 0:00 ps -jx What other info could you use? Nic -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: henken <he...@se...> - 2002-02-06 01:16:04
|
We are in the middle of having fun dealing with a unique request. We have a group that wants to be able to execute a job on a single cpu, but the entire cluster is made up of duals. Is there a way to configure bproc in such a way that it would allow multiple users onto the machine at once? The only way we have seen to do this would be to create groups on the fly, and add the user to that group before allowing their job to execute, but this seems really hackish. Thanks! Nic -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: <ne...@nn...> - 2002-02-05 20:47:23
|
Erik, Thanks for your previous prompt reply. I reloaded RedHat 7.2 fresh with the same options as before. I mounted the CD and did the "rpm -Uvh *" from the RPMS/i386 directory. The modules.dep message came up as expected. I decided to try to compile a custom kernel at this point so I went to /usr/src and created a link for "linux" to the linux-2.4.13-lanl.3. I changed into linux and did a "make menuconfig". I turned on show expermintal, turned off smp and turned on Beowulf Distributed Process Space. I exited with save and did make clean, make dep and make all. Everything looked o.k. I then did a make modules and during the compile of drivers/net/dummy.c there were a whole host of errors. The last 4 on the screen said "boot_cpu_data_R0657d037 undeclared". I then changed to /mnt/cdrom/RPMS/athlon and did an "rpm -Uvh *" and got another screen full of errors like "file ...sunrpc.o from install of ... conflicts with file from ..." I then did an "rpm -Uvh --force *" from RPMS/athlon and got 12 errors like "unresolved symbols __down_write_Rc3c866ab, ...". I also got two messages saying that "img"s already existed and the 2.4.7-10/modules.dep not found message. I then went back to /usr/src/linux and reran the "make modules". I got the same messages as before. I decided to try and boot the machine. Grub now had six lines (2 smp, 2 lanl.3, 1 beoboot and one 3custom). I tried both lanl.3's and the custom but the machine would not boot. Thanks, Neal Nelson |
From: Erik A. H. <er...@he...> - 2002-02-05 17:04:18
|
On Tue, Feb 05, 2002 at 04:33:04PM +0000, ne...@nn... wrote: > I am trying to use the software on the clustermatic fall 2001 CD > and am having some problems. I install a fresh RedHat 7.2 x86 on > an machine with a Tyan Thunder MB and a single Athlon XP 1600+ > cpu. I specify "custom, everything, grub, ext3, kde, text boot". > I have hda1 as /boot, an unused hda2 as /usr9, extended as /hda3, > two swaps as /hda5 and /hda6, and / at about 19 gig as /hda7. > > I mount the cdrom, change into the RPMS/athlon diretory and type > "rpm -Uvh *". The rpm issues a message for bproc-modules saying > that it can't open 2.4.7-10/modules.dep. The 2.4.7-10 entries have > apparently been replaced by 2.4.13-lanl.3 entries. Should I worry > about this? I don't think so. That modules directory should be removed by the upgrade and that would cause that error. > There are many more files in the RPMS/i386 directory than in the > RPMS/Athlon directory. Should I install some/all RPMS's in i386? The only things in the i686 and athlon directories are packages which have some differentiation based on architecture. That's really just the kernel and anything with dependencies on the kernel. > Should I install all in i386 and then the 3 from Athlon on top > of them? Yes. > I have both Pentium and Athlon machines here. Can I install/build > for some binary that will work on both? The i386 kernel *should* work on any about anything. I've seen cases where it wont (highmem + athlon doesn't work with the x86 kernel). It's probably possible to run an athlon kernel on one machine and have it work with another box running the i686 kernel. The issue there is the differing FPU types. You might get an oops migrating with that version of BProc. This has been fixed but only recently. > When I install just the 3 files from Athlon I do not get a ..beoboot > entry on the grub boot screen and bpstat is not found anywhere on > the machine. When I install all the files from i386 I get a ..beoboot > line from grub but when I try to use that option the machine will > not boot. When i386 is installed and I boot without the ..beoboot > option and try to run "/etc/rc.d/init.d/beowulf start" it fails. The beoboot kernel is ONLY for phase 1 boot images. It's a stripped down good only for booting slaves. It doesn't include things like disk or ext2 support. You shouldn't see it on the grub screen. > The README talks about creating phase 1 boot images. Are those for > the slave nodes? I am trying to load and configure the master node. > I plan to start by booting the slave nodes from the distribution CD. The phase 1 images are only if you need to use a floppy. > I have installed all from i386 and then tried to build a custom > kernel. "make all" seemed to work. "make modules" failed. *shrug* You're going to have to elaborate on "failed". - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: <ne...@nn...> - 2002-02-05 16:36:32
|
I am trying to use the software on the clustermatic fall 2001 CD and am having some problems. I install a fresh RedHat 7.2 x86 on an machine with a Tyan Thunder MB and a single Athlon XP 1600+ cpu. I specify "custom, everything, grub, ext3, kde, text boot". I have hda1 as /boot, an unused hda2 as /usr9, extended as /hda3, two swaps as /hda5 and /hda6, and / at about 19 gig as /hda7. I mount the cdrom, change into the RPMS/athlon diretory and type "rpm -Uvh *". The rpm issues a message for bproc-modules saying that it can't open 2.4.7-10/modules.dep. The 2.4.7-10 entries have apparently been replaced by 2.4.13-lanl.3 entries. Should I worry about this? There are many more files in the RPMS/i386 directory than in the RPMS/Athlon directory. Should I install some/all RPMS's in i386? Should I install all in i386 and then the 3 from Athlon on top of them? Should I install some from i386 and some/all from Athlon? If so which ones? I have both Pentium and Athlon machines here. Can I install/build for some binary that will work on both? When I install just the 3 files from Athlon I do not get a ..beoboot entry on the grub boot screen and bpstat is not found anywhere on the machine. When I install all the files from i386 I get a ..beoboot line from grub but when I try to use that option the machine will not boot. When i386 is installed and I boot without the ..beoboot option and try to run "/etc/rc.d/init.d/beowulf start" it fails. The README talks about creating phase 1 boot images. Are those for the slave nodes? I am trying to load and configure the master node. I plan to start by booting the slave nodes from the distribution CD. I have installed all from i386 and then tried to build a custom kernel. "make all" seemed to work. "make modules" failed. Thanks in advance for you help. Neal Nelson |
From: Erik A. H. <er...@he...> - 2002-02-04 23:28:17
|
On Sun, Feb 03, 2002 at 12:37:45PM -0500, henken wrote: > Hey-- > Has there been any luck getting your mpich patches that enable the > use of bproc released? We would be really interested in that here at Penn. It's just slow. I'm waiting for some paper work to get signed by the powers that be. They have already decided that it doesn't contain classified information or unclassified nuclear information. Now we need to jump through the hoops required to say "no, this has no commercial value." > BTW -- 3.1.6 seesm to be really stable -- I am going to pound on it > further this week, but most likely it will be the version we use as the > base for the 0.5 release of Clubmask due out Friday. Cool. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: henken <he...@se...> - 2002-02-03 17:37:51
|
Hey-- Has there been any luck getting your mpich patches that enable the use of bproc released? We would be really interested in that here at Penn. Thanks -- Nic BTW -- 3.1.6 seesm to be really stable -- I am going to pound on it further this week, but most likely it will be the version we use as the base for the 0.5 release of Clubmask due out Friday. -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: Erik A. H. <er...@he...> - 2002-01-28 22:05:40
|
On Sat, Jan 26, 2002 at 12:05:38AM -0500, Nicholas Henke wrote: > Here is the info I could find: > from /var/log/messages > Feb 6 23:17:37 master bpmaster: write(ghost): missing process for message type 14 req; to=3,23110 from=1,23110 result=0 > Feb 7 01:47:06 master bpmaster: write(ghost): missing process for message type 14 req; to=3,16070 from=1,16070 result=0 > > I have attached the greps for each of the pids -- they are quite large as > the pid rolled around. > > BTW -- thanks for all of the help and work, it has made the cluster sooo > much more stable. Thanks. Unfortunately, they don't show anything out of the ordinary. In the process of hunting down this stuff, I found a problem which may be the cause of at least some of what you're seeing. It turns out that tha VMADump's stack frames are a bit on the porky side and were involved in causing a stack overflow in kernel space. In kernel space, ever task gets two pages (8k) allocated to it. The bottom is the task_struct and the stack grows down from the top. These stacks don't grow dynamically like stacks in user space. If you overflow it you start cobbering task_struct starting with the BProc stuff which lives at the end of that structure. This is immediately followed by all hell breaking loose. I believe my stack was overflowing when there were enough (more than one, I think) during interrupts during move. Each one pushed a some more stuff on the stack. The interesting thing was that how quickly it happened depended on which network driver I was using. 3c59x did it faster than tulip. If the overflow was lucky enough to stick a zero in the bproc.ghost entry in task struct, then those messages about a missing ghost during exit make sense. I got junk in there which caused ps to oops and then hang the whole system. The fix was pretty straight forward - use less stack space. I think I shaved off about 1k. I haven't been running long enough here to say the problem is gone but it's probably worth trying. It's all in BProc 3.1.6 which I just threw up on source forge. - Erik |
From: Erik A. H. <er...@he...> - 2002-01-28 21:47:09
|
BProc 3.1.6 is up on sourceforge. http://sourceforge.net/project/showfiles.php?group_id=24453&release_id=72421 3.1.6 --------------------------------------------------------------------- Transparent Remote Exec This release adds a facility to allow sys_execve on slave nodes to transparently use their ghost to provide a binary image. Essentially, this looks like bproc_rexec(-1, ...) followed by bproc_move(original_slave) except that the process never really leaves the slave and therefore maintains stuff like its current working directory and open files. This is a big step in being able to do script-like things on slave nodes. Shells don't quite work like they do on the front end since it's not possible to walk a path and find a binary before exec()ing it. Anything with a full path name on the binary works though. Hot Reconnect for Slaves The hot reconnect feature allows slaves to reconnect to the front end without resetting their state. This is *not* a fail-over feature. The old connection needs to be shutdown cleanly to avoid loss of messages in flight. Hot reconnect is intended to allow slaves to switch networks after startup. Bugs There's also the usual round of bug fixes. See the change log for more info on those. Changes from 3.1.5 to 3.1.6 * Added bproc_execve which allows processes on slave nodes to perform execve on the front end machine using their ghost process. * Added a hook in sys_execve to transparently use the ghost for execve if the local execve fails. This can behavior can be switched on and off via /proc/sys/bproc/execve_hook. * Updated kernel patch for new features. Included patch is against 2.4.17. * Added hot reconnect for slave daemons. The slave can now reconnect to the master at runtime w/o affecting the slave's state. (bpctl --reconnect) * Added another work-around for another Linux TCP bug. TCP sure does seem awfully broken lately. * Added "async" versions of the cache mangement calls. * Fixed bproc.o so that sysctl table registration can fail w/o causing insmod to fail. This is incase the kernel doesn't support sysctl. * Fixed master daemon bug which resulted in failure to note a process's new location during a move and resulted in message loops between the kernel and the master daemon. * Fixed a security hole involving ptrace and setuid binaries on slave nodes. * Fixed a security that would cause the master default to allowing connections from unreserved ports. * Fixed BProc's CLONE_PARENT, CLONE_THREAD, CLONE_PTRACE handling to some extent. It's still got a few race conditions but at least it does something approaching "correct" at this point. * Reworked VMADump to reduce its foot print on the caller's kernel stack. Kernel stack overflows have been observed here. |
From: Erik A. H. <er...@he...> - 2002-01-25 17:47:44
|
On Wed, Jan 23, 2002 at 04:13:33PM -0500, henken wrote: > Here is the latest error message. I am getting this with 3.1.5 + patch. > Feb 4 17:55:30 master /usr/sbin/bpmaster: write(ghost): missing process > for message type 14 req; to=3,7576 from=1,7576 result=0 > > Here is the relevant ps output > [henken@master henken]$ ps -jx > PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND > 22401 7573 22385 22385 ? -1 S 27659 0:00 bpsh 0 > /home/henken/cvs/jobs/bin/noop > 7573 7576 22385 22385 ? -1 RW 27659 0:00 [noop] > 20876 7624 7624 20826 pts/2 7624 R 27659 0:00 ps -jx > > > You can see that the messag is tryin to get to the process that bpsh > started on the remote node. Actually, it's an EXIT message from the remote process back to the ghost on the front end. message type 14 req; to=3,7576 from=1,7576 result=0 14 = EXIT 3 = route to ghost 7576 = pid of ghost 1 = route to real (from real in this case) 7576 = pid of real process 0 = result - in this case indicating exit status 0 - normal exit. That's perfectly normal since "noop" exits on the remote node and needs to tell the ghost to do the same. The question is why the process on the front end doesn't seem to be a ghost. Looking at your ps output, it does look like it's probably ghosted since it's has no memory space swapped in. It's in the W state. Does "noop" stay in that state pretty much forever? What's in /proc/<pid>/maps for noop at this point? I really wish I could see a message trace for this. A trace of just the move and exit messages would be pretty useful here if possible. Hopefully that would get the message bulk down far enough to be managable. You can filter by doing something like: bpmaster -d -m - | egrep 'move|exit' > file Basically, I want to know what the result on the move response for "noop" was in this case. If it's non-zero and the remote went on for some reason, or if there's no error and the ghost screwed up somehow. I don't think it should be a problem, but I could imagine this being caused by the exit and move response getting out of order. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: henken <he...@se...> - 2002-01-23 21:15:44
|
Here is the latest error message. I am getting this with 3.1.5 + patch. Feb 4 17:55:30 master /usr/sbin/bpmaster: write(ghost): missing process for message type 14 req; to=3,7576 from=1,7576 result=0 Here is the relevant ps output [henken@master henken]$ ps -jx PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 22401 7573 22385 22385 ? -1 S 27659 0:00 bpsh 0 /home/henken/cvs/jobs/bin/noop 7573 7576 22385 22385 ? -1 RW 27659 0:00 [noop] 20876 7624 7624 20826 pts/2 7624 R 27659 0:00 ps -jx You can see that the messag is tryin to get to the process that bpsh started on the remote node. Nic -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: Erik A. H. <er...@he...> - 2002-01-19 00:43:34
|
On Fri, Jan 18, 2002 at 06:37:26PM -0500, henken wrote: > 2 questions -- > > 1 -- is it still necessary to call bproc_init? Nope, that's a no-op that was left in there for compatibility with REALLY old code. > 2 -- are there any changes to the calls from the docs on the web? Nope. If there are differences, they're either bugs in the calls or the web page. There are a few typos that that I haven't updated yet. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |
From: henken <he...@se...> - 2002-01-18 23:39:13
|
2 questions -- 1 -- is it still necessary to call bproc_init? 2 -- are there any changes to the calls from the docs on the web? Nic -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: Nicholas H. <he...@se...> - 2002-01-18 17:33:22
|
Yup -- didnt even notice that, but those are the pids that were hanging. I will try the patch as well -- it should only take a day to tell if it works :-) Nic On Fri, 18 Jan 2002, Erik Arjan Hendriks wrote: > On Thu, Jan 17, 2002 at 08:37:52PM -0500, henken wrote: > > I have tracked down an error message that shows up everytime bpsh hangs: > > Jan 29 12:57:53 master bpmaster: write(ghost): missing process for message > > type 14 req; to=3,27978 from=1,27978 result=0 > > Jan 29 16:20:06 master bpmaster: write(ghost): missing process for message > > type 14 req; to=3,31128 from=1,31128 result=11 > > Do these PIDs you're seeing match the ones that the master daemon is > blowing up on? I don't know how they'd be related but it's another > data point. > > You might want to see if this fix to the master daemon's error > handling helps you out. (apply to daemons/master.c) It's for a > problem I ran across last week and could result in the master failing > to note a new process's location. That could certainly lead to the > above messages and maybe the other problem you're seeing. > > (This fix will be in 3.1.6) > > diff -u -r1.104 -r1.105 > --- master.c 2 Jan 2002 22:36:30 -0000 1.104 > +++ master.c 11 Jan 2002 03:05:45 -0000 1.105 > @@ -17,7 +17,7 @@ > * along with this program; if not, write to the Free Software > * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. > * > - * $Id: master.c,v 1.104 2002/01/02 22:36:30 hendriks Exp $ > + * $Id: master.c,v 1.105 2002/01/11 03:05:45 hendriks Exp $ > *-----------------------------------------------------------------------*/ > #include <sys/types.h> > #include <sys/stat.h> > @@ -1446,9 +1446,14 @@ > assoc->req = 0; /* clear outstanding request */ > } > } else { > - if (req->req.fromtype == BPROC_ROUTE_REAL) { > - /* Don't ever generate responses to these in case of slave errors */ > + if (req->req.fromtype == BPROC_ROUTE_REAL && > + req->req.totype != BPROC_ROUTE_GHOST) { > + /* Don't make note of requests to ghosts... We will never > + * have to generate an error response due to ghost > + * disappearance. */ > switch(req->req.req) { > + /* Don't make note of these because we don't ever want > + * to auto-generate responses to these messages */ > case BPROC_GET_STATUS: > case BPROC_PARENT_EXIT: > break; > > > - Erik > -- Nicholas Henke Undergraduate - Engineerring 2002 -- Senior Architect and Developer Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |
From: Erik A. H. <er...@he...> - 2002-01-18 17:27:19
|
On Thu, Jan 17, 2002 at 08:37:52PM -0500, henken wrote: > I have tracked down an error message that shows up everytime bpsh hangs: > Jan 29 12:57:53 master bpmaster: write(ghost): missing process for message > type 14 req; to=3,27978 from=1,27978 result=0 > Jan 29 16:20:06 master bpmaster: write(ghost): missing process for message > type 14 req; to=3,31128 from=1,31128 result=11 Do these PIDs you're seeing match the ones that the master daemon is blowing up on? I don't know how they'd be related but it's another data point. You might want to see if this fix to the master daemon's error handling helps you out. (apply to daemons/master.c) It's for a problem I ran across last week and could result in the master failing to note a new process's location. That could certainly lead to the above messages and maybe the other problem you're seeing. (This fix will be in 3.1.6) diff -u -r1.104 -r1.105 --- master.c 2 Jan 2002 22:36:30 -0000 1.104 +++ master.c 11 Jan 2002 03:05:45 -0000 1.105 @@ -17,7 +17,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. * - * $Id: master.c,v 1.104 2002/01/02 22:36:30 hendriks Exp $ + * $Id: master.c,v 1.105 2002/01/11 03:05:45 hendriks Exp $ *-----------------------------------------------------------------------*/ #include <sys/types.h> #include <sys/stat.h> @@ -1446,9 +1446,14 @@ assoc->req = 0; /* clear outstanding request */ } } else { - if (req->req.fromtype == BPROC_ROUTE_REAL) { - /* Don't ever generate responses to these in case of slave errors */ + if (req->req.fromtype == BPROC_ROUTE_REAL && + req->req.totype != BPROC_ROUTE_GHOST) { + /* Don't make note of requests to ghosts... We will never + * have to generate an error response due to ghost + * disappearance. */ switch(req->req.req) { + /* Don't make note of these because we don't ever want + * to auto-generate responses to these messages */ case BPROC_GET_STATUS: case BPROC_PARENT_EXIT: break; - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |