You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Nicholas H. <he...@se...> - 2003-11-13 22:01:23
|
On a sourceforge mirror near you: Name : Clubmask Version : 0.6 Release : b1 Group : Cluster Resource Management and Scheduling Vendor : Liniac Project, University of Pennsylvania License : GPL-2 URL : http://clubmask.sourceforge.net Download : http://sourceforge.net/project/showfiles.php?group_id=1316&release_id=197383 What is Clubmask ------------------------------------------------------------------------------ Clubmask is a resource manager designed to allow Bproc based clusters enjoy the full scheduling power and configuration of the Maui HPC Scheduler. Clubmask uses a modified version of the Supermon resource monitoring software to gather resource information from the cluster nodes. This information is combined with job submission data and delivered to the Maui scheduler. Maui issues job control commands back to Clubmask, which then starts or stops the job scripts using the Bproc environment. Clubmask also provides builtin support for a supermon2ganglia translator that allows a standard Ganlgia web backend to contact supermon and get XML data that will disply through the Ganglia web interface. Clubmask is currently running on around 10 clusters, varying in size from 8 to 128 nodes, and has been tested up to 5000 jobs. Notes/warnings on this release: ------------------------------------------------------------------------------ Before upgrading, please make sure to save your /etc/clubmask/clubmask.conf file, as it may get overwritten. There are a few new variables in clubmask.conf, so beware! To use the resource requests, you must be running the latest snapshot of maui. Changes since 0.5: ------------------------------------------------------------------------------ Change the name from the god awfull absolute timestamp, to a more normal "string.number" format, where "string" is an arbitrary job name and "number" is the Nth time that the job name is being used. EX root.1, root.2, ... fix cmnodesshknownhosts to get the -n information from the bproc nodenumber that is given as the argument update to latest supermon APIs Feature Request #790938: add 'cmsubmit -r <resid>' to run a job in a maui reservation. Fixed bug #791396: make sure processes get killed in Interactive jobs make sure bproc is running when starting resource_manager fix cmsubmit -h. it is now cleaner, and easier to understand add support for resource requirements on the nodes. swap, mem, disk, qos, reservation, and processors per node are supported now. see cmsumbit -h for more information. add infrastructure for architecture, os, network, arbitrary features as node resource requests. We do not get this information dynamically yet, so no need in letting people muck with it. add supermon_state daemon to manage the nodelist for supermon. keeps that logic out of resource_manager make sure there is at most one 'R' command in the pipeline for down nodes at any given time. No sense in asking nodes to revive if they have not responded to the last request yet. cleanup setup to perform RPM builds cleaner split /etc/clubmask/clubmask.conf to /etc/clubmask/{system,clubmask}.conf to allow variables that need user editing to live in clubmask.conf and the rest of the system varaibles to live in system.conf. This will let a user update to a newer version of Clubmask, and just copy over the old clubmask.conf to restore their configuration. migrate all docs from Docbook XML to Lyx/latex. All of the docs -- pdf, html single, and html multiple can be generated with a simple 'make' in the docs/ directory. add --secret-key to setup.py args for building maui and clubmask with same checksum key. This removes the need to edit setup.py when installing clubmask. Links ------------- Bproc: http://bproc.sourceforge.net Ganglia: http://ganglia.sourceforge.net Maui Scheduler: http://www.supercluster.org/maui Supermon: http://supermon.sourceforge.net Cheers~ Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-11-10 21:26:05
|
On Thu, Nov 06, 2003 at 10:27:37AM -0500, gor...@ph... wrote: > > Never saw this message before; > > bproc: iod: no daemon present to forward IO > > What does this message mean? I means there's no daemon :) Normally the bproc master and slave daemons fork to create an I/O daemon. The purpose of that daemon is basically to copy I/O from a remote job to whatever file descriptor used to be on that job's stdout. It's a bit of a hack to make printf() work. > Subsequent to this message, jobs which are rfork'd out to a slave node die > as soon as they being to produce output. > Jobs which are bpsh'ed out, work ok. bproc version 3.2.5 on kernel 2.4.21. Are you seeing this on the master or slave? I presume the daemon is getting created and dying for some reason. Is there something particular that you're doing cause this problem? I haven't seen this one myself. My only guess as far a cause is concerned would be resource exhaustion of some kind. - Erik |
From: <er...@he...> - 2003-11-10 21:19:39
|
On Fri, Oct 31, 2003 at 12:08:09AM +0100, J.A. Magallon wrote: > Hi all... > > I would like to try bproc-4.0.0-pre1, but I have a couple questions: > - What advantages does it offer vs 3.2.6 ? > - Is it source-compatible, for example, to build MPICH-1.2.5.2 ? The biggest change is the "bpfs" virtual file system stuff for node status. This completely replaces the old way of getting node status information which involved talking to the master daemon. The advantage is that it's MUCH faster for doing the kinds of things a scheduler does (chmod, chown, etc.). On our 1024 node system, the scheduler is about 100x faster as a result. Opteron support is also only in 4.0.0pre1 at this point. That should be easy to copy to the other one though. There are a bunch of smallish API changes too. The goal there was to clean things up a bit and get rid of some crud. That's definitely a work in progress. The MPICH hacks are going to require some changes. Somebody here has done them so it's probably time to create a new tarball of that stuff. This should be done for "Clustermatic 4" in about a week or so. - Erik |
From: <gor...@ph...> - 2003-11-06 15:29:46
|
Never saw this message before; bproc: iod: no daemon present to forward IO What does this message mean? Subsequent to this message, jobs which are rfork'd out to a slave node die as soon as they being to produce output. Jobs which are bpsh'ed out, work ok. bproc version 3.2.5 on kernel 2.4.21. |
From: J.A. M. <jam...@ab...> - 2003-10-30 23:08:17
|
Hi all... I would like to try bproc-4.0.0-pre1, but I have a couple questions: - What advantages does it offer vs 3.2.6 ? - Is it source-compatible, for example, to build MPICH-1.2.5.2 ? TIA -- J.A. Magallon <jamagallon()able!es> \ Software is like sex: werewolf!able!es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.23-pre8-jam2 (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-4mdk)) |
From: <er...@he...> - 2003-10-30 15:15:19
|
On Tue, Oct 28, 2003 at 03:16:10PM -0800, Dale Harris wrote: > So has anyone done any preliminary work to get bproc working with the > 2.6 kernel? Not yet. It's at the top of the to-do list though. I hope to get to it after super computing '03 (which is about 2 weeks from now)... - Erik |
From: Dale H. <ro...@ma...> - 2003-10-28 23:18:21
|
So has anyone done any preliminary work to get bproc working with the 2.6 kernel? -- Dale Harris ro...@ma... /.-) |
From: MIYOSHI,DENNIS (HP-Loveland,ex1) <den...@hp...> - 2003-10-26 01:28:10
|
I actually got the RedHat 9.0 TFTP server to work with PXE. I start this with the supplied service definition in /etc/xinetd.d. I also used the RedHat 9.0 DHCP supplied RPM. Best regards, Dennis E. Miyoshi, PE Hendrix Release Manager Hewlett-Packard Company 825 14th Street, S.W., MS E-200 Loveland, CO 80537 (970) 898-6110 -----Original Message----- From: bpr...@li... [mailto:bpr...@li...] On Behalf Of Larry Baker Sent: Friday, October 24, 2003 11:33 AM To: bpr...@li... Subject: [BProc] Red Hat PXE fixes for Clustermatic 3 My system is a 4 node Linux Beowulf cluster, using the Clustermatic 3 kit. I was not able to get the PXE package that came with Red Hat 8.0 to work (pxe-0.1-33). I tried the newer version that comes with Red Hat 9 (pxe-0.1-36), but it would not work either. I downloaded the source RPMS for the two versions, and found they were almost identical. So, I modified the Red Hat 9 PXE package. The changes I made are: 1. The PXE package includes a multicast TFTP server, /usr/sbin/in.mtftpd. However, it does not include a service definition file for mtftp in /etc/xinetd.d. I fixed that. Note: On Red Hat 8 you still have to manually edit /etc/services to add entries for pxe and mtftp. They are already there in Red Hat 9. 2. The linux.0 layer 0 bootstrap sets up the downloaded initrd ram disk as /dev/ram1 (0x0101), but it is mounted on /dev/ram0 (0x0100). Without a kernel command line "root=/dev/ram0" to override it, the boot fails. I changed the code in prepare.c to set up the ram disk as /dev/ram0. 3. By default, pxe-0.1-33 redirects the console to COM1; pxe-0.1-36 does not. The default kernel command line is hard-coded into linux.0. To override it, you must manually enter a replacement at the console. I modified linux.c and download.c to add an optional layer 3 file containing the default kernel command line (like the APPEND option in syslinux/pxelinux). Unfortunately, PXE can select only one default kernel/initrd/command line combination. This is not a problem for me, since I have a small, homogeneous cluster. Installation instructions are in the HTML file, pxe-0.1-36a.htm. Larry Baker US Geological Survey ba...@us... <mailto:ba...@us...> |
From: Greg W. <gw...@la...> - 2003-10-22 21:48:07
|
At 1:46 PM -0600 22/10/03, er...@he... wrote: >On Wed, Oct 22, 2003 at 06:57:20AM +0200, Francois Thomas wrote: >> Has someone already tested bproc on IBM Power4 running Linux ? >> I have seen some notes about ppc support but would bproc work on ppc64 ? Or >> what would it take to run ? >> I am running a cluster of SLES8 ppc64 nodes and would like to give a try at >> bproc. > >I haven't tried that one. We haven't got any Power4 machines around >here. If it's reasonably similar to the 32 bit power pc (I've tried >G3 and G4) it should be easy to do a port. > >- Erik It's on my todo list when I have some time and when I can get a version of Linux that works on the G5. Greg |
From: <er...@he...> - 2003-10-22 21:21:03
|
On Tue, Oct 21, 2003 at 12:48:27PM -0400, Nicholas Henke wrote: > On Tue, 2003-10-21 at 11:54, er...@he... wrote: > > I got an oops out of the NMI watchdog which was enlightening (or at > > least indicated which code was at fault). The following patch may > > have fixed it for me. I say "may have" since I've had some trouble > > reproducing the problem reliably. > > > > This patch turns off "sigbypass" which is a little optimization where > > a process sending a signal to a ghost doesn't bother the ghost. > > Instead it just throws a signal forwarding message right on the > > message queue. I'm not sure how the code is broken. I haven't had > > time to look into it yet. > > > > Please give it a try and let me know if you still see the deadlock. > > Thanks for the quick patch, I am running now to see if it deadlocks. > > BTW -- did you do anything special to get NMI to dump an oops for you? > Can you tell me the basic setup -- I am just booting with > nmi_watchdog=1, and I see the interrupts in /proc/interrupts. Does > something more need to be done ? That's all I did. - Erik |
From: <er...@he...> - 2003-10-22 20:49:55
|
On Wed, Oct 22, 2003 at 06:57:20AM +0200, Francois Thomas wrote: > Has someone already tested bproc on IBM Power4 running Linux ? > I have seen some notes about ppc support but would bproc work on ppc64 ? Or > what would it take to run ? > I am running a cluster of SLES8 ppc64 nodes and would like to give a try at > bproc. I haven't tried that one. We haven't got any Power4 machines around here. If it's reasonably similar to the 32 bit power pc (I've tried G3 and G4) it should be easy to do a port. - Erik |
From: <er...@he...> - 2003-10-22 05:01:30
|
On Tue, Oct 21, 2003 at 03:13:13PM -0400, Nicholas Henke wrote: > On Tue, 2003-10-21 at 11:54, er...@he... wrote: > > I got an oops out of the NMI watchdog which was enlightening (or at > > least indicated which code was at fault). The following patch may > > have fixed it for me. I say "may have" since I've had some trouble > > reproducing the problem reliably. > > > > This patch turns off "sigbypass" which is a little optimization where > > a process sending a signal to a ghost doesn't bother the ghost. > > Instead it just throws a signal forwarding message right on the > > message queue. I'm not sure how the code is broken. I haven't had > > time to look into it yet. > > > > Please give it a try and let me know if you still see the deadlock. > > > > > > --- hooks.c 29 Aug 2003 21:46:57 -0000 1.53 > > +++ hooks.c 21 Oct 2003 15:41:44 -0000 > > @@ -314,6 +314,11 @@ > > * t->sigmask */ > > struct bproc_krequest_t *req; > > struct siginfo tmpinfo; > > + > > + return 0; /* XXX disable sigbypass for now. > > + * There seems to be something busted or > > + * unsafe about this code... */ > > + > > if (!BPROC_ISGHOST(t) || !t->bproc.ghost->sigbypass) > > return 0; > > Well, that seems to have worked. I just passed 10M iterations, which it > was unable to do previously. I am now doing 100M just to be an idiot :) Ok, thanks for the feedback. Now to figure out what's wrong with that code... - Erik |
From: Francois T. <FT...@fr...> - 2003-10-22 04:58:54
|
Has someone already tested bproc on IBM Power4 running Linux ? I have seen some notes about ppc support but would bproc work on ppc64 ? Or what would it take to run ? I am running a cluster of SLES8 ppc64 nodes and would like to give a try at bproc. Salutations/Regards. ============================================ Dr. Francois THOMAS, EMEA-PSSC RS/6000 SP Group Tel : (33)-4-67344061, GSM : (33)-6-83258855 Fax : (33)-4-67346477 ft...@fr..., ICQ# 95392338, http://ft-fr.userv.ibm.com ============================================ |
From: Nicholas H. <he...@se...> - 2003-10-22 01:21:28
|
On Tue, 2003-10-21 at 11:54, er...@he... wrote: > I got an oops out of the NMI watchdog which was enlightening (or at > least indicated which code was at fault). The following patch may > have fixed it for me. I say "may have" since I've had some trouble > reproducing the problem reliably. > > This patch turns off "sigbypass" which is a little optimization where > a process sending a signal to a ghost doesn't bother the ghost. > Instead it just throws a signal forwarding message right on the > message queue. I'm not sure how the code is broken. I haven't had > time to look into it yet. > > Please give it a try and let me know if you still see the deadlock. Thanks for the quick patch, I am running now to see if it deadlocks. BTW -- did you do anything special to get NMI to dump an oops for you? Can you tell me the basic setup -- I am just booting with nmi_watchdog=1, and I see the interrupts in /proc/interrupts. Does something more need to be done ? Thanks! Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-10-21 20:25:19
|
On Tue, Oct 14, 2003 at 12:30:41PM -0400, Nicholas Henke wrote: > On Tue, 2003-10-14 at 11:56, er...@he... wrote: > > > > Hrm. I'm glad you brought this up. I've recently seen a similar > > problem. I thought it had something to do with the recent network > > upgrade we did. Sounds like probably not. > > Fun ;/ Thanks for taking a look at it. > > > > > It's pretty mysterious to me. BProc really doesn't do much with > > interrupts turned off. I've been working on reproducing it more > > reliably here. > > > > I don't know a good way to shake the kernel loose in that case. One > > thing we wwere going to try was to instrument it a bit with POST codes > > or try and poke around in memory a bit with a bus analyzer. > > Ok -- way over my head there, I wonder if a hardware watchdog card would > help - or if that would give the same results as the nmi_watchdog..aka > nothing. I got an oops out of the NMI watchdog which was enlightening (or at least indicated which code was at fault). The following patch may have fixed it for me. I say "may have" since I've had some trouble reproducing the problem reliably. This patch turns off "sigbypass" which is a little optimization where a process sending a signal to a ghost doesn't bother the ghost. Instead it just throws a signal forwarding message right on the message queue. I'm not sure how the code is broken. I haven't had time to look into it yet. Please give it a try and let me know if you still see the deadlock. --- hooks.c 29 Aug 2003 21:46:57 -0000 1.53 +++ hooks.c 21 Oct 2003 15:41:44 -0000 @@ -314,6 +314,11 @@ * t->sigmask */ struct bproc_krequest_t *req; struct siginfo tmpinfo; + + return 0; /* XXX disable sigbypass for now. + * There seems to be something busted or + * unsafe about this code... */ + if (!BPROC_ISGHOST(t) || !t->bproc.ghost->sigbypass) return 0; |
From: Nicholas H. <he...@se...> - 2003-10-21 19:39:22
|
On Tue, 2003-10-21 at 11:54, er...@he... wrote: > I got an oops out of the NMI watchdog which was enlightening (or at > least indicated which code was at fault). The following patch may > have fixed it for me. I say "may have" since I've had some trouble > reproducing the problem reliably. > > This patch turns off "sigbypass" which is a little optimization where > a process sending a signal to a ghost doesn't bother the ghost. > Instead it just throws a signal forwarding message right on the > message queue. I'm not sure how the code is broken. I haven't had > time to look into it yet. > > Please give it a try and let me know if you still see the deadlock. > > > --- hooks.c 29 Aug 2003 21:46:57 -0000 1.53 > +++ hooks.c 21 Oct 2003 15:41:44 -0000 > @@ -314,6 +314,11 @@ > * t->sigmask */ > struct bproc_krequest_t *req; > struct siginfo tmpinfo; > + > + return 0; /* XXX disable sigbypass for now. > + * There seems to be something busted or > + * unsafe about this code... */ > + > if (!BPROC_ISGHOST(t) || !t->bproc.ghost->sigbypass) > return 0; Well, that seems to have worked. I just passed 10M iterations, which it was unable to do previously. I am now doing 100M just to be an idiot :) Nice catch~ Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-10-20 16:36:55
|
On Sun, Oct 19, 2003 at 02:08:51AM -0700, Parshuram Limaye wrote: > Hi, > > I just installed the bpoc.4pre release on the master > and slave node.i patched the kernel on both side and > build it and booted it and then installed the bproc on > both system.on one I run the bpmaster and another > bpslave.but the problem is that i am not able to see > any /etc/beowulf/node script , what I got it just a > single file config which i modified as > > on master > > interface eth0 192.168.1.3 255.255.255.0 > > nodes 1 > > iprange 192.168.1.1 192.168.1.1 > > on slave > > interface eth1 192.168.1.1 255.255.255.0 > > nodes 1 > > iprange 192.168.1.1 192.168.1.1 > > > > when i run the bpmaster on master and bpslave on slave > > > system.i got no error but the problem is that when > running > > bpstat is gives an error saying bproc_notifier : no > such file or directory.how can i verify that the node > is up because no file > > is there either in /var/run/ having name bproc > please help in solving this problem Most likely you forgot to mount bpfs. See "Node status" in the release notes. - Erik |
From: Parshuram L. <par...@ya...> - 2003-10-19 09:08:52
|
Hi, I just installed the bpoc.4pre release on the master and slave node.i patched the kernel on both side and build it and booted it and then installed the bproc on both system.on one I run the bpmaster and another bpslave.but the problem is that i am not able to see any /etc/beowulf/node script , what I got it just a single file config which i modified as on master interface eth0 192.168.1.3 255.255.255.0 nodes 1 iprange 192.168.1.1 192.168.1.1 on slave interface eth1 192.168.1.1 255.255.255.0 nodes 1 iprange 192.168.1.1 192.168.1.1 when i run the bpmaster on master and bpslave on slave system.i got no error but the problem is that when running bpstat is gives an error saying bproc_notifier : no such file or directory.how can i verify that the node is up because no file is there either in /var/run/ having name bproc please help in solving this problem Parshuram limaye |
From: Nicholas H. <he...@se...> - 2003-10-17 14:40:45
|
On Thu, 2003-10-16 at 19:01, er...@he... wrote: > On Thu, Oct 16, 2003 at 05:23:05PM -0400, Nicholas Henke wrote: > > Any ideas on how hard it would be to add access control lists to nodes? > > We are getting pretty hard pressure here to support multiple users per > > node. I would be doing the coding, just looking for ideas and a sanity > > check. > > Here's my plan on that one: > > 1: Wait till I port to 2.6 > > 2: Then use all the existing POSIX ACL stuff. This should be trivial > at that point since "bpfs" (the node file system in BProc 4.0.0pre1) > already has support for arbitrary extended file attributes. > > Until then, the UNIX file system-like semantics are pretty limiting > for that case and it won't be easy to fix. > > It might be fairly easy to skip straight to step 2 (on BProc 4, not > 3.2.6) since there are some ACL patches for Linux 2.4. I haven't > tried that so I have no idea what the feasibility of that will be. I looked at the info available on the ACL support for 2.4, and it looks fairly sane. From what I can see, those are the same patches that the Lustre folks are also using. Once I get 4.0pre up on struggles, I am going to look at the bpfs stuff to see what it would take to get this done. I hope to have a better idea in a few weeks :) Thanks for the info ~ Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-10-16 23:33:46
|
On Thu, Oct 16, 2003 at 05:23:05PM -0400, Nicholas Henke wrote: > Any ideas on how hard it would be to add access control lists to nodes? > We are getting pretty hard pressure here to support multiple users per > node. I would be doing the coding, just looking for ideas and a sanity > check. Here's my plan on that one: 1: Wait till I port to 2.6 2: Then use all the existing POSIX ACL stuff. This should be trivial at that point since "bpfs" (the node file system in BProc 4.0.0pre1) already has support for arbitrary extended file attributes. Until then, the UNIX file system-like semantics are pretty limiting for that case and it won't be easy to fix. It might be fairly easy to skip straight to step 2 (on BProc 4, not 3.2.6) since there are some ACL patches for Linux 2.4. I haven't tried that so I have no idea what the feasibility of that will be. - Erik |
From: Nicholas H. <he...@se...> - 2003-10-16 21:23:14
|
Any ideas on how hard it would be to add access control lists to nodes? We are getting pretty hard pressure here to support multiple users per node. I would be doing the coding, just looking for ideas and a sanity check. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-10-14 16:30:44
|
On Tue, 2003-10-14 at 11:56, er...@he... wrote: > > Hrm. I'm glad you brought this up. I've recently seen a similar > problem. I thought it had something to do with the recent network > upgrade we did. Sounds like probably not. Fun ;/ Thanks for taking a look at it. > > It's pretty mysterious to me. BProc really doesn't do much with > interrupts turned off. I've been working on reproducing it more > reliably here. > > I don't know a good way to shake the kernel loose in that case. One > thing we wwere going to try was to instrument it a bit with POST codes > or try and poke around in memory a bit with a bus analyzer. Ok -- way over my head there, I wonder if a hardware watchdog card would help - or if that would give the same results as the nmi_watchdog..aka nothing. > > Can you resent the attachment? I didn't get it for some reason. Heh, probably forgot to attach it -- it is now. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-10-14 16:24:13
|
On Mon, Oct 13, 2003 at 02:17:16PM -0400, Nicholas Henke wrote: > Howdy~ > I am back to torturing bproc again, just trying to make sure a kernel > upgrade is going to be stable. Attached is a tar.gz with a script to run > remote_fork (.c included). There is a 'NODES=' section at the top to > edit for your nodes to use. > > If you do a './run.sh 10000000' ( 10 million iterations ), at some > point, usually 1.5 million, the head node will hard lock -- not even > nmi_watchdog can rescue it. > > If you have a way to rescue a kernel from this hard of a lock, I would > love to know about it, so I could give this bug a whirl, but otherwise I > am pretty stuck. Hrm. I'm glad you brought this up. I've recently seen a similar problem. I thought it had something to do with the recent network upgrade we did. Sounds like probably not. It's pretty mysterious to me. BProc really doesn't do much with interrupts turned off. I've been working on reproducing it more reliably here. I don't know a good way to shake the kernel loose in that case. One thing we wwere going to try was to instrument it a bit with POST codes or try and poke around in memory a bit with a bus analyzer. Can you resent the attachment? I didn't get it for some reason. - Erik |
From: Nicholas H. <he...@se...> - 2003-10-13 18:17:45
|
Howdy~ I am back to torturing bproc again, just trying to make sure a kernel upgrade is going to be stable. Attached is a tar.gz with a script to run remote_fork (.c included). There is a 'NODES=' section at the top to edit for your nodes to use. If you do a './run.sh 10000000' ( 10 million iterations ), at some point, usually 1.5 million, the head node will hard lock -- not even nmi_watchdog can rescue it. If you have a way to rescue a kernel from this hard of a lock, I would love to know about it, so I could give this bug a whirl, but otherwise I am pretty stuck. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Dale H. <ro...@ma...> - 2003-10-09 15:32:36
|
On Wed, Oct 08, 2003 at 09:59:41PM -0600, Maurice Hilarius elucidated: > > >Hey, of course, this may not be too important for many people out there. > >But are there any debs for bproc and these clustermatic tools? Yes, I > >suppose that is something I could do... someday, in my copious time. > > > There are RPMs, because we built them and put them on our ftp server for > public access. > Hey Maurice, Yeah, that's cool. But that's not debs. ;-) Yes, I know I could take alien and convert RPMs to debs, etc, etc. But that just isn't as clean of a solution than just having made the debs from scratch, getting them included into Debian officially. FYI, Maurice, I'm thinking about for the old cluster, which I recently installed Debian, and not demeter, which is still Red Hat and bproc, of course. But who knows, I might install debian on it some day, too. *shrug* ;-) -- Dale Harris ro...@ma... /.-) |