From: Daniel G. <dg...@ti...> - 2004-03-05 04:15:27
|
Erik, Thanks for the comments. I will do my best to work on the testing, but I have another question: Have you tried either clustermatic 3 or 4 with a machine like mine? They are UX (ruffian) boards, made by Samsung, with EV56 @600 MHz. Perhaps somebody else on the list has tried these? Thanks, Daniel On Thu, Mar 04, 2004 at 06:13:15PM -0700, er...@he... wrote: > On Thu, Mar 04, 2004 at 02:16:05PM -0500, Daniel Gruner wrote: > > > Anyway, a few things you can do to try to figure out that's what's > > > going on... (I tried this on our alphas real quick and it doesn't > > > seem to be happening here.) > > > > > > - Comment out this stuff in the slave daemon: > > > > > > /* bump our priority to RT to avoid getting hosed by errant > > > * stuff that gets run on our node */ > > > p.sched_priority = 1; > > > if (sched_setscheduler(0, SCHED_FIFO, &p)) > > > syslog(LOG_NOTICE, "Failed to set real-time scheduling for" > > > " slave daemon.\n"); > > > > > > and rebuild the slave daemon. (also reinstall libbpslave.a and > > > rebuild the phase 2 boot image if you're using the rest of > > > clustermatic) > > > > I will try, if I get some time... > > > > > > > > - bpsh other stuff (e.g. uptime) to the node while this job is > > > running but before the node dies. Is it responsive or is it just > > > completely dead? Is it really slow? > > > > Nothing else runs. "bpsh 1 uptime" just hangs. The process table on the > > master does not get updated, and the waster does not appear on "top" at > > all. The node dies (is killed, actually, because it times out) although the > > waster job manages to finish. I guess even the bpctl does not get through... > > Well, it sounds like the the starvation of the slave daemon is more or > less complete in that case. It's pretty much got to be a scheduler > problem. If commenting out the bit of code above and this tid bit in > kernel/slave.c fixes it, then that will confirm it. > > /* Knock our priority back to default */ > current->policy = SCHED_OTHER; > current->nice = DEF_NICE; > current->rt_priority = 0; > > If that fixes it I'm pretty sure that either the kernel you're using > has some scheduler related patch in it or my code is subtly buggy in a > way that doesn't crop up on the kernels that I've built. > > If you're running as root, you can also try to do > sched_setscheduler(0, SCHED_OTHER, ...) in your program to try and > confirm it. > > > > - Finally, if everything seems to be working but slow or something > > > like that you can up the ping timeout by adding a line like this > > > to your /etc/beowulf/config. > > > > > > pingtimeout 120 > > > > > > 120 is the timeout in seconds. The default is 30. > > > > Not a solution yet... > > > > Could it be something to do with the network driver? I am using eepro100 > > cards. Here is the list of loaded modules on the nodes: > > I seriously doubt it. You're still getting output from the remote > process which means the network and TCP are all still working fine. > > - Erik -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |