[BProc] abnormal dying of nodes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi

I have experienced a strange phenomenon on my alpha cluster.  It is running
bproc 3.2.6, on alpha UX machines.  For the most part the cluster behaves
quite normally, allowing me to run jobs, and all the normal stuff.

However, I am testing a fairly short, highly cpu-intensive job, simply to
have a way of submitting many jobs using bjs and learn its functioning,
and it appears that the job is so cpu-intensive that the node appears
dead to the master (i.e. it does not respond or something like that),
and it dies before the job is completed.  Well, actually the job manages
to complete, but the node is reset anyway.  If I do "ps" on the master I
don't see the job (actually, it says it has not used up any time), nor do
I see it appear in "top".  Is it possible that the node gets TOO busy with
the computation?  I append two files:  The program itself (waster.cpp),
and its output (junk).  The command line to run the program was:

	bpsh 1 -I /dev/null ./waster > & junk &

The program ends with:
	[1]    Exit 255                      bpsh 1 -I /dev/null ./waster >& junk
and the node is reset.  From the /var/log/messages file I get:
	Mar  4 11:29:57 racaille bpmaster: ping timeout on slave 1

It looks like the node is too busy computing to even respond to pings...

Any hints at what may be happening are welcome.

I attach my /etc/beowulf/config file, for completeness.

Thanks,
Daniel
-- 

Dr. Daniel Gruner                        dg...@ti...
Dept. of Chemistry                       dan...@ut...
University of Toronto                    phone:  (416)-978-8689
80 St. George Street                     fax:    (416)-978-5325
Toronto, ON  M5S 3H6, Canada             finger for PGP public key