From: Daniel G. <dg...@ti...> - 2004-03-04 17:00:59
|
Hi I have experienced a strange phenomenon on my alpha cluster. It is running bproc 3.2.6, on alpha UX machines. For the most part the cluster behaves quite normally, allowing me to run jobs, and all the normal stuff. However, I am testing a fairly short, highly cpu-intensive job, simply to have a way of submitting many jobs using bjs and learn its functioning, and it appears that the job is so cpu-intensive that the node appears dead to the master (i.e. it does not respond or something like that), and it dies before the job is completed. Well, actually the job manages to complete, but the node is reset anyway. If I do "ps" on the master I don't see the job (actually, it says it has not used up any time), nor do I see it appear in "top". Is it possible that the node gets TOO busy with the computation? I append two files: The program itself (waster.cpp), and its output (junk). The command line to run the program was: bpsh 1 -I /dev/null ./waster > & junk & The program ends with: [1] Exit 255 bpsh 1 -I /dev/null ./waster >& junk and the node is reset. From the /var/log/messages file I get: Mar 4 11:29:57 racaille bpmaster: ping timeout on slave 1 It looks like the node is too busy computing to even respond to pings... Any hints at what may be happening are welcome. I attach my /etc/beowulf/config file, for completeness. Thanks, Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |