[BProc] slave problems

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

//** THE SETUP **
I've got a test cluster that works using bproc-3.1.9 (RH7.2 master node)
from the March ClusterMatic CD.  I'm trying to build a new master node
(RH8.0) from source using bproc-3.2.2 with beoboot-lanl.1.3.  Beowulf
starts up clean.

Nodes all boot with linuxbios, so I don't need to muck with a phase 1
kernel.

The phase 2 kernel was built with:
'beoboot -2 -n -o vmlinuz-beoboot'.

//** THE PROBLEM **
When a slave boots, it gets stuck in an infinte loop like such:
while (1) {
// slave issues dhpc request
// slave does arp for master -- master responds
// dhcp serves up the kernel
// new in.tftpd process starts up on master
// slave starts the tftp download and downloads a few blocks
}

I end up with tons of tftp daemons all trying to serve a single node,
and beoserv never receives a RARP.

This seems detached from bproc master problems --stopping beowulf
produces the same effect.

So the question is:  has anyone seen this before?  What is causing the
slave to continue to issue DHCP requests after the first request
seemingly succeeds?  Everything works fine when using the 3.1.9 master
node.  Is this merely another SUA (Stupid User Artifact) where the
answer should be blindingly obvious?

Thanks for any help,

-JE
-----------------------------------------------
Josh England
Sandia National Laboratory, Livermore, CA
Distributed Information Systems
email: jj...@sa...
phone: (925) 294-2076