Re: [BProc] CM4 node_up failure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Aug 24, 2004 at 04:01:23PM -0700, Vipul Deokar wrote:
> Folks, 
> 
> I have a configuration of 4 identical "compute" nodes
> (with disks) and 1 slightly different "master" node.
> The master node has a slightly more powerful CPU, more
> RAM, an additonal 10/100 NIC interface to connect to
> extranet, a CDROM drive on secondary IDE, a more
> recent verison of BIOS firmware.
> 
> With Red Hat 9 (runlevel 3) and Clustermatic4 i386
> RPMS installed on computeNode#1 as master, I can build
> a cluster with the other 4 nodes as slaves. (I see one
> node rebooting every 5-6 minutes that I need to
> debug). I am using this cluster currently.
> 
> 
> However, with RedHat 9 (runlevel 3 or 5; more
> packages) and Clustermatic4 i386 RPMS installed on the
> "master", I cannot get any of the "compute" nodes up
> as slaves. The node_up script fails on all nodes with
> :
>  nodeup    : Starting 1 child processes.
>  nodeup    : Finished creating child processes.
>  nodeup    : I/O error talking to child
>  nodeup    : Child process for node 1 died with signal
>  4
> 
> 
> The same config, node_up, config.boot scripts execute
> in both cluster configuration attempts - one succeeds
> and the other fails. Any insight why this would
> happen?

Signal 4 = SIGILL.  This usually happens when the front end node is
some newer rev of processor than the back end nodes.  Red Hat is
pretty good about installing the best glibc it can on the front
end. (e.g. install the i686 version instead of the i386 one)

I usually see something like this when the slave node is some other
CPU type that doesn't qualify as i686.  The issue is really that the
library gets loaded on the front end and those instructions turn out
to not be valid on the destination node.

My work around is to downgrade the glibc on the front end to one that
will work on all nodes.  In other words, load the i386 one instead of
the i686 one.

- Erik