From: <er...@he...> - 2004-09-11 02:31:16
|
On Wed, Sep 08, 2004 at 10:43:17AM +0200, Peter Englmaier wrote: > Hi, we had a 'Scyld' Cluster running for years until some disk > on the master died. Some MPI program and other 'normal' programs > were running. After the disk crash the bproc system was hanging > and after a reboot all nodes didn't come up. Finally, > we installed RH9+Clustermatic 4. Main reason for this: we > had no good documentation aboud scyld and the old sysadmin had > left the institute. Setting up clustermatic was quite easy. > > Nodes had scratch disks with three partitions: Scyld boot > partition (actually an ext2 partition), swap, and /scratch. > All three partitions where destroyed - I could not mount any of them. > On about 10 of 22 nodes! Even swap was not recognized as such, although > fdisk reported all partitions. I suspect something was going wrong > with the kernel part of bproc. Is this possible? BProc doesn't get its fingers in any of the kernel's file system stuff so I think it's unlikely that BProc would induce file system problems. I suppose it's possible if something went crazy and started scribbling on memory. Personally, I've never seen a problem like that with BProc. I can't speak for Scyld though - they've modified BProc and I don't know the details of what they've done. Personally, I would pull out one of the disks to see if they work on some other stand alone system. If they do then I'd start picking through the node setup process to see what's going on there. - Erik |