From: Rene S. <rs...@tu...> - 2005-07-27 02:18:56
|
Hi List, The problem is somewhat reprodusable. The cluster has crashed a couple of times in the past few days with similar messages. Following is the syslog entry right before the crash for today. The cluster consists of 40 dual opteron nodes. Here is what we do to reproduce the crash: 1)The cluster boots fine and all the nodes come up no problem. 2)We then procced to queue up some mpi and non mpi jobs to run. 3)After a few hours of running the jobs the cluster becomes unresponsibe meaning we can't get a prompt at the console, we can't ssh in we can't get anywhere or do anything other than power cycle the whole thing. Is there any way to get more debug info out of bproc so that I can get a clue as to where to start looking at what might be causing this problem? thank you for any advice/help on this. Rene > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu last message repeated 577 times > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: proc: ghproc: ghost: siproc: > ghosproc: ghost: sigproc: gproc: ghost: sigproc: ghost: sigproc: ghost: sigproc > : ghost: sigproc: ghost: siproc: ghost: sigproc: ghostproc: ghost: sigproc: ghos > t: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghos > tproc: ghost: sigproc: ghost: signal: signr == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3631 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3639 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 Erik Hendriks wrote: > I've never seen a problem like that. Is it reproducable? > > - Erik > |