From: Rene S. <rs...@tu...> - 2005-07-21 03:49:29
|
Hi List, Our cluster crashed a bit ago and rebooted and now things are back and running again. I am looking through the syslogs and trying to figure out what happend or what caused the crash and I found these messages: Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghost: signal: signr = = 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghost: signal: signr = = 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghosproc: ghostproc: g host: signal: signr == 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Any one have any clues? Thanks Rene |
From: Erik H. <eah...@gm...> - 2005-07-22 17:22:36
|
I've never seen a problem like that. Is it reproducable? - Erik On 7/20/05, Rene Salmon <rs...@tu...> wrote: > Hi List, >=20 > Our cluster crashed a bit ago and rebooted and now things are back and > running again. >=20 > I am looking through the syslogs and trying to figure out what happend > or what caused the crash and I found these messages: >=20 >=20 >=20 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghost: signal: signr =3D > =3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghost: signal: signr =3D > =3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghosproc: ghostproc: g > host: signal: signr =3D=3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 >=20 >=20 > Any one have any clues? >=20 > Thanks > Rene >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic= k > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Rene S. <rs...@tu...> - 2005-07-22 18:34:28
|
Hi, > I've never seen a problem like that. Is it reproducable? > Not sure. I think it might have to do with the queuing system we are using from platform LSF. We enabled preemption on the queues and it seemed that right before the crash some of the preempted/suspended jobs did not get the nodes reasigned to them properly. For example user joe whos job was suspended on node 0 did not get node 0 reasigned to joe. The job or process for user joe was running on node 0 but the node still belonged to root and not user joe. Then the cluster crashed and now things are working fine. I am watching the logs and so far no sign of trouble. Preemption seems to be working correctly now and nodes get assigned/reasigned properly. Thanks Rene |
From: Rene S. <rs...@tu...> - 2005-07-27 02:18:56
|
Hi List, The problem is somewhat reprodusable. The cluster has crashed a couple of times in the past few days with similar messages. Following is the syslog entry right before the crash for today. The cluster consists of 40 dual opteron nodes. Here is what we do to reproduce the crash: 1)The cluster boots fine and all the nodes come up no problem. 2)We then procced to queue up some mpi and non mpi jobs to run. 3)After a few hours of running the jobs the cluster becomes unresponsibe meaning we can't get a prompt at the console, we can't ssh in we can't get anywhere or do anything other than power cycle the whole thing. Is there any way to get more debug info out of bproc so that I can get a clue as to where to start looking at what might be causing this problem? thank you for any advice/help on this. Rene > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu last message repeated 577 times > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: proc: ghproc: ghost: siproc: > ghosproc: ghost: sigproc: gproc: ghost: sigproc: ghost: sigproc: ghost: sigproc > : ghost: sigproc: ghost: siproc: ghost: sigproc: ghostproc: ghost: sigproc: ghos > t: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghost: sigproc: ghos > tproc: ghost: sigproc: ghost: signal: signr == 0 > Jul 26 18:14:58 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3631 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu last message repeated 3639 times > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: proc: ghost: signal: signr = > = 0 > Jul 26 18:14:59 aresnfs-frontend.ccsnfs.edu kernel: bproc: ghost: signal: signr > == 0 Erik Hendriks wrote: > I've never seen a problem like that. Is it reproducable? > > - Erik > |