Thread: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <rog...@gm...> - 2005-08-23 00:10:58
|
Hi, What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried" mean? I was able to reproduce this message on node 2 by SSH to node 2, exit SSH, open new shell on node 1 and time it so that the new shell on node1 takes node 2 SSH's old tty. -Roger |
From: Roger T. <rog...@gm...> - 2005-08-23 00:14:32
|
oh and also the whole cluster hangs soon after, but the cluster recovers when I reboot node 2. -Roger On 8/22/05, Roger Tsang <rog...@gm...> wrote: > Hi, >=20 > What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried" > mean? I was able to reproduce this message on node 2 by SSH to node > 2, exit SSH, open new shell on node 1 and time it so that the new > shell on node1 takes node 2 SSH's old tty. >=20 > -Roger > |
From: John B. <joh...@hp...> - 2005-08-23 00:18:46
|
Roger Tsang wrote: > Hi, > > What does "icssvr_nanny: spawn_daemon_procerror=-11, will be retried" > mean? I was able to reproduce this message on node 2 by SSH to node > 2, exit SSH, open new shell on node 1 and time it so that the new > shell on node1 takes node 2 SSH's old tty. > > -Roger > Your are probably hitting the "ladder problem" where the kernels get confused about what node a tty is on and ICS goes back and forth between nodes until the system runs out of memory. I could never figure out how to reproduce it. It may be tied to your earlier problem. John |
From: Roger T. <rog...@gm...> - 2005-08-23 01:05:39
|
I just fixed the earlier problem.. hmm. On 8/22/05, John Byrne <joh...@hp...> wrote: > Roger Tsang wrote: > > Hi, > > > > What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried" > > mean? I was able to reproduce this message on node 2 by SSH to node > > 2, exit SSH, open new shell on node 1 and time it so that the new > > shell on node1 takes node 2 SSH's old tty. > > > > -Roger > > >=20 > Your are probably hitting the "ladder problem" where the kernels get > confused about what node a tty is on and ICS goes back and forth between > nodes until the system runs out of memory. I could never figure out how > to reproduce it. It may be tied to your earlier problem. >=20 > John > |
From: John B. <joh...@hp...> - 2005-08-23 01:10:17
|
Roger Tsang wrote: > I just fixed the earlier problem.. hmm. > So, I'm a little confused at the moment, then. Is the "spawm_daemon" problem the only problem you currently have with ttys and rmtfbs? John |
From: Roger T. <rog...@gm...> - 2005-08-23 01:13:59
|
I got rid of the tty and rfb problem before seeing this. so this might be a new unrelated problem. On 8/22/05, John Byrne <joh...@hp...> wrote: > Roger Tsang wrote: > > I just fixed the earlier problem.. hmm. > > >=20 > So, I'm a little confused at the moment, then. Is the "spawm_daemon" > problem the only problem you currently have with ttys and rmtfbs? >=20 > John > |
From: Roger T. <rog...@gm...> - 2005-08-23 01:35:43
|
Just wait. I think this might be related to my rmtfb fix... Roger On 8/22/05, John Byrne <joh...@hp...> wrote: > Roger Tsang wrote: > > I just fixed the earlier problem.. hmm. > > >=20 > So, I'm a little confused at the moment, then. Is the "spawm_daemon" > problem the only problem you currently have with ttys and rmtfbs? >=20 > John > |
From: Roger T. <rog...@gm...> - 2005-08-23 01:47:02
|
Could this be related to the previoius as_xscribe.c as_do_vma() pg_off assertion failure? I see a path to rmtfb_newcli() that run on the same node as the process. as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli() -Roger On 8/22/05, Roger Tsang <rog...@gm...> wrote: > Just wait. I think this might be related to my rmtfb fix... >=20 > Roger >=20 > On 8/22/05, John Byrne <joh...@hp...> wrote: > > Roger Tsang wrote: > > > I just fixed the earlier problem.. hmm. > > > > > > > So, I'm a little confused at the moment, then. Is the "spawm_daemon" > > problem the only problem you currently have with ttys and rmtfbs? > > > > John > > > |
From: Roger T. <rog...@gm...> - 2005-08-23 01:33:45
|
When this happens I see an endless stream of printk's "rmtfb_newcli: fail" logged locally before the cluster hangs. "rmtfb_newcli" means some remote node is trying to open something on this node. resouce temporarily unavailable messages on the serial console. Maybe the local process is trying to open the local tty via a remote node... kinda backwards, but not really when considering all nodes share the same /dev/pts. What does SSI-1.2 do with /dev/pts? -Roger On 8/22/05, Roger Tsang <rog...@gm...> wrote: > Hi, >=20 > What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried" > mean? I was able to reproduce this message on node 2 by SSH to node > 2, exit SSH, open new shell on node 1 and time it so that the new > shell on node1 takes node 2 SSH's old tty. >=20 > -Roger > |
From: John B. <joh...@hp...> - 2005-08-23 01:39:50
|
Roger Tsang wrote: > Just wait. I think this might be related to my rmtfb fix... > > Roger The newcli stuff maybe. I can produce the -11 without it. What happens is that the node 1 thinks the pty is on node 2 and node 2 thinks the pty is on node 1. ssidev_remote_open ping-pongs between both nodes until things break. Something isn't being cleaned up/initialized properly when the pty is reused on a different node. John > > On 8/22/05, John Byrne <joh...@hp...> wrote: > >>Roger Tsang wrote: >> >>>I just fixed the earlier problem.. hmm. >>> >> >>So, I'm a little confused at the moment, then. Is the "spawm_daemon" >>problem the only problem you currently have with ttys and rmtfbs? >> >>John >> > > |
From: John B. <joh...@hp...> - 2005-08-23 01:56:40
|
Roger Tsang wrote: > Could this be related to the previoius as_xscribe.c as_do_vma() pg_off > assertion failure? > > I see a path to rmtfb_newcli() that run on the same node as the process. > > as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli() > > -Roger > I may be confusing you or I may be confused. To produce the -11 problem, I ssh'd to node 2; exited; and ssh'd to node 1. There is nothing in doing this that would call as_do_vma(), I'd think. Are you doing something different? If not, what kind of op is calling as_do_vma()? John |
From: Roger T. <rog...@gm...> - 2005-08-23 02:04:20
|
nah.. I was just sighting a possibility for the relationship to as_do_vma()= . Oh I see that's exactly how I trigger the error -11 too. I didn't run into this problem before fixing the rmtfb /dev/tty problem. I'll back out my rmtfb fix and see if I can reproduce -11 by following what you did. The only difference in behavior of the cluster with my rmtfb fix is the /dev/tty rfb's get cleaned from the svrtable as designed. -Roger On 8/22/05, John Byrne <joh...@hp...> wrote: > Roger Tsang wrote: > > Could this be related to the previoius as_xscribe.c as_do_vma() pg_off > > assertion failure? > > > > I see a path to rmtfb_newcli() that run on the same node as the process= . > > > > as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli() > > > > -Roger > > >=20 > I may be confusing you or I may be confused. >=20 > To produce the -11 problem, I ssh'd to node 2; exited; and ssh'd to node > 1. There is nothing in doing this that would call as_do_vma(), I'd > think. Are you doing something different? If not, what kind of op is > calling as_do_vma()? >=20 > John > |
From: Roger T. <rog...@gm...> - 2005-08-23 02:38:15
|
By the way kdb on the remote node never hit ssidev_remote_open(). I can run commands on the local node, even open new tty's, but any commands to the filesystem like `echo > xxx' or `ls` will not return.=20 Shortly after, I run out of resources on all nodes; can't fork new processes. Also I don't see any immediate indications of memory starvation. -Roger On 8/22/05, John Byrne <joh...@hp...> wrote: > Roger Tsang wrote: > > Just wait. I think this might be related to my rmtfb fix... > > > > Roger >=20 > The newcli stuff maybe. I can produce the -11 without it. >=20 > What happens is that the node 1 thinks the pty is on node 2 and node 2 > thinks the pty is on node 1. ssidev_remote_open ping-pongs between both > nodes until things break. Something isn't being cleaned up/initialized > properly when the pty is reused on a different node. >=20 > John >=20 >=20 > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > >>Roger Tsang wrote: > >> > >>>I just fixed the earlier problem.. hmm. > >>> > >> > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon" > >>problem the only problem you currently have with ttys and rmtfbs? > >> > >>John > >> > > > > >=20 > |
From: Roger T. <rog...@gm...> - 2005-08-23 02:49:42
|
Looks like the system was running out of inodes... These numbers are a bit suspicious. [root@node1 fs]# cat dentry-state 29050 16830 45 0 0 0 [root@node1 fs]# cat inode-state 25552 3007 0 0 0 0 0 [root@node1 fs]# cat inode-nr 25555 3007 On 8/22/05, Roger Tsang <rog...@gm...> wrote: > By the way kdb on the remote node never hit ssidev_remote_open(). I > can run commands on the local node, even open new tty's, but any > commands to the filesystem like `echo > xxx' or `ls` will not return. > Shortly after, I run out of resources on all nodes; can't fork new > processes. Also I don't see any immediate indications of memory > starvation. >=20 > -Roger >=20 >=20 > On 8/22/05, John Byrne <joh...@hp...> wrote: > > Roger Tsang wrote: > > > Just wait. I think this might be related to my rmtfb fix... > > > > > > Roger > > > > The newcli stuff maybe. I can produce the -11 without it. > > > > What happens is that the node 1 thinks the pty is on node 2 and node 2 > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between both > > nodes until things break. Something isn't being cleaned up/initialized > > properly when the pty is reused on a different node. > > > > John > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > >>Roger Tsang wrote: > > >> > > >>>I just fixed the earlier problem.. hmm. > > >>> > > >> > > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon" > > >>problem the only problem you currently have with ttys and rmtfbs? > > >> > > >>John > > >> > > > > > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-08-23 04:56:34
|
Okay I think I found the bug. Just change the pty's file op to ssidev_remote_open() and have ssidev_get_inode_server() return i_version. If my fix is valid, the remaining question is why this problem doesn't always happen on my cluster. -Roger On 8/22/05, Roger Tsang <rog...@gm...> wrote: > By the way kdb on the remote node never hit ssidev_remote_open(). I > can run commands on the local node, even open new tty's, but any > commands to the filesystem like `echo > xxx' or `ls` will not return. > Shortly after, I run out of resources on all nodes; can't fork new > processes. Also I don't see any immediate indications of memory > starvation. >=20 > -Roger >=20 >=20 > On 8/22/05, John Byrne <joh...@hp...> wrote: > > Roger Tsang wrote: > > > Just wait. I think this might be related to my rmtfb fix... > > > > > > Roger > > > > The newcli stuff maybe. I can produce the -11 without it. > > > > What happens is that the node 1 thinks the pty is on node 2 and node 2 > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between both > > nodes until things break. Something isn't being cleaned up/initialized > > properly when the pty is reused on a different node. > > > > John > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > >>Roger Tsang wrote: > > >> > > >>>I just fixed the earlier problem.. hmm. > > >>> > > >> > > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon" > > >>problem the only problem you currently have with ttys and rmtfbs? > > >> > > >>John > > >> > > > > > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-08-28 01:59:06
|
btw, I hit this bug again after upgrading to 2.6.10-bk2. Roger On 8/23/05, Roger Tsang <rog...@gm...> wrote: > Okay I think I found the bug. Just change the pty's file op to > ssidev_remote_open() and have ssidev_get_inode_server() return > i_version. >=20 > If my fix is valid, the remaining question is why this problem doesn't > always happen on my cluster. >=20 > -Roger >=20 >=20 > On 8/22/05, Roger Tsang <rog...@gm...> wrote: > > By the way kdb on the remote node never hit ssidev_remote_open(). I > > can run commands on the local node, even open new tty's, but any > > commands to the filesystem like `echo > xxx' or `ls` will not return. > > Shortly after, I run out of resources on all nodes; can't fork new > > processes. Also I don't see any immediate indications of memory > > starvation. > > > > -Roger > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > Roger Tsang wrote: > > > > Just wait. I think this might be related to my rmtfb fix... > > > > > > > > Roger > > > > > > The newcli stuff maybe. I can produce the -11 without it. > > > > > > What happens is that the node 1 thinks the pty is on node 2 and node = 2 > > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between bo= th > > > nodes until things break. Something isn't being cleaned up/initialize= d > > > properly when the pty is reused on a different node. > > > > > > John > > > > > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > > > >>Roger Tsang wrote: > > > >> > > > >>>I just fixed the earlier problem.. hmm. > > > >>> > > > >> > > > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon= " > > > >>problem the only problem you currently have with ttys and rmtfbs? > > > >> > > > >>John > > > >> > > > > > > > > > > > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-08-28 23:17:27
|
I'm at 2.6.10-bk5 now. Looks like the problem is stale ino->i_version in the dcache. Roger On 8/27/05, Roger Tsang <rog...@gm...> wrote: > btw, I hit this bug again after upgrading to 2.6.10-bk2. >=20 > Roger >=20 >=20 > On 8/23/05, Roger Tsang <rog...@gm...> wrote: > > Okay I think I found the bug. Just change the pty's file op to > > ssidev_remote_open() and have ssidev_get_inode_server() return > > i_version. > > > > If my fix is valid, the remaining question is why this problem doesn't > > always happen on my cluster. > > > > -Roger > > > > > > On 8/22/05, Roger Tsang <rog...@gm...> wrote: > > > By the way kdb on the remote node never hit ssidev_remote_open(). I > > > can run commands on the local node, even open new tty's, but any > > > commands to the filesystem like `echo > xxx' or `ls` will not return. > > > Shortly after, I run out of resources on all nodes; can't fork new > > > processes. Also I don't see any immediate indications of memory > > > starvation. > > > > > > -Roger > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > Roger Tsang wrote: > > > > > Just wait. I think this might be related to my rmtfb fix... > > > > > > > > > > Roger > > > > > > > > The newcli stuff maybe. I can produce the -11 without it. > > > > > > > > What happens is that the node 1 thinks the pty is on node 2 and nod= e 2 > > > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between = both > > > > nodes until things break. Something isn't being cleaned up/initiali= zed > > > > properly when the pty is reused on a different node. > > > > > > > > John > > > > > > > > > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > > > > > >>Roger Tsang wrote: > > > > >> > > > > >>>I just fixed the earlier problem.. hmm. > > > > >>> > > > > >> > > > > >>So, I'm a little confused at the moment, then. Is the "spawm_daem= on" > > > > >>problem the only problem you currently have with ttys and rmtfbs? > > > > >> > > > > >>John > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-09-17 00:08:29
|
To conclude, this problem is fixed in CVS. -Roger On 8/28/05, Roger Tsang <rog...@gm...> wrote: >=20 > I'm at 2.6.10-bk5 now. Looks like the problem is stale ino->i_version > in the dcache. >=20 > Roger >=20 >=20 > On 8/27/05, Roger Tsang <rog...@gm...> wrote: > > btw, I hit this bug again after upgrading to 2.6.10-bk2. > > > > Roger > > > > > > On 8/23/05, Roger Tsang <rog...@gm...> wrote: > > > Okay I think I found the bug. Just change the pty's file op to > > > ssidev_remote_open() and have ssidev_get_inode_server() return > > > i_version. > > > > > > If my fix is valid, the remaining question is why this problem doesn'= t > > > always happen on my cluster. > > > > > > -Roger > > > > > > > > > On 8/22/05, Roger Tsang <rog...@gm...> wrote: > > > > By the way kdb on the remote node never hit ssidev_remote_open(). I > > > > can run commands on the local node, even open new tty's, but any > > > > commands to the filesystem like `echo > xxx' or `ls` will not=20 > return. > > > > Shortly after, I run out of resources on all nodes; can't fork new > > > > processes. Also I don't see any immediate indications of memory > > > > starvation. > > > > > > > > -Roger > > > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > Roger Tsang wrote: > > > > > > Just wait. I think this might be related to my rmtfb fix... > > > > > > > > > > > > Roger > > > > > > > > > > The newcli stuff maybe. I can produce the -11 without it. > > > > > > > > > > What happens is that the node 1 thinks the pty is on node 2 and= =20 > node 2 > > > > > thinks the pty is on node 1. ssidev_remote_open ping-pongs betwee= n=20 > both > > > > > nodes until things break. Something isn't being cleaned=20 > up/initialized > > > > > properly when the pty is reused on a different node. > > > > > > > > > > John > > > > > > > > > > > > > > > > > > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote: > > > > > > > > > > > >>Roger Tsang wrote: > > > > > >> > > > > > >>>I just fixed the earlier problem.. hmm. > > > > > >>> > > > > > >> > > > > > >>So, I'm a little confused at the moment, then. Is the=20 > "spawm_daemon" > > > > > >>problem the only problem you currently have with ttys and=20 > rmtfbs? > > > > > >> > > > > > >>John > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |