Thread: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried | OpenSSI Clusters for Linux

ssic-linux-devel

[SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 00:10:58

Hi,

What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried"
mean?  I was able to reproduce this message on node 2 by SSH to node
2, exit SSH, open new shell on node 1 and time it so that the new
shell on node1 takes node 2 SSH's old tty.

-Roger

[SSI-devel] Re: 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 00:14:32

oh and also the whole cluster hangs soon after, but the cluster
recovers when I reboot node 2.

-Roger

On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> Hi,
>=20
> What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried"
> mean?  I was able to reproduce this message on node 2 by SSH to node
> 2, exit SSH, open new shell on node 1 and time it so that the new
> shell on node1 takes node 2 SSH's old tty.
>=20
> -Roger
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: John B. <joh...@hp...> - 2005-08-23 00:18:46

Roger Tsang wrote:
> Hi,
> 
> What does "icssvr_nanny: spawn_daemon_procerror=-11, will be retried"
> mean?  I was able to reproduce this message on node 2 by SSH to node
> 2, exit SSH, open new shell on node 1 and time it so that the new
> shell on node1 takes node 2 SSH's old tty.
> 
> -Roger
> 

Your are probably hitting the "ladder problem" where the kernels get 
confused about what node a tty is on and ICS goes back and forth between 
nodes until the system runs out of memory. I could never figure out how 
to reproduce it. It may be tied to your earlier problem.

John

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 01:05:39

I just fixed the earlier problem.. hmm.

On 8/22/05, John Byrne <joh...@hp...> wrote:
> Roger Tsang wrote:
> > Hi,
> >
> > What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried"
> > mean?  I was able to reproduce this message on node 2 by SSH to node
> > 2, exit SSH, open new shell on node 1 and time it so that the new
> > shell on node1 takes node 2 SSH's old tty.
> >
> > -Roger
> >
>=20
> Your are probably hitting the "ladder problem" where the kernels get
> confused about what node a tty is on and ICS goes back and forth between
> nodes until the system runs out of memory. I could never figure out how
> to reproduce it. It may be tied to your earlier problem.
>=20
> John
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: John B. <joh...@hp...> - 2005-08-23 01:10:17

Roger Tsang wrote:
> I just fixed the earlier problem.. hmm.
> 

So, I'm a little confused at the moment, then. Is the "spawm_daemon" 
problem the only problem you currently have with ttys and rmtfbs?

John

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 01:13:59

I got rid of the tty and rfb problem before seeing this.  so this
might be a new unrelated problem.

On 8/22/05, John Byrne <joh...@hp...> wrote:
> Roger Tsang wrote:
> > I just fixed the earlier problem.. hmm.
> >
>=20
> So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> problem the only problem you currently have with ttys and rmtfbs?
>=20
> John
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 01:35:43

Just wait. I think this might be related to my rmtfb fix...

Roger

On 8/22/05, John Byrne <joh...@hp...> wrote:
> Roger Tsang wrote:
> > I just fixed the earlier problem.. hmm.
> >
>=20
> So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> problem the only problem you currently have with ttys and rmtfbs?
>=20
> John
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 01:47:02

Could this be related to the previoius as_xscribe.c as_do_vma() pg_off
assertion failure?

I see a path to rmtfb_newcli() that run on the same node as the process.

as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli()

-Roger


On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> Just wait. I think this might be related to my rmtfb fix...
>=20
> Roger
>=20
> On 8/22/05, John Byrne <joh...@hp...> wrote:
> > Roger Tsang wrote:
> > > I just fixed the earlier problem.. hmm.
> > >
> >
> > So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> > problem the only problem you currently have with ttys and rmtfbs?
> >
> > John
> >
>

[SSI-devel] Re: 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 01:33:45

When this happens I see an endless stream of printk's "rmtfb_newcli:
fail" logged locally before the cluster hangs.  "rmtfb_newcli" means
some remote node is trying to open something on this node.  resouce
temporarily unavailable messages on the serial console.

Maybe the local process is trying to open the local tty via a remote
node... kinda backwards, but not really when considering all nodes
share the same /dev/pts.  What does SSI-1.2 do with /dev/pts?

-Roger

On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> Hi,
>=20
> What does "icssvr_nanny: spawn_daemon_procerror=3D-11, will be retried"
> mean?  I was able to reproduce this message on node 2 by SSH to node
> 2, exit SSH, open new shell on node 1 and time it so that the new
> shell on node1 takes node 2 SSH's old tty.
>=20
> -Roger
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: John B. <joh...@hp...> - 2005-08-23 01:39:50

Roger Tsang wrote:
> Just wait. I think this might be related to my rmtfb fix...
> 
> Roger

The newcli stuff maybe. I can produce the -11 without it.

What happens is that the node 1 thinks the pty is on node 2 and node 2 
thinks the pty is on node 1. ssidev_remote_open ping-pongs between both 
nodes until things break. Something isn't being cleaned up/initialized 
properly when the pty is reused on a different node.

John

> 
> On 8/22/05, John Byrne <joh...@hp...> wrote:
> 
>>Roger Tsang wrote:
>>
>>>I just fixed the earlier problem.. hmm.
>>>
>>
>>So, I'm a little confused at the moment, then. Is the "spawm_daemon"
>>problem the only problem you currently have with ttys and rmtfbs?
>>
>>John
>>
> 
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: John B. <joh...@hp...> - 2005-08-23 01:56:40

Roger Tsang wrote:
> Could this be related to the previoius as_xscribe.c as_do_vma() pg_off
> assertion failure?
> 
> I see a path to rmtfb_newcli() that run on the same node as the process.
> 
> as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli()
> 
> -Roger
> 

I may be confusing you or I may be confused.

To produce the -11 problem, I ssh'd to node 2; exited; and ssh'd to node 
1. There is nothing in doing this that would call as_do_vma(), I'd 
think. Are you doing something different? If not, what kind of op is 
calling as_do_vma()?

John

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 02:04:20

nah.. I was just sighting a possibility for the relationship to as_do_vma()=
.

Oh I see that's exactly how I trigger the error -11 too.  I didn't run
into this problem before fixing the rmtfb /dev/tty problem.  I'll back
out my rmtfb fix and see if I can reproduce -11 by following what you
did.  The only difference in behavior of the cluster with my rmtfb fix
is the /dev/tty rfb's get cleaned from the svrtable as designed.

-Roger

On 8/22/05, John Byrne <joh...@hp...> wrote:
> Roger Tsang wrote:
> > Could this be related to the previoius as_xscribe.c as_do_vma() pg_off
> > assertion failure?
> >
> > I see a path to rmtfb_newcli() that run on the same node as the process=
.
> >
> > as_do_vma()->reop_import_file()->rmtfb_getcli_id()->rmtfb_newcli()
> >
> > -Roger
> >
>=20
> I may be confusing you or I may be confused.
>=20
> To produce the -11 problem, I ssh'd to node 2; exited; and ssh'd to node
> 1. There is nothing in doing this that would call as_do_vma(), I'd
> think. Are you doing something different? If not, what kind of op is
> calling as_do_vma()?
>=20
> John
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 02:38:15

By the way kdb on the remote node never hit ssidev_remote_open().  I
can run commands on the local node, even open new tty's, but any
commands to the filesystem like `echo > xxx' or `ls` will not return.=20
Shortly after, I run out of resources on all nodes; can't fork new
processes.  Also I don't see any immediate indications of memory
starvation.

-Roger


On 8/22/05, John Byrne <joh...@hp...> wrote:
> Roger Tsang wrote:
> > Just wait. I think this might be related to my rmtfb fix...
> >
> > Roger
>=20
> The newcli stuff maybe. I can produce the -11 without it.
>=20
> What happens is that the node 1 thinks the pty is on node 2 and node 2
> thinks the pty is on node 1. ssidev_remote_open ping-pongs between both
> nodes until things break. Something isn't being cleaned up/initialized
> properly when the pty is reused on a different node.
>=20
> John
>=20
>=20
> >
> > On 8/22/05, John Byrne <joh...@hp...> wrote:
> >
> >>Roger Tsang wrote:
> >>
> >>>I just fixed the earlier problem.. hmm.
> >>>
> >>
> >>So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> >>problem the only problem you currently have with ttys and rmtfbs?
> >>
> >>John
> >>
> >
> >
>=20
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 02:49:42

Looks like the system was running out of inodes... These numbers are a
bit suspicious.

[root@node1 fs]# cat dentry-state
29050   16830   45      0       0       0
[root@node1 fs]# cat inode-state
25552   3007    0       0       0       0       0
[root@node1 fs]# cat inode-nr
25555   3007


On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> By the way kdb on the remote node never hit ssidev_remote_open().  I
> can run commands on the local node, even open new tty's, but any
> commands to the filesystem like `echo > xxx' or `ls` will not return.
> Shortly after, I run out of resources on all nodes; can't fork new
> processes.  Also I don't see any immediate indications of memory
> starvation.
>=20
> -Roger
>=20
>=20
> On 8/22/05, John Byrne <joh...@hp...> wrote:
> > Roger Tsang wrote:
> > > Just wait. I think this might be related to my rmtfb fix...
> > >
> > > Roger
> >
> > The newcli stuff maybe. I can produce the -11 without it.
> >
> > What happens is that the node 1 thinks the pty is on node 2 and node 2
> > thinks the pty is on node 1. ssidev_remote_open ping-pongs between both
> > nodes until things break. Something isn't being cleaned up/initialized
> > properly when the pty is reused on a different node.
> >
> > John
> >
> >
> > >
> > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > >
> > >>Roger Tsang wrote:
> > >>
> > >>>I just fixed the earlier problem.. hmm.
> > >>>
> > >>
> > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> > >>problem the only problem you currently have with ttys and rmtfbs?
> > >>
> > >>John
> > >>
> > >
> > >
> >
> >
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-23 04:56:34

Okay I think I found the bug.  Just change the pty's file op to
ssidev_remote_open() and have ssidev_get_inode_server() return
i_version.

If my fix is valid, the remaining question is why this problem doesn't
always happen on my cluster.

-Roger


On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> By the way kdb on the remote node never hit ssidev_remote_open().  I
> can run commands on the local node, even open new tty's, but any
> commands to the filesystem like `echo > xxx' or `ls` will not return.
> Shortly after, I run out of resources on all nodes; can't fork new
> processes.  Also I don't see any immediate indications of memory
> starvation.
>=20
> -Roger
>=20
>=20
> On 8/22/05, John Byrne <joh...@hp...> wrote:
> > Roger Tsang wrote:
> > > Just wait. I think this might be related to my rmtfb fix...
> > >
> > > Roger
> >
> > The newcli stuff maybe. I can produce the -11 without it.
> >
> > What happens is that the node 1 thinks the pty is on node 2 and node 2
> > thinks the pty is on node 1. ssidev_remote_open ping-pongs between both
> > nodes until things break. Something isn't being cleaned up/initialized
> > properly when the pty is reused on a different node.
> >
> > John
> >
> >
> > >
> > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > >
> > >>Roger Tsang wrote:
> > >>
> > >>>I just fixed the earlier problem.. hmm.
> > >>>
> > >>
> > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon"
> > >>problem the only problem you currently have with ttys and rmtfbs?
> > >>
> > >>John
> > >>
> > >
> > >
> >
> >
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-28 01:59:06

btw, I hit this bug again after upgrading to 2.6.10-bk2.

Roger


On 8/23/05, Roger Tsang <rog...@gm...> wrote:
> Okay I think I found the bug.  Just change the pty's file op to
> ssidev_remote_open() and have ssidev_get_inode_server() return
> i_version.
>=20
> If my fix is valid, the remaining question is why this problem doesn't
> always happen on my cluster.
>=20
> -Roger
>=20
>=20
> On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> > By the way kdb on the remote node never hit ssidev_remote_open().  I
> > can run commands on the local node, even open new tty's, but any
> > commands to the filesystem like `echo > xxx' or `ls` will not return.
> > Shortly after, I run out of resources on all nodes; can't fork new
> > processes.  Also I don't see any immediate indications of memory
> > starvation.
> >
> > -Roger
> >
> >
> > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > Roger Tsang wrote:
> > > > Just wait. I think this might be related to my rmtfb fix...
> > > >
> > > > Roger
> > >
> > > The newcli stuff maybe. I can produce the -11 without it.
> > >
> > > What happens is that the node 1 thinks the pty is on node 2 and node =
2
> > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between bo=
th
> > > nodes until things break. Something isn't being cleaned up/initialize=
d
> > > properly when the pty is reused on a different node.
> > >
> > > John
> > >
> > >
> > > >
> > > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > >
> > > >>Roger Tsang wrote:
> > > >>
> > > >>>I just fixed the earlier problem.. hmm.
> > > >>>
> > > >>
> > > >>So, I'm a little confused at the moment, then. Is the "spawm_daemon=
"
> > > >>problem the only problem you currently have with ttys and rmtfbs?
> > > >>
> > > >>John
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-08-28 23:17:27

I'm at 2.6.10-bk5 now.  Looks like the problem is stale ino->i_version
in the dcache.

Roger


On 8/27/05, Roger Tsang <rog...@gm...> wrote:
> btw, I hit this bug again after upgrading to 2.6.10-bk2.
>=20
> Roger
>=20
>=20
> On 8/23/05, Roger Tsang <rog...@gm...> wrote:
> > Okay I think I found the bug.  Just change the pty's file op to
> > ssidev_remote_open() and have ssidev_get_inode_server() return
> > i_version.
> >
> > If my fix is valid, the remaining question is why this problem doesn't
> > always happen on my cluster.
> >
> > -Roger
> >
> >
> > On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> > > By the way kdb on the remote node never hit ssidev_remote_open().  I
> > > can run commands on the local node, even open new tty's, but any
> > > commands to the filesystem like `echo > xxx' or `ls` will not return.
> > > Shortly after, I run out of resources on all nodes; can't fork new
> > > processes.  Also I don't see any immediate indications of memory
> > > starvation.
> > >
> > > -Roger
> > >
> > >
> > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > > Roger Tsang wrote:
> > > > > Just wait. I think this might be related to my rmtfb fix...
> > > > >
> > > > > Roger
> > > >
> > > > The newcli stuff maybe. I can produce the -11 without it.
> > > >
> > > > What happens is that the node 1 thinks the pty is on node 2 and nod=
e 2
> > > > thinks the pty is on node 1. ssidev_remote_open ping-pongs between =
both
> > > > nodes until things break. Something isn't being cleaned up/initiali=
zed
> > > > properly when the pty is reused on a different node.
> > > >
> > > > John
> > > >
> > > >
> > > > >
> > > > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > > >
> > > > >>Roger Tsang wrote:
> > > > >>
> > > > >>>I just fixed the earlier problem.. hmm.
> > > > >>>
> > > > >>
> > > > >>So, I'm a little confused at the moment, then. Is the "spawm_daem=
on"
> > > > >>problem the only problem you currently have with ttys and rmtfbs?
> > > > >>
> > > > >>John
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [SSI-devel] 1.9 icssvr_nanny: spawn_daemon_procerror=-11, will be retried

From: Roger T. <rog...@gm...> - 2005-09-17 00:08:29

To conclude, this problem is fixed in CVS. -Roger

On 8/28/05, Roger Tsang <rog...@gm...> wrote:
>=20
> I'm at 2.6.10-bk5 now. Looks like the problem is stale ino->i_version
> in the dcache.
>=20
> Roger
>=20
>=20
> On 8/27/05, Roger Tsang <rog...@gm...> wrote:
> > btw, I hit this bug again after upgrading to 2.6.10-bk2.
> >
> > Roger
> >
> >
> > On 8/23/05, Roger Tsang <rog...@gm...> wrote:
> > > Okay I think I found the bug. Just change the pty's file op to
> > > ssidev_remote_open() and have ssidev_get_inode_server() return
> > > i_version.
> > >
> > > If my fix is valid, the remaining question is why this problem doesn'=
t
> > > always happen on my cluster.
> > >
> > > -Roger
> > >
> > >
> > > On 8/22/05, Roger Tsang <rog...@gm...> wrote:
> > > > By the way kdb on the remote node never hit ssidev_remote_open(). I
> > > > can run commands on the local node, even open new tty's, but any
> > > > commands to the filesystem like `echo > xxx' or `ls` will not=20
> return.
> > > > Shortly after, I run out of resources on all nodes; can't fork new
> > > > processes. Also I don't see any immediate indications of memory
> > > > starvation.
> > > >
> > > > -Roger
> > > >
> > > >
> > > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > > > Roger Tsang wrote:
> > > > > > Just wait. I think this might be related to my rmtfb fix...
> > > > > >
> > > > > > Roger
> > > > >
> > > > > The newcli stuff maybe. I can produce the -11 without it.
> > > > >
> > > > > What happens is that the node 1 thinks the pty is on node 2 and=
=20
> node 2
> > > > > thinks the pty is on node 1. ssidev_remote_open ping-pongs betwee=
n=20
> both
> > > > > nodes until things break. Something isn't being cleaned=20
> up/initialized
> > > > > properly when the pty is reused on a different node.
> > > > >
> > > > > John
> > > > >
> > > > >
> > > > > >
> > > > > > On 8/22/05, John Byrne <joh...@hp...> wrote:
> > > > > >
> > > > > >>Roger Tsang wrote:
> > > > > >>
> > > > > >>>I just fixed the earlier problem.. hmm.
> > > > > >>>
> > > > > >>
> > > > > >>So, I'm a little confused at the moment, then. Is the=20
> "spawm_daemon"
> > > > > >>problem the only problem you currently have with ttys and=20
> rmtfbs?
> > > > > >>
> > > > > >>John
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>