Thread: [SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

Brought to you by: brucewalker, rogertsang

ssic-linux-devel

[SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

From: SourceForge.net <no...@so...> - 2004-07-31 00:04:03

Bugs item #1001010, was opened at 2004-07-30 17:03
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: David B. Zafman (dzafman)
Assigned to: Nobody/Anonymous (nobody)
Summary: Can't halt initnode

Initial Comment:

If the cluster administrator wants to take the current
initnode out of service, "clusternode_shutdown -N# -h
..." will not work right.  The problem is that the
sys_reboot base system call doesn't completely stop the
node from doing things.  I've added code to take down
ics interfaces and run ics_nodedown() on all other
nodes, but although services are stopped, init is still
running.  In a failover environment, which is the only
one which makes sense, this is bad because the shared
root is still writable.

I've checked-in code into clusternode_shutdown, to
disallow halt in this case.

Areas to fix:
1. Make the root read-only during service stop.
2. Improve the halting code in the kernel.
3. Stop init.  Process 1 should also be sent a SIGSTOP.
 This can be added to /sbin/halt which skips that in
the local "-L" case because it shouldn't be done when a
non-initnode is being halted (-L). 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

From: SourceForge.net <no...@so...> - 2004-07-31 19:34:58

Bugs item #1001010, was opened at 2004-07-30 17:03
Message generated for change (Comment added) made by dzafman
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: David B. Zafman (dzafman)
Assigned to: Nobody/Anonymous (nobody)
Summary: Can't halt initnode

Initial Comment:

If the cluster administrator wants to take the current
initnode out of service, "clusternode_shutdown -N# -h
..." will not work right.  The problem is that the
sys_reboot base system call doesn't completely stop the
node from doing things.  I've added code to take down
ics interfaces and run ics_nodedown() on all other
nodes, but although services are stopped, init is still
running.  In a failover environment, which is the only
one which makes sense, this is bad because the shared
root is still writable.

I've checked-in code into clusternode_shutdown, to
disallow halt in this case.

Areas to fix:
1. Make the root read-only during service stop.
2. Improve the halting code in the kernel.
3. Stop init.  Process 1 should also be sent a SIGSTOP.
 This can be added to /sbin/halt which skips that in
the local "-L" case because it shouldn't be done when a
non-initnode is being halted (-L). 

----------------------------------------------------------------------

>Comment By: David B. Zafman (dzafman)
Date: 2004-07-31 12:34

Message:
Logged In: YES 
user_id=297844


Another minor issue is that the ramdisk wanted to halt a booting initnode 
which failed to mount the root.  Because of the way we are performing 
the halt operation, instead of getting a clean halt, the node ends up 
panic'ing in nodedown because it was a simultaneous boot and other 
nodes were present.  Looking at the stack we could fix 
cfs_nodedown_thread(), but I believe that fixing the halt code in this bug 
report eliminates the need to.  This is because there could be other 
panics due to the bad state of this machine.

Creating root device
mkrootdev: label /1 not found
mount: special device /dev/root does not exist
ERROR: Mounting root file system failed.
Unable to continue. Halting.
nm_add_node: Node 3 added
nm_add_node: Node 2 added
nm_add_node: Node 4 added
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at igmp.c(556)
RTNL: assertion failed at igmp.c(529)
flushing ide devices: hda
System halted.
Node 2 has gone down!!!
Node 3 h<as1 >gUonnabe led otwno! !h!an
leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle  
vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe 
eriepnc:                                       i<c40>2 a3t6 a2vdi            
etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t   
nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod                     
mCPU:    0EIP:    0060:[<c0236a2d>]    Not taintedEFLAGS: 00010286 
EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox-
dzafman)eax: 00000400   ebx: c32c8000   ecx: 00000000   edx: 
c3e0d800esi: 00000000   edi: 00000000   ebp: c32c9fec   esp: c32c9fe8
ds: 0068   es: 0068   ss: 0068Process cfs failover (pid: 65689, 
stackpage=c32c9000)
Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000
Call Trace:
[<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8)
[<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

From: SourceForge.net <no...@so...> - 2007-10-12 02:47:50

Bugs item #1001010, was opened at 2004-07-30 20:03
Message generated for change (Comment added) made by rogertsang
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: Booting / init
Group: None
Status: Open
Resolution: None
>Priority: 3
Private: No
Submitted By: David Zafman (dzafman)
Assigned to: Nobody/Anonymous (nobody)
Summary: Can't halt initnode

Initial Comment:

If the cluster administrator wants to take the current
initnode out of service, "clusternode_shutdown -N# -h
..." will not work right.  The problem is that the
sys_reboot base system call doesn't completely stop the
node from doing things.  I've added code to take down
ics interfaces and run ics_nodedown() on all other
nodes, but although services are stopped, init is still
running.  In a failover environment, which is the only
one which makes sense, this is bad because the shared
root is still writable.

I've checked-in code into clusternode_shutdown, to
disallow halt in this case.

Areas to fix:
1. Make the root read-only during service stop.
2. Improve the halting code in the kernel.
3. Stop init.  Process 1 should also be sent a SIGSTOP.
 This can be added to /sbin/halt which skips that in
the local "-L" case because it shouldn't be done when a
non-initnode is being halted (-L). 

----------------------------------------------------------------------

>Comment By: Roger Tsang (rogertsang)
Date: 2007-10-11 22:47

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Need to validate SSI-1.9.3

----------------------------------------------------------------------

Comment By: David Zafman (dzafman)
Date: 2004-07-31 15:34

Message:
Logged In: YES 
user_id=297844


Another minor issue is that the ramdisk wanted to halt a booting initnode

which failed to mount the root.  Because of the way we are performing 
the halt operation, instead of getting a clean halt, the node ends up 
panic'ing in nodedown because it was a simultaneous boot and other 
nodes were present.  Looking at the stack we could fix 
cfs_nodedown_thread(), but I believe that fixing the halt code in this bug

report eliminates the need to.  This is because there could be other 
panics due to the bad state of this machine.

Creating root device
mkrootdev: label /1 not found
mount: special device /dev/root does not exist
ERROR: Mounting root file system failed.
Unable to continue. Halting.
nm_add_node: Node 3 added
nm_add_node: Node 2 added
nm_add_node: Node 4 added
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at igmp.c(556)
RTNL: assertion failed at igmp.c(529)
flushing ide devices: hda
System halted.
Node 2 has gone down!!!
Node 3 h<as1 >gUonnabe led otwno! !h!an
leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle  
vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe 
eriepnc:                                       i<c40>2 a3t6 a2vdi         
  
etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t   
nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod            
        
mCPU:    0EIP:    0060:[<c0236a2d>]    Not taintedEFLAGS: 00010286 
EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox-
dzafman)eax: 00000400   ebx: c32c8000   ecx: 00000000   edx: 
c3e0d800esi: 00000000   edi: 00000000   ebp: c32c9fec   esp: c32c9fe8
ds: 0068   es: 0068   ss: 0068Process cfs failover (pid: 65689, 
stackpage=c32c9000)
Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000
Call Trace:
[<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8)
[<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

From: SourceForge.net <no...@so...> - 2007-10-12 09:53:19

Bugs item #1001010, was opened at 2004-07-31 02:03
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Booting / init
Group: None
Status: Open
Resolution: None
Priority: 3
Private: No
Submitted By: David Zafman (dzafman)
Assigned to: Nobody/Anonymous (nobody)
Summary: Can't halt initnode

Initial Comment:

If the cluster administrator wants to take the current
initnode out of service, "clusternode_shutdown -N# -h
..." will not work right.  The problem is that the
sys_reboot base system call doesn't completely stop the
node from doing things.  I've added code to take down
ics interfaces and run ics_nodedown() on all other
nodes, but although services are stopped, init is still
running.  In a failover environment, which is the only
one which makes sense, this is bad because the shared
root is still writable.

I've checked-in code into clusternode_shutdown, to
disallow halt in this case.

Areas to fix:
1. Make the root read-only during service stop.
2. Improve the halting code in the kernel.
3. Stop init.  Process 1 should also be sent a SIGSTOP.
 This can be added to /sbin/halt which skips that in
the local "-L" case because it shouldn't be done when a
non-initnode is being halted (-L). 

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2007-10-12 11:53

Message:
Logged In: YES 
user_id=166336
Originator: NO

Still present in 1.9.3.

node1:~# clusternode_shutdown -h -N 1 now

Broadcast message from root (1/ttyS0) (Fri Oct 12 10:47:31 2007):

Node 1 is going down for system halt NOW!
[...]
Deactivating swap...done.
Unmounting file systems:
umount2: Device or resource busy
umount: /boot: device is busy
umount2: Device or resource busy
umount: /boot: device is busy
/boot:
Unmounting file systems (retry):
[...]
System halted.
Node 2 has gone down!!!

Debian GNU/Linux 3.1 node1 tty1

Node1 login:

----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2007-10-12 04:47

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Need to validate SSI-1.9.3

----------------------------------------------------------------------

Comment By: David Zafman (dzafman)
Date: 2004-07-31 21:34

Message:
Logged In: YES 
user_id=297844


Another minor issue is that the ramdisk wanted to halt a booting initnode

which failed to mount the root.  Because of the way we are performing 
the halt operation, instead of getting a clean halt, the node ends up 
panic'ing in nodedown because it was a simultaneous boot and other 
nodes were present.  Looking at the stack we could fix 
cfs_nodedown_thread(), but I believe that fixing the halt code in this bug

report eliminates the need to.  This is because there could be other 
panics due to the bad state of this machine.

Creating root device
mkrootdev: label /1 not found
mount: special device /dev/root does not exist
ERROR: Mounting root file system failed.
Unable to continue. Halting.
nm_add_node: Node 3 added
nm_add_node: Node 2 added
nm_add_node: Node 4 added
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at igmp.c(556)
RTNL: assertion failed at igmp.c(529)
flushing ide devices: hda
System halted.
Node 2 has gone down!!!
Node 3 h<as1 >gUonnabe led otwno! !h!an
leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle  
vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe 
eriepnc:                                       i<c40>2 a3t6 a2vdi         
  
etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t   
nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod            
        
mCPU:    0EIP:    0060:[<c0236a2d>]    Not taintedEFLAGS: 00010286 
EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox-
dzafman)eax: 00000400   ebx: c32c8000   ecx: 00000000   edx: 
c3e0d800esi: 00000000   edi: 00000000   ebp: c32c9fec   esp: c32c9fe8
ds: 0068   es: 0068   ss: 0068Process cfs failover (pid: 65689, 
stackpage=c32c9000)
Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000
Call Trace:
[<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8)
[<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode

From: SourceForge.net <no...@so...> - 2008-04-20 22:38:31

Bugs item #1001010, was opened at 2004-07-30 20:03
Message generated for change (Settings changed) made by rogertsang
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Booting / init
>Group: v1.2.0
Status: Open
Resolution: None
Priority: 3
Private: No
Submitted By: David Zafman (dzafman)
Assigned to: Nobody/Anonymous (nobody)
Summary: Can't halt initnode

Initial Comment:

If the cluster administrator wants to take the current
initnode out of service, "clusternode_shutdown -N# -h
..." will not work right.  The problem is that the
sys_reboot base system call doesn't completely stop the
node from doing things.  I've added code to take down
ics interfaces and run ics_nodedown() on all other
nodes, but although services are stopped, init is still
running.  In a failover environment, which is the only
one which makes sense, this is bad because the shared
root is still writable.

I've checked-in code into clusternode_shutdown, to
disallow halt in this case.

Areas to fix:
1. Make the root read-only during service stop.
2. Improve the halting code in the kernel.
3. Stop init.  Process 1 should also be sent a SIGSTOP.
 This can be added to /sbin/halt which skips that in
the local "-L" case because it shouldn't be done when a
non-initnode is being halted (-L). 

----------------------------------------------------------------------

>Comment By: Roger Tsang (rogertsang)
Date: 2008-04-20 18:38

Message:
Logged In: YES 
user_id=1246761
Originator: NO

You must have tested a non-initnode in 1.9.3 because `clusternode_shutdown
-h -N {initnode_num}` has been disabled by dzafman.

2.0.0pre3 fixes this bug for `clusternode_shutdown -h -N
{potential_initnode|compute_node}`.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2007-10-12 05:53

Message:
Logged In: YES 
user_id=166336
Originator: NO

Still present in 1.9.3.

node1:~# clusternode_shutdown -h -N 1 now

Broadcast message from root (1/ttyS0) (Fri Oct 12 10:47:31 2007):

Node 1 is going down for system halt NOW!
[...]
Deactivating swap...done.
Unmounting file systems:
umount2: Device or resource busy
umount: /boot: device is busy
umount2: Device or resource busy
umount: /boot: device is busy
/boot:
Unmounting file systems (retry):
[...]
System halted.
Node 2 has gone down!!!

Debian GNU/Linux 3.1 node1 tty1

Node1 login:

----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2007-10-11 22:47

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Need to validate SSI-1.9.3

----------------------------------------------------------------------

Comment By: David Zafman (dzafman)
Date: 2004-07-31 15:34

Message:
Logged In: YES 
user_id=297844


Another minor issue is that the ramdisk wanted to halt a booting initnode

which failed to mount the root.  Because of the way we are performing 
the halt operation, instead of getting a clean halt, the node ends up 
panic'ing in nodedown because it was a simultaneous boot and other 
nodes were present.  Looking at the stack we could fix 
cfs_nodedown_thread(), but I believe that fixing the halt code in this bug

report eliminates the need to.  This is because there could be other 
panics due to the bad state of this machine.

Creating root device
mkrootdev: label /1 not found
mount: special device /dev/root does not exist
ERROR: Mounting root file system failed.
Unable to continue. Halting.
nm_add_node: Node 3 added
nm_add_node: Node 2 added
nm_add_node: Node 4 added
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at devinet.c(825)
RTNL: assertion failed at igmp.c(556)
RTNL: assertion failed at igmp.c(529)
flushing ide devices: hda
System halted.
Node 2 has gone down!!!
Node 3 h<as1 >gUonnabe led otwno! !h!an
leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle  
vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe 
eriepnc:                                       i<c40>2 a3t6 a2vdi         
  
etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t   
nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod            
        
mCPU:    0EIP:    0060:[<c0236a2d>]    Not taintedEFLAGS: 00010286 
EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox-
dzafman)eax: 00000400   ebx: c32c8000   ecx: 00000000   edx: 
c3e0d800esi: 00000000   edi: 00000000   ebp: c32c9fec   esp: c32c9fe8
ds: 0068   es: 0068   ss: 0068Process cfs failover (pid: 65689, 
stackpage=c32c9000)
Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000
Call Trace:
[<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8)
[<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541