Thread: [SSI-devel] [ ssic-linux-Bugs-1001010 ] Can't halt initnode
Brought to you by:
brucewalker,
rogertsang
From: SourceForge.net <no...@so...> - 2004-07-31 00:04:03
|
Bugs item #1001010, was opened at 2004-07-30 17:03 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: David B. Zafman (dzafman) Assigned to: Nobody/Anonymous (nobody) Summary: Can't halt initnode Initial Comment: If the cluster administrator wants to take the current initnode out of service, "clusternode_shutdown -N# -h ..." will not work right. The problem is that the sys_reboot base system call doesn't completely stop the node from doing things. I've added code to take down ics interfaces and run ics_nodedown() on all other nodes, but although services are stopped, init is still running. In a failover environment, which is the only one which makes sense, this is bad because the shared root is still writable. I've checked-in code into clusternode_shutdown, to disallow halt in this case. Areas to fix: 1. Make the root read-only during service stop. 2. Improve the halting code in the kernel. 3. Stop init. Process 1 should also be sent a SIGSTOP. This can be added to /sbin/halt which skips that in the local "-L" case because it shouldn't be done when a non-initnode is being halted (-L). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 |
From: SourceForge.net <no...@so...> - 2004-07-31 19:34:58
|
Bugs item #1001010, was opened at 2004-07-30 17:03 Message generated for change (Comment added) made by dzafman You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: David B. Zafman (dzafman) Assigned to: Nobody/Anonymous (nobody) Summary: Can't halt initnode Initial Comment: If the cluster administrator wants to take the current initnode out of service, "clusternode_shutdown -N# -h ..." will not work right. The problem is that the sys_reboot base system call doesn't completely stop the node from doing things. I've added code to take down ics interfaces and run ics_nodedown() on all other nodes, but although services are stopped, init is still running. In a failover environment, which is the only one which makes sense, this is bad because the shared root is still writable. I've checked-in code into clusternode_shutdown, to disallow halt in this case. Areas to fix: 1. Make the root read-only during service stop. 2. Improve the halting code in the kernel. 3. Stop init. Process 1 should also be sent a SIGSTOP. This can be added to /sbin/halt which skips that in the local "-L" case because it shouldn't be done when a non-initnode is being halted (-L). ---------------------------------------------------------------------- >Comment By: David B. Zafman (dzafman) Date: 2004-07-31 12:34 Message: Logged In: YES user_id=297844 Another minor issue is that the ramdisk wanted to halt a booting initnode which failed to mount the root. Because of the way we are performing the halt operation, instead of getting a clean halt, the node ends up panic'ing in nodedown because it was a simultaneous boot and other nodes were present. Looking at the stack we could fix cfs_nodedown_thread(), but I believe that fixing the halt code in this bug report eliminates the need to. This is because there could be other panics due to the bad state of this machine. Creating root device mkrootdev: label /1 not found mount: special device /dev/root does not exist ERROR: Mounting root file system failed. Unable to continue. Halting. nm_add_node: Node 3 added nm_add_node: Node 2 added nm_add_node: Node 4 added RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at igmp.c(556) RTNL: assertion failed at igmp.c(529) flushing ide devices: hda System halted. Node 2 has gone down!!! Node 3 h<as1 >gUonnabe led otwno! !h!an leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe eriepnc: i<c40>2 a3t6 a2vdi etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod mCPU: 0EIP: 0060:[<c0236a2d>] Not taintedEFLAGS: 00010286 EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox- dzafman)eax: 00000400 ebx: c32c8000 ecx: 00000000 edx: c3e0d800esi: 00000000 edi: 00000000 ebp: c32c9fec esp: c32c9fe8 ds: 0068 es: 0068 ss: 0068Process cfs failover (pid: 65689, stackpage=c32c9000) Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000 Call Trace: [<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8) [<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 |
From: SourceForge.net <no...@so...> - 2007-10-12 02:47:50
|
Bugs item #1001010, was opened at 2004-07-30 20:03 Message generated for change (Comment added) made by rogertsang You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: Booting / init Group: None Status: Open Resolution: None >Priority: 3 Private: No Submitted By: David Zafman (dzafman) Assigned to: Nobody/Anonymous (nobody) Summary: Can't halt initnode Initial Comment: If the cluster administrator wants to take the current initnode out of service, "clusternode_shutdown -N# -h ..." will not work right. The problem is that the sys_reboot base system call doesn't completely stop the node from doing things. I've added code to take down ics interfaces and run ics_nodedown() on all other nodes, but although services are stopped, init is still running. In a failover environment, which is the only one which makes sense, this is bad because the shared root is still writable. I've checked-in code into clusternode_shutdown, to disallow halt in this case. Areas to fix: 1. Make the root read-only during service stop. 2. Improve the halting code in the kernel. 3. Stop init. Process 1 should also be sent a SIGSTOP. This can be added to /sbin/halt which skips that in the local "-L" case because it shouldn't be done when a non-initnode is being halted (-L). ---------------------------------------------------------------------- >Comment By: Roger Tsang (rogertsang) Date: 2007-10-11 22:47 Message: Logged In: YES user_id=1246761 Originator: NO Need to validate SSI-1.9.3 ---------------------------------------------------------------------- Comment By: David Zafman (dzafman) Date: 2004-07-31 15:34 Message: Logged In: YES user_id=297844 Another minor issue is that the ramdisk wanted to halt a booting initnode which failed to mount the root. Because of the way we are performing the halt operation, instead of getting a clean halt, the node ends up panic'ing in nodedown because it was a simultaneous boot and other nodes were present. Looking at the stack we could fix cfs_nodedown_thread(), but I believe that fixing the halt code in this bug report eliminates the need to. This is because there could be other panics due to the bad state of this machine. Creating root device mkrootdev: label /1 not found mount: special device /dev/root does not exist ERROR: Mounting root file system failed. Unable to continue. Halting. nm_add_node: Node 3 added nm_add_node: Node 2 added nm_add_node: Node 4 added RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at igmp.c(556) RTNL: assertion failed at igmp.c(529) flushing ide devices: hda System halted. Node 2 has gone down!!! Node 3 h<as1 >gUonnabe led otwno! !h!an leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe eriepnc: i<c40>2 a3t6 a2vdi etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod mCPU: 0EIP: 0060:[<c0236a2d>] Not taintedEFLAGS: 00010286 EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox- dzafman)eax: 00000400 ebx: c32c8000 ecx: 00000000 edx: c3e0d800esi: 00000000 edi: 00000000 ebp: c32c9fec esp: c32c9fe8 ds: 0068 es: 0068 ss: 0068Process cfs failover (pid: 65689, stackpage=c32c9000) Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000 Call Trace: [<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8) [<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 |
From: SourceForge.net <no...@so...> - 2007-10-12 09:53:19
|
Bugs item #1001010, was opened at 2004-07-31 02:03 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Booting / init Group: None Status: Open Resolution: None Priority: 3 Private: No Submitted By: David Zafman (dzafman) Assigned to: Nobody/Anonymous (nobody) Summary: Can't halt initnode Initial Comment: If the cluster administrator wants to take the current initnode out of service, "clusternode_shutdown -N# -h ..." will not work right. The problem is that the sys_reboot base system call doesn't completely stop the node from doing things. I've added code to take down ics interfaces and run ics_nodedown() on all other nodes, but although services are stopped, init is still running. In a failover environment, which is the only one which makes sense, this is bad because the shared root is still writable. I've checked-in code into clusternode_shutdown, to disallow halt in this case. Areas to fix: 1. Make the root read-only during service stop. 2. Improve the halting code in the kernel. 3. Stop init. Process 1 should also be sent a SIGSTOP. This can be added to /sbin/halt which skips that in the local "-L" case because it shouldn't be done when a non-initnode is being halted (-L). ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2007-10-12 11:53 Message: Logged In: YES user_id=166336 Originator: NO Still present in 1.9.3. node1:~# clusternode_shutdown -h -N 1 now Broadcast message from root (1/ttyS0) (Fri Oct 12 10:47:31 2007): Node 1 is going down for system halt NOW! [...] Deactivating swap...done. Unmounting file systems: umount2: Device or resource busy umount: /boot: device is busy umount2: Device or resource busy umount: /boot: device is busy /boot: Unmounting file systems (retry): [...] System halted. Node 2 has gone down!!! Debian GNU/Linux 3.1 node1 tty1 Node1 login: ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2007-10-12 04:47 Message: Logged In: YES user_id=1246761 Originator: NO Need to validate SSI-1.9.3 ---------------------------------------------------------------------- Comment By: David Zafman (dzafman) Date: 2004-07-31 21:34 Message: Logged In: YES user_id=297844 Another minor issue is that the ramdisk wanted to halt a booting initnode which failed to mount the root. Because of the way we are performing the halt operation, instead of getting a clean halt, the node ends up panic'ing in nodedown because it was a simultaneous boot and other nodes were present. Looking at the stack we could fix cfs_nodedown_thread(), but I believe that fixing the halt code in this bug report eliminates the need to. This is because there could be other panics due to the bad state of this machine. Creating root device mkrootdev: label /1 not found mount: special device /dev/root does not exist ERROR: Mounting root file system failed. Unable to continue. Halting. nm_add_node: Node 3 added nm_add_node: Node 2 added nm_add_node: Node 4 added RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at igmp.c(556) RTNL: assertion failed at igmp.c(529) flushing ide devices: hda System halted. Node 2 has gone down!!! Node 3 h<as1 >gUonnabe led otwno! !h!an leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe eriepnc: i<c40>2 a3t6 a2vdi etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod mCPU: 0EIP: 0060:[<c0236a2d>] Not taintedEFLAGS: 00010286 EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox- dzafman)eax: 00000400 ebx: c32c8000 ecx: 00000000 edx: c3e0d800esi: 00000000 edi: 00000000 ebp: c32c9fec esp: c32c9fe8 ds: 0068 es: 0068 ss: 0068Process cfs failover (pid: 65689, stackpage=c32c9000) Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000 Call Trace: [<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8) [<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-20 22:38:31
|
Bugs item #1001010, was opened at 2004-07-30 20:03 Message generated for change (Settings changed) made by rogertsang You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Booting / init >Group: v1.2.0 Status: Open Resolution: None Priority: 3 Private: No Submitted By: David Zafman (dzafman) Assigned to: Nobody/Anonymous (nobody) Summary: Can't halt initnode Initial Comment: If the cluster administrator wants to take the current initnode out of service, "clusternode_shutdown -N# -h ..." will not work right. The problem is that the sys_reboot base system call doesn't completely stop the node from doing things. I've added code to take down ics interfaces and run ics_nodedown() on all other nodes, but although services are stopped, init is still running. In a failover environment, which is the only one which makes sense, this is bad because the shared root is still writable. I've checked-in code into clusternode_shutdown, to disallow halt in this case. Areas to fix: 1. Make the root read-only during service stop. 2. Improve the halting code in the kernel. 3. Stop init. Process 1 should also be sent a SIGSTOP. This can be added to /sbin/halt which skips that in the local "-L" case because it shouldn't be done when a non-initnode is being halted (-L). ---------------------------------------------------------------------- >Comment By: Roger Tsang (rogertsang) Date: 2008-04-20 18:38 Message: Logged In: YES user_id=1246761 Originator: NO You must have tested a non-initnode in 1.9.3 because `clusternode_shutdown -h -N {initnode_num}` has been disabled by dzafman. 2.0.0pre3 fixes this bug for `clusternode_shutdown -h -N {potential_initnode|compute_node}`. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2007-10-12 05:53 Message: Logged In: YES user_id=166336 Originator: NO Still present in 1.9.3. node1:~# clusternode_shutdown -h -N 1 now Broadcast message from root (1/ttyS0) (Fri Oct 12 10:47:31 2007): Node 1 is going down for system halt NOW! [...] Deactivating swap...done. Unmounting file systems: umount2: Device or resource busy umount: /boot: device is busy umount2: Device or resource busy umount: /boot: device is busy /boot: Unmounting file systems (retry): [...] System halted. Node 2 has gone down!!! Debian GNU/Linux 3.1 node1 tty1 Node1 login: ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2007-10-11 22:47 Message: Logged In: YES user_id=1246761 Originator: NO Need to validate SSI-1.9.3 ---------------------------------------------------------------------- Comment By: David Zafman (dzafman) Date: 2004-07-31 15:34 Message: Logged In: YES user_id=297844 Another minor issue is that the ramdisk wanted to halt a booting initnode which failed to mount the root. Because of the way we are performing the halt operation, instead of getting a clean halt, the node ends up panic'ing in nodedown because it was a simultaneous boot and other nodes were present. Looking at the stack we could fix cfs_nodedown_thread(), but I believe that fixing the halt code in this bug report eliminates the need to. This is because there could be other panics due to the bad state of this machine. Creating root device mkrootdev: label /1 not found mount: special device /dev/root does not exist ERROR: Mounting root file system failed. Unable to continue. Halting. nm_add_node: Node 3 added nm_add_node: Node 2 added nm_add_node: Node 4 added RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at devinet.c(825) RTNL: assertion failed at igmp.c(556) RTNL: assertion failed at igmp.c(529) flushing ide devices: hda System halted. Node 2 has gone down!!! Node 3 h<as1 >gUonnabe led otwno! !h!an leNo dkee rn4e lha Ns UgLLo npeo idontwenr! !!dreferenceUna abtle vitort uhaanld laded kreersnse l0 00NU00L4L1 0pon tperri dnetirengfe eriepnc: i<c40>2 a3t6 a2vdi etu*apld ae d=d re0s0s00 00000000041O0ps : pr00i0nt0ig teliapn :t nlicp0 2m3i6ia 2cdpqfc* pdsey m=53 c0800xx0 0s0d0_0od scsi_mod mCPU: 0EIP: 0060:[<c0236a2d>] Not taintedEFLAGS: 00010286 EIP is at cfs_nodedown_thread [kernel] 0x1d (2.4.20sandbox- dzafman)eax: 00000400 ebx: c32c8000 ecx: 00000000 edx: c3e0d800esi: 00000000 edi: 00000000 ebp: c32c9fec esp: c32c9fe8 ds: 0068 es: 0068 ss: 0068Process cfs failover (pid: 65689, stackpage=c32c9000) Stack: c0236a10 00000000 c010776d 00000002 00000000 00000000 Call Trace: [<c0236a10>] cfs_nodedown_thread [kernel] 0x0 (0xc32c9fe8) [<c010776d>] kernel_thread_helper [kernel] 0x5 (0xc32c9ff0) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1001010&group_id=32541 |