From: <ebi...@xm...> - 2003-02-09 18:39:50
|
Corey Minyard <cmi...@mv...> writes: > The panic case is actually the most interesting for us. We are using bootimg > with the MCL coredump to take a kernel core to memory and pick it up on the next > boot. [snip] With respect to DMA and SMP handling for kexec on panic that case is much trickier. A lot of the normal methods simply don't apply because by definition in a panic something is broken, and that something may be the code we need to cleanly shutdown the hardware. But I an not ready to sacrifice a method that works well in a properly working kernel just because the panic case can't use it. In getting it working I suggest we start with the easy cases, where DMA and SMP are not big issues. And then we can have a working framework. I am still digesting the crash dump code I have seen, but as far as I can tell what it does is to compress the contents of memory, for writing out later. To handle the hard cases for kexec on panic I would recommend the following. - Place the recovery code in a reserved area of memory that the normal kernel will not touch, and actually run the code there. This trivially solves the DMA problem because the hardware is not DMA'ing - Setup the kernel that does the recovery so that the pool of memory it uses for dynamic allocations is also in the reserved area of memory so that it is equally free of DMA dangers. - Modify the kernel that does the recovery so it can be run at different physical address from the standard kernel, so it will not need to be moved out of the reserved area of memory. - Modify the kernel that does the recovery to not care about which cpu in a SMP system it comes up on first. - Modify the kernel that does the recovery so that it is very robust in reinitializing devices. So it can cope with devices in a random state. Though most devices can be handled by simply ignoring them. - Possibly preserve in the reserved area a separate copy of the tables ACPI/MP/etc that the kernel needs for coming up. I actually don't think this needs to happen as the kernel preserves those in place already. At that point I believe a full memory core dump can be achieved without needing to do anything except to jump to the other kernel on panic. All of the memory can be preserved because the kexec case would not have touched it. I find this very attractive because it can be done with a very low impact on the primary kernel whose panic we want to capture, plus it is an extremely robust solution. The one piece I don't know about is how to prioritize which pieces of memory are written out first. It is certainly a desirable feature but do we need that, if we can preserve everything? Or is it easy enough to get the prioritizing information that we don't care. Eric |
From: Kenneth S. <ke...@mv...> - 2003-02-11 01:36:08
|
"Eric W. Biederman" wrote: > > Suparna Bhattacharya <su...@in...> writes: > > > On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote: > > > Corey Minyard <cmi...@mv...> writes: > > > > > > With respect to DMA and SMP handling for kexec on panic that case is > > > much trickier. A lot of the normal methods simply don't apply because > > > by definition in a panic something is broken, and that something may > > > be the code we need to cleanly shutdown the hardware. But I an not > > > ready to sacrifice a method that works well in a properly working > > > kernel just because the panic case can't use it. > > > > > > In getting it working I suggest we start with the easy cases, where > > > DMA and SMP are not big issues. And then we can have a working > > > framework. > > > > I'd agree. That was also the idea behind the patch we'd just posted > > for LKCD. With a basic working framework in hand that works for > > simpler cases, we can now keep working on addressing more and harder > > situations bit by bit. > > Agreed. I guess the primary question is can we trust the current > device shutdown + reboot notifier path or do we need to make some > large changes to avoid it. > So are the functions registered on the reboot notifier path guaranteed to be non-blocking? In the kexec on panic case, calls that can block would obviously be a bad thing. If they can block, perhaps we could add a new flag SYS_PANIC or something like that to tell the driver to only do a non-blocking shutdown of the chip. > > Are you trying to address the possibility that DMA is overwriting > > memory we are using in the recovery code, due to a runaway driver > > or other code passing a wrong memory address to a device (e.g. in > > a corrupted command area) ? > > Not primarily. Instead I am trying to address the possibility that > DMA is overwriting the recovery code due to a device not being shutdown > properly. Though it would happen to cover many cases of the wrong > memory address being passed to a device. > The problem we were seeing was that rogue DMA from a network interface chip was corrupting dentry's in the dirent cache when the rebooted kernel was coming back up. This caused a whole new set of panics. :-( Ken Sumrall ke...@mv... |
From: <ebi...@xm...> - 2003-02-11 05:08:22
|
Kenneth Sumrall <ke...@mv...> writes: > > Suparna Bhattacharya <su...@in...> writes: > > Agreed. I guess the primary question is can we trust the current > > device shutdown + reboot notifier path or do we need to make some > > large changes to avoid it. > > > So are the functions registered on the reboot notifier path guaranteed > to be non-blocking? In the kexec on panic case, calls that can block > would obviously be a bad thing. If they can block, perhaps we could add > a new flag SYS_PANIC or something like that to tell the driver to only > do a non-blocking shutdown of the chip. I think there is some amount of blocking allowed. But that has not be clearly defined. Note in 2.5.x there is a specific subset of the reboot notifiers the shutdown() device method. That you don't need to register a notifier for. The rules are the same and it is just a little bit cleaner. > > Not primarily. Instead I am trying to address the possibility that > > DMA is overwriting the recovery code due to a device not being shutdown > > properly. Though it would happen to cover many cases of the wrong > > memory address being passed to a device. > > > The problem we were seeing was that rogue DMA from a network interface > chip was corrupting dentry's in the dirent cache when the rebooted > kernel was coming back up. This caused a whole new set of panics. :-( And this a reserved hunk of memory from of memory from say 16MB to 20MB would handle. As the DMA could never have been setup at that address it obviously will never be used... Eric |
From: Stephen H. <she...@os...> - 2003-02-11 17:09:36
|
On Mon, 2003-02-10 at 21:08, Eric W. Biederman wrote: > Kenneth Sumrall <ke...@mv...> writes: > > > > Suparna Bhattacharya <su...@in...> writes: > > > > Agreed. I guess the primary question is can we trust the current > > > device shutdown + reboot notifier path or do we need to make some > > > large changes to avoid it. > > > > > So are the functions registered on the reboot notifier path guaranteed > > to be non-blocking? In the kexec on panic case, calls that can block > > would obviously be a bad thing. If they can block, perhaps we could add > > a new flag SYS_PANIC or something like that to tell the driver to only > > do a non-blocking shutdown of the chip. > Some of the network shutdown reboot notifiers can block. I found this out the hard way when trying to convert notifiers to use RCU and discovered many warnings. So many that the effort was abandoned. |
From: Suparna B. <su...@in...> - 2003-02-10 11:08:42
|
I am using the OSDL versions of the kexec patches for 2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump work. So far I had only been trying the cases where machine_kexec was being invoked directly from (safe) panics, which worked, i.e. it could successfully kexec and save dumps generated via artificially induced panics on a system that's not doing very much (Not considering harder cases or for the moment). Surprisingly though, when I tried just a simple kexec -e today (having loaded the kernel earlier on), I ran into the following Oops, consistently: I'm using kexec-tools-1.8, and this has worked for me earlier. The test system is a 4way SMP machine. Has anyone seen this as well ? (I'd already issued init 1 and unmounted filesystems by this point) sh-2.05a# /sbin/kexec -e Synchronizing SCSI caches: Shutting down devices Starting new kernel Unable to handle kernel paging request at virtual address 361ae000 printing eip: c011470a *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0060:[<c011470a>] Not tainted EFLAGS: 00010003 EIP is at machine_kexec+0x14a/0x190 eax: 00000097 ebx: f7742260 ecx: 00000025 edx: 361ac000 esi: c0114750 edi: 361ae000 ebp: f7365e94 esp: f7365e80 ds: 007b es: 007b ss: 0068 Process kexec (pid: 1685, threadinfo=f7364000 task=f6290060) Stack: 361ae000 361ac000 f7742260 f7364000 00000000 f7365fbc c0126903 f7742260 c02a71af c03a9aa8 00000001 00000000 f7fe1640 f7793ec0 c1b3b120 f7364000 00000001 f7365edc c014dbef f7fe1668 f7fe1668 00000286 f7ff51e0 f7365efc Call Trace: [<c0126903>] sys_reboot+0x363/0x400 [<c014dbef>] invalidate_inode_buffers+0xf/0x90 [<c01633b0>] clear_inode+0x10/0xb0 [<c0238276>] sock_destroy_inode+0x16/0x20 [<c016149e>] dput+0x1e/0x170 i [<c014cb56>] __fput+0x116/0x140 i [<c014b38f>] filp_close+0xcf/0xe0 i [<c014b43e>] sys_close+0x9e/0xd0 i [<c01091c7>] syscall_call+0x7/0xb i Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 e8 84 fe ff ff 6a 00 Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Andy P. <an...@os...> - 2003-02-10 17:09:40
|
On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote: > I am using the OSDL versions of the kexec patches for > 2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump > work. <snip> > > Surprisingly though, when I tried just a simple > kexec -e today (having loaded the kernel earlier on), > I ran into the following Oops, consistently: > > I'm using kexec-tools-1.8, and this has worked for me > earlier. The test system is a 4way SMP machine. > > Has anyone seen this as well ? (I'd already issued init 1 > and unmounted filesystems by this point) > > sh-2.05a# /sbin/kexec -e > Synchronizing SCSI caches: > Shutting down devices > Starting new kernel > Unable to handle kernel paging request at virtual address 361ae000 > printing eip: > c011470a > *pde = 00000000 > Oops: 0002 > CPU: 0 > EIP: 0060:[<c011470a>] Not tainted > EFLAGS: 00010003 > EIP is at machine_kexec+0x14a/0x190 > eax: 00000097 ebx: f7742260 ecx: 00000025 edx: 361ac000 > esi: c0114750 edi: 361ae000 ebp: f7365e94 esp: f7365e80 > ds: 007b es: 007b ss: 0068 > > Process kexec (pid: 1685, threadinfo=f7364000 task=f6290060) > Stack: 361ae000 361ac000 f7742260 f7364000 00000000 f7365fbc > c0126903 f7742260 c02a71af c03a9aa8 00000001 00000000 f7fe1640 > f7793ec0 c1b3b120 f7364000 00000001 f7365edc c014dbef f7fe1668 > f7fe1668 00000286 f7ff51e0 f7365efc > > Call Trace: > [<c0126903>] sys_reboot+0x363/0x400 > [<c014dbef>] invalidate_inode_buffers+0xf/0x90 > [<c01633b0>] clear_inode+0x10/0xb0 > [<c0238276>] sock_destroy_inode+0x16/0x20 > [<c016149e>] dput+0x1e/0x170 i > [<c014cb56>] __fput+0x116/0x140 i > [<c014b38f>] filp_close+0xcf/0xe0 i > [<c014b43e>] sys_close+0x9e/0xd0 i > [<c01091c7>] syscall_call+0x7/0xb i > > Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 e8 84 fe ff ff 6a 00 > > Regards > Suparna Yes, I have seen that exact or similar oops when trying kexec for 2.5.59 on a 2-way Xeon system. The exact same software configuration does *not* generate that oops on a 1-way P3-800 system. I've had some difficulty with the serial console on that system, so I don't yet have an exact traceback and cannot confirm 100% that yours and mine are identical. It sure *looks* the same. Andy |
From: <ebi...@xm...> - 2003-02-10 18:07:26
|
Andy Pfiffer <an...@os...> writes: > On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote: > > I am using the OSDL versions of the kexec patches for > > 2.5.59 (plm 1442 and 1444) for lkcd-kexec based crash dump > > work. > > <snip> > > > > > Surprisingly though, when I tried just a simple > > kexec -e today (having loaded the kernel earlier on), > > I ran into the following Oops, consistently: > > > > I'm using kexec-tools-1.8, and this has worked for me > > earlier. The test system is a 4way SMP machine. > > > > Has anyone seen this as well ? (I'd already issued init 1 > > and unmounted filesystems by this point) Hmm. Would love to know which cpu this is on... I think the primary candidate if this only occurs in smp is the switch_mm. It may be that modifying the init_mm is not safe, or it gets zapped somewhere else. As soon as I get distractions in other directions under control I will take a look. Eric |
From: Suparna B. <su...@in...> - 2003-02-11 07:16:27
Attachments:
kexec-usemm.patch
|
On Mon, Feb 10, 2003 at 11:07:06AM -0700, Eric W. Biederman wrote: > Andy Pfiffer <an...@os...> writes: > > > On Mon, 2003-02-10 at 03:14, Suparna Bhattacharya wrote: > > > Surprisingly though, when I tried just a simple > > > kexec -e today (having loaded the kernel earlier on), > > > I ran into the following Oops, consistently: > > > > > > I'm using kexec-tools-1.8, and this has worked for me > > > earlier. The test system is a 4way SMP machine. > > > > > > Has anyone seen this as well ? (I'd already issued init 1 > > > and unmounted filesystems by this point) > > Hmm. Would love to know which cpu this is on... > > I think the primary candidate if this only occurs in smp is > the switch_mm. It may be that modifying the init_mm is not safe, > or it gets zapped somewhere else. > The following patch from Anton Blanchard's WIP kexec tree for ppc64 seems to fix this for me. It just does a use_mm() (routine from fs/aio.c) instead of switch_mm(). Andy could you try this out and see if it helps ? The other change in Anton's tree that we should probably include uses a separate kexec_mm rather than init_mm for the mapping. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Andy P. <an...@os...> - 2003-02-11 17:05:21
|
On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote: <snip> > The following patch from Anton Blanchard's WIP kexec tree > for ppc64 seems to fix this for me. It just does a use_mm() > (routine from fs/aio.c) instead of switch_mm(). > > Andy could you try this out and see if it helps ? > > The other change in Anton's tree that we should probably > include uses a separate kexec_mm rather than init_mm > for the mapping. > > Regards > Suparna Will do. --Andy |
From: Andy P. <an...@os...> - 2003-02-11 23:47:10
|
On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote: > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote: > <snip> > > The following patch from Anton Blanchard's WIP kexec tree > > for ppc64 seems to fix this for me. It just does a use_mm() > > (routine from fs/aio.c) instead of switch_mm(). > > > > Andy could you try this out and see if it helps ? > > > > The other change in Anton's tree that we should probably > > include uses a separate kexec_mm rather than init_mm > > for the mapping. > > > > Regards > > Suparna > > Will do. --Andy Answer: hard lock-up after decompressing the kernel. I'll see if I can get anything meaningful out of the system before it wedges. Andy |
From: <ebi...@xm...> - 2003-02-12 04:30:24
|
Andy Pfiffer <an...@os...> writes: > On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote: > > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote: > > <snip> > > > The following patch from Anton Blanchard's WIP kexec tree > > > for ppc64 seems to fix this for me. It just does a use_mm() > > > (routine from fs/aio.c) instead of switch_mm(). > > > > > > Andy could you try this out and see if it helps ? > > > > > > The other change in Anton's tree that we should probably > > > include uses a separate kexec_mm rather than init_mm > > > for the mapping. > > > > > > Regards > > > Suparna > > > > Will do. --Andy > > Answer: hard lock-up after decompressing the kernel. I'll see if I can > get anything meaningful out of the system before it wedges. Which kernel is wedging. The kexec'd kernel. Or the kernel with the patch? Eric |
From: Andy P. <an...@os...> - 2003-02-12 22:32:15
|
On Tue, 2003-02-11 at 20:29, Eric W. Biederman wrote: > Andy Pfiffer <an...@os...> writes: > > > On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote: > > > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote: > > > <snip> > > > > The following patch from Anton Blanchard's WIP kexec tree > > > > for ppc64 seems to fix this for me. It just does a use_mm() > > > > (routine from fs/aio.c) instead of switch_mm(). > > > > > > > > Andy could you try this out and see if it helps ? > > > > <snip> > > > > Regards > > > > Suparna > > > > > > Will do. --Andy > > > > Answer: hard lock-up after decompressing the kernel. I'll see if I can > > get anything meaningful out of the system before it wedges. > > Which kernel is wedging. The kexec'd kernel. Or the kernel with > the patch? > > Eric Correction: this patch is now working for me. While pruning my .config to debug my serial console problem, kexec worked on a 2-way for me several times in a row without failure. (I hadn't properly updated my script that invokes kexec with my preferred command line arguments). I'll add the patchlet to our PLM system, and try the entire package again on 2.5.60 on a 2-way. Andy |
From: Suparna B. <su...@in...> - 2003-02-13 09:45:04
|
On Wed, Feb 12, 2003 at 02:31:57PM -0800, Andy Pfiffer wrote: > On Tue, 2003-02-11 at 20:29, Eric W. Biederman wrote: > > Andy Pfiffer <an...@os...> writes: > > > On Tue, 2003-02-11 at 09:04, Andy Pfiffer wrote: > > > > On Mon, 2003-02-10 at 23:21, Suparna Bhattacharya wrote: > > > > <snip> > > > > > The following patch from Anton Blanchard's WIP kexec tree > > > > > for ppc64 seems to fix this for me. It just does a use_mm() > > > > > (routine from fs/aio.c) instead of switch_mm(). > > > > > > > > > > Andy could you try this out and see if it helps ? > > > > > > <snip> > > > > > Regards > > > > > Suparna > > > > > > > > Will do. --Andy > > > > > > Answer: hard lock-up after decompressing the kernel. I'll see if I can > > > get anything meaningful out of the system before it wedges. > > > > Which kernel is wedging. The kexec'd kernel. Or the kernel with > > the patch? > > > > Eric > > Correction: this patch is now working for me. While pruning my .config > to debug my serial console problem, kexec worked on a 2-way for me > several times in a row without failure. (I hadn't properly updated my > script that invokes kexec with my preferred command line arguments). Great ! Eventually we should probably avoid init_mm altogether (on ppc64 at least, init_mm can't be used as Anton pointed out to me) and setup a spare mm instead. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: <ebi...@xm...> - 2003-02-13 15:11:14
|
Suparna Bhattacharya <su...@in...> writes: > Great ! > Eventually we should probably avoid init_mm altogether (on ppc64 > at least, init_mm can't be used as Anton pointed out to me) and > setup a spare mm instead. What is the problem with init_mm? Besides the fact that using it is now failing? Eric |
From: Suparna B. <su...@in...> - 2003-02-18 10:55:48
|
Here's the explanation from Anton about why using init_mm is a problem on ppc64. Regards Suparna ----- Forwarded message from Anton Blanchard <an...@sa...> ----- Date: Tue, 18 Feb 2003 20:56:23 +1100 From: Anton Blanchard <an...@sa...> To: Suparna Bhattacharya <su...@in...> Subject: Re: Fw: Re: [Fastboot] Re: Kexec on 2.5.59 problems ? On Thu, Feb 13, 2003 at 08:10:41AM -0700, Eric W. Biederman wrote: > Suparna Bhattacharya <su...@in...> writes: > > > Great ! > > Eventually we should probably avoid init_mm altogether (on ppc64 > > at least, init_mm can't be used as Anton pointed out to me) and > > setup a spare mm instead. > > What is the problem with init_mm? Besides the fact that using it > is now failing? > Hi Suparna, On ppc64 we have many 2^41B (2 TB) regions: USER KERNEL VMALLOC IO Why 2TB? Well our three level linux pagetables can map 2TB. The kernel has no pagetables, so we only need three sets of pagetables. As usual each user task has its own set of pagetables. So that leaves vmalloc and IO. For IO we create our own pgd, ioremap_pgd and for vmalloc we use init_mm. Why not? Its not being used anywhere else... except for kexec. So init_mm covers the region of: 0xD000000000000000 to 0xD000000000000000+2^41 And what kexec wants is a page under 4GB :) Thats why we created another mm. Could you please forward it on to the list? Thanks! Anton ----- End forwarded message ----- -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: <ebi...@xm...> - 2003-02-18 15:06:52
|
Suparna Bhattacharya <su...@in...> writes: > Here's the explanation from Anton about why using init_mm is > a problem on ppc64. Thanks. > Hi Suparna, > > On ppc64 we have many 2^41B (2 TB) regions: > > USER > KERNEL > VMALLOC > IO > > Why 2TB? Well our three level linux pagetables can map 2TB. The kernel has > no pagetables, so we only need three sets of pagetables. As usual each > user task has its own set of pagetables. So that leaves vmalloc and IO. > > For IO we create our own pgd, ioremap_pgd and for vmalloc we use init_mm. > Why not? Its not being used anywhere else... except for kexec. > > So init_mm covers the region of: > > 0xD000000000000000 to 0xD000000000000000+2^41 > > And what kexec wants is a page under 4GB :) In this case it definitely wants something identity mapped, which would mean in the first 2TB region. On x86 the limit is 4GB because I only have 32bit pointers. On a 64bit arch that limit should go away. > Thats why we created another mm. That makes sense. I guess it boils down to the fact that init_mm is special cased in a number of places and using it I am likely to get me into trouble... You would not happen to have code that creates a separate mm so I can be lazy would you? Eric |
From: Suparna B. <su...@in...> - 2003-02-10 12:07:18
|
On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote: > Corey Minyard <cmi...@mv...> writes: > > With respect to DMA and SMP handling for kexec on panic that case is > much trickier. A lot of the normal methods simply don't apply because > by definition in a panic something is broken, and that something may > be the code we need to cleanly shutdown the hardware. But I an not > ready to sacrifice a method that works well in a properly working > kernel just because the panic case can't use it. > > In getting it working I suggest we start with the easy cases, where > DMA and SMP are not big issues. And then we can have a working > framework. I'd agree. That was also the idea behind the patch we'd just posted for LKCD. With a basic working framework in hand that works for simpler cases, we can now keep working on addressing more and harder situations bit by bit. > > I am still digesting the crash dump code I have seen, but as far as I > can tell what it does is to compress the contents of memory, for > writing out later. Yes. It actually saves a formatted compressed dump in memory, and later writes it out to disk as is. > > To handle the hard cases for kexec on panic I would recommend the > following. > > - Place the recovery code in a reserved area of memory that the normal > kernel will not touch, and actually run the code there. This > trivially solves the DMA problem because the hardware is not DMA'ing > > - Setup the kernel that does the recovery so that the pool of memory > it uses for dynamic allocations is also in the reserved area of > memory so that it is equally free of DMA dangers. > > - Modify the kernel that does the recovery so it can be run at > different physical address from the standard kernel, so it will not > need to be moved out of the reserved area of memory. Are you trying to address the possibility that DMA is overwriting memory we are using in the recovery code, due to a runaway driver or other code passing a wrong memory address to a device (e.g. in a corrupted command area) ? I'm wondering if just reserving an area of memory would help. As long as the address is visible/ accessible by the device (i.e. unless we have the h/w support to apply protection at that level), can we really be safe in those weird or rare cases ? Disabling the bus-master sounds like a more dependable option for that (via device shutdown or reboot notifiers as suitable) if it can be done. Placing the recovery code in a safe reserved area (that the running kernel may not know about or may be protected), may reduce the possibility of the panic/buggy kernel overwriting it, but will it help the DMA case ? > > - Modify the kernel that does the recovery to not care about > which cpu in a SMP system it comes up on first. > > - Modify the kernel that does the recovery so that it is very robust > in reinitializing devices. So it can cope with devices in a random > state. Though most devices can be handled by simply ignoring them. > > - Possibly preserve in the reserved area a separate copy of the tables > ACPI/MP/etc that the kernel needs for coming up. I actually don't > think this needs to happen as the kernel preserves those in place > already. > > At that point I believe a full memory core dump can be achieved > without needing to do anything except to jump to the other kernel > on panic. All of the memory can be preserved because the kexec case > would not have touched it. > > I find this very attractive because it can be done with a very low > impact on the primary kernel whose panic we want to capture, plus it > is an extremely robust solution. > > The one piece I don't know about is how to prioritize which pieces of > memory are written out first. It is certainly a desirable feature > but do we need that, if we can preserve everything? Or is it easy > enough to get the prioritizing information that we don't care. LKCD has support for doing that - it provides for specifying dump levels, to dump just a header, kernel pages, all in-use pages and full-memory. This can be tuned/extended for some more intermediate levels (e.g. header + stack traces for all cpus). There is also some work-in-progress code for more granular dump customisation as a future item. While the patch I'd posted has been designed so that ideally it should be possible to preserve everything, I'm still not certain if the compression we get is good enough for all cases (e.g a heavily loaded system with lots of non-redundant data) -- we really need to play around with the implementation and tune it. Secondly, for a large memory system, it could take a bit of time to compress all pages, and we might just want to dump potentially more relevant data (e.g kernel pages) for some kind of problems. It was easy enough to do this with some simple heuristics like dumping inuse pages which are nonlru. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Corey M. <cmi...@mv...> - 2003-02-10 13:56:43
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Suparna Bhattacharya wrote: |>I am still digesting the crash dump code I have seen, but as far as I |>can tell what it does is to compress the contents of memory, for |>writing out later. | | |Yes. It actually saves a formatted compressed dump in memory, |and later writes it out to disk as is. MCL coredump does funny memory shuffling, too. It compresses pages into a contiguous area of memory, and as it runs into output pages that it has not yet compressed, it moves them into pages that it has already compressed and keeps track of where everything is located. That a lot of the complexity of MCL coredump. | |>To handle the hard cases for kexec on panic I would recommend the |>following. |> |>- Place the recovery code in a reserved area of memory that the normal |> kernel will not touch, and actually run the code there. This |> trivially solves the DMA problem because the hardware is not DMA'ing |> |>- Setup the kernel that does the recovery so that the pool of memory |> it uses for dynamic allocations is also in the reserved area of |> memory so that it is equally free of DMA dangers. |> |>- Modify the kernel that does the recovery so it can be run at |> different physical address from the standard kernel, so it will not |> need to be moved out of the reserved area of memory. | | |Are you trying to address the possibility that DMA is overwriting |memory we are using in the recovery code, due to a runaway driver |or other code passing a wrong memory address to a device (e.g. in |a corrupted command area) ? I'm wondering if just reserving |an area of memory would help. As long as the address is visible/ |accessible by the device (i.e. unless we have the h/w support to |apply protection at that level), can we really be safe in those |weird or rare cases ? Disabling the bus-master sounds like a |more dependable option for that (via device shutdown or reboot |notifiers as suitable) if it can be done. | |Placing the recovery code in a safe reserved area (that the |running kernel may not know about or may be protected), |may reduce the possibility of the panic/buggy kernel overwriting |it, but will it help the DMA case ? Eric, I'd suggest you go with your previous advice and start simple and go one step at a time. You will never be able to build a system that can perfectly protect from anything that can go wrong. So start with the simple case that covers 95% of the problems. Build a system first that lets the driver quiesce the chip, then think about moving those functions and their data into special protected memory. I've actually never seen a core dump with the MCL core dump that had a memory corruption so bad it couldn't take the dump. | | |While the patch I'd posted has been designed so that ideally |it should be possible to preserve everything, I'm still not |certain if the compression we get is good enough for all cases |(e.g a heavily loaded system with lots of non-redundant data) |-- we really need to play around with the implementation and |tune it. Secondly, for a large memory system, it could take a |bit of time to compress all pages, and we might just want to |dump potentially more relevant data (e.g kernel pages) for |some kind of problems. It was easy enough to do this with some |simple heuristics like dumping inuse pages which are nonlru. ~From my experience, data is memory is very compressible (moreso than the average text file). Perhaps some pieces are not very compressible, but in the whole they are. Plus you don't have to have that much compressions for this to work, just enough to give you memory to boot the next kernel and save off a dump. And speed is probably not a big issue here, since this should be a very rare occurrance. - -Corey -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE+R6+OmUvlb4BhfF4RAhW7AJ9ZUCbWk6TBLvbwYunyKMN0dAxf+QCff21/ WoOfzq4NrjYv3E0bOYhwSD8= =T9Y9 -----END PGP SIGNATURE----- |
From: Suparna B. <su...@in...> - 2003-02-10 15:02:05
|
On Mon, Feb 10, 2003 at 07:56:35AM -0600, Corey Minyard wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Suparna Bhattacharya wrote: > > |Yes. It actually saves a formatted compressed dump in memory, > |and later writes it out to disk as is. > > MCL coredump does funny memory shuffling, too. It compresses > pages into a contiguous area of memory, and as it runs into output > pages that it has not yet compressed, it moves them into pages that > it has already compressed and keeps track of where everything is AFAICR, the MCL coredump implementation I'd seen (and used as a reference to model some of this code for lkcd) seemed to save only a kernel dump (not user space pages), so it would use the free and user pages as destination for compressed dump. What you are describing sounds a little different and closer to what we are doing. I'd be interested in takng a look at the implementation you are working with if it actually saves the whole memory by making use of pages it has already compressed. Could you point me to the code ? > located. That a lot of the complexity of MCL coredump. > > | > |While the patch I'd posted has been designed so that ideally > |it should be possible to preserve everything, I'm still not > |certain if the compression we get is good enough for all cases > |(e.g a heavily loaded system with lots of non-redundant data) > |-- we really need to play around with the implementation and > |tune it. Secondly, for a large memory system, it could take a > |bit of time to compress all pages, and we might just want to > |dump potentially more relevant data (e.g kernel pages) for > |some kind of problems. It was easy enough to do this with some > |simple heuristics like dumping inuse pages which are nonlru. > > ~From my experience, data is memory is very compressible > (moreso than the average text file). Perhaps some pieces are > not very compressible, but in the whole they are. Plus you don't Well, it may just be a matter of how our implementation is tuned. MCL compresses a much larger buffer at a time than we do at the moment (we did it a page at a time to simplify some of the tracking in the dump format), so that could be one factor to consider and maybe rethink. Its a little early to say, though; I need to investigate further. > have to have that much compressions for this to work, just enough > to give you memory to boot the next kernel and save off a dump. Also has to be enough to not overwrite the current kernel (at least the parts that the dump saving code is using or relying on) > And speed is probably not a big issue here, since this should be a > very rare occurrance. Speed is secondary of course, but just good to keep in mind for very large memory systems. Regards Suparna Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Corey M. <cmi...@mv...> - 2003-02-10 15:22:59
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Suparna Bhattacharya wrote: |On Mon, Feb 10, 2003 at 07:56:35AM -0600, Corey Minyard wrote: | |>-----BEGIN PGP SIGNED MESSAGE----- |>Hash: SHA1 |> |>Suparna Bhattacharya wrote: |> |>|Yes. It actually saves a formatted compressed dump in memory, |>|and later writes it out to disk as is. |> |>MCL coredump does funny memory shuffling, too. It compresses |>pages into a contiguous area of memory, and as it runs into output |>pages that it has not yet compressed, it moves them into pages that |>it has already compressed and keeps track of where everything is | | |AFAICR, the MCL coredump implementation I'd seen (and used as |a reference to model some of this code for lkcd) seemed to |save only a kernel dump (not user space pages), so it would |use the free and user pages as destination for compressed |dump. What you are describing sounds a little different and |closer to what we are doing. I'd be interested in takng a look |at the implementation you are working with if it actually |saves the whole memory by making use of pages it has already |compressed. Could you point me to the code ? I remembered incorrectly here. I was thinking of bootimg, which does to some wierd page shuffling. MCL coredump does not save in a contiguous region, it keeps a free list of pages it has alread compressed and allocates destination pages from it's free list, and stores those in a map. - -Corey -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE+R8NemUvlb4BhfF4RApfrAJ4tWv3mU8N4TDYXaymM4FBXJurJ3ACfef4r qHRXTq8OS/+fb7KSFqWMKiw= =h6qs -----END PGP SIGNATURE----- |
From: <ebi...@xm...> - 2003-02-10 17:57:06
|
Suparna Bhattacharya <su...@in...> writes: > On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote: > > Corey Minyard <cmi...@mv...> writes: > > > > With respect to DMA and SMP handling for kexec on panic that case is > > much trickier. A lot of the normal methods simply don't apply because > > by definition in a panic something is broken, and that something may > > be the code we need to cleanly shutdown the hardware. But I an not > > ready to sacrifice a method that works well in a properly working > > kernel just because the panic case can't use it. > > > > In getting it working I suggest we start with the easy cases, where > > DMA and SMP are not big issues. And then we can have a working > > framework. > > I'd agree. That was also the idea behind the patch we'd just posted > for LKCD. With a basic working framework in hand that works for > simpler cases, we can now keep working on addressing more and harder > situations bit by bit. Agreed. I guess the primary question is can we trust the current device shutdown + reboot notifier path or do we need to make some large changes to avoid it. > Are you trying to address the possibility that DMA is overwriting > memory we are using in the recovery code, due to a runaway driver > or other code passing a wrong memory address to a device (e.g. in > a corrupted command area) ? Not primarily. Instead I am trying to address the possibility that DMA is overwriting the recovery code due to a device not being shutdown properly. Though it would happen to cover many cases of the wrong memory address being passed to a device. > I'm wondering if just reserving > an area of memory would help. As long as the address is visible/ > accessible by the device (i.e. unless we have the h/w support to > apply protection at that level), can we really be safe in those > weird or rare cases ? Disabling the bus-master sounds like a > more dependable option for that (via device shutdown or reboot > notifiers as suitable) if it can be done. Basically using a reserved area of memory is an alternative to device shutdown or calling the reboot notifiers. If the device shutdown code is reliable enough we can go with that... The other piece that a reserved area of memory is that you can simplify the other cases because you don't need to do anything before the dump because everything is preserved. > Placing the recovery code in a safe reserved area (that the > running kernel may not know about or may be protected), > may reduce the possibility of the panic/buggy kernel overwriting > it, but will it help the DMA case ? Yes, for the same reasons. I am definitely not trying to address the case of buggy hardware. > > The one piece I don't know about is how to prioritize which pieces of > > memory are written out first. It is certainly a desirable feature > > but do we need that, if we can preserve everything? Or is it easy > > enough to get the prioritizing information that we don't care. > > LKCD has support for doing that - it provides for specifying dump > levels, to dump just a header, kernel pages, all in-use pages and > full-memory. This can be tuned/extended for some more intermediate > levels (e.g. header + stack traces for all cpus). And that is why I though of it. I need to review how that portion of the code is done. The one downside of the simplifications that come with a reserved area of memory is they make knowing the set of kernel allocations a challenge. However in most cases all in-use pages ~= full-memory. And the kernel pages can be computed statically. For more I guess it would be necessary to pass information regarding the current kernels data structures for tracking free memory to the dumper. So the functionality does not need to be lost, but providing it becomes a different problem. > There is also some work-in-progress code for more granular > dump customisation as a future item. > > While the patch I'd posted has been designed so that ideally > it should be possible to preserve everything, I'm still not > certain if the compression we get is good enough for all cases > (e.g a heavily loaded system with lots of non-redundant data) > -- we really need to play around with the implementation and > tune it. And I am certain that with a preserved memory area we can preserve everything without compression. > Secondly, for a large memory system, it could take a > bit of time to compress all pages, and we might just want to > dump potentially more relevant data (e.g kernel pages) for > some kind of problems. It was easy enough to do this with some > simple heuristics like dumping inuse pages which are nonlru. I see. So you are definitely have some interesting heuristics to pick which pages to dump. I hate to break that but.. Eric |
From: Suparna B. <su...@in...> - 2003-02-11 12:49:44
|
On Mon, Feb 10, 2003 at 10:56:43AM -0700, Eric W. Biederman wrote: > Suparna Bhattacharya <su...@in...> writes: [snip] > > Not primarily. Instead I am trying to address the possibility that > DMA is overwriting the recovery code due to a device not being shutdown > properly. Though it would happen to cover many cases of the wrong > memory address being passed to a device. > [snip] > > The other piece that a reserved area of memory is that you can > simplify the other cases because you don't need to do anything > before the dump because everything is preserved. > [snip] > Yes, for the same reasons. I am definitely not trying to address the > case of buggy hardware. > [snip] > > And I am certain that with a preserved memory area we can > preserve everything without compression. > OK, I see where you are coming from. It is an interesting possibility, if you know how to pull it off for various architectures, and the working area that the new kernel needs to do operate to the extent of issuing the writeout is not too big (i.e. doesn't take away too much memory from the operational kernel). Perhaps we could hide this memory from the normal kernel virtual address space most of the time, so its less susceptable to software corruption (i.e. besides physical access via DMA). At the same time we do want to quiesce / stop the DMA as soon as possible to get a dump that reflects the contents of memory at the concerned instant as closely as possible. And in the buggy case we want to stop any malfunctioning DMA (writes) immediately to minimize damage. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Suparna B. <su...@in...> - 2003-02-11 13:34:55
|
On Tue, Feb 11, 2003 at 06:25:08PM +0530, Suparna Bhattacharya wrote: > On Mon, Feb 10, 2003 at 10:56:43AM -0700, Eric W. Biederman wrote: > > Suparna Bhattacharya <su...@in...> writes: > [snip] > > > > Not primarily. Instead I am trying to address the possibility that > > DMA is overwriting the recovery code due to a device not being shutdown > > properly. Though it would happen to cover many cases of the wrong > > memory address being passed to a device. > > > > OK, I see where you are coming from. It is an interesting > possibility, if you know how to pull it off for various > architectures, and the working area that the new kernel needs > to do operate to the extent of issuing the writeout is not > too big (i.e. doesn't take away too much memory from the > operational kernel). Perhaps we could hide this memory from > the normal kernel virtual address space most of the time, so > its less susceptable to software corruption (i.e. besides > physical access via DMA). For the sort of problems which Ken is seeing, maybe we can, for a start, do without all the modifications to make the new kernel run at a different address, if we can assume that most i/o is likely is happen on dynamically allocated buffers. We could just reserve a memory area of reasonable size (how much ?) which would be used by the new kernel for all its allocations. We already have the infrastructure to tell the new kernel which memory areas not to use, so its simple enough to ask it exclude all but the reserved area. By issuing the i/o as early as possible during bootup (for lkcd all we need is the block device to be setup for i/o requests), we can minimize the amount of memory to reserve in this manner. That might address a large percentage of the regular cases, i.e. except where statically allocated buffers could be targets for DMA. If we are using in-use (user) pages for saving the dump, then there is a possibility of a dump getting corrupted by a DMA, but there may be a way to minimize that when we chose destination pages to use. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |
From: Corey M. <cmi...@mv...> - 2003-02-11 14:06:53
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Suparna Bhattacharya wrote: | |For the sort of problems which Ken is seeing, maybe we can, |for a start, do without all the modifications to make the |new kernel run at a different address, if we can assume |that most i/o is likely is happen on dynamically allocated |buffers. | |We could just reserve a memory area of reasonable size (how |much ?) which would be used by the new kernel for all its |allocations. We already have the infrastructure to tell the |new kernel which memory areas not to use, so its simple |enough to ask it exclude all but the reserved area. |By issuing the i/o as early as possible during bootup |(for lkcd all we need is the block device to be setup for |i/o requests), we can minimize the amount of memory to |reserve in this manner. DMA can occur almost anywhere. If you restrict the area of DMA, that means you have to copy the contents to the final destination. I don't think we want to do that in many cases. | |That might address a large percentage of the regular cases, |i.e. except where statically allocated buffers could be |targets for DMA. If we are using in-use (user) pages |for saving the dump, then there is a possibility of a dump |getting corrupted by a DMA, but there may be a way to |minimize that when we chose destination pages to use. Unless you have some way to mark pages as current DMA targets, you, you won't be able to do this. And the problem Ken and I are seeing is happening after the new kernel has booted. An old DMA operation is occuring after the new kernel has booted. That means two kernels would have to choose the same DMA target areas, and that's fairly unreasonable to ask, IMHO. The only reasonable way I can think of to do this is to quiesce the devices before dumping memory or doing a kexec. It's not that hard to do, it's just that a lot of DMA capable device drivers exist that don't do this. - -Corey -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE+SQNzmUvlb4BhfF4RAnboAJ4rOL+Amh8F1EvahT9Uko/Y6tPXRwCfV2su 0g582Xllh4TGZ7wQ2YJSsDg= =FaLb -----END PGP SIGNATURE----- |
From: Suparna B. <su...@in...> - 2003-02-11 14:34:59
|
On Tue, Feb 11, 2003 at 08:06:44AM -0600, Corey Minyard wrote: > | > |We could just reserve a memory area of reasonable size (how > |much ?) which would be used by the new kernel for all its > |allocations. We already have the infrastructure to tell the > |new kernel which memory areas not to use, so its simple > |enough to ask it exclude all but the reserved area. > |By issuing the i/o as early as possible during bootup > |(for lkcd all we need is the block device to be setup for > |i/o requests), we can minimize the amount of memory to > |reserve in this manner. > > DMA can occur almost anywhere. If you restrict the area of DMA, that > means you have to copy the contents to the final destination. I don't think > we want to do that in many cases. The scope here was just the case that Eric seemed to be trying to address, the way I understood it, and hence a much simpler subset of the problem at hand, since it is not really tackling the rouge/buggy cases. There is no restriction on where DMA can happen, just a block of memory area set aside for the dormant kernel to use when it is instantiated. So this is an area that the current kernel will not use or touch and not specify as a DMA target during "regular" operation. > | > |That might address a large percentage of the regular cases, > |i.e. except where statically allocated buffers could be > |targets for DMA. If we are using in-use (user) pages > |for saving the dump, then there is a possibility of a dump > |getting corrupted by a DMA, but there may be a way to > |minimize that when we chose destination pages to use. > > Unless you have some way to mark pages as current DMA targets, you, > you won't be able to do this. And the problem Ken and I are seeing is > happening after the new kernel has booted. An old DMA operation is > occuring after the new kernel has booted. That means two kernels would > have to choose the same DMA target areas, and that's fairly unreasonable > to ask, IMHO. Not really, this isn't about matching DMA target areas. Its about the new kernel ignoring memory that the old kernel was using and only using the reserved area of memory which the old kernel was expected to have left alone in normal operation. This is not the entire spectrum of situations where any physical address could be a potential DMA target, due to a buggy kernel which could have passed any address to the device concerned. For that case, of course, quiescing the devices seems like the best way out so far. So whether such reservation would solve the case you see would depend on whether the old DMA operation is targetted at a valid buffer in the old kernel, or if it is indeed a buggy scenario where DMA is happening into an address it shouldn't really be overwriting. > > The only reasonable way I can think of to do this is to quiesce the devices > before dumping memory or doing a kexec. It's not that hard to do, it's just > that a lot of DMA capable device drivers exist that don't do this. Yes, this is indeed what we need eventually. What would it take to get there ? The main difficulty is making sure all device drivers do this .. Regards Suparna -- Suparna Bhattacharya (su...@in...) Linux Technology Center IBM Software Labs, India |