From: Martin J. B. <mb...@ar...> - 2003-02-24 18:09:00
|
The patchset contains mainly scalability and NUMA stuff, and anything else that stops things from irritating me. It's meant to be pretty stable, not so much a testing ground for new stuff. I'd be very interested in feedback from anyone willing to test on any platform, however large or small. ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-mjb 3.bz2 additional: http://www.aracnet.com/~fletch/linux/2.5.59/pidmaps_nodepages Since 2.5.62-mjb2 (~ = changed, + = added, - = dropped) Notes: Fixes some critical scheduler hangs. - discontig_x440 Pat Gaughen / IBM NUMA team + early_ioremap Dave Hansen + x440disco_A0 Pat Gaughen / IBM NUMA team + fix_was_sched Ingo / wli / Rick Lindsley + no_kirq Martin J. Bligh + auto_disable_tsc John Stultz + cleaner_inodes Andrew Morton Pending: scheduler callers profiling (Anton) PPC64 NUMA patches (Anton) Child runs first (akpm) Kexec e1000 fixes Non-PAE aligned kernel splits (Dave Hansen) Update the lost timer ticks code Ingo scheduler updates Present in this patch: early_printk Dave Hansen et al. Allow printk before console_init confighz Andrew Morton / Dave Hansen Make HZ a config option of 100 Hz or 1000 Hz config_page_offset Dave Hansen / Andrea Make PAGE_OFFSET a config option vmalloc_stats Dave Hansen Expose useful vmalloc statistics local_pgdat William Lee Irwin Move the pgdat structure into the remapped space with lmem_map numameminfo Martin Bligh / Keith Mannthey Expose NUMA meminfo information under /proc/meminfo.numa notsc Martin Bligh Enable notsc option for NUMA-Q (new version for new config system) mpc_apic_id Martin J. Bligh Fix null ptr dereference (optimised away, but ...) doaction Martin J. Bligh Fix cruel torture of macros and small furry animals in io_apic.c kgdb Andrew Morton / Various People The older version of kgdb, synched with 2.5.54-mm1 noframeptr Martin Bligh Disable -fomit_frame_pointer ingosched Ingo Molnar Modify NUMA scheduler to have independant tick basis. schedstat Rick Lindsley Provide stats about the scheduler under /proc/stat sched_tunables Robert Love Provide tunable parameters for the scheduler (+ NUMA scheduler) early_ioremap Dave Hansen Provide ioremap in very early boot when we only have 8Mb address space x440disco_A0 Pat Gaughen / IBM NUMA team SLIT/SRAT parsing for x440 discontigmem acpi_x440_hack Anonymous Coward Stops x440 crashing, but owner is ashamed of it ;-) numa_pci_fix Dave Hansen Fix a potential error in the numa pci code from Stanford Checker pfn_to_nid William Lee Irwin Turn pfn_to_nid into a macro kprobes Vamsi Krishna S Add kernel probes hooks to the kernel dmc_exit1 Dave McCracken Speed up the exit path, pt 1. dmc_exit2 Dave McCracken Speed up the exit path, pt 1. shpte Dave McCracken Shared pagetables (as a config option) thread_info_cleanup (4K stacks pt 1) Dave Hansen / Ben LaHaise Prep work to reduce kernel stacks to 4K interrupt_stacks (4K stacks pt 2) Dave Hansen / Ben LaHaise Create a per-cpu interrupt stack. stack_usage_check (4K stacks pt 3) Dave Hansen / Ben LaHaise Check for kernel stack overflows. 4k_stack (4K stacks pt 4) Dave Hansen Config option to reduce kernel stacks to 4K fix_kgdb Dave Hansen Fix interaction between kgdb and 4K stacks stacks_from_slab William Lee Irwin Take kernel stacks from the slab cache, not page allocation. thread_under_page William Lee Irwin Fix THREAD_SIZE < PAGE_SIZE case lkcd LKCD team Linux kernel crash dump support percpu_loadavg Martin J. Bligh Provide per-cpu loadaverages, and real load averages irq_affinity Martin J. Bligh Workaround for irq_affinity on clustered apic mode systems (eg x440) kirq_clustered_fix Dave Hansen / Martin J. Bligh Fix kirq for clustered apic systems (eg x440) fix_was_sched Ingo / wli / Rick Lindsley Fix scheduler hangs from deadlocks no_kirq Martin J. Bligh Allow disabling of kirq to work properly auto_disable_tsc John Stultz Automatically disable the TSC for NUMA-Q cleaner_inodes Andrew Morton Make noatime filesystems more efficient -mjb Martin J. Bligh Add a tag to the makefile |
From: Mark H. <ma...@os...> - 2003-02-26 15:36:24
|
On Mon, 2003-02-24 at 10:08, Martin J. Bligh wrote: > The patchset contains mainly scalability and NUMA stuff, and anything > else that stops things from irritating me. It's meant to be pretty stable, > not so much a testing ground for new stuff. > > I'd be very interested in feedback from anyone willing to test on any > platform, however large or small. > > ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-mjb > 3.bz2 > Martin, I have been seeing system hangs on my 16 processor numaq while running contest. The system will hang within a few seconds to half an hour. Unfortunately there is no stack trace or any other indication on the system console. I have been running your 2.5.62-mjb2 without problems previously. Any ideas what I can do to narrow this down? Mark. -- Mark Haverkamp <ma...@os...> |
From: Martin J. B. <mb...@ar...> - 2003-02-26 15:55:38
|
>> The patchset contains mainly scalability and NUMA stuff, and anything >> else that stops things from irritating me. It's meant to be pretty >> stable, not so much a testing ground for new stuff. >> >> I'd be very interested in feedback from anyone willing to test on any >> platform, however large or small. >> >> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62- >> mjb 3.bz2 >> > > Martin, > > I have been seeing system hangs on my 16 processor numaq while running > contest. The system will hang within a few seconds to half an hour. > Unfortunately there is no stack trace or any other indication on the > system console. I have been running your 2.5.62-mjb2 without problems > previously. Any ideas what I can do to narrow this down? Humpf. Can you try backing out this patch (it caused me similar problems on 59, but seemed fine in 62). I suspect it's just changing timing enough that we hit some other bug ... if you could, would be nice to try the ALT+SYSRQ stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able to reproduce this on recent kernels. Thanks, M. diff -urpN -X /home/fletch/.diff.exclude 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h --- 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h Fri Jan 17 09:18:31 2003 +++ 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h Mon Feb 24 08:14:42 2003 @@ -32,6 +32,7 @@ static inline void mps_oem_check(struct if (mpc->mpc_oemptr) smp_read_mpc_oem((struct mp_config_oemtable *) mpc->mpc_oemptr, mpc->mpc_oemsize); + tsc_disable=1; } /* Hook from generic ACPI tables.c */ |
From: Mark H. <ma...@os...> - 2003-02-26 16:03:43
|
On Wed, 2003-02-26 at 07:55, Martin J. Bligh wrote: > >> The patchset contains mainly scalability and NUMA stuff, and anything > >> else that stops things from irritating me. It's meant to be pretty > >> stable, not so much a testing ground for new stuff. > >> > >> I'd be very interested in feedback from anyone willing to test on any > >> platform, however large or small. > >> > >> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62- > >> mjb 3.bz2 > >> > > > > Martin, > > > > I have been seeing system hangs on my 16 processor numaq while running > > contest. The system will hang within a few seconds to half an hour. > > Unfortunately there is no stack trace or any other indication on the > > system console. I have been running your 2.5.62-mjb2 without problems > > previously. Any ideas what I can do to narrow this down? > > Humpf. Can you try backing out this patch (it caused me similar problems on > 59, but seemed fine in 62). I suspect it's just changing timing enough that > we hit some other bug ... OK, I'll try this. > if you could, would be nice to try the ALT+SYSRQ > stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able > to reproduce this on recent kernels. I'll try these first and see what happens. Mark. -- Mark Haverkamp <ma...@os...> |
From: Mark H. <ma...@os...> - 2003-02-26 22:49:03
|
On Wed, 2003-02-26 at 07:55, Martin J. Bligh wrote: > >> The patchset contains mainly scalability and NUMA stuff, and anything > >> else that stops things from irritating me. It's meant to be pretty > >> stable, not so much a testing ground for new stuff. > >> > >> I'd be very interested in feedback from anyone willing to test on any > >> platform, however large or small. > >> > >> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62- > >> mjb 3.bz2 > >> > > > > Martin, > > > > I have been seeing system hangs on my 16 processor numaq while running > > contest. The system will hang within a few seconds to half an hour. > > Unfortunately there is no stack trace or any other indication on the > > system console. I have been running your 2.5.62-mjb2 without problems > > previously. Any ideas what I can do to narrow this down? > > Humpf. Can you try backing out this patch (it caused me similar problems on > 59, but seemed fine in 62). I suspect it's just changing timing enough that > we hit some other bug ... if you could, would be nice to try the ALT+SYSRQ > stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able > to reproduce this on recent kernels. > > Thanks, > > M. > > diff -urpN -X /home/fletch/.diff.exclude > 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h > 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h > --- 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h Fri Jan 17 > 09:18:31 2003 > +++ 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h Mon Feb > 24 08:14:42 2003 > @@ -32,6 +32,7 @@ static inline void mps_oem_check(struct > if (mpc->mpc_oemptr) > smp_read_mpc_oem((struct mp_config_oemtable *) mpc->mpc_oemptr, > mpc->mpc_oemsize); > + tsc_disable=1; > } > > /* Hook from generic ACPI tables.c */ > I turned on NMI watchdogs and when the system hung, I saw no output. My serial console is through a terminal server that isn't set up to pass along the sysrq, so I need to get this fixed. In any case I backed out the patch that you suggested and I have had no system hangs since. Mark. -- Mark Haverkamp <ma...@os...> |
From: Randy.Dunlap <rdd...@os...> - 2003-02-26 22:56:35
|
| | I turned on NMI watchdogs and when the system hung, I saw no output. My | serial console is through a terminal server that isn't set up to pass | along the sysrq, so I need to get this fixed. In any case I backed out | the patch that you suggested and I have had no system hangs since. | | Mark. | -- | Mark Haverkamp <ma...@os...> Mark, You can also use my "echo key > sysrq" patch. It was updated to 2.5.62 by Zwane M. It's available at www.osdl.org/archive/rddunlap/patches/magickey_2562.patch (after a possible 15-minute rsync delay). -- ~Randy |
From: Martin J. B. <mb...@ar...> - 2003-02-26 22:53:50
|
> I turned on NMI watchdogs and when the system hung, I saw no output. My > serial console is through a terminal server that isn't set up to pass > along the sysrq, so I need to get this fixed. In any case I backed out > the patch that you suggested and I have had no system hangs since. OK, I'll back out that patch for now, but it seems to indicate underlying crud. What parameter did you set for NMI watchdog? M. |
From: Mark H. <ma...@os...> - 2003-02-26 23:03:24
|
On Wed, 2003-02-26 at 14:53, Martin J. Bligh wrote: > > I turned on NMI watchdogs and when the system hung, I saw no output. My > > serial console is through a terminal server that isn't set up to pass > > along the sysrq, so I need to get this fixed. In any case I backed out > > the patch that you suggested and I have had no system hangs since. > > OK, I'll back out that patch for now, but it seems to indicate underlying > crud. What parameter did you set for NMI watchdog? I set it to 1. In Documentation/nmi_watchdog.txt this looked like the only option. Now that I look at apic.h, I see that I could set it to 2 also. If you like I can try this also. Mark. -- Mark Haverkamp <ma...@os...> |
From: Martin J. B. <mb...@ar...> - 2003-02-26 23:05:12
|
>> > I turned on NMI watchdogs and when the system hung, I saw no output. >> > My serial console is through a terminal server that isn't set up to >> > pass along the sysrq, so I need to get this fixed. In any case I >> > backed out the patch that you suggested and I have had no system hangs >> > since. >> >> OK, I'll back out that patch for now, but it seems to indicate underlying >> crud. What parameter did you set for NMI watchdog? > > I set it to 1. In Documentation/nmi_watchdog.txt this looked like the > only option. Now that I look at apic.h, I see that I could set it to 2 > also. If you like I can try this also. 2 is what we used sucessfully last time, but I can't remember the difference off the top of my head ... if you could try that, would be most useful. M. |
From: Mark H. <ma...@os...> - 2003-02-27 17:22:03
|
On Wed, 2003-02-26 at 15:05, Martin J. Bligh wrote: > >> > I turned on NMI watchdogs and when the system hung, I saw no output. > >> > My serial console is through a terminal server that isn't set up to > >> > pass along the sysrq, so I need to get this fixed. In any case I > >> > backed out the patch that you suggested and I have had no system hangs > >> > since. > >> > >> OK, I'll back out that patch for now, but it seems to indicate underlying > >> crud. What parameter did you set for NMI watchdog? > > > > I set it to 1. In Documentation/nmi_watchdog.txt this looked like the > > only option. Now that I look at apic.h, I see that I could set it to 2 > > also. If you like I can try this also. > > 2 is what we used sucessfully last time, but I can't remember the > difference off the top of my head ... if you could try that, would be most > useful. Still no luck getting a stack trace. With nmi_watchdog=2, I get these kind of messages on occasion: Uhhuh. NMI received for unknown reason 35 on CPU 11. Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? But when the system finally froze, there was nothing. Mark. -- Mark Haverkamp <ma...@os...> |