From: Vlado D. <vl...@tt...> - 2009-07-27 07:32:11
|
Hello, I'm posting here in hope that someone experienced similiar problem like me. I've tried latest kernel 2.6.30.3 on Proliant DL380 G5 and oprofile 0.9.4 but I have problem with getting out profiling information. The NMI seems not to be triggered for defualt event. I've tried also booting kernel with idle=poll. In dmesg I can see that actually oprofile has choosen to use NMI: oprofile: using NMI interrupt. In previous kernels 2.6.18 it seems to be working on 2nd attempt (after boot I need to start, stop profiling and then after second start it starts generating NMI) but on new one it's not working at all. Here is output from oprifle --start -V: SESSION_DIR /var/lib/oprofile LOCK_FILE /var/lib/oprofile/lock SAMPLES_DIR /var/lib/oprofile/samples CURRENT_SAMPLES_DIR /var/lib/oprofile/samples/current CPUTYPE i386/core_2 BUF_SIZE default value BUF_WATERSHED default value CPU_BUF_SIZE default value SEPARATE_LIB 0 SEPARATE_KERNEL 0 SEPARATE_THREAD 0 SEPARATE_CPU 0 CALLGRAPH 0 VMLINUX /boot/vmlinux-2.6.30.3 KERNEL_RANGE ffffffff80200000,ffffffff804f5432 XENIMAGE none XEN_RANGE Using default event: CPU_CLK_UNHALTED:100000:0:1:1 executing oprofiled --session-dir=/var/lib/oprofile --separate-lib=0 --separate-kernel=0 --separate-thread=0 --separate-cpu=0 --events=CPU_CLK_UNHALTED:60:0:100000:0:1:1, --vmlinux=/boot/vmlinux-2.6.30.3 --kernel-range=ffffffff80200000,ffffffff804f5432 --verbose=all Events: CPU_CLK_UNHALTED:60:0:100000:0:1:1, Using 2.6+ OProfile kernel interface. kernel_start = ffffffff80200000, kernel_end = ffffffff804f5432 Reading module info. And /proc/cpuinfo for first core: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 5110 @ 1.60GHz stepping : 6 cpu MHz : 1600.190 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow bogomips : 3200.38 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Do someone have idea what needs to be checked or what is wrong (interrupt settings or what)? Thanks, Vlado |
From: Andi K. <an...@fi...> - 2009-07-27 23:51:23
|
Vlado Dr¾ík <vl...@tt...> writes: > Hello, > > I'm posting here in hope that someone experienced similiar problem like me. > I've tried latest kernel 2.6.30.3 on Proliant DL380 G5 and oprofile > 0.9.4 but I have problem with getting out profiling information. > The NMI seems not to be triggered for defualt event. How do you determine this? Did you check /proc/interrupts? > I've tried also > booting kernel with idle=poll. > In dmesg I can see that actually oprofile has choosen to use NMI: > oprofile: using NMI interrupt. That means it detected your CPU. It should normally work. The first thing I tend to do when oprofile behaves strangly is to rm -rf /root/.oprofile /var/lib/oprofile/* and restart/reconfigure everything. -Andi -- ak...@li... -- Speaking for myself only. |
From: Vlado D. <vl...@tt...> - 2009-07-28 07:19:05
|
Andi Kleen wrote: > Vlado Drzik <vl...@tt...> writes: > >> Hello, >> >> I'm posting here in hope that someone experienced similiar problem like me. >> I've tried latest kernel 2.6.30.3 on Proliant DL380 G5 and oprofile >> 0.9.4 but I have problem with getting out profiling information. >> The NMI seems not to be triggered for defualt event. > > How do you determine this? Did you check /proc/interrupts? Yes, I've used /proc/interrupts > >> I've tried also >> booting kernel with idle=poll. >> In dmesg I can see that actually oprofile has choosen to use NMI: >> oprofile: using NMI interrupt. > > That means it detected your CPU. > > It should normally work. > > The first thing I tend to do when oprofile behaves strangly is to > rm -rf /root/.oprofile /var/lib/oprofile/* > and restart/reconfigure everything. > > -Andi > I've tried that but sadly it didn't help in my case :( Kernel is compiled with this options: CONFIG_PROFILING=y CONFIG_OPROFILE=m CONFIG_OPROFILE_IBS=y CONFIG_HAVE_OPROFILE=y What can be possible reason that NMI is not started? |
From: Andi K. <an...@fi...> - 2009-07-28 07:47:35
|
> > Kernel is compiled with this options: > CONFIG_PROFILING=y > CONFIG_OPROFILE=m > CONFIG_OPROFILE_IBS=y > CONFIG_HAVE_OPROFILE=y > > What can be possible reason that NMI is not started? If earlier kernels worked then it must be triggered by some kernel change. I haven't tested that particular CPU, but it works on a lot of other similar ones. We've had one problem in the past that caused stuck NMIs on some systems, but that should be already fixed with 2.6.30. If you know it worked on 2.6.18 and not on 2.6.30 you could do a git bisect (see Documentation/BUG-HUNTING) and track down which change broke it, but that would require some time. -Andi -- ak...@li... -- Speaking for myself only. |
From: Tony J. <to...@su...> - 2009-07-28 18:50:10
|
On Mon, Jul 27, 2009 at 09:12:34AM +0200, Vlado Držík wrote: > Hello, > > I'm posting here in hope that someone experienced similiar problem like me. > I've tried latest kernel 2.6.30.3 on Proliant DL380 G5 and oprofile > 0.9.4 but I have problem with getting out profiling information. > The NMI seems not to be triggered for defualt event. I've tried also > booting kernel with idle=poll. > In dmesg I can see that actually oprofile has choosen to use NMI: > oprofile: using NMI interrupt. > > In previous kernels 2.6.18 it seems to be working on 2nd attempt (after > boot I need to start, stop profiling and then after second start it > starts generating NMI) but on new one it's not working at all. > > Here is output from oprifle --start -V: On the DL785 there is a known issue where the HP firmware is itself using counter 0. HP has acknowledged issue but last I heard, no immediate plan to fix. Same symptoms you are seeing, NMI==0 in /proc/interrupts. Counter 0 is running, just not under oprofile control. Try editing the events file (for your cpu, in /usr/share/oprofile) to use counter 1 for CPU_CLK_UNHALTED -- i.e remove counter 0 from the set. I don't believe there is a way to specify this on the command line or in the session file. lmk Tony |
From: Andi K. <an...@fi...> - 2009-07-28 20:00:57
|
Tony Jones <to...@su...> writes: > On the DL785 there is a known issue where the HP firmware is itself using > counter 0. HP has acknowledged issue but last I heard, no immediate plan to > fix. Same symptoms you are seeing, NMI==0 in /proc/interrupts. > > Counter 0 is running, just not under oprofile control. > > Try editing the events file (for your cpu, in /usr/share/oprofile) to use > counter 1 for CPU_CLK_UNHALTED -- i.e remove counter 0 from the set. I don't > believe there is a way to specify this on the command line or in the session > file. Thanks Tony. That makes sense. There's actually a new protocol to coordinate the BIOS use of counters with the OS, but that's not implemented yet for Linux (but it's on my todo list). If we have that that might solve the problem. But not using counter 0 sounds like a reasonable workaround for now. -Andi -- ak...@li... -- Speaking for myself only. |
From: Andi K. <an...@fi...> - 2009-07-28 20:02:22
|
Tony Jones <to...@su...> writes: Ok on rethinking there's one problem with Tony's theory -- 2.6.18 should not have worked either. So maybe it's something else after all. -Andi -- ak...@li... -- Speaking for myself only. |
From: Tony J. <to...@su...> - 2009-07-28 21:12:07
|
On Tue, Jul 28, 2009 at 10:02:06PM +0200, Andi Kleen wrote: > Tony Jones <to...@su...> writes: > > Ok on rethinking there's one problem with Tony's theory -- 2.6.18 should not > have worked either. So maybe it's something else after all. On the DL785 HP asserted it was working on RHEL5 (2.6.18). There were no interesting patches. It's been a while but IIRC I could get vanilla 2.6.18 to boot over SLES11 userspace into single user and it was working in NMI mode. I recall talking to Robert about 2.6.18 working but I can't recall what he said. We spent some time debugging it and came to the conclusion that counter 0 was running outside of Linux control .... and subsequent to this HP acknowledged the firmware issue so I never bisected down to an actual > .18 non working version. Again, IIRC .21 wasn't working but I'm not 100%. I can go back and recheck. Tony |
From: Andi K. <an...@fi...> - 2009-07-28 21:17:51
|
> I recall talking to Robert about 2.6.18 working but I can't recall what he > said. We spent some time debugging it and came to the conclusion that counter > 0 was running outside of Linux control .... and subsequent to this HP > acknowledged the firmware issue so I never bisected down to an actual > .18 > non working version. Again, IIRC .21 wasn't working but I'm not 100%. I can > go back and recheck. You're saying that 2.6.18 worked on the DL785 too? Yes I already heard from other people that some boxes do that. So I don't doubt your theory for the DL785. There's also a new protocol to detect such sharing. What just sounds strange is that it worked with 2.6.18. When the firmware really uses the counter it shouldn't work on 2.6.18 either. So I'm not sure it's the same problem on the DL380. Or maybe 2.6.18 was more forceful than 2.6.30 in claiming counters, but I can't think of any such change. -Andi -- ak...@li... -- Speaking for myself only. |
From: Tony J. <to...@su...> - 2009-07-28 21:41:54
|
On Tue, Jul 28, 2009 at 11:17:35PM +0200, Andi Kleen wrote: > You're saying that 2.6.18 worked on the DL785 too? My recollection is that 2.6.18 worked (for the DL785) once I applied two patches that were in RHEL src.rpm to enable NMI support for the specific AMD processor (both in mainline). > What just sounds strange is that it worked with 2.6.18. When the firmware > really uses the counter it shouldn't work on 2.6.18 either. So > I'm not sure it's the same problem on the DL380. I just rechecked and the SLES bug was for DL785 *and* DL380g5 (woodcrest). Sorry, I knew it was for several machine types but had forgotten one was a DL380 also. Disabling counter 0 in the events file (obviously different events files) fixed it for both machines. For the DL380 the file would be /usr/share/oprofile/i386/core_2/events > Or maybe 2.6.18 was more forceful than 2.6.30 in claiming counters, but > I can't think of any such change. I only tried 2.6.18 on the DL785. My gut is that 2.6.18 was somehow more "forceful" in claiming the counter. IIRC Robert said that the implementation wrt NMI was different for this 2.6.18 but I never dug into it. I agree, it doesn't all add up but all I can say is that I'll wager that changing the events file fixes the OP's problem. Tony |
From: Vlado Drz(ík <vl...@tt...> - 2009-07-28 23:03:17
|
Oh, great Many thanks for your great memory, it really helped! I've disabled counter 0 and now NMI is started without problem on DL380 G5 with 2.6.30.3. Sorry, I'd forgotten to mention that I've used 2.6.18 kernel from RHEL as well before. So most probably it's related to additional RedHat patches that it works there (but I can try also vanilla). Interesting is that for me it works in 2.6.18 RHEL kernel just on 2nd attempt. Starting opofile for fist time doesn't start NMI but after stopping/starting it 2nd time under 2.6.18 NMI is started. Tony Jones wrote: > On Tue, Jul 28, 2009 at 11:17:35PM +0200, Andi Kleen wrote: > >> You're saying that 2.6.18 worked on the DL785 too? > > My recollection is that 2.6.18 worked (for the DL785) once I applied two > patches that were in RHEL src.rpm to enable NMI support for the specific AMD > processor (both in mainline). > >> What just sounds strange is that it worked with 2.6.18. When the firmware >> really uses the counter it shouldn't work on 2.6.18 either. So >> I'm not sure it's the same problem on the DL380. > > I just rechecked and the SLES bug was for DL785 *and* DL380g5 (woodcrest). > Sorry, I knew it was for several machine types but had forgotten one was a > DL380 also. Disabling counter 0 in the events file (obviously different > events files) fixed it for both machines. > > For the DL380 the file would be /usr/share/oprofile/i386/core_2/events > >> Or maybe 2.6.18 was more forceful than 2.6.30 in claiming counters, but >> I can't think of any such change. > > I only tried 2.6.18 on the DL785. My gut is that 2.6.18 was somehow more > "forceful" in claiming the counter. IIRC Robert said that the implementation > wrt NMI was different for this 2.6.18 but I never dug into it. > > I agree, it doesn't all add up but all I can say is that I'll wager that > changing the events file fixes the OP's problem. > > Tony > |
From: Tony J. <to...@su...> - 2009-07-28 23:24:53
|
On Wed, Jul 29, 2009 at 01:02:42AM +0200, Vlado Drz(ík wrote: > Oh, great > Many thanks for your great memory, it really helped! > I've disabled counter 0 and now NMI is started without problem on DL380 > G5 with 2.6.30.3. > Sorry, I'd forgotten to mention that I've used 2.6.18 kernel from RHEL > as well before. So most probably it's related to additional RedHat > patches that it works there (but I can try also vanilla). I don't recall any RHEL patches specific to this (hp) issue. On the DL785 I believe vanilla worked for me once I'd added patches to support NMI. Not sure if Woodcrest NMI support is in 2.6.18. Try it, if it uses timer mode add in the necessary patches for NMI. Maybe I'll free up some time this week to look more into why .18 was working Tony |
From: Robert R. <rob...@am...> - 2009-07-29 07:22:00
|
On 28.07.09 14:10:53, Tony Jones wrote: > On Tue, Jul 28, 2009 at 10:02:06PM +0200, Andi Kleen wrote: > > Tony Jones <to...@su...> writes: > > > > Ok on rethinking there's one problem with Tony's theory -- 2.6.18 should not > > have worked either. So maybe it's something else after all. > > On the DL785 HP asserted it was working on RHEL5 (2.6.18). There were no > interesting patches. It's been a while but IIRC I could get vanilla 2.6.18 to > boot over SLES11 userspace into single user and it was working in NMI mode. > > I recall talking to Robert about 2.6.18 working but I can't recall what he > said. We spent some time debugging it and came to the conclusion that counter > 0 was running outside of Linux control .... and subsequent to this HP > acknowledged the firmware issue so I never bisected down to an actual > .18 > non working version. Again, IIRC .21 wasn't working but I'm not 100%. I can > go back and recheck. There are reports that this is caused by the hpwdt module, disabling this solves the problem too. This would explain why it is working with v2.6.18, the module was not yet supported (?). We had also a lot of rework of the oprofile code since v2.6.18. Introducing a bug with it could have been possible to. But this didn't come true. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center email: rob...@am... |
From: Tony J. <to...@su...> - 2009-07-29 16:18:35
|
On Wed, Jul 29, 2009 at 09:21:31AM +0200, Robert Richter wrote: > There are reports that this is caused by the hpwdt module, disabling > this solves the problem too. This would explain why it is working with > v2.6.18, the module was not yet supported (?). I think this stems from the confusion of the hpdwt author which has spread ;-) There was a unrelated bug with hpdwt. It registered itself as the highest pri on the die_notify chain believing that the only source of NMIs would be either the hw watchdog (which it was originally documented to not work with) or the ILO service processor. So when oprofile NMIs started arriving it would output a "I don't recognise this NMI" printk per NMI effectively locking the machine. We only saw this on a DL585 which unlike the DL785/DL380 was correctly generating NMIs for counter 0. Also disabling hpdwt on the DL785/380 didn't change the lack of NMIs issue. I reported the lockup issue to the hpdwt author. Commit 47bece87b14b8 added support to allow hpdwt to work with the nmi watchdog. At this time the author added a comment to our Bugzilla for the HP oprofile/NMI bug regarding this commit. I didn't understand why as the change didn't do anything (for oprofile). I pointed this out to him and he then created commit 44df75353bc8 which adds a priority module parameter controlling whether it is first or last on the die_notify chain. I need to test this second change today (for SLES) but I'm not seeing how it is related to the lack of NMIs issue but maybe I'm wrong? > We had also a lot of rework of the oprofile code since > v2.6.18. Introducing a bug with it could have been possible to. But > this didn't come true. I think this is the case. I'll try and look at it this week and get some better info. Tony |