Thread: [perfmon2] Regression in perfmon2 in 2.6.30 for Power (possibly others)
Status: Beta
Brought to you by:
seranian
From: Corey A. <cja...@li...> - 2009-08-23 02:00:59
|
Hello, On Friday, we decided to double check that the recently checked-in changes for Power7 are working for perfmon2 (and perf_counters). So I set up a system with the latest 2.6.30 perfmon2 kernel, latest libpfm, and latest PAPI code. I ran the PAPI tests and immediately had a number of problems. The first problem is that there are missing groups in the power7_events.h file due to an error in the source file that I use to generate the C code. This can be corrected easily, and I will submit a patch later for it early next week. The second problem is that is that the mapping from PAPI events to native events required too many native events for PAPI_FP_OPS. This works OK for perf_counters, since it doesn't use the group tables to determine the values for the mmcr registers. However, the dispatch implementation in libpfm does use the groups, and there didn't exist a group containing all of the needed events. I can work around this by just removing one of the events so that a matching group can be found. We can probably generate a new group containing the needed event, and I will work on creating such a patch. (another solution would be to port the perf_counters code to libpfm, but that would be quite a lot of work) I left the worst for last: even running a simple test like libpfm/examples_v2.x/self, I get very inconsistent counts. Sometimes I get zeros! The numbers I get are almost always consistently less than they should be, which leads me to suspect that the pmd registers are getting zeroed out somewhere where they shouldn't be. I wasn't sure if this was due to some difference between Power7 and earlier chips, so I went and set up the same experiment on a Power5 machine, and the problem replicates there too. So this isn't a Power7-specific problem. Something changed between 2.6.29 and 2.6.30. When I run the same experiment on a perfmon2 2.6.29 kernel, the results are very consistent and look correct. I diffed the Power-specific source files that might have had some changes: arch/powerpc/perfmon/perfmon.c and arch/powerpc/perfmon/perfmon_power5.c but there are no differences between 2.6.29 and 2.6.30. Stephane, can you think of anything that you had to do in porting perfmon2 to 2.6.30 that might be causing this? Can you run examples_v2.x/self on x86 to see if you are seeing the same issue? Thanks for your consideration, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cja...@us... |
From: stephane e. <er...@go...> - 2009-08-23 13:55:32
|
Corey, On Sun, Aug 23, 2009 at 4:00 AM, Corey Ashford<cja...@li...> wrote: > Hello, > > On Friday, we decided to double check that the recently checked-in changes > for Power7 are working for perfmon2 (and perf_counters). > good idea. > So I set up a system with the latest 2.6.30 perfmon2 kernel, latest > libpfm, and latest PAPI code. I ran the PAPI tests and immediately > had a number of problems. > > The first problem is that there are missing groups in the power7_events.h > file due to an error in the source file that I use > to generate the C code. This can be corrected easily, and I will > submit a patch later for it early next week. > > The second problem is that is that the mapping from PAPI events > to native events required too many native events for PAPI_FP_OPS. > This works OK for perf_counters, since it doesn't use the group > tables to determine the values for the mmcr registers. However, > the dispatch implementation in libpfm does use the groups, and there > didn't exist a group containing all of the needed events. I can > work around this by just removing one of the events so that a > matching group can be found. We can probably generate a new > group containing the needed event, and I will work on creating > such a patch. (another solution would be to port the > perf_counters code to libpfm, but that would be quite a lot > of work) > > I left the worst for last: even running a simple test like > libpfm/examples_v2.x/self, I get very inconsistent counts. > Sometimes I get zeros! The numbers I get are almost always > consistently less than they should be, which leads me to > suspect that the pmd registers are getting zeroed out > somewhere where they shouldn't be. > > I wasn't sure if this was due to some difference between Power7 > and earlier chips, so I went and set up the same experiment > on a Power5 machine, and the problem replicates there too. > So this isn't a Power7-specific problem. Something changed > between 2.6.29 and 2.6.30. When I run the same experiment on > a perfmon2 2.6.29 kernel, the results are very consistent and > look correct. > > I diffed the Power-specific source files that might have had some > changes: > arch/powerpc/perfmon/perfmon.c and > arch/powerpc/perfmon/perfmon_power5.c > but there are no differences between 2.6.29 and 2.6.30. > I don't see any major changes. Here are a couple of tests you could try and run to narrow it down: - taskset -c 0 self - syst Do those produce more stable and valid results? > Stephane, can you think of anything that you had to do in porting > perfmon2 to 2.6.30 that might be causing this? > Not enough data to figure it out. > Can you run examples_v2.x/self on x86 to see if you are seeing > the same issue? > I did and it is stable and correct. So must be something in the Power code or something in the generic code which impacts only Power. |
From: Corey A. <cja...@li...> - 2009-08-23 23:04:24
|
On 08/23/2009 06:48 AM, stephane eranian wrote: > Corey, > > On Sun, Aug 23, 2009 at 4:00 AM, Corey > Ashford<cja...@li...> wrote: >> Hello, >> >> On Friday, we decided to double check that the recently checked-in changes >> for Power7 are working for perfmon2 (and perf_counters). >> > good idea. > >> So I set up a system with the latest 2.6.30 perfmon2 kernel, latest >> libpfm, and latest PAPI code. I ran the PAPI tests and immediately >> had a number of problems. >> >> The first problem is that there are missing groups in the power7_events.h >> file due to an error in the source file that I use >> to generate the C code. This can be corrected easily, and I will >> submit a patch later for it early next week. >> >> The second problem is that is that the mapping from PAPI events >> to native events required too many native events for PAPI_FP_OPS. >> This works OK for perf_counters, since it doesn't use the group >> tables to determine the values for the mmcr registers. However, >> the dispatch implementation in libpfm does use the groups, and there >> didn't exist a group containing all of the needed events. I can >> work around this by just removing one of the events so that a >> matching group can be found. We can probably generate a new >> group containing the needed event, and I will work on creating >> such a patch. (another solution would be to port the >> perf_counters code to libpfm, but that would be quite a lot >> of work) >> >> I left the worst for last: even running a simple test like >> libpfm/examples_v2.x/self, I get very inconsistent counts. >> Sometimes I get zeros! The numbers I get are almost always >> consistently less than they should be, which leads me to >> suspect that the pmd registers are getting zeroed out >> somewhere where they shouldn't be. >> >> I wasn't sure if this was due to some difference between Power7 >> and earlier chips, so I went and set up the same experiment >> on a Power5 machine, and the problem replicates there too. >> So this isn't a Power7-specific problem. Something changed >> between 2.6.29 and 2.6.30. When I run the same experiment on >> a perfmon2 2.6.29 kernel, the results are very consistent and >> look correct. >> >> I diffed the Power-specific source files that might have had some >> changes: >> arch/powerpc/perfmon/perfmon.c and >> arch/powerpc/perfmon/perfmon_power5.c >> but there are no differences between 2.6.29 and 2.6.30. >> > I don't see any major changes. > > Here are a couple of tests you could try and run to narrow it down: > - taskset -c 0 self > - syst > > Do those produce more stable and valid results? > >> Stephane, can you think of anything that you had to do in porting >> perfmon2 to 2.6.30 that might be causing this? >> > > Not enough data to figure it out. > >> Can you run examples_v2.x/self on x86 to see if you are seeing >> the same issue? >> > I did and it is stable and correct. So must be something in the > Power code or something in the generic code which impacts only > Power. Thanks, Stephane. I'll look deeper into this issue tomorrow (work day). - Corey |
From: Corey A. <cja...@li...> - 2009-08-24 18:48:15
|
stephane eranian wrote: > Corey, > [snip] > Here are a couple of tests you could try and run to narrow it down: > - taskset -c 0 self > - syst > "taskset -c 0 self" doesn't improve the behavior. The results are still all over the place. "syst" is giving me an error, which may be something completely unrelated: [root@elm3c4 examples_v2.x]# ./syst cannot set affinity to CPU0: Invalid argument I'll look into the syst problem first, since it should be easy to debug. - Corey |
From: stephane e. <er...@go...> - 2009-08-24 18:58:51
|
On Mon, Aug 24, 2009 at 8:48 PM, Corey Ashford<cja...@li...> wrote: > stephane eranian wrote: >> >> Corey, >> > [snip] >> >> Here are a couple of tests you could try and run to narrow it down: >> - taskset -c 0 self >> - syst >> > > "taskset -c 0 self" doesn't improve the behavior. The results are still all > over the place. > That's strange, must be something really central. You need to enable debugging. Careful as this has changed again in 2.6.30 because of the dynamic_printk stuff. The good thing is that now you can turn on/off individual printk. > "syst" is giving me an error, which may be something completely unrelated: > > [root@elm3c4 examples_v2.x]# ./syst > cannot set affinity to CPU0: Invalid argument > Weird. You have a CPU0, don't you? > I'll look into the syst problem first, since it should be easy to debug. > > - Corey > > > |
From: Corey A. <cja...@li...> - 2009-08-25 00:45:40
|
stephane eranian wrote: > On Mon, Aug 24, 2009 at 8:48 PM, Corey > Ashford<cja...@li...> wrote: >> stephane eranian wrote: >>> Corey, >>> >> [snip] >>> Here are a couple of tests you could try and run to narrow it down: >>> - taskset -c 0 self >>> - syst >>> >> "taskset -c 0 self" doesn't improve the behavior. The results are still all >> over the place. >> > That's strange, must be something really central. > You need to enable debugging. Careful as this has changed again in 2.6.30 > because of the dynamic_printk stuff. The good thing is that now you can > turn on/off individual printk. I'm not familiar with dynamic_printk, so that will take some research. > >> "syst" is giving me an error, which may be something completely unrelated: >> >> [root@elm3c4 examples_v2.x]# ./syst >> cannot set affinity to CPU0: Invalid argument >> > Weird. You have a CPU0, don't you? Yes :) I'm still debugging this to figure out what's going on. No results yet (took me awhile to get systemtap running due to many pilot errors) Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cja...@us... |
From: Corey A. <cja...@li...> - 2009-08-25 23:55:33
|
Corey Ashford wrote: > > stephane eranian wrote: >> On Mon, Aug 24, 2009 at 8:48 PM, Corey >> Ashford<cja...@li...> wrote: >>> stephane eranian wrote: >>>> Corey, >>>> >>> [snip] >>>> Here are a couple of tests you could try and run to narrow it down: >>>> - taskset -c 0 self >>>> - syst >>>> >>> "taskset -c 0 self" doesn't improve the behavior. The results are still all >>> over the place. >>> >> That's strange, must be something really central. >> You need to enable debugging. Careful as this has changed again in 2.6.30 >> because of the dynamic_printk stuff. The good thing is that now you can >> turn on/off individual printk. > > I'm not familiar with dynamic_printk, so that will take some research. > >>> "syst" is giving me an error, which may be something completely unrelated: >>> >>> [root@elm3c4 examples_v2.x]# ./syst >>> cannot set affinity to CPU0: Invalid argument >>> >> Weird. You have a CPU0, don't you? > > Yes :) I'm still debugging this to figure out what's going on. No results yet > (took me awhile to get systemtap running due to many pilot errors) Ok, I tracked the syst problem down. There is an error in syst.c which manifests itself on big-endian machines when syst.c is compiled in 32-bit mode. The bit vector which is used to describe the cpus that you want to set the affinity for is an array of 32-bit words (when using the compat_sys_sched_setaffinity system call in 32-bit mode). syst programs a vector of 64-bit words. On a little endian machine, this wouldn't matter, because the least significant byte of the 32-bit or 64-bit word is always at offset 0. But on a big-endian machine, the least significant byte is at offset 0x3 or 0x7 depending on the word size. So the result is that the bit vector is interpreted as setting the affinity for a cpu which does not exist. There are a couple of ways to fix this, and I will post a patch which contains both versions. So, after fixing this problem, syst does produce reliable results on 2.6.30. So I am assuming now that this the problem with the self test (and others) is that something is messed up with the per-thread context code. I will be start working on this. - Corey |
From: stephane e. <er...@go...> - 2009-08-26 00:12:20
|
Corey, On Wed, Aug 26, 2009 at 1:55 AM, Corey Ashford<cja...@li...> wrote: > Corey Ashford wrote: >> >> stephane eranian wrote: >>> >>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>> Ashford<cja...@li...> wrote: >>>> >>>> stephane eranian wrote: >>>>> >>>>> Corey, >>>>> >>>> [snip] >>>>> >>>>> Here are a couple of tests you could try and run to narrow it down: >>>>> - taskset -c 0 self >>>>> - syst >>>>> >>>> "taskset -c 0 self" doesn't improve the behavior. The results are still >>>> all >>>> over the place. >>>> >>> That's strange, must be something really central. >>> You need to enable debugging. Careful as this has changed again in 2.6.30 >>> because of the dynamic_printk stuff. The good thing is that now you can >>> turn on/off individual printk. >> >> I'm not familiar with dynamic_printk, so that will take some research. >> >>>> "syst" is giving me an error, which may be something completely >>>> unrelated: >>>> >>>> [root@elm3c4 examples_v2.x]# ./syst >>>> cannot set affinity to CPU0: Invalid argument >>>> >>> Weird. You have a CPU0, don't you? >> >> Yes :) I'm still debugging this to figure out what's going on. No >> results yet >> (took me awhile to get systemtap running due to many pilot errors) > > Ok, I tracked the syst problem down. There is an error in syst.c which > manifests itself on big-endian machines when syst.c is compiled in 32-bit > mode. > > The bit vector which is used to describe the cpus that you want to set the > affinity for is an array of 32-bit words (when using the > compat_sys_sched_setaffinity system call in 32-bit mode). syst programs a > vector of 64-bit words. On a little endian machine, this wouldn't matter, > because the least significant byte of the 32-bit or 64-bit word is always at > offset 0. But on a big-endian machine, the least significant byte is at > offset 0x3 or 0x7 depending on the word size. So the result is that the bit > vector is interpreted as setting the affinity for a cpu which does not > exist. > I think nowdays, we should simply use the libc cpu_set and call the regular sched_setaffinity() instead of having a custom version. That was from a long time ago. Hopefully, the official API will work on 32-bit big-endian systems. > There are a couple of ways to fix this, and I will post a patch which > contains both versions. > > So, after fixing this problem, syst does produce reliable results on 2.6.30. > So I am assuming now that this the problem with the self test (and others) > is that something is messed up with the per-thread context code. > Yes, most likely. That is why I asked you to try taskset -c 0 self to avoid switching from one CPU to another. But obviously you can be switched in and out. > I will be start working on this. > > - Corey > > |
From: Corey A. <cja...@li...> - 2009-08-28 01:05:31
|
Hi Stephane, I have made some progress in tracking this problem down. The big picture is that pfm_arch_ctxswin_thread is never getting called, so when the thread is switched out, and then back in again at some point, the PMU context is not getting restored onto the PMU registers, causing the counters to stop till the end of the run. pfm_arch_ctxswin_thread is not getting called because of the following code in perfmon_ctxsw.c: /* * TIF flag was removed since switch_to * context is detaching, skip everything, * keep oncpu=-1 */ if (!test_thread_flag(TIF_PERFMON_CTXSW)) goto skip_all; Apparently the TIF_PERFMON_CTXSW flag is always cleared. I haven't tracked any farther back than this yet, but was hoping this might trigger a thought or two in your mind as to what might be going on. I also noticed that this code appears to have changed from 2.6.29 to 2.6.30. Anyway, I'd appreciate any thoughts you might have on this. I may not get back to looking at this till Monday afternoon, so no huge rush. Thanks for your consideration, - Corey stephane eranian wrote: > Corey, > > On Wed, Aug 26, 2009 at 1:55 AM, Corey > Ashford<cja...@li...> wrote: >> Corey Ashford wrote: >>> stephane eranian wrote: >>>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>>> Ashford<cja...@li...> wrote: >>>>> stephane eranian wrote: >>>>>> Corey, >>>>>> >>>>> [snip] >>>>>> Here are a couple of tests you could try and run to narrow it down: >>>>>> - taskset -c 0 self >>>>>> - syst >>>>>> >>>>> "taskset -c 0 self" doesn't improve the behavior. The results are still >>>>> all >>>>> over the place. >>>>> >>>> That's strange, must be something really central. >>>> You need to enable debugging. Careful as this has changed again in 2.6.30 >>>> because of the dynamic_printk stuff. The good thing is that now you can >>>> turn on/off individual printk. >>> I'm not familiar with dynamic_printk, so that will take some research. >>> >>>>> "syst" is giving me an error, which may be something completely >>>>> unrelated: >>>>> >>>>> [root@elm3c4 examples_v2.x]# ./syst >>>>> cannot set affinity to CPU0: Invalid argument >>>>> >>>> Weird. You have a CPU0, don't you? >>> Yes :) I'm still debugging this to figure out what's going on. No >>> results yet >>> (took me awhile to get systemtap running due to many pilot errors) >> Ok, I tracked the syst problem down. There is an error in syst.c which >> manifests itself on big-endian machines when syst.c is compiled in 32-bit >> mode. >> >> The bit vector which is used to describe the cpus that you want to set the >> affinity for is an array of 32-bit words (when using the >> compat_sys_sched_setaffinity system call in 32-bit mode). syst programs a >> vector of 64-bit words. On a little endian machine, this wouldn't matter, >> because the least significant byte of the 32-bit or 64-bit word is always at >> offset 0. But on a big-endian machine, the least significant byte is at >> offset 0x3 or 0x7 depending on the word size. So the result is that the bit >> vector is interpreted as setting the affinity for a cpu which does not >> exist. >> > I think nowdays, we should simply use the libc cpu_set and call the > regular sched_setaffinity() instead of having a custom version. That > was from a long time ago. Hopefully, the official API will work on 32-bit > big-endian systems. > >> There are a couple of ways to fix this, and I will post a patch which >> contains both versions. >> >> So, after fixing this problem, syst does produce reliable results on 2.6.30. >> So I am assuming now that this the problem with the self test (and others) >> is that something is messed up with the per-thread context code. >> > Yes, most likely. That is why I asked you to try taskset -c 0 self to avoid > switching from one CPU to another. But obviously you can be switched in > and out. > > >> I will be start working on this. >> >> - Corey >> >> -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cja...@us... |
From: stephane e. <er...@go...> - 2009-08-28 14:01:42
|
Corey, On Fri, Aug 28, 2009 at 3:05 AM, Corey Ashford<cja...@li...> wrote: > Hi Stephane, > > I have made some progress in tracking this problem down. The big picture is > that pfm_arch_ctxswin_thread is never getting called, so when the thread is > switched out, and then back in again at some point, the PMU context is not > getting restored onto the PMU registers, causing the counters to stop till > the end of the run. > > pfm_arch_ctxswin_thread is not getting called because of the following code > in perfmon_ctxsw.c: > /* > * TIF flag was removed since switch_to > * context is detaching, skip everything, > * keep oncpu=-1 > */ > if (!test_thread_flag(TIF_PERFMON_CTXSW)) > goto skip_all; > > Apparently the TIF_PERFMON_CTXSW flag is always cleared. I haven't tracked > any farther back than this yet, but was hoping this might trigger a thought > or two in your mind as to what might be going on. > TIF_PERFMON_CTXSW is only set in pfm_preload_context(). If you are testing with self.c I don't see how this can be happening at this point. I think you have to instrument the places where the flag gets cleared. > I also noticed that this code appears to have changed from 2.6.29 to 2.6.30. > > Anyway, I'd appreciate any thoughts you might have on this. I may not get > back to looking at this till Monday afternoon, so no huge rush. > > Thanks for your consideration, > > - Corey > > stephane eranian wrote: >> >> Corey, >> >> On Wed, Aug 26, 2009 at 1:55 AM, Corey >> Ashford<cja...@li...> wrote: >>> >>> Corey Ashford wrote: >>>> >>>> stephane eranian wrote: >>>>> >>>>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>>>> Ashford<cja...@li...> wrote: >>>>>> >>>>>> stephane eranian wrote: >>>>>>> >>>>>>> Corey, >>>>>>> >>>>>> [snip] >>>>>>> >>>>>>> Here are a couple of tests you could try and run to narrow it down: >>>>>>> - taskset -c 0 self >>>>>>> - syst >>>>>>> >>>>>> "taskset -c 0 self" doesn't improve the behavior. The results are >>>>>> still >>>>>> all >>>>>> over the place. >>>>>> >>>>> That's strange, must be something really central. >>>>> You need to enable debugging. Careful as this has changed again in >>>>> 2.6.30 >>>>> because of the dynamic_printk stuff. The good thing is that now you can >>>>> turn on/off individual printk. >>>> >>>> I'm not familiar with dynamic_printk, so that will take some research. >>>> >>>>>> "syst" is giving me an error, which may be something completely >>>>>> unrelated: >>>>>> >>>>>> [root@elm3c4 examples_v2.x]# ./syst >>>>>> cannot set affinity to CPU0: Invalid argument >>>>>> >>>>> Weird. You have a CPU0, don't you? >>>> >>>> Yes :) I'm still debugging this to figure out what's going on. No >>>> results yet >>>> (took me awhile to get systemtap running due to many pilot errors) >>> >>> Ok, I tracked the syst problem down. There is an error in syst.c which >>> manifests itself on big-endian machines when syst.c is compiled in 32-bit >>> mode. >>> >>> The bit vector which is used to describe the cpus that you want to set >>> the >>> affinity for is an array of 32-bit words (when using the >>> compat_sys_sched_setaffinity system call in 32-bit mode). syst programs >>> a >>> vector of 64-bit words. On a little endian machine, this wouldn't >>> matter, >>> because the least significant byte of the 32-bit or 64-bit word is always >>> at >>> offset 0. But on a big-endian machine, the least significant byte is at >>> offset 0x3 or 0x7 depending on the word size. So the result is that the >>> bit >>> vector is interpreted as setting the affinity for a cpu which does not >>> exist. >>> >> I think nowdays, we should simply use the libc cpu_set and call the >> regular sched_setaffinity() instead of having a custom version. That >> was from a long time ago. Hopefully, the official API will work on 32-bit >> big-endian systems. >> >>> There are a couple of ways to fix this, and I will post a patch which >>> contains both versions. >>> >>> So, after fixing this problem, syst does produce reliable results on >>> 2.6.30. >>> So I am assuming now that this the problem with the self test (and >>> others) >>> is that something is messed up with the per-thread context code. >>> >> Yes, most likely. That is why I asked you to try taskset -c 0 self to >> avoid >> switching from one CPU to another. But obviously you can be switched in >> and out. >> >> >>> I will be start working on this. >>> >>> - Corey >>> >>> > > -- > Regards, > > - Corey > > Corey Ashford > Software Engineer > IBM Linux Technology Center, Linux Toolchain > Beaverton, OR > 503-578-3507 > cja...@us... > |
From: stephane e. <er...@go...> - 2009-08-29 17:00:23
|
Corey, I think I have found and fixed the problem. As I was debugging 2.6.30 on Itanium I found a couple of issues. One of them is identical to the one you reported. It turns out that: >> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >> goto skip_all; Is bogus because current is not the task we want to test on. It needs to be as follows instead: >> if (!test_tsk_thread_flag(task, TIF_PERFMON_CTXSW)) >> goto skip_all; With that, I think Power should work again. I have also fixed a bogus initialization on set0. But that was inttroduced just last week by my vmalloc() changes. On Fri, Aug 28, 2009 at 3:54 PM, stephane eranian<er...@go...> wrote: > Corey, > > On Fri, Aug 28, 2009 at 3:05 AM, Corey > Ashford<cja...@li...> wrote: >> Hi Stephane, >> >> I have made some progress in tracking this problem down. The big picture is >> that pfm_arch_ctxswin_thread is never getting called, so when the thread is >> switched out, and then back in again at some point, the PMU context is not >> getting restored onto the PMU registers, causing the counters to stop till >> the end of the run. >> >> pfm_arch_ctxswin_thread is not getting called because of the following code >> in perfmon_ctxsw.c: >> /* >> * TIF flag was removed since switch_to >> * context is detaching, skip everything, >> * keep oncpu=-1 >> */ >> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >> goto skip_all; >> >> Apparently the TIF_PERFMON_CTXSW flag is always cleared. I haven't tracked >> any farther back than this yet, but was hoping this might trigger a thought >> or two in your mind as to what might be going on. >> > > TIF_PERFMON_CTXSW is only set in pfm_preload_context(). If you are testing > with self.c I don't see how this can be happening at this point. I > think you have > to instrument the places where the flag gets cleared. > > > >> I also noticed that this code appears to have changed from 2.6.29 to 2.6.30. >> >> Anyway, I'd appreciate any thoughts you might have on this. I may not get >> back to looking at this till Monday afternoon, so no huge rush. >> >> Thanks for your consideration, >> >> - Corey >> >> stephane eranian wrote: >>> >>> Corey, >>> >>> On Wed, Aug 26, 2009 at 1:55 AM, Corey >>> Ashford<cja...@li...> wrote: >>>> >>>> Corey Ashford wrote: >>>>> >>>>> stephane eranian wrote: >>>>>> >>>>>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>>>>> Ashford<cja...@li...> wrote: >>>>>>> >>>>>>> stephane eranian wrote: >>>>>>>> >>>>>>>> Corey, >>>>>>>> >>>>>>> [snip] >>>>>>>> >>>>>>>> Here are a couple of tests you could try and run to narrow it down: >>>>>>>> - taskset -c 0 self >>>>>>>> - syst >>>>>>>> >>>>>>> "taskset -c 0 self" doesn't improve the behavior. The results are >>>>>>> still >>>>>>> all >>>>>>> over the place. >>>>>>> >>>>>> That's strange, must be something really central. >>>>>> You need to enable debugging. Careful as this has changed again in >>>>>> 2.6.30 >>>>>> because of the dynamic_printk stuff. The good thing is that now you can >>>>>> turn on/off individual printk. >>>>> >>>>> I'm not familiar with dynamic_printk, so that will take some research. >>>>> >>>>>>> "syst" is giving me an error, which may be something completely >>>>>>> unrelated: >>>>>>> >>>>>>> [root@elm3c4 examples_v2.x]# ./syst >>>>>>> cannot set affinity to CPU0: Invalid argument >>>>>>> >>>>>> Weird. You have a CPU0, don't you? >>>>> >>>>> Yes :) I'm still debugging this to figure out what's going on. No >>>>> results yet >>>>> (took me awhile to get systemtap running due to many pilot errors) >>>> >>>> Ok, I tracked the syst problem down. There is an error in syst.c which >>>> manifests itself on big-endian machines when syst.c is compiled in 32-bit >>>> mode. >>>> >>>> The bit vector which is used to describe the cpus that you want to set >>>> the >>>> affinity for is an array of 32-bit words (when using the >>>> compat_sys_sched_setaffinity system call in 32-bit mode). syst programs >>>> a >>>> vector of 64-bit words. On a little endian machine, this wouldn't >>>> matter, >>>> because the least significant byte of the 32-bit or 64-bit word is always >>>> at >>>> offset 0. But on a big-endian machine, the least significant byte is at >>>> offset 0x3 or 0x7 depending on the word size. So the result is that the >>>> bit >>>> vector is interpreted as setting the affinity for a cpu which does not >>>> exist. >>>> >>> I think nowdays, we should simply use the libc cpu_set and call the >>> regular sched_setaffinity() instead of having a custom version. That >>> was from a long time ago. Hopefully, the official API will work on 32-bit >>> big-endian systems. >>> >>>> There are a couple of ways to fix this, and I will post a patch which >>>> contains both versions. >>>> >>>> So, after fixing this problem, syst does produce reliable results on >>>> 2.6.30. >>>> So I am assuming now that this the problem with the self test (and >>>> others) >>>> is that something is messed up with the per-thread context code. >>>> >>> Yes, most likely. That is why I asked you to try taskset -c 0 self to >>> avoid >>> switching from one CPU to another. But obviously you can be switched in >>> and out. >>> >>> >>>> I will be start working on this. >>>> >>>> - Corey >>>> >>>> >> >> -- >> Regards, >> >> - Corey >> >> Corey Ashford >> Software Engineer >> IBM Linux Technology Center, Linux Toolchain >> Beaverton, OR >> 503-578-3507 >> cja...@us... >> > |
From: Corey A. <cja...@li...> - 2009-08-30 08:17:29
|
That's great news, Stephane, thanks! I'll give this a whirl on Monday. - Corey On 08/29/2009 10:00 AM, stephane eranian wrote: > Corey, > > I think I have found and fixed the problem. As I was debugging > 2.6.30 on Itanium I found a couple of issues. One of them is > identical to the one you reported. > > > It turns out that: >>> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >>> goto skip_all; > Is bogus because current is not the task we want to test on. > It needs to be as follows instead: > >>> if (!test_tsk_thread_flag(task, TIF_PERFMON_CTXSW)) >>> goto skip_all; > > With that, I think Power should work again. > > I have also fixed a bogus initialization on set0. But that was > inttroduced just last week by my vmalloc() changes. > > On Fri, Aug 28, 2009 at 3:54 PM, stephane eranian<er...@go...> wrote: >> Corey, >> >> On Fri, Aug 28, 2009 at 3:05 AM, Corey >> Ashford<cja...@li...> wrote: >>> Hi Stephane, >>> >>> I have made some progress in tracking this problem down. The big picture is >>> that pfm_arch_ctxswin_thread is never getting called, so when the thread is >>> switched out, and then back in again at some point, the PMU context is not >>> getting restored onto the PMU registers, causing the counters to stop till >>> the end of the run. >>> >>> pfm_arch_ctxswin_thread is not getting called because of the following code >>> in perfmon_ctxsw.c: >>> /* >>> * TIF flag was removed since switch_to >>> * context is detaching, skip everything, >>> * keep oncpu=-1 >>> */ >>> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >>> goto skip_all; >>> >>> Apparently the TIF_PERFMON_CTXSW flag is always cleared. I haven't tracked >>> any farther back than this yet, but was hoping this might trigger a thought >>> or two in your mind as to what might be going on. >>> >> >> TIF_PERFMON_CTXSW is only set in pfm_preload_context(). If you are testing >> with self.c I don't see how this can be happening at this point. I >> think you have >> to instrument the places where the flag gets cleared. >> >> >> >>> I also noticed that this code appears to have changed from 2.6.29 to 2.6.30. >>> >>> Anyway, I'd appreciate any thoughts you might have on this. I may not get >>> back to looking at this till Monday afternoon, so no huge rush. >>> >>> Thanks for your consideration, >>> >>> - Corey >>> >>> stephane eranian wrote: >>>> >>>> Corey, >>>> >>>> On Wed, Aug 26, 2009 at 1:55 AM, Corey >>>> Ashford<cja...@li...> wrote: >>>>> >>>>> Corey Ashford wrote: >>>>>> >>>>>> stephane eranian wrote: >>>>>>> >>>>>>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>>>>>> Ashford<cja...@li...> wrote: >>>>>>>> >>>>>>>> stephane eranian wrote: >>>>>>>>> >>>>>>>>> Corey, >>>>>>>>> >>>>>>>> [snip] >>>>>>>>> >>>>>>>>> Here are a couple of tests you could try and run to narrow it down: >>>>>>>>> - taskset -c 0 self >>>>>>>>> - syst >>>>>>>>> >>>>>>>> "taskset -c 0 self" doesn't improve the behavior. The results are >>>>>>>> still >>>>>>>> all >>>>>>>> over the place. >>>>>>>> >>>>>>> That's strange, must be something really central. >>>>>>> You need to enable debugging. Careful as this has changed again in >>>>>>> 2.6.30 >>>>>>> because of the dynamic_printk stuff. The good thing is that now you can >>>>>>> turn on/off individual printk. >>>>>> >>>>>> I'm not familiar with dynamic_printk, so that will take some research. >>>>>> >>>>>>>> "syst" is giving me an error, which may be something completely >>>>>>>> unrelated: >>>>>>>> >>>>>>>> [root@elm3c4 examples_v2.x]# ./syst >>>>>>>> cannot set affinity to CPU0: Invalid argument >>>>>>>> >>>>>>> Weird. You have a CPU0, don't you? >>>>>> >>>>>> Yes :) I'm still debugging this to figure out what's going on. No >>>>>> results yet >>>>>> (took me awhile to get systemtap running due to many pilot errors) >>>>> >>>>> Ok, I tracked the syst problem down. There is an error in syst.c which >>>>> manifests itself on big-endian machines when syst.c is compiled in 32-bit >>>>> mode. >>>>> >>>>> The bit vector which is used to describe the cpus that you want to set >>>>> the >>>>> affinity for is an array of 32-bit words (when using the >>>>> compat_sys_sched_setaffinity system call in 32-bit mode). syst programs >>>>> a >>>>> vector of 64-bit words. On a little endian machine, this wouldn't >>>>> matter, >>>>> because the least significant byte of the 32-bit or 64-bit word is always >>>>> at >>>>> offset 0. But on a big-endian machine, the least significant byte is at >>>>> offset 0x3 or 0x7 depending on the word size. So the result is that the >>>>> bit >>>>> vector is interpreted as setting the affinity for a cpu which does not >>>>> exist. >>>>> >>>> I think nowdays, we should simply use the libc cpu_set and call the >>>> regular sched_setaffinity() instead of having a custom version. That >>>> was from a long time ago. Hopefully, the official API will work on 32-bit >>>> big-endian systems. >>>> >>>>> There are a couple of ways to fix this, and I will post a patch which >>>>> contains both versions. >>>>> >>>>> So, after fixing this problem, syst does produce reliable results on >>>>> 2.6.30. >>>>> So I am assuming now that this the problem with the self test (and >>>>> others) >>>>> is that something is messed up with the per-thread context code. >>>>> >>>> Yes, most likely. That is why I asked you to try taskset -c 0 self to >>>> avoid >>>> switching from one CPU to another. But obviously you can be switched in >>>> and out. >>>> >>>> >>>>> I will be start working on this. >>>>> >>>>> - Corey >>>>> >>>>> >>> >>> -- >>> Regards, >>> >>> - Corey >>> >>> Corey Ashford >>> Software Engineer >>> IBM Linux Technology Center, Linux Toolchain >>> Beaverton, OR >>> 503-578-3507 >>> cja...@us... >>> >> |
From: Corey A. <cja...@li...> - 2009-09-02 22:21:09
|
Hi Stephane, Your change fixed the problem I was seeing. Thanks! - Corey stephane eranian wrote: > Corey, > > I think I have found and fixed the problem. As I was debugging > 2.6.30 on Itanium I found a couple of issues. One of them is > identical to the one you reported. > > > It turns out that: >>> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >>> goto skip_all; > Is bogus because current is not the task we want to test on. > It needs to be as follows instead: > >>> if (!test_tsk_thread_flag(task, TIF_PERFMON_CTXSW)) >>> goto skip_all; > > With that, I think Power should work again. > > I have also fixed a bogus initialization on set0. But that was > inttroduced just last week by my vmalloc() changes. > > On Fri, Aug 28, 2009 at 3:54 PM, stephane eranian<er...@go...> wrote: >> Corey, >> >> On Fri, Aug 28, 2009 at 3:05 AM, Corey >> Ashford<cja...@li...> wrote: >>> Hi Stephane, >>> >>> I have made some progress in tracking this problem down. The big picture is >>> that pfm_arch_ctxswin_thread is never getting called, so when the thread is >>> switched out, and then back in again at some point, the PMU context is not >>> getting restored onto the PMU registers, causing the counters to stop till >>> the end of the run. >>> >>> pfm_arch_ctxswin_thread is not getting called because of the following code >>> in perfmon_ctxsw.c: >>> /* >>> * TIF flag was removed since switch_to >>> * context is detaching, skip everything, >>> * keep oncpu=-1 >>> */ >>> if (!test_thread_flag(TIF_PERFMON_CTXSW)) >>> goto skip_all; >>> >>> Apparently the TIF_PERFMON_CTXSW flag is always cleared. I haven't tracked >>> any farther back than this yet, but was hoping this might trigger a thought >>> or two in your mind as to what might be going on. >>> >> TIF_PERFMON_CTXSW is only set in pfm_preload_context(). If you are testing >> with self.c I don't see how this can be happening at this point. I >> think you have >> to instrument the places where the flag gets cleared. >> >> >> >>> I also noticed that this code appears to have changed from 2.6.29 to 2.6.30. >>> >>> Anyway, I'd appreciate any thoughts you might have on this. I may not get >>> back to looking at this till Monday afternoon, so no huge rush. >>> >>> Thanks for your consideration, >>> >>> - Corey >>> >>> stephane eranian wrote: >>>> Corey, >>>> >>>> On Wed, Aug 26, 2009 at 1:55 AM, Corey >>>> Ashford<cja...@li...> wrote: >>>>> Corey Ashford wrote: >>>>>> stephane eranian wrote: >>>>>>> On Mon, Aug 24, 2009 at 8:48 PM, Corey >>>>>>> Ashford<cja...@li...> wrote: >>>>>>>> stephane eranian wrote: >>>>>>>>> Corey, >>>>>>>>> >>>>>>>> [snip] >>>>>>>>> Here are a couple of tests you could try and run to narrow it down: >>>>>>>>> - taskset -c 0 self >>>>>>>>> - syst >>>>>>>>> >>>>>>>> "taskset -c 0 self" doesn't improve the behavior. The results are >>>>>>>> still >>>>>>>> all >>>>>>>> over the place. >>>>>>>> >>>>>>> That's strange, must be something really central. >>>>>>> You need to enable debugging. Careful as this has changed again in >>>>>>> 2.6.30 >>>>>>> because of the dynamic_printk stuff. The good thing is that now you can >>>>>>> turn on/off individual printk. >>>>>> I'm not familiar with dynamic_printk, so that will take some research. >>>>>> >>>>>>>> "syst" is giving me an error, which may be something completely >>>>>>>> unrelated: >>>>>>>> >>>>>>>> [root@elm3c4 examples_v2.x]# ./syst >>>>>>>> cannot set affinity to CPU0: Invalid argument >>>>>>>> >>>>>>> Weird. You have a CPU0, don't you? >>>>>> Yes :) I'm still debugging this to figure out what's going on. No >>>>>> results yet >>>>>> (took me awhile to get systemtap running due to many pilot errors) >>>>> Ok, I tracked the syst problem down. There is an error in syst.c which >>>>> manifests itself on big-endian machines when syst.c is compiled in 32-bit >>>>> mode. >>>>> >>>>> The bit vector which is used to describe the cpus that you want to set >>>>> the >>>>> affinity for is an array of 32-bit words (when using the >>>>> compat_sys_sched_setaffinity system call in 32-bit mode). syst programs >>>>> a >>>>> vector of 64-bit words. On a little endian machine, this wouldn't >>>>> matter, >>>>> because the least significant byte of the 32-bit or 64-bit word is always >>>>> at >>>>> offset 0. But on a big-endian machine, the least significant byte is at >>>>> offset 0x3 or 0x7 depending on the word size. So the result is that the >>>>> bit >>>>> vector is interpreted as setting the affinity for a cpu which does not >>>>> exist. >>>>> >>>> I think nowdays, we should simply use the libc cpu_set and call the >>>> regular sched_setaffinity() instead of having a custom version. That >>>> was from a long time ago. Hopefully, the official API will work on 32-bit >>>> big-endian systems. >>>> >>>>> There are a couple of ways to fix this, and I will post a patch which >>>>> contains both versions. >>>>> >>>>> So, after fixing this problem, syst does produce reliable results on >>>>> 2.6.30. >>>>> So I am assuming now that this the problem with the self test (and >>>>> others) >>>>> is that something is messed up with the per-thread context code. >>>>> >>>> Yes, most likely. That is why I asked you to try taskset -c 0 self to >>>> avoid >>>> switching from one CPU to another. But obviously you can be switched in >>>> and out. >>>> >>>> >>>>> I will be start working on this. >>>>> >>>>> - Corey >>>>> >>>>> >>> -- >>> Regards, >>> >>> - Corey >>> >>> Corey Ashford >>> Software Engineer >>> IBM Linux Technology Center, Linux Toolchain >>> Beaverton, OR >>> 503-578-3507 >>> cja...@us... >>> -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cja...@us... |
From: stephane e. <er...@go...> - 2009-09-12 09:54:08
|
Corey, Glad to hear that. In the process I also found a couple of issues elsewhere in x86 code. On Sep 1, 2009 2:18 AM, "Corey Ashford" <cja...@li...> wrote: Hi Stephane, Your change fixed the problem I was seeing. Thanks! - Corey stephane eranian wrote: > > Corey, > > I think I have found and fixed the problem. As I w... -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beav... |