From: Don Z. <dz...@gm...> - 2006-04-03 21:32:53
Attachments:
oprofile_reserve_nmi.patch
|
Hello, I was trying to fix some functionality in the nmi watchdog but unfortunately it broke how oprofile works. As a result I was trying to come up with a patch that would fix the oprofile issue. The problem revolves around the fact that oprofile uses the same performance counter register as the nmi watchdog. Thus if one were to start oprofile, they would accidentally disable the watchdog, which in turn would cause other nmi watchdog issues. I have attached a patch which creates a reservation function so that only one subsystem (oprofile or nmi watchdog) could use a particular performance counter. The patch only contains changes for the athlon model (I'll get to the Pentium models in just a sec). It has minimal testing. More importantly, I was hoping someone could look at the patch and say it is an acceptable solution for oprofile to have the restricted functionality. If this is acceptable, then I was hoping to solict some feedback on a good way of implementing this on the pentium models. It seems in the AMD systems, only a few registers are ever touched, thus making it easy to hack in some code. On the Intel systems, the code seems to touch over 100 registers for no other reason than just to clear them.=20 This makes hacking more complicated. My question is are all those registers being touched on the Pentuim models really necessary? I mean over a 100 hundred registers for a handful of performance counters? Am I missing something? Cheers, Don |
From: Andi K. <ak...@su...> - 2006-04-03 23:26:37
|
On Monday 03 April 2006 23:32, Don Zickus wrote: > Hello, > > I was trying to fix some functionality in the nmi watchdog but > unfortunately it broke how oprofile works. As a result I was trying > to come up with a patch that would fix the oprofile issue. The > problem revolves around the fact that oprofile uses the same > performance counter register as the nmi watchdog. Thus if one were to > start oprofile, they would accidentally disable the watchdog, which in > turn would cause other nmi watchdog issues. > > I have attached a patch which creates a reservation function so that > only one subsystem (oprofile or nmi watchdog) could use a particular > performance counter. The patch only contains changes for the athlon > model (I'll get to the Pentium models in just a sec). It has minimal > testing. More importantly, I was hoping someone could look at the > patch and say it is an acceptable solution for oprofile to have the > restricted functionality. Looks reasonable to me. Two comments: - If you use the linux atomic bit operations (set_bit, clear_bit, test_and_set_bit etc.) you don't actually need the spinlock - I'm not sure how oprofile userland reacts to a reserved counter. Perhaps adding a printk for this case would be useful and fixing the userland to give a sensible error message (since it's effectively a change to the UI) > > If this is acceptable, then I was hoping to solict some feedback on a > good way of implementing this on the pentium models. It seems in the > AMD systems, only a few registers are ever touched, thus making it > easy to hack in some code. On the Intel systems, the code seems to > touch over 100 registers for no other reason than just to clear them. > This makes hacking more complicated. > > My question is are all those registers being touched on the Pentuim > models really necessary? I mean over a 100 hundred registers for a > handful of performance counters? Am I missing something? Intel decided to do it the complicated way for P4 (P3 and P-M are nice and simple to). Essentially there are multiple groups of performance counters belonging to different parts of the CPU and each group has a couple of counters that can be assigned to an arbitary event in the group (I probably left out some details) HyperThreading makes it more complicated again. The easy way to handle this mess I guess would be to just extend the bitmap to all possible counters. Bits are cheap. -Andi |
From: Stephane E. <er...@hp...> - 2006-04-04 07:22:41
|
Don, On Mon, Apr 03, 2006 at 04:32:46PM -0500, Don Zickus wrote: > > I was trying to fix some functionality in the nmi watchdog but > unfortunately it broke how oprofile works. As a result I was trying > to come up with a patch that would fix the oprofile issue. The > problem revolves around the fact that oprofile uses the same > performance counter register as the nmi watchdog. Thus if one were to > start oprofile, they would accidentally disable the watchdog, which in > turn would cause other nmi watchdog issues. > I welcome your initiative. I have had problems dealing with this NMI code with the perfmon2 subsystem. I looked at the patch and I am not quite sure what the goal is with the reserve_perftctr_nmi()/release_perfctr_nmi(). The logic seems to be that Oprofile or perfmon2 would share the counters with the NMI watchdog. If NMI watchdog is active, then they have one less counter (a counter group for P4). I am sure this could break some tools because most of the time they are not prepared to deal with unavailable counter. You need to convery that the counter is unavailable as opposed to implemented so that they can try requesting the same set of events BUT with another event -> counter assignement. The reservation works if you assume that NMI watchdog and Oprofile NMI-based interruption can work together. I do not see where you can chain the interrupt handler on NMI interrupt. The set_nmi_callback() replaces the existing handler, it does not add a new one. Thus, you break one or the other. If you cannot chain the NMI interrupt handler then what is the point of the reservation? Did I miss anything? > If this is acceptable, then I was hoping to solict some feedback on a > good way of implementing this on the pentium models. It seems in the > AMD systems, only a few registers are ever touched, thus making it > easy to hack in some code. On the Intel systems, the code seems to > touch over 100 registers for no other reason than just to clear them. > This makes hacking more complicated. You have about 18 counters and they are implemented by about 80 registers. Yes, they can all be used it depends on what you want to measure. > > My question is are all those registers being touched on the Pentuim > models really necessary? I mean over a 100 hundred registers for a > handful of performance counters? Am I missing something? > -- -Stephane |
From: Don Z. <dz...@gm...> - 2006-04-04 16:14:14
|
On 4/4/06, Stephane Eranian <er...@hp...> wrote: > Don, > > On Mon, Apr 03, 2006 at 04:32:46PM -0500, Don Zickus wrote: > > > > I was trying to fix some functionality in the nmi watchdog but > > unfortunately it broke how oprofile works. As a result I was trying > > to come up with a patch that would fix the oprofile issue. The > > problem revolves around the fact that oprofile uses the same > > performance counter register as the nmi watchdog. Thus if one were to > > start oprofile, they would accidentally disable the watchdog, which in > > turn would cause other nmi watchdog issues. > > > I welcome your initiative. I have had problems dealing with this NMI > code with the perfmon2 subsystem. > > I looked at the patch and I am not quite sure what the goal is > with the reserve_perftctr_nmi()/release_perfctr_nmi(). The logic > seems to be that Oprofile or perfmon2 would share the counters > with the NMI watchdog. If NMI watchdog is active, then they > have one less counter (a counter group for P4). I am sure this > could break some tools because most of the time they are not prepared > to deal with unavailable counter. You need to convery that the counter > is unavailable as opposed to implemented so that they can try requesting > the same set of events BUT with another event -> counter assignement. Thanks for the feedback (both Stephane and Andi). I didn't realize my patch would break the user space API. :( Initially I was trying to figure out a way to make that counter unavailable but couldn't find the right piece of code, so I went the other way and denying a request. If you could point me to the code which shows user space which counters are available, I would happily re-work the patch. > > The reservation works if you assume that NMI watchdog and Oprofile NMI-ba= sed > interruption can work together. I do not see where you can chain the inte= rrupt > handler on NMI interrupt. The set_nmi_callback() replaces the existing ha= ndler, > it does not add a new one. Thus, you break one or the other. If you canno= t > chain the NMI interrupt handler then what is the point of the reservation= ? > > Did I miss anything? > Yes, my original patch to Andi reworked the NMI watchdog such that set_nmi_callback() would _not_ disable it. Instead set_nmi_callback() would be used to register an NMI handler to capture interrupts the NMI subsystem didn't know what to do with (it knows NMI watchdog, memory parity errors, and i/o check errors). However, the problem arises over the new concept of the NMI watchdog never being allowed to be disabled. Since the NMI watchdog uses one of the performance counters, this conflicts with how oprofile implements its functionality. Therefore, I was trying to find a way to prevent oprofile from using whatever counter register NMI watchdog is using (as long as NMI watchdog is enabled at boot time), hence my reservation logic. I am open to suggestions of alternate ideas but this is the crux of the problem I am trying to solve. > > If this is acceptable, then I was hoping to solict some feedback on a > > good way of implementing this on the pentium models. It seems in the > > AMD systems, only a few registers are ever touched, thus making it > > easy to hack in some code. On the Intel systems, the code seems to > > touch over 100 registers for no other reason than just to clear them. > > This makes hacking more complicated. > > You have about 18 counters and they are implemented by about 80 registers= . > Yes, they can all be used it depends on what you want to measure. *blinks* I was hoping I just misread the code. This will cause my approach to be very intrusive on P4 boxes. Hopefully, we can come up with a simpler way. Cheers, Don |
From: Stephane E. <er...@hp...> - 2006-04-05 08:38:03
|
Don, On Tue, Apr 04, 2006 at 11:13:51AM -0500, Don Zickus wrote: > On 4/4/06, Stephane Eranian <er...@hp...> wrote: > > Don, > > > > On Mon, Apr 03, 2006 at 04:32:46PM -0500, Don Zickus wrote: > > > > > > I was trying to fix some functionality in the nmi watchdog but > > > unfortunately it broke how oprofile works. As a result I was trying > > > to come up with a patch that would fix the oprofile issue. The > > > problem revolves around the fact that oprofile uses the same > > > performance counter register as the nmi watchdog. Thus if one were to > > > start oprofile, they would accidentally disable the watchdog, which in > > > turn would cause other nmi watchdog issues. > > > > > I welcome your initiative. I have had problems dealing with this NMI > > code with the perfmon2 subsystem. > > > > I looked at the patch and I am not quite sure what the goal is > > with the reserve_perftctr_nmi()/release_perfctr_nmi(). The logic > > seems to be that Oprofile or perfmon2 would share the counters > > with the NMI watchdog. If NMI watchdog is active, then they > > have one less counter (a counter group for P4). I am sure this > > could break some tools because most of the time they are not prepared > > to deal with unavailable counter. You need to convery that the counter > > is unavailable as opposed to implemented so that they can try requesting > > the same set of events BUT with another event -> counter assignement. > > Thanks for the feedback (both Stephane and Andi). I didn't realize my > patch would break the user space API. :( Initially I was trying to > figure out a way to make that counter unavailable but couldn't find > the right piece of code, so I went the other way and denying a > request. If you could point me to the code which shows user space > which counters are available, I would happily re-work the patch. > I don't know about the particular details of OProfile. I think counters are exposed via /dev/oprofile filesystem. I don't know if what you did is enough to make unavailable counters disappear from there. Supposing it does. The issue is how is this taken into account? Typically you say I want to measure EVT1, EVT2, EVT3. Then, the tool figures out the assignment: EVT1 -> CNTR2, EVT2 -> CNTR1, EVT3 -> CNTR4 for instance. The key is to figure out if the assignement algorithm takes into the list of available counters as shown by the kernel or just as expected for this CPU. In others words, does this code check /dev/oprofile/.. to verify what is actually available. The other aspect to keep in mind here is that counters are not always symetrical. Not all events can be measured on any counter. Few CPUs have that level of flexibility. I think Pentium M and AMD Opteron do, but P4 does not. If you remove certain counters it may not be possible to make certain measurements anymore. > > > > The reservation works if you assume that NMI watchdog and Oprofile NMI-based > > interruption can work together. I do not see where you can chain the interrupt > > handler on NMI interrupt. The set_nmi_callback() replaces the existing handler, > > it does not add a new one. Thus, you break one or the other. If you cannot > > chain the NMI interrupt handler then what is the point of the reservation? > > > > Did I miss anything? > > > Yes, my original patch to Andi reworked the NMI watchdog such that > set_nmi_callback() would _not_ disable it. Instead set_nmi_callback() > would be used to register an NMI handler to capture interrupts the NMI > subsystem didn't know what to do with (it knows NMI watchdog, memory > parity errors, and i/o check errors). However, the problem arises > over the new concept of the NMI watchdog never being allowed to be > disabled. Since the NMI watchdog uses one of the performance > counters, this conflicts with how oprofile implements its > functionality. Therefore, I was trying to find a way to prevent > oprofile from using whatever counter register NMI watchdog is using > (as long as NMI watchdog is enabled at boot time), hence my > reservation logic. > You want both NMI interrupt for NMI watchdog and for OProfile. NMI watchdog is using one counter. Oprofile will use the others. If set_nmi_callback() registers an ADDITIONAL handler, then you have to dispatch between the various handlers. Watchdog and OProfile will trigger NMI on counter overflow. The dispatcher has to detect that this is an NMI from counters. Then, it needs to figure out which counter(s), based on this it needs to call the corresponding handler. Is that your plan? -- -Stephane |
From: Don Z. <dz...@gm...> - 2006-04-05 14:16:24
|
Stephane, > > Thanks for the feedback (both Stephane and Andi). I didn't realize my > > patch would break the user space API. :( Initially I was trying to > > figure out a way to make that counter unavailable but couldn't find > > the right piece of code, so I went the other way and denying a > > request. If you could point me to the code which shows user space > > which counters are available, I would happily re-work the patch. > > > > I don't know about the particular details of OProfile. I think counters > are exposed via /dev/oprofile filesystem. I don't know if what you did is > enough to make unavailable counters disappear from there. Supposing it do= es. > The issue is how is this taken into account? Typically you say I want to > measure EVT1, EVT2, EVT3. Then, the tool figures out the assignment: > EVT1 -> CNTR2, EVT2 -> CNTR1, EVT3 -> CNTR4 for instance. The key > is to figure out if the assignement algorithm takes into the list of > available counters as shown by the kernel or just as expected for this CP= U. > In others words, does this code check /dev/oprofile/.. to verify what is = actually > available. > > The other aspect to keep in mind here is that counters are not always sym= etrical. > Not all events can be measured on any counter. Few CPUs have that level o= f flexibility. > I think Pentium M and AMD Opteron do, but P4 does not. If you remove cert= ain counters > it may not be possible to make certain measurements anymore. > Thanks, this gives me a better idea how this subsystem works. I'll look through the code again to find this logic. > > > Yes, my original patch to Andi reworked the NMI watchdog such that > > set_nmi_callback() would _not_ disable it. Instead set_nmi_callback() > > would be used to register an NMI handler to capture interrupts the NMI > > subsystem didn't know what to do with (it knows NMI watchdog, memory > > parity errors, and i/o check errors). However, the problem arises > > over the new concept of the NMI watchdog never being allowed to be > > disabled. Since the NMI watchdog uses one of the performance > > counters, this conflicts with how oprofile implements its > > functionality. Therefore, I was trying to find a way to prevent > > oprofile from using whatever counter register NMI watchdog is using > > (as long as NMI watchdog is enabled at boot time), hence my > > reservation logic. > > > You want both NMI interrupt for NMI watchdog and for OProfile. > NMI watchdog is using one counter. Oprofile will use the others. > If set_nmi_callback() registers an ADDITIONAL handler, then you > have to dispatch between the various handlers. Watchdog and > OProfile will trigger NMI on counter overflow. The dispatcher > has to detect that this is an NMI from counters. Then, it needs to figure > out which counter(s), based on this it needs to call the corresponding > handler. > > Is that your plan? > Yes, but not as complicated. Basically, the NMI watchdog counter is checked on every NMI interrupt (if enabled). If the watchdog overflowed, then clear, reset and return success. Else if set_nmi_callback is not NULL, execute the handler. The logic doesn't neccessarily have to understand the concept of counters. Cheers, Don |
From: Don Z. <dz...@gm...> - 2006-04-07 21:27:41
Attachments:
oprofile_reserve_nmi2.patch
|
On 4/5/06, Don Zickus <dz...@gm...> wrote: > Stephane, > > > > Thanks for the feedback (both Stephane and Andi). I didn't realize m= y > > > patch would break the user space API. :( Initially I was trying to > > > figure out a way to make that counter unavailable but couldn't find > > > the right piece of code, so I went the other way and denying a > > > request. If you could point me to the code which shows user space > > > which counters are available, I would happily re-work the patch. > > > > > > > I don't know about the particular details of OProfile. I think counters > > are exposed via /dev/oprofile filesystem. I don't know if what you did = is > > enough to make unavailable counters disappear from there. Supposing it = does. > > The issue is how is this taken into account? Typically you say I want t= o > > measure EVT1, EVT2, EVT3. Then, the tool figures out the assignment: > > EVT1 -> CNTR2, EVT2 -> CNTR1, EVT3 -> CNTR4 for instance. The key > > is to figure out if the assignement algorithm takes into the list of > > available counters as shown by the kernel or just as expected for this = CPU. > > In others words, does this code check /dev/oprofile/.. to verify what i= s actually > > available. > > I reworked this patch a little bit. This time I added a hack to _not_ expose performance counters (avail_to_resrv_perfctr_nmi() ) through the /dev/oprofile interface if they are reserved. Its an ugly hack but that is the result of the way oprofile was designed. There was also a bunch of other cleanups too but more related to nmi.c. I am hoping this doesn't break your user space tools. As I don't really run oprofile, I don't know the right way to test this. Let me know if this is a better solution. Cheers, Don |
From: Stephane E. <er...@hp...> - 2006-04-10 09:40:51
|
Don, On Fri, Apr 07, 2006 at 05:27:38PM -0400, Don Zickus wrote: > On 4/5/06, Don Zickus <dz...@gm...> wrote: > > Stephane, > > > > > > Thanks for the feedback (both Stephane and Andi). I didn't realize my > > > > patch would break the user space API. :( Initially I was trying to > > > > figure out a way to make that counter unavailable but couldn't find > > > > the right piece of code, so I went the other way and denying a > > > > request. If you could point me to the code which shows user space > > > > which counters are available, I would happily re-work the patch. > > > > > > > > > > I don't know about the particular details of OProfile. I think counters > > > are exposed via /dev/oprofile filesystem. I don't know if what you did is > > > enough to make unavailable counters disappear from there. Supposing it does. > > > The issue is how is this taken into account? Typically you say I want to > > > measure EVT1, EVT2, EVT3. Then, the tool figures out the assignment: > > > EVT1 -> CNTR2, EVT2 -> CNTR1, EVT3 -> CNTR4 for instance. The key > > > is to figure out if the assignement algorithm takes into the list of > > > available counters as shown by the kernel or just as expected for this CPU. > > > In others words, does this code check /dev/oprofile/.. to verify what is actually > > > available. > > > > > I reworked this patch a little bit. This time I added a hack to _not_ > expose performance counters (avail_to_resrv_perfctr_nmi() ) through > the /dev/oprofile interface if they are reserved. Its an ugly hack > but that is the result of the way oprofile was designed. There was > also a bunch of other cleanups too but more related to nmi.c. > I think it looks okay. As Will pointed out, I think there is still a problem with HyperThreading (HT). The counters are ALL Shared in HT. Thus the P4_NMI_IQ_CCCR0 of thread0 and thread1 go to the same register,i.e, overwritten. I wonder how NMI works on HT CPUs? I think it would make sense to use two counters in HT mode, one for each CPU seen by the kernel. Yet this may render certain user measurement impossible. > I am hoping this doesn't break your user space tools. As I don't > really run oprofile, I don't know the right way to test this. Let me > know if this is a better solution. I don't own OProfile tools. I think as a result of this change, user level code must take a look at /dev/oprofile to figure out what counters are actually available. They must be prepared to operate with fewer counters than expected for a given CPU. I don't know the internals of the OProfile tools, not sure they already have support for this. In the case of perfmon, the PMU description table can use your avail_to_resrv_perfctr_nmi() function to mask off certain PMU registers that are reserved for NMI. The results on user tools will be the same, i.e, they need to be prepared to cope with less counters than expected. None do this today. But that could be fixed. Thanks. -- -Stephane |
From: Andi K. <ak...@su...> - 2006-04-10 14:44:39
|
On Monday 10 April 2006 11:34, Stephane Eranian wrote: > Thus the P4_NMI_IQ_CCCR0 of thread0 and thread1 go to the same > register,i.e, overwritten. I wonder how NMI works on HT CPUs? I think it > would make sense to use two counters in HT mode, one for each CPU seen by > the kernel. Yet this may render certain user measurement impossible The watchdog code should always use the first counter even on HT. This means it uses a shared counter essentially for both threads. Should be ok. -Andi |
From: Don Z. <dz...@gm...> - 2006-04-10 18:22:15
|
> > I reworked this patch a little bit. This time I added a hack to _not_ > > expose performance counters (avail_to_resrv_perfctr_nmi() ) through > > the /dev/oprofile interface if they are reserved. Its an ugly hack > > but that is the result of the way oprofile was designed. There was > > also a bunch of other cleanups too but more related to nmi.c. > > > I think it looks okay. As Will pointed out, I think there is still > a problem with HyperThreading (HT). The counters are ALL Shared in HT. > > Thus the P4_NMI_IQ_CCCR0 of thread0 and thread1 go to the same register,i= .e, > overwritten. I wonder how NMI works on HT CPUs? I think it would make sen= se > to use two counters in HT mode, one for each CPU seen by the kernel. Yet > this may render certain user measurement impossible. > > > I am hoping this doesn't break your user space tools. As I don't > > really run oprofile, I don't know the right way to test this. Let me > > know if this is a better solution. > > I don't own OProfile tools. I think as a result of this change, user leve= l > code must take a look at /dev/oprofile to figure out what counters are ac= tually > available. They must be prepared to operate with fewer counters than expe= cted > for a given CPU. I don't know the internals of the OProfile tools, not su= re > they already have support for this. > Who owns the tool? I just figured this mailing list was the right place to post patches and ask questions about making internal changes? Perhaps I was wrong? Cheers, Don |
From: John L. <le...@mo...> - 2006-04-11 23:48:47
|
On Mon, Apr 10, 2006 at 02:34:37AM -0700, Stephane Eranian wrote: > I don't own OProfile tools. I think as a result of this change, user level > code must take a look at /dev/oprofile to figure out what counters are actually > available. They must be prepared to operate with fewer counters than expected > for a given CPU. I don't know the internals of the OProfile tools, not sure > they already have support for this. They don't, you'd need to add support for this. Otherwise it's yet more userspace breakage. john |
From: Don Z. <dz...@gm...> - 2006-04-12 15:08:07
|
What other changes do I need to do for my changes to be acceptable. I know I still have to port to the p4 (which I get sick every time I think about the work needed) and some stubs for i386. Other than that, how do I go about getting the tools to accept this? Should I just post the patch and let them build off of it? Cheers, Don On 4/11/06, John Levon <le...@mo...> wrote: > On Mon, Apr 10, 2006 at 02:34:37AM -0700, Stephane Eranian wrote: > > > I don't own OProfile tools. I think as a result of this change, user le= vel > > code must take a look at /dev/oprofile to figure out what counters are = actually > > available. They must be prepared to operate with fewer counters than ex= pected > > for a given CPU. I don't know the internals of the OProfile tools, not = sure > > they already have support for this. > > They don't, you'd need to add support for this. Otherwise it's yet more > userspace breakage. > > john > |
From: John L. <le...@mo...> - 2006-04-12 15:09:46
|
On Wed, Apr 12, 2006 at 11:07:58AM -0400, Don Zickus wrote: > What other changes do I need to do for my changes to be acceptable. I > know I still have to port to the p4 (which I get sick every time I > think about the work needed) and some stubs for i386. Other than > that, how do I go about getting the tools to accept this? Should I > just post the patch and let them build off of it? I'm not sure what you mean here. You have to send a patch to opr...@li... that does the checking, and changes to the docs. regards john |
From: Andi K. <ak...@su...> - 2006-04-12 15:13:23
|
On Wednesday 12 April 2006 17:07, Don Zickus wrote: > What other changes do I need to do for my changes to be acceptable. I > know I still have to port to the p4 (which I get sick every time I > think about the work needed) and some stubs for i386. Other than > that, how do I go about getting the tools to accept this? Should I > just post the patch and let them build off of it? I think the tools should just display an error to the user right? If there is something intelligible in dmesg that explains the error that should be ok. -Andi |
From: John L. <le...@mo...> - 2006-04-12 15:20:28
|
On Wed, Apr 12, 2006 at 05:12:56PM +0200, Andi Kleen wrote: > > What other changes do I need to do for my changes to be acceptable. I > > know I still have to port to the p4 (which I get sick every time I > > think about the work needed) and some stubs for i386. Other than > > that, how do I go about getting the tools to accept this? Should I > > just post the patch and let them build off of it? > > I think the tools should just display an error to the user right? > > If there is something intelligible in dmesg that explains the error > that should be ok. BTW, I still think this needs to be optional, or at the very least a way to dynamically turn off the NMI watchdog so if we need the counter we can use it. john |
From: Stephane E. <er...@hp...> - 2006-04-14 13:11:46
|
Don, I took a second look at your patch. You are assuming that the naming scheme of Oprofile counters is the same as yours. I think this is too constrained. It would be better if avail_to_resrv_perfctr_nmi() would use the MSR address of the register to check. This is guaranteed to be the common piece of information. I understand this impacts the way you maintain the ownership via a bitmask. But the NMI is never going to use more than one counter and worst case (P4) you need to track 2 MSR addresses. What do you think? On Fri, Apr 07, 2006 at 05:27:38PM -0400, Don Zickus wrote: > On 4/5/06, Don Zickus <dz...@gm...> wrote: > > Stephane, > > > > > > Thanks for the feedback (both Stephane and Andi). I didn't realize my > > > > patch would break the user space API. :( Initially I was trying to > > > > figure out a way to make that counter unavailable but couldn't find > > > > the right piece of code, so I went the other way and denying a > > > > request. If you could point me to the code which shows user space > > > > which counters are available, I would happily re-work the patch. > > > > > > > > > > I don't know about the particular details of OProfile. I think counters > > > are exposed via /dev/oprofile filesystem. I don't know if what you did is > > > enough to make unavailable counters disappear from there. Supposing it does. > > > The issue is how is this taken into account? Typically you say I want to > > > measure EVT1, EVT2, EVT3. Then, the tool figures out the assignment: > > > EVT1 -> CNTR2, EVT2 -> CNTR1, EVT3 -> CNTR4 for instance. The key > > > is to figure out if the assignement algorithm takes into the list of > > > available counters as shown by the kernel or just as expected for this CPU. > > > In others words, does this code check /dev/oprofile/.. to verify what is actually > > > available. > > > > > I reworked this patch a little bit. This time I added a hack to _not_ > expose performance counters (avail_to_resrv_perfctr_nmi() ) through > the /dev/oprofile interface if they are reserved. Its an ugly hack > but that is the result of the way oprofile was designed. There was > also a bunch of other cleanups too but more related to nmi.c. > > I am hoping this doesn't break your user space tools. As I don't > really run oprofile, I don't know the right way to test this. Let me > know if this is a better solution. > > Cheers, > Don -- -Stephane |
From: Don Z. <dz...@gm...> - 2006-04-14 15:49:03
|
On 4/14/06, Stephane Eranian <er...@hp...> wrote: > Don, > > I took a second look at your patch. You are assuming > that the naming scheme of Oprofile counters is the same > as yours. I think this is too constrained. It would > be better if avail_to_resrv_perfctr_nmi() would use > the MSR address of the register to check. This is guaranteed > to be the common piece of information. I understand > this impacts the way you maintain the ownership via a > bitmask. But the NMI is never going to use more than > one counter and worst case (P4) you need to track 2 > MSR addresses. > > What do you think? > The reason I hacked in avail_to_resrv_perfctr_nmi the way I did is because I _didn't_ have the MSR address to play with. In fact the MSRs are even reserved yet. Unless I am misreading the oprofile code, it keeps two distinct structs: one to hold the msrs and the other to hold the file pointers to /dev/oprofile. It would be nice to have a reference pointer to either struct from the other, that way the lookup could be safer. I'll take another look at the code and see I missed something. But its hard to do when oprofile blindly sets up all the counters under /dev/oprofile then later checks to see if they are there and initializes them. Without re-writing half of the oprofile setup logic, I couldn't find a real easy way to do what I needed to do. Cheers, Don |
From: Stephane E. <er...@hp...> - 2006-04-18 08:42:57
|
Don, On Fri, Apr 14, 2006 at 11:48:59AM -0400, Don Zickus wrote: > On 4/14/06, Stephane Eranian <er...@hp...> wrote: > > Don, > > > > I took a second look at your patch. You are assuming > > that the naming scheme of Oprofile counters is the same > > as yours. I think this is too constrained. It would > > be better if avail_to_resrv_perfctr_nmi() would use > > the MSR address of the register to check. This is guaranteed > > to be the common piece of information. I understand > > this impacts the way you maintain the ownership via a > > bitmask. But the NMI is never going to use more than > > one counter and worst case (P4) you need to track 2 > > MSR addresses. > > > > What do you think? > > > The reason I hacked in avail_to_resrv_perfctr_nmi the way I did is > because I _didn't_ have the MSR address to play with. In fact the > MSRs are even reserved yet. Unless I am misreading the oprofile code, > it keeps two distinct structs: one to hold the msrs and the other to > hold the file pointers to /dev/oprofile. It would be nice to have a > reference pointer to either struct from the other, that way the lookup > could be safer. > > I'll take another look at the code and see I missed something. But > its hard to do when oprofile blindly sets up all the counters under > /dev/oprofile then later checks to see if they are there and > initializes them. Without re-writing half of the oprofile setup > logic, I couldn't find a real easy way to do what I needed to do. > What about those *fill_in_*() routines? They fill in the MSR addresses corresponding to counters. Couldn't you intervene at this level? -- -Stephane |
From: Don Z. <dz...@gm...> - 2006-04-18 14:25:32
|
On 4/18/06, Stephane Eranian <er...@hp...> wrote: > Don, > > > The reason I hacked in avail_to_resrv_perfctr_nmi the way I did is > > because I _didn't_ have the MSR address to play with. In fact the > > MSRs are even reserved yet. Unless I am misreading the oprofile code, > > it keeps two distinct structs: one to hold the msrs and the other to > > hold the file pointers to /dev/oprofile. It would be nice to have a > > reference pointer to either struct from the other, that way the lookup > > could be safer. > > > > I'll take another look at the code and see I missed something. But > > its hard to do when oprofile blindly sets up all the counters under > > /dev/oprofile then later checks to see if they are there and > > initializes them. Without re-writing half of the oprofile setup > > logic, I couldn't find a real easy way to do what I needed to do. > > > > What about those *fill_in_*() routines? They fill in the MSR addresses > corresponding to counters. Couldn't you intervene at this level? > Unfortunately, I can not. The fill_in-*() routines are called once a user selects an event to profile. This is waaaaay later than when the /dev/oprofile/* files are created. Hence my ugly hack. The problem (as I see it) was the code originally assumed it had full access to that part of the hardware and that is was always available. By adding in the changes I want to make, it completely breaks that assumption.=20 Now things start to get ugly. Like I said without re-writing much of the setup logic for oprofile, I have failed to find a better approach. Cheers, Don |
From: Stephane E. <er...@hp...> - 2006-04-19 17:57:32
|
Don, On Tue, Apr 18, 2006 at 10:25:26AM -0400, Don Zickus wrote: > On 4/18/06, Stephane Eranian <er...@hp...> wrote: > > Don, > > > > > The reason I hacked in avail_to_resrv_perfctr_nmi the way I did is > > > because I _didn't_ have the MSR address to play with. In fact the > > > MSRs are even reserved yet. Unless I am misreading the oprofile code, > > > it keeps two distinct structs: one to hold the msrs and the other to > > > hold the file pointers to /dev/oprofile. It would be nice to have a > > > reference pointer to either struct from the other, that way the lookup > > > could be safer. > > > > > > I'll take another look at the code and see I missed something. But > > > its hard to do when oprofile blindly sets up all the counters under > > > /dev/oprofile then later checks to see if they are there and > > > initializes them. Without re-writing half of the oprofile setup > > > logic, I couldn't find a real easy way to do what I needed to do. > > > > > > > What about those *fill_in_*() routines? They fill in the MSR addresses > > corresponding to counters. Couldn't you intervene at this level? > > > > Unfortunately, I can not. The fill_in-*() routines are called once a > user selects an event to profile. This is waaaaay later than when the > /dev/oprofile/* files are created. Hence my ugly hack. The problem > (as I see it) was the code originally assumed it had full access to > that part of the hardware and that is was always available. By adding > in the changes I want to make, it completely breaks that assumption. > Now things start to get ugly. > > Like I said without re-writing much of the setup logic for oprofile, I > have failed to find a better approach. > What about providing a second avail_() function that takes MSR addresses as input? I think it would make your interface more useable by non OProfile code. That would certainly make my job easier trying to interface perfmon2 to your NMI routines. > Cheers, > Don > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd_______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list -- -Stephane |
From: Don Z. <dz...@gm...> - 2006-04-20 22:28:45
|
On 4/19/06, Stephane Eranian <er...@hp...> wrote: > Don, > > > > What about providing a second avail_() function that takes MSR addresses = as input? > I think it would make your interface more useable by non OProfile code. > That would certainly make my job easier trying to interface perfmon2 to y= our NMI routines. I like it. In fact it makes better sense for all the reserve/release functions too. I'll add it. So now I have the following functions: avail_to_resrv_perfctr_nmi( unsigned int MSR) avail_to_resrv_perfctr_nmi_bit( unsigned int counter) //hack version for oprofile resrv_perfctr_nmi( unsigned int MSR) release_perfctr_nmi( unsigned int MSR) resrv_evntsel_nmi( unsigned int MSR) release_evntsel_nmi( unsigned int MSR) Cheers, Don |
From: Stephane E. <er...@hp...> - 2006-04-21 16:09:27
|
Don, On Thu, Apr 20, 2006 at 06:28:42PM -0400, Don Zickus wrote: > On 4/19/06, Stephane Eranian <er...@hp...> wrote: > > Don, > > > > > > > What about providing a second avail_() function that takes MSR addresses as input? > > I think it would make your interface more useable by non OProfile code. > > That would certainly make my job easier trying to interface perfmon2 to your NMI routines. > > I like it. In fact it makes better sense for all the reserve/release > functions too. I'll add it. > So now I have the following functions: > avail_to_resrv_perfctr_nmi( unsigned int MSR) > avail_to_resrv_perfctr_nmi_bit( unsigned int counter) //hack version > for oprofile > resrv_perfctr_nmi( unsigned int MSR) > release_perfctr_nmi( unsigned int MSR) > resrv_evntsel_nmi( unsigned int MSR) > release_evntsel_nmi( unsigned int MSR) Very good. Now I can use this for perfmon2. Thanks. -- -Stephane |
From: Andi K. <ak...@su...> - 2006-04-05 18:48:39
|
On Wednesday 05 April 2006 10:31, Stephane Eranian wrote: > The other aspect to keep in mind here is that counters are not always > symetrical. Not all events can be measured on any counter. Few CPUs have > that level of flexibility. I think Pentium M and AMD Opteron do, but P4 > does not. If you remove certain counters it may not be possible to make > certain measurements anymore. P4 has at least two counters per group, possible more. NMI watchdog will never use more than one. So any counter should be still accessible. > > > The reservation works if you assume that NMI watchdog and Oprofile > > > NMI-based interruption can work together. I do not see where you can > > > chain the interrupt handler on NMI interrupt. The set_nmi_callback() > > > replaces the existing handler, it does not add a new one. Thus, you > > > break one or the other. If you cannot chain the NMI interrupt handler > > > then what is the point of the reservation? > > > > > > Did I miss anything? > > > > Yes, my original patch to Andi reworked the NMI watchdog such that > > set_nmi_callback() would _not_ disable it. Instead set_nmi_callback() > > would be used to register an NMI handler to capture interrupts the NMI > > subsystem didn't know what to do with (it knows NMI watchdog, memory > > parity errors, and i/o check errors). However, the problem arises > > over the new concept of the NMI watchdog never being allowed to be > > disabled. Since the NMI watchdog uses one of the performance > > counters, this conflicts with how oprofile implements its > > functionality. Therefore, I was trying to find a way to prevent > > oprofile from using whatever counter register NMI watchdog is using > > (as long as NMI watchdog is enabled at boot time), hence my > > reservation logic. > > You want both NMI interrupt for NMI watchdog and for OProfile. > NMI watchdog is using one counter. Oprofile will use the others. > If set_nmi_callback() registers an ADDITIONAL handler, Only oprofile uses set_nmi_callback, the other users use DIE_NMI > then you > have to dispatch between the various handlers. Watchdog and > OProfile will trigger NMI on counter overflow. The dispatcher > has to detect that this is an NMI from counters. Then, it needs to figure > out which counter(s), based on this it needs to call the corresponding > handler. The point is that people want to get panics to reach the "panic on nmi" check at the end. That only works if not all users before that eat them up. Currently oprofile and the nmi watchdog are that badly behaved. -Andi |
From: William C. <wc...@re...> - 2006-04-05 18:59:15
|
Andi Kleen wrote: > On Wednesday 05 April 2006 10:31, Stephane Eranian wrote: > > >>The other aspect to keep in mind here is that counters are not always >>symetrical. Not all events can be measured on any counter. Few CPUs have >>that level of flexibility. I think Pentium M and AMD Opteron do, but P4 >>does not. If you remove certain counters it may not be possible to make >>certain measurements anymore. > > > P4 has at least two counters per group, possible more. NMI watchdog > will never use more than one. So any counter should be still accessible. Is this still true when hyperthreading is being used? A number of the tools, e.g. split the perf counters between the threads, leaving only one register in a group on a processor. -Will |
From: Stephane E. <er...@hp...> - 2006-04-05 19:08:24
|
Will, On Wed, Apr 05, 2006 at 02:57:01PM -0400, William Cohen wrote: > Andi Kleen wrote: > >On Wednesday 05 April 2006 10:31, Stephane Eranian wrote: > > > > > >>The other aspect to keep in mind here is that counters are not always > >>symetrical. Not all events can be measured on any counter. Few CPUs have > >>that level of flexibility. I think Pentium M and AMD Opteron do, but P4 > >>does not. If you remove certain counters it may not be possible to make > >>certain measurements anymore. > > > > > >P4 has at least two counters per group, possible more. NMI watchdog > >will never use more than one. So any counter should be still accessible. > > Is this still true when hyperthreading is being used? A number of the > tools, e.g. split the perf counters between the threads, leaving only > one register in a group on a processor. > This is a good point. Yes, perfmon does split the register file in two. The user sees on 8 counters and those 8 are mapped into one half of the other depending of which HW thread the program executes on. That is the only way to let two programs using the counters run on the same CPU core. -- -Stephane |