From: David G. <da...@gi...> - 2005-03-23 06:36:26
|
Well, having thrown my oar on the concerns I have with the perfctr interface, now let me do the same for the perfmon interface. I'm moving this discussion onto the lse-tech list so that it is archived and publicly visible and all those good things. Stephane, I've been having a closer look at the perfmon spec document (the version dated Dec 21, 2004). Below are a number of points which concern me (this is by no means exhaustive): General issues: * The multiplexed syscall takes an argument giving the number of parameters, the size and type of which depends on the individual call. Better would be an overall size argument, or even argument size and number of arguments. It's a little harder to process, but at least you can tell how much memory a call will touch without having to know about every individual operation. * The method of requesting overflow sampling or notification for a PMD assumes there is a unique PMC associated with that PMD. This is insufficiently general, since this setup is not naturally true for ppc64 (event selection for the various counters is controlled by a combination of various fields in the registers MMCR0, MMCR1 and MMCRA). * How widespread is the use of the term "event sets"? Is it perfmon specific, or more widely established. I find the term rather misleading, and would prefer something like "subcontext". More specific issues: PFM_CREATE_CONTEXT Altering the calling process's memory map as a side effect is icky. It could also cause problems for (the few) programs which need to take fine-grained control of their memory maps (JVMs?). Much better for the process to map the sample buffer with an explicit mmap() on the context's fd. You could however, return an offset at which to perform the mmap(). PFM_WRITE_PMCS The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between PMU models. But what the value of this is for the current PMU model is not exported anywhere. Varying by architecture doesn't make much sense, since PMU model details vary only mildly more between architectures than they do within CPU models of one architecture. PFM_START / PFM_START_SET I see no reason for two separate entry points; PFM_START_SET with NULL argument can just leave the default or currently actively running set, as a PFM_START. PFM_STOP This could reasonably be folded into PFM_START_SET also, by having a special set id meaning "no set". Obviously the name of the operation would want to be changed, too (PFM_CHANGE_RUNNING?) PFM_LOAD_CONTEXT I'm not sure I see the point of the load_set argument. What can be accomplished with this that can't be with appropriate use of PFM_START_SET? PFM_UNLOAD_CONTEXT Could also reasonably be folded into the above, say using load_pid == 0 to request binding the context to no thread at all. Again a name change would be in order (PFM_CHANGE_THREAD?). PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET Is there really a need to incrementally update the event sets? Would a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces all existing event sets with the ones described suffice. This approach would not only reduce the number of entry points, but could also simplify the kernel's parameter checking. For example at the moment deleting an event set which is reference by another sets set_id_next must presumably either fail, or alter those other event sets to no longer reference the deleted event set. PFM_GET_DEFAULT_PMCS The usefulness of this operation is not obvious to me. How would you envisage it being used? PFM_GET_FEATURES For one thing this doesn't belong in the multiplexor - it doesn't use the fd, and would be better exported via /proc or /sys. But, in any case, I don't think I've ever seen a subsystem version like this well used, so I'm not sure the operation is a good idea at all. How would you envisage this being used? PFM_GET_CONFIG / PFM_SET_CONFIG Again, these definitely don't belong on the multiplexor. They don't use the fd, and since they set the permission regime for the mutliplexor itself, they logically belong outside. Under Linux these definitely ought to be sysctls. If this is ever ported to other OSes, I still don't think they belong here. Setting up the permission regime for all the other calls is, I think a logically OS-specific operation and doesn't belong in the core API. As far as I can tell it's unlikely you would use these operations in the same programs using the rest of perfmon. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-03-29 10:09:08
|
David, Sorry for late reply but i was travelling all of last week. Thank you for you feedback. That is exactly what I am asking for. See my comments. On Wed, 2005-03-23 at 17:35 +1100, David Gibson wrote: > Stephane, I've been having a closer look at the perfmon spec document > (the version dated Dec 21, 2004). Below are a number of points which > concern me (this is by no means exhaustive): > I have since updated the document and several of points you mentioned below have been fixed/changed. > General issues: > > * The multiplexed syscall takes an argument giving the number of > parameters, the size and type of which depends on the individual > call. Better would be an overall size argument, or even argument size > and number of arguments. It's a little harder to process, but at > least you can tell how much memory a call will touch without having to > know about every individual operation. I would rather see size, count, because it would help the kernel filter out calls using the wrong data type. > * The method of requesting overflow sampling or notification for a PMD > assumes there is a unique PMC associated with that PMD. This is > insufficiently general, since this setup is not naturally true for > ppc64 (event selection for the various counters is controlled by a > combination of various fields in the registers MMCR0, MMCR1 and > MMCRA). That is a good point. It is true that today overflow notification is requested via the PMC and not the PMD. The implementation assumes (wrongly) that PMCx corresponds to PMDx. The flag is recorded in the PMD related structure. Hence, it would seem more natural to pass the flags for RANDOM/OVFL_NOTIFY via PFM_WRITE_PMDS. I did it via PFM_WRITE_PMCS because I considered those flags as part of the configuration of the counters, hence they would go with PMC. For the PPC64, it looks like you are in a situation similar to P4, where multiple config registers are used to control a counter. We could move the flags to PFM_WRITE_PMDS. > * How widespread is the use of the term "event sets"? Is it perfmon > specific, or more widely established. I find the term rather > misleading, and would prefer something like "subcontext". I have seen this term used by Phil's PAPI toolkit and it meant the same thing. An event set is a software abstraction which encapsulates the entire PMU state. > > More specific issues: > > PFM_CREATE_CONTEXT > > Altering the calling process's memory map as a side effect is icky. > It could also cause problems for (the few) programs which need to take > fine-grained control of their memory maps (JVMs?). Much better for > the process to map the sample buffer with an explicit mmap() on the > context's fd. You could however, return an offset at which to perform > the mmap(). > That has been fixed in the new rev of the specification. Now the call returns how big the buffer actually is. Then the application must explicitely invoke mmap() using the size. This simplifies the implementation and follows the programming model people are most used to. > PFM_WRITE_PMCS > > The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between > PMU models. But what the value of this is for the current PMU model > is not exported anywhere. Varying by architecture doesn't make much > sense, since PMU model details vary only mildly more between > architectures than they do within CPU models of one architecture. > The PFM_MAX_PMD_BITVECTOR is exported in the perfmon.h header file. At this point, it is provided by each architecture. When the processor architecture is nice, then the PMU framework is specified there and it makes the job of software easier. For instance, on Itanium, the architecture says you can have up to 256 PMC and 256 PMD registers. Having this kind of information is very useful to size data structures approprietely. You don't want to have to copy large data structures (think copy_user) if you don't have to. Are you advocating that this be a PMU model specific size? > PFM_START / PFM_START_SET > > I see no reason for two separate entry points; PFM_START_SET > with NULL argument can just leave the default or currently actively > running set, as a PFM_START. > These calls have been merged as PFM_START with an optional argument. This optional argument is to maintain backward compatibility with existing 2.6 interface. > PFM_STOP > > This could reasonably be folded into PFM_START_SET also, by > having a special set id meaning "no set". Obviously the name of the > operation would want to be changed, too (PFM_CHANGE_RUNNING?) > We could find a way to merge START/STOP but then in the argument it would have to indicate which of the two operations to perfmon. Your trick is one possibility. Another would be to add a flag to the data structure passed. > PFM_LOAD_CONTEXT > > I'm not sure I see the point of the load_set argument. What > can be accomplished with this that can't be with appropriate use of > PFM_START_SET? > You don't want to merge START and load. that is not because you attach to a thread/CPU that you want monitoring to start right away. But I think you have a good point. the interface guarantees that on PFM_LOAD_CONTEXT, monitoring is stopped. You need explicit START to activate. This is true even if you detached while monitoring was active. I need to check to see if there is something else involved here. > PFM_UNLOAD_CONTEXT > > Could also reasonably be folded into the above, say using > load_pid == 0 to request binding the context to no thread at all. > Again a name change would be in order (PFM_CHANGE_THREAD?). > That looks reasonable to me. > > PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET > > Is there really a need to incrementally update the event sets? Would > a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces > all existing event sets with the ones described suffice. This > approach would not only reduce the number of entry points, but could > also simplify the kernel's parameter checking. For example at the > moment deleting an event set which is reference by another sets > set_id_next must presumably either fail, or alter those other event > sets to no longer reference the deleted event set. > This has been trimmed down to two calls in the new rev: PFM_CREATE_EVTSETS and PFM_DELETE_EVTSETS. If the event set already exist, PFM_CREATE_EVTSETS updates it. This is useful for set0 which always exists. As for delete/create, those operations can only happen when the context is detached. Checking of the validity of the event set chain is deferred until PFM_LOAD_CONTEXT because at this point it is not possible to modify the sets. If there set_id_next is invalid, PFM_LOAD_CONTEXT fails. > PFM_GET_DEFAULT_PMCS > > The usefulness of this operation is not obvious to me. How would you > envisage it being used? > This command is now obosolete. Check the new rev of the document for PFM_GETINFO_PMCS and PFM_GETINFO_PMDS. These calls return the default values for PMCS and PMDS as well as bitmask with the reserved fields for each registers. They also return the actual PMD/PMC to actual HW register mappings. > PFM_GET_FEATURES > > For one thing this doesn't belong in the multiplexor - it doesn't use > the fd, and would be better exported via /proc or /sys. But, in any > case, I don't think I've ever seen a subsystem version like this well > used, so I'm not sure the operation is a good idea at all. How would > you envisage this being used? > In the new rev, this call has been folded into PFM_GET_CONFIG. > PFM_GET_CONFIG / PFM_SET_CONFIG > > Again, these definitely don't belong on the multiplexor. They don't > use the fd, and since they set the permission regime for the > mutliplexor itself, they logically belong outside. Under Linux these > definitely ought to be sysctls. If this is ever ported to other OSes, > I still don't think they belong here. Setting up the permission > regime for all the other calls is, I think a logically OS-specific > operation and doesn't belong in the core API. As far as I can tell > it's unlikely you would use these operations in the same programs > using the rest of perfmon. > As discussed earlier, on Linux, these operations could as well be implemented with sysctls. I will update the document to incorporate your feedback which I found very useful. In terms of porting, I am getting closer to being able to send you a skeleton header/C file with the required callbacks. Please let me know of any special PPC64 special behavior. For instance, looking at Opteron and Pentium 4: - on counter overflow, does PPC freeze the entire PMU - HW counters are not 64-bits, what are the values of the upper bits for counters. Should they be all 1 or all 0. - How is a counter overflow detected? When the full 64 bits of the counter overflow or when there is a carry from bit n to n+1 for a width on n. - Are there any PPC64 PMU registers which can only be used by one thread at a time (shared). Think hyperthreading. - Is there a way to stop monitoring without having to modify all used PMC registers. Thanks. |
From: David G. <da...@gi...> - 2005-03-30 06:05:33
|
On Tue, Mar 29, 2005 at 12:07:23PM +0200, Stephane Eranian wrote: > David, > > Sorry for late reply but i was travelling all of last week. > Thank you for you feedback. That is exactly what I am asking for. > See my comments. > > On Wed, 2005-03-23 at 17:35 +1100, David Gibson wrote: > > > Stephane, I've been having a closer look at the perfmon spec document > > (the version dated Dec 21, 2004). Below are a number of points which > > concern me (this is by no means exhaustive): > > > I have since updated the document and several of points you mentioned > below have been fixed/changed. Ok, where do I find the revised document? I'm afraid the link wasn't obvious to me on the perfmon site. > > General issues: > > > > * The multiplexed syscall takes an argument giving the number of > > parameters, the size and type of which depends on the individual > > call. Better would be an overall size argument, or even argument size > > and number of arguments. It's a little harder to process, but at > > least you can tell how much memory a call will touch without having to > > know about every individual operation. > > I would rather see size, count, because it would help the kernel filter > out calls using the wrong data type. Fair enough. > > * The method of requesting overflow sampling or notification for a PMD > > assumes there is a unique PMC associated with that PMD. This is > > insufficiently general, since this setup is not naturally true for > > ppc64 (event selection for the various counters is controlled by a > > combination of various fields in the registers MMCR0, MMCR1 and > > MMCRA). > > That is a good point. It is true that today overflow notification > is requested via the PMC and not the PMD. The implementation assumes > (wrongly) that PMCx corresponds to PMDx. The flag is recorded in > the PMD related structure. Hence, it would seem more natural to > pass the flags for RANDOM/OVFL_NOTIFY via PFM_WRITE_PMDS. I did it > via PFM_WRITE_PMCS because I considered those flags as part of the > configuration of the counters, hence they would go with PMC. For the > PPC64, it looks like you are in a situation similar to P4, where > multiple config registers are used to control a counter. We could > move the flags to PFM_WRITE_PMDS. I think that would make more sense. Or maybe even a different mechanism entirely. How would you support performance monitor event that aren't counter overflows, for those CPUs that have such? > > * How widespread is the use of the term "event sets"? Is it perfmon > > specific, or more widely established. I find the term rather > > misleading, and would prefer something like "subcontext". > > I have seen this term used by Phil's PAPI toolkit and it meant > the same thing. An event set is a software abstraction which > encapsulates the entire PMU state. Oh well, I guess we're stuck with that one, then. > > More specific issues: > > > > PFM_CREATE_CONTEXT > > > > Altering the calling process's memory map as a side effect is icky. > > It could also cause problems for (the few) programs which need to take > > fine-grained control of their memory maps (JVMs?). Much better for > > the process to map the sample buffer with an explicit mmap() on the > > context's fd. You could however, return an offset at which to perform > > the mmap(). > > > That has been fixed in the new rev of the specification. Now the call > returns how big the buffer actually is. Then the application must > explicitely invoke mmap() using the size. This simplifies the > implementation and follows the programming model people are most used to. Ok. > > PFM_WRITE_PMCS > > > > The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between > > PMU models. But what the value of this is for the current PMU model > > is not exported anywhere. Varying by architecture doesn't make much > > sense, since PMU model details vary only mildly more between > > architectures than they do within CPU models of one architecture. > > > The PFM_MAX_PMD_BITVECTOR is exported in the perfmon.h header file. > At this point, it is provided by each architecture. When the > processor architecture is nice, then the PMU framework is specified > there and it makes the job of software easier. For instance, on Itanium, > the architecture says you can have up to 256 PMC and 256 PMD registers. > Having this kind of information is very useful to size data structures > approprietely. You don't want to have to copy large data structures > (think copy_user) if you don't have to. Are you advocating that this > be a PMU model specific size? Having it per-architecture doesn't really make a lot of sense, since PM units vary only slightly less between CPUs of the same architecture than they do between CPUs of different architectures. The PM unit may well not be defined by the architecture specification (if such exists) at all, so I don't think you can count on there being a definitive limit on the number of PMDs in general. The greatest number of PMDs on any PowerPC so far is 8, and I'm not aware of any plans for CPUs with more, but it wouldn't surprise me if it happened some day. Since this size can never be changed, without breaking the ABI, we would have to leave room for expansion, and there's no real guidance as to how much. So I think this should either be PM model dependent, or it should be truly global - per-architecture is a bad compromise. The latter, obviously, is much simpler to implement. > > PFM_START / PFM_START_SET > > > > I see no reason for two separate entry points; PFM_START_SET > > with NULL argument can just leave the default or currently actively > > running set, as a PFM_START. > > > These calls have been merged as PFM_START with an optional argument. > This optional argument is to maintain backward compatibility with > existing 2.6 interface. Excellent. > > PFM_STOP > > > > This could reasonably be folded into PFM_START_SET also, by > > having a special set id meaning "no set". Obviously the name of the > > operation would want to be changed, too (PFM_CHANGE_RUNNING?) > > > We could find a way to merge START/STOP but then in the argument it > would have to indicate which of the two operations to perfmon. Your > trick is one possibility. Another would be to add a flag to the data > structure passed. > > > PFM_LOAD_CONTEXT > > > > I'm not sure I see the point of the load_set argument. What > > can be accomplished with this that can't be with appropriate use of > > PFM_START_SET? > > > You don't want to merge START and load. that is not because you attach > to a thread/CPU that you want monitoring to start right away. > But I think you have a good point. the interface guarantees that on > PFM_LOAD_CONTEXT, monitoring is stopped. You need explicit START to > activate. This is true even if you detached while monitoring was active. > I need to check to see if there is something else involved here. Sorry, I don't fully follow what you're saying here (I can't parse the first sentence, in particular). My point is it's not clear to me that there's anything useful you can accomplish with: CREATE <do stuff> LOAD START that can't be done with CREATE+ATTACH <do stuff> START > > PFM_UNLOAD_CONTEXT > > > > Could also reasonably be folded into the above, say using > > load_pid == 0 to request binding the context to no thread at all. > > Again a name change would be in order (PFM_CHANGE_THREAD?). > > > That looks reasonable to me. > > > PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET > > > > Is there really a need to incrementally update the event sets? Would > > a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces > > all existing event sets with the ones described suffice. This > > approach would not only reduce the number of entry points, but could > > also simplify the kernel's parameter checking. For example at the > > moment deleting an event set which is reference by another sets > > set_id_next must presumably either fail, or alter those other event > > sets to no longer reference the deleted event set. > > > This has been trimmed down to two calls in the new rev: > PFM_CREATE_EVTSETS and PFM_DELETE_EVTSETS. If the event set > already exist, PFM_CREATE_EVTSETS updates it. This is useful > for set0 which always exists. > > As for delete/create, those operations can only happen when > the context is detached. Checking of the validity of the event > set chain is deferred until PFM_LOAD_CONTEXT because at this > point it is not possible to modify the sets. If there set_id_next > is invalid, PFM_LOAD_CONTEXT fails. But again, is there a real reason to allow incremental updates? If there was a single operation which atomically changed all the event sets it would mean one less entry point, *plus* we could do the error checking there (earlier error checking is always good). And we wouldn't even need user allocated id numbers, the array position would suffice. > > PFM_GET_DEFAULT_PMCS > > > > The usefulness of this operation is not obvious to me. How would you > > envisage it being used? > > > This command is now obosolete. Check the new rev of the document > for PFM_GETINFO_PMCS and PFM_GETINFO_PMDS. These calls return > the default values for PMCS and PMDS as well as bitmask with the > reserved fields for each registers. They also return the actual > PMD/PMC to actual HW register mappings. Oh good. > > PFM_GET_FEATURES > > > > For one thing this doesn't belong in the multiplexor - it doesn't use > > the fd, and would be better exported via /proc or /sys. But, in any > > case, I don't think I've ever seen a subsystem version like this well > > used, so I'm not sure the operation is a good idea at all. How would > > you envisage this being used? > > > In the new rev, this call has been folded into PFM_GET_CONFIG. Ok. > > PFM_GET_CONFIG / PFM_SET_CONFIG > > > > Again, these definitely don't belong on the multiplexor. They don't > > use the fd, and since they set the permission regime for the > > mutliplexor itself, they logically belong outside. Under Linux these > > definitely ought to be sysctls. If this is ever ported to other OSes, > > I still don't think they belong here. Setting up the permission > > regime for all the other calls is, I think a logically OS-specific > > operation and doesn't belong in the core API. As far as I can tell > > it's unlikely you would use these operations in the same programs > > using the rest of perfmon. > > > As discussed earlier, on Linux, these operations could as well be > implemented with sysctls. Yes, and I don't think having a cross-platform operation for this is worthwhile. This is a system administrator operation, not an operation for the users of perfmon, so I don't think having it platform specific is a problem at all. > I will update the document to incorporate your feedback which I found > very useful. > > In terms of porting, I am getting closer to being able to send you > a skeleton header/C file with the required callbacks. Please let me > know of any special PPC64 special behavior. For instance, looking > at Opteron and Pentium 4: > - on counter overflow, does PPC freeze the entire PMU Optional, IIRC, depending on some control bits in MMCR0. > - HW counters are not 64-bits, what are the values of > the upper bits for counters. Should they be all 1 or all 0. The counter registers are 32 bits wide, but can only be effectively used as 31-bit counters (see below). > - How is a counter overflow detected? When the full 64 bits > of the counter overflow or when there is a carry from bit > n to n+1 for a width on n. The interrupts occur on (32 bit) counter negative, rather than overflow, per se. The only way to determine which counters have overflowed is to look at the sign bits. Furthermore, the sign bit must be cleared in order to clear the interrupt condition (hence only 31-bit counters, effectively). Another issue which I ran into for perfctr is that interrupts can't be individually enabled or disabled for each counter. There is one control bit which determines if PMC1 generates an interrupt on counter negative, and another control bit which determines if other PMCs cause an interrupt. Because the events for the counters are generally selected in groups, rather than individually, you need to be able to deal with overflow interrupts for a counter you don't otherwise care about. Performance monitor interrupts can also be generated from the timebase. These occur on 0-1 transitions on bit 0, 8, 12 or 16 (selectable) of the (64 bit) timebase. Timebase frequency is not the same as CPU core frequency, and depends on the system, not just the CPU (it can be externally clocked). The timebase is guaranteed to have a fixed frequency, even on systems with variable CPU frequency, so the ratio to CPU core frequency can also vary. > - Are there any PPC64 PMU registers which can only be used by > one thread at a time (shared). Think hyperthreading. Not as far as I'm aware. > - Is there a way to stop monitoring without having to modify > all used PMC registers. Yes, there is a "freeze counters" (FC) bit in MMCR0 which will stop all the counters. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-03-30 06:18:18
|
David, The revised document can be found at: http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html This version also explains the PMC/PMD mappings onto actual registers. I will respond later today on the rest. On Wed, Mar 30, 2005 at 04:05:06PM +1000, David Gibson wrote: > On Tue, Mar 29, 2005 at 12:07:23PM +0200, Stephane Eranian wrote: > > David, > > > > Sorry for late reply but i was travelling all of last week. > > Thank you for you feedback. That is exactly what I am asking for. > > See my comments. > > > > On Wed, 2005-03-23 at 17:35 +1100, David Gibson wrote: > > > > > Stephane, I've been having a closer look at the perfmon spec document > > > (the version dated Dec 21, 2004). Below are a number of points which > > > concern me (this is by no means exhaustive): > > > > > I have since updated the document and several of points you mentioned > > below have been fixed/changed. > > Ok, where do I find the revised document? I'm afraid the link wasn't > obvious to me on the perfmon site. > > > > General issues: > > > > > > * The multiplexed syscall takes an argument giving the number of > > > parameters, the size and type of which depends on the individual > > > call. Better would be an overall size argument, or even argument size > > > and number of arguments. It's a little harder to process, but at > > > least you can tell how much memory a call will touch without having to > > > know about every individual operation. > > > > I would rather see size, count, because it would help the kernel filter > > out calls using the wrong data type. > > Fair enough. > > > > * The method of requesting overflow sampling or notification for a PMD > > > assumes there is a unique PMC associated with that PMD. This is > > > insufficiently general, since this setup is not naturally true for > > > ppc64 (event selection for the various counters is controlled by a > > > combination of various fields in the registers MMCR0, MMCR1 and > > > MMCRA). > > > > That is a good point. It is true that today overflow notification > > is requested via the PMC and not the PMD. The implementation assumes > > (wrongly) that PMCx corresponds to PMDx. The flag is recorded in > > the PMD related structure. Hence, it would seem more natural to > > pass the flags for RANDOM/OVFL_NOTIFY via PFM_WRITE_PMDS. I did it > > via PFM_WRITE_PMCS because I considered those flags as part of the > > configuration of the counters, hence they would go with PMC. For the > > PPC64, it looks like you are in a situation similar to P4, where > > multiple config registers are used to control a counter. We could > > move the flags to PFM_WRITE_PMDS. > > I think that would make more sense. Or maybe even a different > mechanism entirely. How would you support performance monitor event > that aren't counter overflows, for those CPUs that have such? > > > > * How widespread is the use of the term "event sets"? Is it perfmon > > > specific, or more widely established. I find the term rather > > > misleading, and would prefer something like "subcontext". > > > > I have seen this term used by Phil's PAPI toolkit and it meant > > the same thing. An event set is a software abstraction which > > encapsulates the entire PMU state. > > Oh well, I guess we're stuck with that one, then. > > > > More specific issues: > > > > > > PFM_CREATE_CONTEXT > > > > > > Altering the calling process's memory map as a side effect is icky. > > > It could also cause problems for (the few) programs which need to take > > > fine-grained control of their memory maps (JVMs?). Much better for > > > the process to map the sample buffer with an explicit mmap() on the > > > context's fd. You could however, return an offset at which to perform > > > the mmap(). > > > > > That has been fixed in the new rev of the specification. Now the call > > returns how big the buffer actually is. Then the application must > > explicitely invoke mmap() using the size. This simplifies the > > implementation and follows the programming model people are most used to. > > Ok. > > > > PFM_WRITE_PMCS > > > > > > The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between > > > PMU models. But what the value of this is for the current PMU model > > > is not exported anywhere. Varying by architecture doesn't make much > > > sense, since PMU model details vary only mildly more between > > > architectures than they do within CPU models of one architecture. > > > > > The PFM_MAX_PMD_BITVECTOR is exported in the perfmon.h header file. > > At this point, it is provided by each architecture. When the > > processor architecture is nice, then the PMU framework is specified > > there and it makes the job of software easier. For instance, on Itanium, > > the architecture says you can have up to 256 PMC and 256 PMD registers. > > Having this kind of information is very useful to size data structures > > approprietely. You don't want to have to copy large data structures > > (think copy_user) if you don't have to. Are you advocating that this > > be a PMU model specific size? > > Having it per-architecture doesn't really make a lot of sense, since > PM units vary only slightly less between CPUs of the same architecture > than they do between CPUs of different architectures. The PM unit may > well not be defined by the architecture specification (if such exists) > at all, so I don't think you can count on there being a definitive > limit on the number of PMDs in general. > > The greatest number of PMDs on any PowerPC so far is 8, and I'm not > aware of any plans for CPUs with more, but it wouldn't surprise me if > it happened some day. Since this size can never be changed, without > breaking the ABI, we would have to leave room for expansion, and > there's no real guidance as to how much. > > So I think this should either be PM model dependent, or it should be > truly global - per-architecture is a bad compromise. The latter, > obviously, is much simpler to implement. > > > > PFM_START / PFM_START_SET > > > > > > I see no reason for two separate entry points; PFM_START_SET > > > with NULL argument can just leave the default or currently actively > > > running set, as a PFM_START. > > > > > These calls have been merged as PFM_START with an optional argument. > > This optional argument is to maintain backward compatibility with > > existing 2.6 interface. > > Excellent. > > > > PFM_STOP > > > > > > This could reasonably be folded into PFM_START_SET also, by > > > having a special set id meaning "no set". Obviously the name of the > > > operation would want to be changed, too (PFM_CHANGE_RUNNING?) > > > > > We could find a way to merge START/STOP but then in the argument it > > would have to indicate which of the two operations to perfmon. Your > > trick is one possibility. Another would be to add a flag to the data > > structure passed. > > > > > PFM_LOAD_CONTEXT > > > > > > I'm not sure I see the point of the load_set argument. What > > > can be accomplished with this that can't be with appropriate use of > > > PFM_START_SET? > > > > > You don't want to merge START and load. that is not because you attach > > to a thread/CPU that you want monitoring to start right away. > > But I think you have a good point. the interface guarantees that on > > PFM_LOAD_CONTEXT, monitoring is stopped. You need explicit START to > > activate. This is true even if you detached while monitoring was active. > > I need to check to see if there is something else involved here. > > Sorry, I don't fully follow what you're saying here (I can't parse the > first sentence, in particular). My point is it's not clear to me that > there's anything useful you can accomplish with: > CREATE <do stuff> LOAD START > that can't be done with > CREATE+ATTACH <do stuff> START > > > > PFM_UNLOAD_CONTEXT > > > > > > Could also reasonably be folded into the above, say using > > > load_pid == 0 to request binding the context to no thread at all. > > > Again a name change would be in order (PFM_CHANGE_THREAD?). > > > > > That looks reasonable to me. > > > > > PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET > > > > > > Is there really a need to incrementally update the event sets? Would > > > a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces > > > all existing event sets with the ones described suffice. This > > > approach would not only reduce the number of entry points, but could > > > also simplify the kernel's parameter checking. For example at the > > > moment deleting an event set which is reference by another sets > > > set_id_next must presumably either fail, or alter those other event > > > sets to no longer reference the deleted event set. > > > > > This has been trimmed down to two calls in the new rev: > > PFM_CREATE_EVTSETS and PFM_DELETE_EVTSETS. If the event set > > already exist, PFM_CREATE_EVTSETS updates it. This is useful > > for set0 which always exists. > > > > As for delete/create, those operations can only happen when > > the context is detached. Checking of the validity of the event > > set chain is deferred until PFM_LOAD_CONTEXT because at this > > point it is not possible to modify the sets. If there set_id_next > > is invalid, PFM_LOAD_CONTEXT fails. > > But again, is there a real reason to allow incremental updates? If > there was a single operation which atomically changed all the event > sets it would mean one less entry point, *plus* we could do the error > checking there (earlier error checking is always good). And we > wouldn't even need user allocated id numbers, the array position would > suffice. > > > > PFM_GET_DEFAULT_PMCS > > > > > > The usefulness of this operation is not obvious to me. How would you > > > envisage it being used? > > > > > This command is now obosolete. Check the new rev of the document > > for PFM_GETINFO_PMCS and PFM_GETINFO_PMDS. These calls return > > the default values for PMCS and PMDS as well as bitmask with the > > reserved fields for each registers. They also return the actual > > PMD/PMC to actual HW register mappings. > > Oh good. > > > > PFM_GET_FEATURES > > > > > > For one thing this doesn't belong in the multiplexor - it doesn't use > > > the fd, and would be better exported via /proc or /sys. But, in any > > > case, I don't think I've ever seen a subsystem version like this well > > > used, so I'm not sure the operation is a good idea at all. How would > > > you envisage this being used? > > > > > In the new rev, this call has been folded into PFM_GET_CONFIG. > > Ok. > > > > PFM_GET_CONFIG / PFM_SET_CONFIG > > > > > > Again, these definitely don't belong on the multiplexor. They don't > > > use the fd, and since they set the permission regime for the > > > mutliplexor itself, they logically belong outside. Under Linux these > > > definitely ought to be sysctls. If this is ever ported to other OSes, > > > I still don't think they belong here. Setting up the permission > > > regime for all the other calls is, I think a logically OS-specific > > > operation and doesn't belong in the core API. As far as I can tell > > > it's unlikely you would use these operations in the same programs > > > using the rest of perfmon. > > > > > As discussed earlier, on Linux, these operations could as well be > > implemented with sysctls. > > Yes, and I don't think having a cross-platform operation for this is > worthwhile. This is a system administrator operation, not an > operation for the users of perfmon, so I don't think having it > platform specific is a problem at all. > > > I will update the document to incorporate your feedback which I found > > very useful. > > > > In terms of porting, I am getting closer to being able to send you > > a skeleton header/C file with the required callbacks. Please let me > > know of any special PPC64 special behavior. For instance, looking > > at Opteron and Pentium 4: > > - on counter overflow, does PPC freeze the entire PMU > > Optional, IIRC, depending on some control bits in MMCR0. > > > - HW counters are not 64-bits, what are the values of > > the upper bits for counters. Should they be all 1 or all 0. > > The counter registers are 32 bits wide, but can only be effectively > used as 31-bit counters (see below). > > > - How is a counter overflow detected? When the full 64 bits > > of the counter overflow or when there is a carry from bit > > n to n+1 for a width on n. > > The interrupts occur on (32 bit) counter negative, rather than > overflow, per se. The only way to determine which counters have > overflowed is to look at the sign bits. Furthermore, the sign bit > must be cleared in order to clear the interrupt condition (hence only > 31-bit counters, effectively). > > Another issue which I ran into for perfctr is that interrupts can't be > individually enabled or disabled for each counter. There is one > control bit which determines if PMC1 generates an interrupt on counter > negative, and another control bit which determines if other PMCs cause > an interrupt. > > Because the events for the counters are generally selected in groups, > rather than individually, you need to be able to deal with overflow > interrupts for a counter you don't otherwise care about. > > Performance monitor interrupts can also be generated from the > timebase. These occur on 0-1 transitions on bit 0, 8, 12 or 16 > (selectable) of the (64 bit) timebase. Timebase frequency is not the > same as CPU core frequency, and depends on the system, not just the > CPU (it can be externally clocked). The timebase is guaranteed to > have a fixed frequency, even on systems with variable CPU frequency, > so the ratio to CPU core frequency can also vary. > > > - Are there any PPC64 PMU registers which can only be used by > > one thread at a time (shared). Think hyperthreading. > > Not as far as I'm aware. > > > - Is there a way to stop monitoring without having to modify > > all used PMC registers. > > Yes, there is a "freeze counters" (FC) bit in MMCR0 which will stop > all the counters. > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/people/dgibson -- -Stephane |
From: Stephane E. <er...@hp...> - 2005-03-31 12:22:25
|
David, > > That is a good point. It is true that today overflow notification > > is requested via the PMC and not the PMD. The implementation assumes > > (wrongly) that PMCx corresponds to PMDx. The flag is recorded in > > the PMD related structure. Hence, it would seem more natural to > > pass the flags for RANDOM/OVFL_NOTIFY via PFM_WRITE_PMDS. I did it > > via PFM_WRITE_PMCS because I considered those flags as part of the > > configuration of the counters, hence they would go with PMC. For the > > PPC64, it looks like you are in a situation similar to P4, where > > multiple config registers are used to control a counter. We could > > move the flags to PFM_WRITE_PMDS. > > I think that would make more sense. Or maybe even a different > mechanism entirely. How would you support performance monitor event > that aren't counter overflows, for those CPUs that have such? > Now I see several problems if we maek that move. The way tools typically work is that they say "I want to measure event X,Y,Z". Using a support library, they come up with the correct event to counter assignment (Event -> PMC). The PMU configuration registers (perfsel, PMC, ..) are written. The reason the perfmon flags where also setup when writing PMC is because, there a re part of the configuration as well. The counter register itself is not necessarily accessed, if default value is good enough. Hence a PFM_WRITE_PMDS was not required. If we move to your approach, the call woule become necessary. Now, it is true that no matter what a tool needs to know which PMD (perfctr, PMD) are associated with the PMC (perfsel, PMC) used for the measurement. At a minimum, this is needed for reading out the results. A portable tool cannot assume that PMCx corresponds to PMDx. In fact, on PPC and also P4, it seems you need multiple PMC to setup a single counter. I also assume that the PMC selection determines the PMD. Certain events can be measured on any PMC register. No matter what, I think a tool would need to find out the association PMC -> PMD. This can be provided by a user level library. I think your proposal makes sense. The Itanium PMU model is very clean in that regards, making work fairly simple for tools. I guess I will have to revise this in my tools/libraries. Note that if we move all flags to PFM_WRITE_PMDS, that would also move the following other fields: eventid, smpl_pmds[], and reset_pmds[]. Another side effect of this is that the dats structure passed to PFM_WRITE_PMDS will grow in size thereby making the call less efficient (think copy_user). > > > PFM_WRITE_PMCS > > > > > > The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between > > > PMU models. But what the value of this is for the current PMU model > > > is not exported anywhere. Varying by architecture doesn't make much > > > sense, since PMU model details vary only mildly more between > > > architectures than they do within CPU models of one architecture. > > > > > The PFM_MAX_PMD_BITVECTOR is exported in the perfmon.h header file. > > At this point, it is provided by each architecture. When the > > processor architecture is nice, then the PMU framework is specified > > there and it makes the job of software easier. For instance, on Itanium, > > the architecture says you can have up to 256 PMC and 256 PMD registers. > > Having this kind of information is very useful to size data structures > > approprietely. You don't want to have to copy large data structures > > (think copy_user) if you don't have to. Are you advocating that this > > be a PMU model specific size? > > Having it per-architecture doesn't really make a lot of sense, since > PM units vary only slightly less between CPUs of the same architecture > than they do between CPUs of different architectures. The PM unit may Well, I would cite Pentium III/Pentium M vs. P4/Xeon, that's is quite a drastic change. Yet this is inside the same processor family. > well not be defined by the architecture specification (if such exists) > at all, so I don't think you can count on there being a definitive > limit on the number of PMDs in general. On Itanium there is an architected limit. That is quite nice. I don't think having a per PMU-model limit is manageable. It would be hard to manage all the variations for the data structures. How would you handle the X86 family that way: one size for PIII, one size for P4. Yet the kernel support files would probably be the same, in fact the same kernel boots on both. > > The greatest number of PMDs on any PowerPC so far is 8, and I'm not > aware of any plans for CPUs with more, but it wouldn't surprise me if > it happened some day. Since this size can never be changed, without > breaking the ABI, we would have to leave room for expansion, and > there's no real guidance as to how much. > > So I think this should either be PM model dependent, or it should be > truly global - per-architecture is a bad compromise. The latter, > obviously, is much simpler to implement. Another thing to take into account is that you may want to use virtual PMDs to access software resources. For instance, take perfctr, there is TSC (timestamp) that could be mapped to a logical PMD. That way, it would be easy to specify it as part of the registers/resource to record in a sample (via the smpl_pmds[] mask). On Itanium, the debug registers are used by the PMU to restrict the code/data range where to monitor. In the old interface I had a specific call to program the IBR/DBR. In the revised document, you will see that I have logical PMC to do that. It simplifies the code and makes the interace more uniform, after all in this situation the debug registers are really use the configure the PMU. You can imagine mapping some kernel resources to PMD, such as amount of free memory, the PID of the current process. To make this work, it is really nice to have an upper bound for the physical PMU registers. Then you can add on top. That's what I did for Itanium. Another factor to consider here. The limit we use does not necessarily reflect the actual number of PMU registers. If I used 64 for the limit, that does not necessarily translate into 64 PMC registers. Hardware designers sometimes introduce holes in the namespace of registers because of wiring constraint (just a guess). Yet it would be quite costly for software to try and skip those holes. Hence the bitmask may have holes in them. Picking a single value would be good. Let's say you pick 256 for max physical PMC and max physical PMD. Assuming there are no big holes in the namespace. I think that is a pretty safe limit. Then logical PMD/PMC could be added above that. Of course if all really have is a set of 4 PMC, then you pay the copy cost for larger than needed data structures. But, the ABI would be preserved when the number of registers grow. > > > PFM_LOAD_CONTEXT > > > > > > I'm not sure I see the point of the load_set argument. What > > > can be accomplished with this that can't be with appropriate use of > > > PFM_START_SET? > > > > > You don't want to merge START and load. that is not because you attach > > to a thread/CPU that you want monitoring to start right away. > > But I think you have a good point. the interface guarantees that on > > PFM_LOAD_CONTEXT, monitoring is stopped. You need explicit START to > > activate. This is true even if you detached while monitoring was active. > > I need to check to see if there is something else involved here. > > Sorry, I don't fully follow what you're saying here (I can't parse the > first sentence, in particular). My point is it's not clear to me that > there's anything useful you can accomplish with: > CREATE <do stuff> LOAD START > that can't be done with > CREATE+ATTACH <do stuff> START > It depends on what you do in <stuff>. I assume that is where you program the PMU with PFM_WRITE_PMDS/PFM_WRITE_PMCS. I think you will see that the first model makes sense if you want to support batching: for(i=0; i < N ; i++) { c = CREATE PFM_WRITE_PMCS(c); PFM_WRITE_PMDS(c); } forearch(c) { ATTACH to target thread PFM_start(c); } As long as the context is not attach no acutal PMU hardware is touched. But you can still program the context (i.e., PMU software state). This becomes interesting for batching context setup. You can imagine a tool that monitors across fork/pthread_create. Creating and setting up a context can be quite costly. You can prepare the work and then dynamically on fork/pthread_create event you simply have to attach and start and let go. We do have tools that do follow across fork and provide per-thread measurement. > > > PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET > > > > > > Is there really a need to incrementally update the event sets? Would > > > a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces > > > all existing event sets with the ones described suffice. This > > > approach would not only reduce the number of entry points, but could > > > also simplify the kernel's parameter checking. For example at the > > > moment deleting an event set which is reference by another sets > > > set_id_next must presumably either fail, or alter those other event > > > sets to no longer reference the deleted event set. > > > > > This has been trimmed down to two calls in the new rev: > > PFM_CREATE_EVTSETS and PFM_DELETE_EVTSETS. If the event set > > already exist, PFM_CREATE_EVTSETS updates it. This is useful > > for set0 which always exists. > > > > As for delete/create, those operations can only happen when > > the context is detached. Checking of the validity of the event > > set chain is deferred until PFM_LOAD_CONTEXT because at this > > point it is not possible to modify the sets. If there set_id_next > > is invalid, PFM_LOAD_CONTEXT fails. > > But again, is there a real reason to allow incremental updates? If > there was a single operation which atomically changed all the event > sets it would mean one less entry point, *plus* we could do the error > checking there (earlier error checking is always good). And we > wouldn't even need user allocated id numbers, the array position would > suffice. > Keep in mind that PFM_CREATE_EVTSETS can be used to create multiple event sets at a time. This command can be called as many times as you want. If the set exists it is modified. You can create and delete event sets at will as long as the context is not attached. to anything. The set number determines its position in the list of sets. That list determines the DEFAULT switch order. Letting the user pick set number could be useful because it may correspond to some indexing scheme. The interface supports an override for the next set, this is the explicit next set. Why is this useful? This is interesting when the explicit link is pointing backwards. You can thus create sublists of sets. There is an example in the document. This is interesting because each sublist may be used to measure certain metric. Again, you can prepare all the sublists in advance. Then you attach, point the a sublist and start (PFM_START). After a while, you stop, and restart on another sublist. This saves you the reprograming of the sets, you can therefore alternate between sublists much faster. This is kind of an advanced feature, for most tool the basic ordering will eb just fine. Checking the validity of the explicit next set when the set is created/modified would impose an programming order for the tool, i.e., you could not point to a set that does not already exist. At the time, I thought this could be better checked at PFM_LOAD time where you know you have the whole setup. > > > PFM_GET_CONFIG / PFM_SET_CONFIG > > > > > > Again, these definitely don't belong on the multiplexor. They don't > > > use the fd, and since they set the permission regime for the > > > mutliplexor itself, they logically belong outside. Under Linux these > > > definitely ought to be sysctls. If this is ever ported to other OSes, > > > I still don't think they belong here. Setting up the permission > > > regime for all the other calls is, I think a logically OS-specific > > > operation and doesn't belong in the core API. As far as I can tell > > > it's unlikely you would use these operations in the same programs > > > using the rest of perfmon. > > > > > As discussed earlier, on Linux, these operations could as well be > > implemented with sysctls. > > Yes, and I don't think having a cross-platform operation for this is > worthwhile. This is a system administrator operation, not an > operation for the users of perfmon, so I don't think having it > platform specific is a problem at all. > Fine with me. I'll switch to a pure sysctl approach then. > > In terms of porting, I am getting closer to being able to send you > > a skeleton header/C file with the required callbacks. Please let me > > know of any special PPC64 special behavior. For instance, looking > > at Opteron and Pentium 4: > > - on counter overflow, does PPC freeze the entire PMU > > Optional, IIRC, depending on some control bits in MMCR0. > Ok, at least there is something. > > - HW counters are not 64-bits, what are the values of > > the upper bits for counters. Should they be all 1 or all 0. > > The counter registers are 32 bits wide, but can only be effectively > used as 31-bit counters (see below). > > > - How is a counter overflow detected? When the full 64 bits > > of the counter overflow or when there is a carry from bit > > n to n+1 for a width on n. > > The interrupts occur on (32 bit) counter negative, rather than > overflow, per se. The only way to determine which counters have > overflowed is to look at the sign bits. Furthermore, the sign bit > must be cleared in order to clear the interrupt condition (hence only > 31-bit counters, effectively). Ok, that's fine. > > Another issue which I ran into for perfctr is that interrupts can't be > individually enabled or disabled for each counter. There is one > control bit which determines if PMC1 generates an interrupt on counter > negative, and another control bit which determines if other PMCs cause > an interrupt. I think that's fine also. For 64-bit software emulation you need to have overflow intr enabled for every counter anyway. > > Because the events for the counters are generally selected in groups, > rather than individually, you need to be able to deal with overflow > interrupts for a counter you don't otherwise care about. > > Performance monitor interrupts can also be generated from the > timebase. These occur on 0-1 transitions on bit 0, 8, 12 or 16 > (selectable) of the (64 bit) timebase. Timebase frequency is not the > same as CPU core frequency, and depends on the system, not just the > CPU (it can be externally clocked). The timebase is guaranteed to > have a fixed frequency, even on systems with variable CPU frequency, > so the ratio to CPU core frequency can also vary. > > > - Are there any PPC64 PMU registers which can only be used by > > one thread at a time (shared). Think hyperthreading. > > Not as far as I'm aware. That's good. > > > - Is there a way to stop monitoring without having to modify > > all used PMC registers. > > Yes, there is a "freeze counters" (FC) bit in MMCR0 which will stop > all the counters. > Ok. Thanks for your feedback. i am sure I'll come up with other PMU-specific questions. -- -Stephane |
From: David G. <da...@gi...> - 2005-04-01 03:02:30
|
On Thu, Mar 31, 2005 at 03:58:03AM -0800, Stephane Eranian wrote: > David, > > > > That is a good point. It is true that today overflow notification > > > is requested via the PMC and not the PMD. The implementation assumes > > > (wrongly) that PMCx corresponds to PMDx. The flag is recorded in > > > the PMD related structure. Hence, it would seem more natural to > > > pass the flags for RANDOM/OVFL_NOTIFY via PFM_WRITE_PMDS. I did it > > > via PFM_WRITE_PMCS because I considered those flags as part of the > > > configuration of the counters, hence they would go with PMC. For the > > > PPC64, it looks like you are in a situation similar to P4, where > > > multiple config registers are used to control a counter. We could > > > move the flags to PFM_WRITE_PMDS. > > > > I think that would make more sense. Or maybe even a different > > mechanism entirely. How would you support performance monitor event > > that aren't counter overflows, for those CPUs that have such? > > > > Now I see several problems if we maek that move. The way tools > typically work is that they say "I want to measure event X,Y,Z". > Using a support library, they come up with the correct event to > counter assignment (Event -> PMC). The PMU configuration registers > (perfsel, PMC, ..) are written. The reason the perfmon flags where > also setup when writing PMC is because, there a re part of the > configuration as well. The counter register itself is not > necessarily accessed, if default value is good enough. Hence a > PFM_WRITE_PMDS was not required. If we move to your approach, the > call woule become necessary. Now, it is true that no matter what a > tool needs to know which PMD (perfctr, PMD) are associated with the > PMC (perfsel, PMC) used for the measurement. At a minimum, this is > needed for reading out the results. A portable tool cannot assume > that PMCx corresponds to PMDx. In fact, on PPC and also P4, it seems > you need multiple PMC to setup a single counter. And conversely setting one PMC can affect multiple PMDs... > I also assume that > the PMC selection determines the PMD. Certain events can be measured > on any PMC register. No matter what, I think a tool would need to > find out the association PMC -> PMD. This can be provided by a user > level library. > > I think your proposal makes sense. The Itanium PMU model is very clean > in that regards, making work fairly simple for tools. I guess I will have > to revise this in my tools/libraries. > > Note that if we move all flags to PFM_WRITE_PMDS, that would also move > the following other fields: eventid, smpl_pmds[], and reset_pmds[]. Another > side effect of this is that the dats structure passed to PFM_WRITE_PMDS will > grow in size thereby making the call less efficient (think copy_user). It might be worth splitting, say, PFM_CONFIGURE_PMDS, which would set the flags and reset values and so forth, away from PFM_WRITE_PMDS which would write the actual values to the PMDs. Presumably one would (usually) only need to call CONFIGURE_PMDS once, so that would remove the overhead of the larger structures from WRITE_PMDS calls used to update the values. > > > > PFM_WRITE_PMCS > > > > > > > > The documentation says that the PFM_MAX_PMD_BITVECTOR can vary between > > > > PMU models. But what the value of this is for the current PMU model > > > > is not exported anywhere. Varying by architecture doesn't make much > > > > sense, since PMU model details vary only mildly more between > > > > architectures than they do within CPU models of one architecture. > > > > > > > The PFM_MAX_PMD_BITVECTOR is exported in the perfmon.h header file. > > > At this point, it is provided by each architecture. When the > > > processor architecture is nice, then the PMU framework is specified > > > there and it makes the job of software easier. For instance, on Itanium, > > > the architecture says you can have up to 256 PMC and 256 PMD registers. > > > Having this kind of information is very useful to size data structures > > > approprietely. You don't want to have to copy large data structures > > > (think copy_user) if you don't have to. Are you advocating that this > > > be a PMU model specific size? > > > > Having it per-architecture doesn't really make a lot of sense, since > > PM units vary only slightly less between CPUs of the same architecture > > than they do between CPUs of different architectures. The PM unit may > > Well, I would cite Pentium III/Pentium M vs. P4/Xeon, that's is quite a > drastic change. Yet this is inside the same processor family. Exactly my point. > > well not be defined by the architecture specification (if such exists) > > at all, so I don't think you can count on there being a definitive > > limit on the number of PMDs in general. > > On Itanium there is an architected limit. That is quite nice. I don't > think having a per PMU-model limit is manageable. It would be hard to > manage all the variations for the data structures. How would you handle > the X86 family that way: one size for PIII, one size for P4. Yet the > kernel support files would probably be the same, in fact the same > kernel boots on both. I tend to agree, so I think a single universal limit probably makes more sense. 256 seems like a fairly reasonable choice. Out of interest, how many PMDs do actualy Itanium CPUs have? None of the CPUs that I'm familiar with have anything close to 256 PMDs. 8 is the largest number on ppc32 or ppc64 systems.. or actually, 9, if you count the timebase, I guess. > > The greatest number of PMDs on any PowerPC so far is 8, and I'm not > > aware of any plans for CPUs with more, but it wouldn't surprise me if > > it happened some day. Since this size can never be changed, without > > breaking the ABI, we would have to leave room for expansion, and > > there's no real guidance as to how much. > > > > So I think this should either be PM model dependent, or it should be > > truly global - per-architecture is a bad compromise. The latter, > > obviously, is much simpler to implement. > > Another thing to take into account is that you may want to use > virtual PMDs to access software resources. For instance, take > perfctr, there is TSC (timestamp) that could be mapped to a > logical PMD. That way, it would be easy to specify it as part of the > registers/resource to record in a sample (via the smpl_pmds[] mask). > On Itanium, the debug registers are used by the PMU to restrict the > code/data range where to monitor. In the old interface I had a > specific call to program the IBR/DBR. In the revised document, you will > see that I have logical PMC to do that. It simplifies the code and makes > the interace more uniform, after all in this situation the debug registers > are really use the configure the PMU. You can imagine mapping some > kernel resources to PMD, such as amount of free memory, the PID of the > current process. To make this work, it is really nice to have an upper > bound for the physical PMU registers. Then you can add on top. That's > what I did for Itanium. > > Another factor to consider here. The limit we use does not necessarily > reflect the actual number of PMU registers. If I used 64 for the limit, > that does not necessarily translate into 64 PMC registers. Hardware designers > sometimes introduce holes in the namespace of registers because of wiring > constraint (just a guess). Yet it would be quite costly for software to > try and skip those holes. Hence the bitmask may have holes in them. Hrm... I doubt it would really be all that costly to pack the wholes, especially when we have to check for software/virtualized PMDs in there as well. Of course, I am biased by ppc where we need to use switch statements to access the registers, even though they are contiguous (the special purpose register number is part of the instruction opcode, and can't be given indirectly). > Picking a single value would be good. Let's say you pick 256 for max > physical PMC and max physical PMD. Assuming there are no big holes > in the namespace. I think that is a pretty safe limit. Then logical > PMD/PMC could be added above that. Of course if all really have is > a set of 4 PMC, then you pay the copy cost for larger than needed > data structures. But, the ABI would be preserved when the number > of registers grow. Indeed. > > > > PFM_LOAD_CONTEXT > > > > > > > > I'm not sure I see the point of the load_set argument. What > > > > can be accomplished with this that can't be with appropriate use of > > > > PFM_START_SET? > > > > > > > You don't want to merge START and load. that is not because you attach > > > to a thread/CPU that you want monitoring to start right away. > > > But I think you have a good point. the interface guarantees that on > > > PFM_LOAD_CONTEXT, monitoring is stopped. You need explicit START to > > > activate. This is true even if you detached while monitoring was active. > > > I need to check to see if there is something else involved here. > > > > Sorry, I don't fully follow what you're saying here (I can't parse the > > first sentence, in particular). My point is it's not clear to me that > > there's anything useful you can accomplish with: > > CREATE <do stuff> LOAD START > > that can't be done with > > CREATE+ATTACH <do stuff> START > > > It depends on what you do in <stuff>. I assume that is where you program > the PMU with PFM_WRITE_PMDS/PFM_WRITE_PMCS. I think you will see that the > first model makes sense if you want to support batching: > > for(i=0; i < N ; i++) { > c = CREATE > PFM_WRITE_PMCS(c); > PFM_WRITE_PMDS(c); > } > forearch(c) { > ATTACH to target thread > PFM_start(c); > } Sure, I can see the appeal of this. But is there a compelling reason we need to support this way of doing it, rather than: for (i=0; i<N; i++) { c = CREATE/ATTACH; WRITE_PMCS(c); WRITE_PMDS(c); PFM_START(c); <wait for monitored stuff to happen> PFM_STOP(c); } > As long as the context is not attach no acutal PMU hardware is touched. > But you can still program the context (i.e., PMU software state). This becomes > interesting for batching context setup. You can imagine a tool that monitors > across fork/pthread_create. Creating and setting up a context can be quite costly. > You can prepare the work and then dynamically on fork/pthread_create event > you simply have to attach and start and let go. We do have tools that do > follow across fork and provide per-thread measurement. Hmm.. ok. It think you've pretty much convinced me. > > > > PFM_CREATE_EVTSET / PFM_DELETE_EVTSET / PFM_CHANGE_EVSET > > > > > > > > Is there really a need to incrementally update the event sets? Would > > > > a PFM_SETUP_EVTSETS which acts like PFM_CREATE_EVTSET, but replaces > > > > all existing event sets with the ones described suffice. This > > > > approach would not only reduce the number of entry points, but could > > > > also simplify the kernel's parameter checking. For example at the > > > > moment deleting an event set which is reference by another sets > > > > set_id_next must presumably either fail, or alter those other event > > > > sets to no longer reference the deleted event set. > > > > > > > This has been trimmed down to two calls in the new rev: > > > PFM_CREATE_EVTSETS and PFM_DELETE_EVTSETS. If the event set > > > already exist, PFM_CREATE_EVTSETS updates it. This is useful > > > for set0 which always exists. > > > > > > As for delete/create, those operations can only happen when > > > the context is detached. Checking of the validity of the event > > > set chain is deferred until PFM_LOAD_CONTEXT because at this > > > point it is not possible to modify the sets. If there set_id_next > > > is invalid, PFM_LOAD_CONTEXT fails. > > > > But again, is there a real reason to allow incremental updates? If > > there was a single operation which atomically changed all the event > > sets it would mean one less entry point, *plus* we could do the error > > checking there (earlier error checking is always good). And we > > wouldn't even need user allocated id numbers, the array position would > > suffice. > > Keep in mind that PFM_CREATE_EVTSETS can be used to create multiple > event sets at a time. This command can be called as many times > as you want. If the set exists it is modified. You can create and > delete event sets at will as long as the context is not attached. > to anything. The set number determines its position in the list > of sets. That list determines the DEFAULT switch order. Letting > the user pick set number could be useful because it may correspond > to some indexing scheme. > > The interface supports an override for the > next set, this is the explicit next set. Why is this useful? This is > interesting when the explicit link is pointing backwards. You can > thus create sublists of sets. There is an example in the document. > This is interesting because each sublist may be used to measure > certain metric. Again, you can prepare all the sublists in advance. > Then you attach, point the a sublist and start (PFM_START). After > a while, you stop, and restart on another sublist. This saves you > the reprograming of the sets, you can therefore alternate between > sublists much faster. This is kind of an advanced feature, for most > tool the basic ordering will eb just fine. > Checking the validity of the explicit next set when the set > is created/modified would impose an programming order for the > tool, i.e., you could not point to a set that does not already exist. > At the time, I thought this could be better checked at PFM_LOAD time > where you know you have the whole setup. We seem to be talking at cross purposes here. I'm not suggesting any change to the features of event sets once they're established, or at what time they can be set up. I'm just suggesting that it might be simpler to require all the event sets to be created simultaneously, rather than allowing individual, or subsets of the event sets to be created and deleted at will. > > > > PFM_GET_CONFIG / PFM_SET_CONFIG > > > > > > > > Again, these definitely don't belong on the multiplexor. They don't > > > > use the fd, and since they set the permission regime for the > > > > mutliplexor itself, they logically belong outside. Under Linux these > > > > definitely ought to be sysctls. If this is ever ported to other OSes, > > > > I still don't think they belong here. Setting up the permission > > > > regime for all the other calls is, I think a logically OS-specific > > > > operation and doesn't belong in the core API. As far as I can tell > > > > it's unlikely you would use these operations in the same programs > > > > using the rest of perfmon. > > > > > > > As discussed earlier, on Linux, these operations could as well be > > > implemented with sysctls. > > > > Yes, and I don't think having a cross-platform operation for this is > > worthwhile. This is a system administrator operation, not an > > operation for the users of perfmon, so I don't think having it > > platform specific is a problem at all. > > > Fine with me. I'll switch to a pure sysctl approach then. > > > > In terms of porting, I am getting closer to being able to send you > > > a skeleton header/C file with the required callbacks. Please let me > > > know of any special PPC64 special behavior. For instance, looking > > > at Opteron and Pentium 4: > > > - on counter overflow, does PPC freeze the entire PMU > > > > Optional, IIRC, depending on some control bits in MMCR0. > > > Ok, at least there is something. > > > > - HW counters are not 64-bits, what are the values of > > > the upper bits for counters. Should they be all 1 or all 0. > > > > The counter registers are 32 bits wide, but can only be effectively > > used as 31-bit counters (see below). > > > > > - How is a counter overflow detected? When the full 64 bits > > > of the counter overflow or when there is a carry from bit > > > n to n+1 for a width on n. > > > > The interrupts occur on (32 bit) counter negative, rather than > > overflow, per se. The only way to determine which counters have > > overflowed is to look at the sign bits. Furthermore, the sign bit > > must be cleared in order to clear the interrupt condition (hence only > > 31-bit counters, effectively). > > Ok, that's fine. > > > > > Another issue which I ran into for perfctr is that interrupts can't be > > individually enabled or disabled for each counter. There is one > > control bit which determines if PMC1 generates an interrupt on counter > > negative, and another control bit which determines if other PMCs cause > > an interrupt. > > I think that's fine also. For 64-bit software emulation you need to have > overflow intr enabled for every counter anyway. > > > > > Because the events for the counters are generally selected in groups, > > rather than individually, you need to be able to deal with overflow > > interrupts for a counter you don't otherwise care about. > > > > Performance monitor interrupts can also be generated from the > > timebase. These occur on 0-1 transitions on bit 0, 8, 12 or 16 > > (selectable) of the (64 bit) timebase. Timebase frequency is not the > > same as CPU core frequency, and depends on the system, not just the > > CPU (it can be externally clocked). The timebase is guaranteed to > > have a fixed frequency, even on systems with variable CPU frequency, > > so the ratio to CPU core frequency can also vary. > > > > > - Are there any PPC64 PMU registers which can only be used by > > > one thread at a time (shared). Think hyperthreading. > > > > Not as far as I'm aware. > > That's good. > > > > > > - Is there a way to stop monitoring without having to modify > > > all used PMC registers. > > > > Yes, there is a "freeze counters" (FC) bit in MMCR0 which will stop > > all the counters. > > > > Ok. > > Thanks for your feedback. i am sure I'll come up with other PMU-specific > questions. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-04 12:14:30
|
David, > > Now I see several problems if we maek that move. The way tools > > typically work is that they say "I want to measure event X,Y,Z". > > Using a support library, they come up with the correct event to > > counter assignment (Event -> PMC). The PMU configuration registers > > (perfsel, PMC, ..) are written. The reason the perfmon flags where > > also setup when writing PMC is because, there a re part of the > > configuration as well. The counter register itself is not > > necessarily accessed, if default value is good enough. Hence a > > PFM_WRITE_PMDS was not required. If we move to your approach, the > > call woule become necessary. Now, it is true that no matter what a > > tool needs to know which PMD (perfctr, PMD) are associated with the > > PMC (perfsel, PMC) used for the measurement. At a minimum, this is > > needed for reading out the results. A portable tool cannot assume > > that PMCx corresponds to PMDx. In fact, on PPC and also P4, it seems > > you need multiple PMC to setup a single counter. > > And conversely setting one PMC can affect multiple PMDs... > That's true even on Itanium. Take the Branch Trace Buffer for instance. > > Note that if we move all flags to PFM_WRITE_PMDS, that would also move > > the following other fields: eventid, smpl_pmds[], and reset_pmds[]. Another > > side effect of this is that the dats structure passed to PFM_WRITE_PMDS will > > grow in size thereby making the call less efficient (think copy_user). > > It might be worth splitting, say, PFM_CONFIGURE_PMDS, which would set > the flags and reset values and so forth, away from PFM_WRITE_PMDS > which would write the actual values to the PMDs. Presumably one would > (usually) only need to call CONFIGURE_PMDS once, so that would remove > the overhead of the larger structures from WRITE_PMDS calls used to > update the values. > I have made the change now and the impact does not appear to big that big at least on Itanium. > > I tend to agree, so I think a single universal limit probably makes > more sense. 256 seems like a fairly reasonable choice. > So what about we fix it to 256 for actual PMU hardware registers. Then anything above this is either other hardware registers or software state. To make things nicely aligned, we could add up to 64 bit on top of the 256. If you look at the new document, you'll see that this is how I do this for Itanium. For PMC, I have 0-256 reserved for actual PMC, 256-272 is for IBR and DBR (debug registers). For PMD, I have 0-256 for actual PMD registers. Would that fit the PowerPC model? It seem PowerPC like P4.Xeon does not use indexed registers for PMU. Hence a somewhat more complicated mapping must be found. > > Another factor to consider here. The limit we use does not necessarily > > reflect the actual number of PMU registers. If I used 64 for the limit, > > that does not necessarily translate into 64 PMC registers. Hardware designers > > sometimes introduce holes in the namespace of registers because of wiring > > constraint (just a guess). Yet it would be quite costly for software to > > try and skip those holes. Hence the bitmask may have holes in them. > > Hrm... I doubt it would really be all that costly to pack the wholes, > especially when we have to check for software/virtualized PMDs in > there as well. Of course, I am biased by ppc where we need to use > switch statements to access the registers, even though they are > contiguous (the special purpose register number is part of the > instruction opcode, and can't be given indirectly). > Oops, that's yet another difficulty... > Sure, I can see the appeal of this. But is there a compelling reason > we need to support this way of doing it, rather than: > > for (i=0; i<N; i++) { > c = CREATE/ATTACH; > WRITE_PMCS(c); > WRITE_PMDS(c); > PFM_START(c); > <wait for monitored stuff to happen> > PFM_STOP(c); > } > > > > As long as the context is not attach no acutal PMU hardware is touched. > > But you can still program the context (i.e., PMU software state). This becomes > > interesting for batching context setup. You can imagine a tool that monitors > > across fork/pthread_create. Creating and setting up a context can be quite costly. > > You can prepare the work and then dynamically on fork/pthread_create event > > you simply have to attach and start and let go. We do have tools that do > > follow across fork and provide per-thread measurement. > > Hmm.. ok. It think you've pretty much convinced me. > Excellent. > > This is interesting because each sublist may be used to measure > > certain metric. Again, you can prepare all the sublists in advance. > > Then you attach, point the a sublist and start (PFM_START). After > > a while, you stop, and restart on another sublist. This saves you > > the reprograming of the sets, you can therefore alternate between > > sublists much faster. This is kind of an advanced feature, for most > > tool the basic ordering will eb just fine. > > Checking the validity of the explicit next set when the set > > is created/modified would impose an programming order for the > > tool, i.e., you could not point to a set that does not already exist. > > At the time, I thought this could be better checked at PFM_LOAD time > > where you know you have the whole setup. > > We seem to be talking at cross purposes here. I'm not suggesting any > change to the features of event sets once they're established, or at > what time they can be set up. I'm just suggesting that it might be > simpler to require all the event sets to be created simultaneously, > rather than allowing individual, or subsets of the event sets to be > created and deleted at will. > But that's not the natural way the interface is designed. It does not mandate that you issue only a single PFM_WRITE_PMDS or PFM_WRITE_PMCS. So why should it be different for PFM_CREATE_EVTSETS. In fact, it may make it easier on applications which are highly modularized where each module contribute its part of the measurement without knowing about the others. This allows for incremental updates. -- -Stephane |
From: David G. <da...@gi...> - 2005-04-05 04:29:11
|
On Mon, Apr 04, 2005 at 04:50:30AM -0700, Stephane Eranian wrote: > David, > > > > Now I see several problems if we maek that move. The way tools > > > typically work is that they say "I want to measure event X,Y,Z". > > > Using a support library, they come up with the correct event to > > > counter assignment (Event -> PMC). The PMU configuration registers > > > (perfsel, PMC, ..) are written. The reason the perfmon flags where > > > also setup when writing PMC is because, there a re part of the > > > configuration as well. The counter register itself is not > > > necessarily accessed, if default value is good enough. Hence a > > > PFM_WRITE_PMDS was not required. If we move to your approach, the > > > call woule become necessary. Now, it is true that no matter what a > > > tool needs to know which PMD (perfctr, PMD) are associated with the > > > PMC (perfsel, PMC) used for the measurement. At a minimum, this is > > > needed for reading out the results. A portable tool cannot assume > > > that PMCx corresponds to PMDx. In fact, on PPC and also P4, it seems > > > you need multiple PMC to setup a single counter. > > > > And conversely setting one PMC can affect multiple PMDs... > > > That's true even on Itanium. Take the Branch Trace Buffer for instance. > > > > Note that if we move all flags to PFM_WRITE_PMDS, that would also move > > > the following other fields: eventid, smpl_pmds[], and reset_pmds[]. Another > > > side effect of this is that the dats structure passed to PFM_WRITE_PMDS will > > > grow in size thereby making the call less efficient (think copy_user). > > > > It might be worth splitting, say, PFM_CONFIGURE_PMDS, which would set > > the flags and reset values and so forth, away from PFM_WRITE_PMDS > > which would write the actual values to the PMDs. Presumably one would > > (usually) only need to call CONFIGURE_PMDS once, so that would remove > > the overhead of the larger structures from WRITE_PMDS calls used to > > update the values. > > > I have made the change now and the impact does not appear to big that big > at least on Itanium. Ok. > > I tend to agree, so I think a single universal limit probably makes > > more sense. 256 seems like a fairly reasonable choice. > > So what about we fix it to 256 for actual PMU hardware registers. Then > anything above this is either other hardware registers or software state. > To make things nicely aligned, we could add up to 64 bit on top of the 256. > If you look at the new document, you'll see that this is how I do this > for Itanium. For PMC, I have 0-256 reserved for actual PMC, 256-272 is for > IBR and DBR (debug registers). For PMD, I have 0-256 for actual PMD registers. > Would that fit the PowerPC model? It seem PowerPC like P4.Xeon does not > use indexed registers for PMU. Hence a somewhat more complicated mapping > must be found. Ok, so the restriction would be that only things below 256 could be triggered for the reset bitmaps and so forth, but there could be things numbered above that? > > > Another factor to consider here. The limit we use does not necessarily > > > reflect the actual number of PMU registers. If I used 64 for the limit, > > > that does not necessarily translate into 64 PMC registers. Hardware designers > > > sometimes introduce holes in the namespace of registers because of wiring > > > constraint (just a guess). Yet it would be quite costly for software to > > > try and skip those holes. Hence the bitmask may have holes in them. > > > > Hrm... I doubt it would really be all that costly to pack the wholes, > > especially when we have to check for software/virtualized PMDs in > > there as well. Of course, I am biased by ppc where we need to use > > switch statements to access the registers, even though they are > > contiguous (the special purpose register number is part of the > > instruction opcode, and can't be given indirectly). > > > Oops, that's yet another difficulty... It's not that big a deal. There aren't that many registers, so a switch isn't too bad. > > Sure, I can see the appeal of this. But is there a compelling reason > > we need to support this way of doing it, rather than: > > > > for (i=0; i<N; i++) { > > c = CREATE/ATTACH; > > WRITE_PMCS(c); > > WRITE_PMDS(c); > > PFM_START(c); > > <wait for monitored stuff to happen> > > PFM_STOP(c); > > } > > > > > > > As long as the context is not attach no acutal PMU hardware is touched. > > > But you can still program the context (i.e., PMU software state). This becomes > > > interesting for batching context setup. You can imagine a tool that monitors > > > across fork/pthread_create. Creating and setting up a context can be quite costly. > > > You can prepare the work and then dynamically on fork/pthread_create event > > > you simply have to attach and start and let go. We do have tools that do > > > follow across fork and provide per-thread measurement. > > > > Hmm.. ok. It think you've pretty much convinced me. > > > Excellent. > > > > This is interesting because each sublist may be used to measure > > > certain metric. Again, you can prepare all the sublists in advance. > > > Then you attach, point the a sublist and start (PFM_START). After > > > a while, you stop, and restart on another sublist. This saves you > > > the reprograming of the sets, you can therefore alternate between > > > sublists much faster. This is kind of an advanced feature, for most > > > tool the basic ordering will eb just fine. > > > Checking the validity of the explicit next set when the set > > > is created/modified would impose an programming order for the > > > tool, i.e., you could not point to a set that does not already exist. > > > At the time, I thought this could be better checked at PFM_LOAD time > > > where you know you have the whole setup. > > > > We seem to be talking at cross purposes here. I'm not suggesting any > > change to the features of event sets once they're established, or at > > what time they can be set up. I'm just suggesting that it might be > > simpler to require all the event sets to be created simultaneously, > > rather than allowing individual, or subsets of the event sets to be > > created and deleted at will. > > > But that's not the natural way the interface is designed. It does > not mandate that you issue only a single PFM_WRITE_PMDS or PFM_WRITE_PMCS. > So why should it be different for PFM_CREATE_EVTSETS. In fact, it may make > it easier on applications which are highly modularized where each module > contribute its part of the measurement without knowing about the others. > This allows for incremental updates. I guess. It's just that I can see compelling reasons why incremental WRITE_PMDS and WRITE_PMCS are useful, but the same is not true for CREATE_EVTSETS. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-05 22:39:27
|
David, > > > I tend to agree, so I think a single universal limit probably makes > > > more sense. 256 seems like a fairly reasonable choice. > > > > So what about we fix it to 256 for actual PMU hardware registers. Then > > anything above this is either other hardware registers or software state. > > To make things nicely aligned, we could add up to 64 bit on top of the 256. > > If you look at the new document, you'll see that this is how I do this > > for Itanium. For PMC, I have 0-256 reserved for actual PMC, 256-272 is for > > IBR and DBR (debug registers). For PMD, I have 0-256 for actual PMD registers. > > Would that fit the PowerPC model? It seem PowerPC like P4.Xeon does not > > use indexed registers for PMU. Hence a somewhat more complicated mapping > > must be found. > > Ok, so the restriction would be that only things below 256 could be > triggered for the reset bitmaps and so forth, but there could be > things numbered above that? > I think we would be way of the safe side with: PFM_MAX_PMCS=320 (256+64) PFM_MAX_PMDS=320 (256+64) That would cover IA-64 architected PMU and most others. all 320 PMDS would be treated equal. They could be used in the smpl_pmds bitmask but also in the reset_pmds bitmask but tere would be no guarantee it would have an effect. Suppose a logical PMD maps to the pid of the current task. You could include it in smpl_pmds but it would not make sense in reset_pmds. I don't know enough about PowerPC to figure out if it would be ard to come up with PMC/PMD mappings that would be simple. Do you have any ideas? > > > Hrm... I doubt it would really be all that costly to pack the wholes, > > > especially when we have to check for software/virtualized PMDs in > > > there as well. Of course, I am biased by ppc where we need to use > > > switch statements to access the registers, even though they are > > > contiguous (the special purpose register number is part of the > > > instruction opcode, and can't be given indirectly). > > > > > Oops, that's yet another difficulty... > > It's not that big a deal. There aren't that many registers, so a > switch isn't too bad. > Well, even on Itanium we do this for PMC because of actual PMC versus IBR/DBR (debug registers). Yes, it's a small switch-case. > > But that's not the natural way the interface is designed. It does > > not mandate that you issue only a single PFM_WRITE_PMDS or PFM_WRITE_PMCS. > > So why should it be different for PFM_CREATE_EVTSETS. In fact, it may make > > it easier on applications which are highly modularized where each module > > contribute its part of the measurement without knowing about the others. > > This allows for incremental updates. > > I guess. It's just that I can see compelling reasons why incremental > WRITE_PMDS and WRITE_PMCS are useful, but the same is not true for > CREATE_EVTSETS. > In practice you are proably right. Yet it feels strange to somehow restrict the uses of CREATE_EVTSETS. Related to event sets and the register virtual mapping that is done by perfctr. The same kind of mapping could be provided per-event set. It would be possible to return the address at which a set is visible. That would be automatic remapping. Of course, that goes against the model which I recently change whereby the user must call mmap() explicitely to remapping the sampling buffer. But it would be hard to reuse the same call to map PMD registers of a set. There is only one file descriptor per context. Sowe need to find another trick to indicate which set to mmap. I think we could find a nice trick with the mmap offset. Offset=0 means sampling buffer, offset=1 means set0, offset=2 means set1 and so forth. Do you have any ideas on this? -- -Stephane |
From: David G. <da...@gi...> - 2005-04-06 02:14:51
|
On Tue, Apr 05, 2005 at 03:14:59PM -0700, Stephane Eranian wrote: > David, > > > > I tend to agree, so I think a single universal limit probably makes > > > > more sense. 256 seems like a fairly reasonable choice. > > > > > > So what about we fix it to 256 for actual PMU hardware registers. Then > > > anything above this is either other hardware registers or software state. > > > To make things nicely aligned, we could add up to 64 bit on top of the 256. > > > If you look at the new document, you'll see that this is how I do this > > > for Itanium. For PMC, I have 0-256 reserved for actual PMC, 256-272 is for > > > IBR and DBR (debug registers). For PMD, I have 0-256 for actual PMD registers. > > > Would that fit the PowerPC model? It seem PowerPC like P4.Xeon does not > > > use indexed registers for PMU. Hence a somewhat more complicated mapping > > > must be found. > > > > Ok, so the restriction would be that only things below 256 could be > > triggered for the reset bitmaps and so forth, but there could be > > things numbered above that? > > > I think we would be way of the safe side with: > PFM_MAX_PMCS=320 (256+64) > PFM_MAX_PMDS=320 (256+64) Ok. I suspect it will end up being massive overkill for nearly every CPU we ever deal with, but who cares, really. > That would cover IA-64 architected PMU and most others. > all 320 PMDS would be treated equal. They could be used in > the smpl_pmds bitmask but also in the reset_pmds bitmask but > tere would be no guarantee it would have an effect. Suppose > a logical PMD maps to the pid of the current task. You could > include it in smpl_pmds but it would not make sense in reset_pmds. > > I don't know enough about PowerPC to figure out if it would be ard > to come up with PMC/PMD mappings that would be simple. Do you have > any ideas? Well, there are so few, I don't think we need to be particularly clever. I suppose we could use the SPR numbers, minus some offset (perfctr does this in the latest versions), but I was thinking something simple, say: PMCs: 0: MMCR0 1: MMCR1 2: MMCR2 (ppc32 only) 3: MMCRA (ppc64 only) PMDs: 0: timebase 1: PMC1 2: PMC2 ... 8: PMC8 (The PowerPC documentation's use of PMC for "Performance Monitor Counter" makes this look a little confusing). Incidentally, how were you planning to implement the perfctr-like virtualized tsc and mmap() based sampling you were talking about? That could have an impact on how we do things here. > > > > Hrm... I doubt it would really be all that costly to pack the wholes, > > > > especially when we have to check for software/virtualized PMDs in > > > > there as well. Of course, I am biased by ppc where we need to use > > > > switch statements to access the registers, even though they are > > > > contiguous (the special purpose register number is part of the > > > > instruction opcode, and can't be given indirectly). > > > > > > > Oops, that's yet another difficulty... > > > > It's not that big a deal. There aren't that many registers, so a > > switch isn't too bad. > > Well, even on Itanium we do this for PMC because of actual PMC versus > IBR/DBR (debug registers). Yes, it's a small switch-case. > > > > But that's not the natural way the interface is designed. It does > > > not mandate that you issue only a single PFM_WRITE_PMDS or PFM_WRITE_PMCS. > > > So why should it be different for PFM_CREATE_EVTSETS. In fact, it may make > > > it easier on applications which are highly modularized where each module > > > contribute its part of the measurement without knowing about the others. > > > This allows for incremental updates. > > > > I guess. It's just that I can see compelling reasons why incremental > > WRITE_PMDS and WRITE_PMCS are useful, but the same is not true for > > CREATE_EVTSETS. > > > In practice you are proably right. Yet it feels strange to somehow restrict > the uses of CREATE_EVTSETS. > > > Related to event sets and the register virtual mapping that is done by perfctr. > The same kind of mapping could be provided per-event set. It would be possible > to return the address at which a set is visible. That would be automatic remapping. > Of course, that goes against the model which I recently change whereby the user > must call mmap() explicitely to remapping the sampling buffer. But it would > be hard to reuse the same call to map PMD registers of a set. There is only > one file descriptor per context. Sowe need to find another trick to indicate > which set to mmap. I think we could find a nice trick with the mmap offset. > Offset=0 means sampling buffer, offset=1 means set0, offset=2 means set1 and > so forth. Do you have any ideas on this? Yes, I've always thought using the offset as a selector for which information to access would be a reasonable idea. However, they'll need to be multiples of whole pages to work properly. I also think the offsets should be somewhere up high: because you also support read() on the fd, I think it would be counter-intuitive for an mmap() at 0 *not* to return the same data as read(). -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-06 08:19:53
|
David, > > I think we would be way of the safe side with: > > PFM_MAX_PMCS=320 (256+64) > > PFM_MAX_PMDS=320 (256+64) > > Ok. I suspect it will end up being massive overkill for nearly every > CPU we ever deal with, but who cares, really. > Yes, given how hard it is to get more counters I don't think we'll ever reach 256. But for PMU which use indexed registers we may have, let's say, 32 registers scattered across the entire 256-entry namespace. > Well, there are so few, I don't think we need to be particularly > clever. I suppose we could use the SPR numbers, minus some offset > (perfctr does this in the latest versions), but I was thinking > something simple, say: > > PMCs: > 0: MMCR0 > 1: MMCR1 > 2: MMCR2 (ppc32 only) > 3: MMCRA (ppc64 only) > > PMDs: > 0: timebase > 1: PMC1 > 2: PMC2 > ... > 8: PMC8 > > (The PowerPC documentation's use of PMC for "Performance Monitor > Counter" makes this look a little confusing). > From this description, look like you have 8 counters but you don't have 8 configuration registers for them. Do they always go in pairs? > Incidentally, how were you planning to implement the perfctr-like > virtualized tsc and mmap() based sampling you were talking about? > That could have an impact on how we do things here. > See below for mmapping. > > > > > Hrm... I doubt it would really be all that costly to pack the wholes, > > > > Related to event sets and the register virtual mapping that is done by perfctr. > > The same kind of mapping could be provided per-event set. It would be possible > > to return the address at which a set is visible. That would be automatic remapping. > > Of course, that goes against the model which I recently change whereby the user > > must call mmap() explicitely to remapping the sampling buffer. But it would > > be hard to reuse the same call to map PMD registers of a set. There is only > > one file descriptor per context. Sowe need to find another trick to indicate > > which set to mmap. I think we could find a nice trick with the mmap offset. > > Offset=0 means sampling buffer, offset=1 means set0, offset=2 means set1 and > > so forth. Do you have any ideas on this? > > Yes, I've always thought using the offset as a selector for which > information to access would be a reasonable idea. However, they'll > need to be multiples of whole pages to work properly. I also think > the offsets should be somewhere up high: because you also support > read() on the fd, I think it would be counter-intuitive for an mmap() > at 0 *not* to return the same data as read(). > What about on creation each set returns an opaque cookie which must be used as the offset to mmap. This way, the user does not have to deal with page size. We could add the cookie in the structures passed to PFM_CREATE_EVTSETS and PFM_GETINFO_EVTSETS. This would cover set0 which is always created by default. And yes, you need one page per set because each set can individually be destroyed. The cookie could correspond to low or high offset depending on what is more convenient. On the question of read() on fd vs. mmap, there is an important difference here. The read() has the side effect of actually consuming the notification message from the message queue. Making the queue visible via mmap avoid the copying done by read but we loose the side-effect. We would still need a system call to remove that message from the queue. Mmaping is read-only. -- -Stephane |
From: David G. <da...@gi...> - 2005-04-07 03:43:38
|
On Wed, Apr 06, 2005 at 12:55:21AM -0700, Stephane Eranian wrote: > David, > > > > I think we would be way of the safe side with: > > > PFM_MAX_PMCS=320 (256+64) > > > PFM_MAX_PMDS=320 (256+64) > > > > Ok. I suspect it will end up being massive overkill for nearly every > > CPU we ever deal with, but who cares, really. > > > Yes, given how hard it is to get more counters I don't think we'll ever > reach 256. But for PMU which use indexed registers we may have, let's say, > 32 registers scattered across the entire 256-entry namespace. Very well. > > Well, there are so few, I don't think we need to be particularly > > clever. I suppose we could use the SPR numbers, minus some offset > > (perfctr does this in the latest versions), but I was thinking > > something simple, say: > > > > PMCs: > > 0: MMCR0 > > 1: MMCR1 > > 2: MMCR2 (ppc32 only) > > 3: MMCRA (ppc64 only) > > > > PMDs: > > 0: timebase > > 1: PMC1 > > 2: PMC2 > > ... > > 8: PMC8 > > > > (The PowerPC documentation's use of PMC for "Performance Monitor > > Counter" makes this look a little confusing). > > > >From this description, look like you have 8 counters but you don't have > 8 configuration registers for them. Do they always go in pairs? No. The individual event selection fields, such as they are are spread across MMCR0 (32bit) and MMCR1 (64bit). The rest of MMCR0 and MMCRA have general control bits, plus the settings for the various muxes which affect the interpretation of the event selection fields. I still don't fully understand the event selection logic, I have to admit - it's pretty baroque. > > Incidentally, how were you planning to implement the perfctr-like > > virtualized tsc and mmap() based sampling you were talking about? > > That could have an impact on how we do things here. > > > See below for mmapping. > > > > > > > Hrm... I doubt it would really be all that costly to pack the wholes, > > > > > > Related to event sets and the register virtual mapping that is done by perfctr. > > > The same kind of mapping could be provided per-event set. It would be possible > > > to return the address at which a set is visible. That would be automatic remapping. > > > Of course, that goes against the model which I recently change whereby the user > > > must call mmap() explicitely to remapping the sampling buffer. But it would > > > be hard to reuse the same call to map PMD registers of a set. There is only > > > one file descriptor per context. Sowe need to find another trick to indicate > > > which set to mmap. I think we could find a nice trick with the mmap offset. > > > Offset=0 means sampling buffer, offset=1 means set0, offset=2 means set1 and > > > so forth. Do you have any ideas on this? > > > > Yes, I've always thought using the offset as a selector for which > > information to access would be a reasonable idea. However, they'll > > need to be multiples of whole pages to work properly. I also think > > the offsets should be somewhere up high: because you also support > > read() on the fd, I think it would be counter-intuitive for an mmap() > > at 0 *not* to return the same data as read(). > > > What about on creation each set returns an opaque cookie which must be used > as the offset to mmap. This way, the user does not have to deal with page size. > We could add the cookie in the structures passed to PFM_CREATE_EVTSETS and > PFM_GETINFO_EVTSETS. This would cover set0 which is always created by default. > And yes, you need one page per set because each set can individually be destroyed. > The cookie could correspond to low or high offset depending on what is > more convenient. Well, given that it is the offset to pass to mmap, I wouldn't really call if "opaque". But returning this offset is a good idea, yes. Oh... hang on. I didn't immediately realize the implication of this. Does this mean that there is a separate sample buffer for each event set? > On the question of read() on fd vs. mmap, there is an important difference > here. The read() has the side effect of actually consuming the notification > message from the message queue. Making the queue visible via mmap avoid the > copying done by read but we loose the side-effect. We would still need a system > call to remove that message from the queue. Mmaping is read-only. I thought about this before, allowing mmap() access to the notification buffer. I could be done, allowing the ring buffer to be mmap()ed, however as you point out you'd need some way of consuming the messages. It occurred to me eventually that the natural way to do that would be to make appropriate lseek()s remove the messages. However, that's not actually what I was getting at. I don't know that it's necessary to support mmap()ing the ring buffer. All I was saying is that if it were possible to mmap() at offsets around the value of the file pointer (so, near 0), it would be very peculiar for it to give you something other than read() does. So I suggest that for this other information we put it at a high offset and basically never allow the file pointer to reach up to it. Although, going back for a minute. Supposing we did allow the notification messages to be read and consumed via mmap() and lseek(), do we still need to provide special notification messages. Could we make the notification ring buffer be just the sample buffer? It would have a certain elegance to it. One other question: how are you planning to implement the mmap() based sampling and tsc virtualization features from perfctr? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-07 23:36:48
|
David, > > > Well, there are so few, I don't think we need to be particularly > > > clever. I suppose we could use the SPR numbers, minus some offset > > > (perfctr does this in the latest versions), but I was thinking > > > something simple, say: > > > > > > PMCs: > > > 0: MMCR0 > > > 1: MMCR1 > > > 2: MMCR2 (ppc32 only) > > > 3: MMCRA (ppc64 only) > > > > > > PMDs: > > > 0: timebase > > > 1: PMC1 > > > 2: PMC2 > > > ... > > > 8: PMC8 > > > > > > (The PowerPC documentation's use of PMC for "Performance Monitor > > > Counter" makes this look a little confusing). > > > > > >From this description, look like you have 8 counters but you don't have > > 8 configuration registers for them. Do they always go in pairs? > > No. The individual event selection fields, such as they are are > spread across MMCR0 (32bit) and MMCR1 (64bit). The rest of MMCR0 and > MMCRA have general control bits, plus the settings for the various > muxes which affect the interpretation of the event selection fields. > I still don't fully understand the event selection logic, I have to > admit - it's pretty baroque. > Ok, I see how it is spread across those registers. > > What about on creation each set returns an opaque cookie which must be used > > as the offset to mmap. This way, the user does not have to deal with page size. > > We could add the cookie in the structures passed to PFM_CREATE_EVTSETS and > > PFM_GETINFO_EVTSETS. This would cover set0 which is always created by default. > > And yes, you need one page per set because each set can individually be destroyed. > > The cookie could correspond to low or high offset depending on what is > > more convenient. > > Well, given that it is the offset to pass to mmap, I wouldn't really > call if "opaque". But returning this offset is a good idea, yes. > > Oh... hang on. I didn't immediately realize the implication of this. > Does this mean that there is a separate sample buffer for each event > set? > No, there is one sampling buffer per context and it is shared by all sets. > > On the question of read() on fd vs. mmap, there is an important difference > > here. The read() has the side effect of actually consuming the notification > > message from the message queue. Making the queue visible via mmap avoid the > > copying done by read but we loose the side-effect. We would still need a system > > call to remove that message from the queue. Mmaping is read-only. > > I thought about this before, allowing mmap() access to the > notification buffer. I could be done, allowing the ring buffer to be > mmap()ed, however as you point out you'd need some way of consuming > the messages. It occurred to me eventually that the natural way to do > that would be to make appropriate lseek()s remove the messages. > Yes, that's an idea. I need to check and see how the callback looks for lseek. I hope we can filter/fails on baroque paramters for the whence for instance. How to deal with movements that are not mltiple of the message size. If I recall I fail the read(), so we could as well fail the lseek() to a position that is in the middle of a message. But now, how is using mmap+lseek more efficient than a plain read() of something like 56 bytes? You need a system no matter what. > However, that's not actually what I was getting at. I don't know that > it's necessary to support mmap()ing the ring buffer. All I was saying > is that if it were possible to mmap() at offsets around the value of > the file pointer (so, near 0), it would be very peculiar for it to > give you something other than read() does. So I suggest that for this > other information we put it at a high offset and basically never allow > the file pointer to reach up to it. > Yes, I think that would work. > Although, going back for a minute. Supposing we did allow the > notification messages to be read and consumed via mmap() and lseek(), > do we still need to provide special notification messages. Could we > make the notification ring buffer be just the sample buffer? It would > have a certain elegance to it. > Well, the issue there is that perfmon allows you to handle sampling totally at the user level, i.e., no kernel level buffer. In that case you still want user level notification. Also in the future, the message queue will carry more than just overflow notifications. > One other question: how are you planning to implement the mmap() > based sampling and tsc virtualization features from perfctr? > I have started implemented the mmap() access for the virtualized 64-bit PMD registers. I am not too happy with the page consumption that this implies. at 320 PMDS x 8 bytes (minimum). Plus you need to indicate to the user which set is active. Only for that set there is the requirement to read the hardware registers. What is the common page size on PowerPC64 8k/16k? It would be nice to fit this into a 4 kb page/set. Also note that this mode is ONLY interesting in the case of a self-monitoring thread or for system wide. It does not when a thread is monitoring another thread. Thus I was thinking that this special page allocation could be made an option. By default the 64-bit values would be allocated with the kernel set structure. If the user requests remapping then they would be allocated on a distinct page. All of this to try to spare memory. People are applying monitoring to large workload with hundreds of threads to monitor. I have not yet looked at the TSC virtualization. My idea was to use a "soft" PMD to implement this feature. -- -Stephane |
From: David G. <da...@gi...> - 2005-04-08 01:26:24
|
On Thu, Apr 07, 2005 at 04:12:24PM -0700, Stephane Eranian wrote: > David, > > > > > Well, there are so few, I don't think we need to be particularly > > > > clever. I suppose we could use the SPR numbers, minus some offset > > > > (perfctr does this in the latest versions), but I was thinking > > > > something simple, say: > > > > > > > > PMCs: > > > > 0: MMCR0 > > > > 1: MMCR1 > > > > 2: MMCR2 (ppc32 only) > > > > 3: MMCRA (ppc64 only) > > > > > > > > PMDs: > > > > 0: timebase > > > > 1: PMC1 > > > > 2: PMC2 > > > > ... > > > > 8: PMC8 > > > > > > > > (The PowerPC documentation's use of PMC for "Performance Monitor > > > > Counter" makes this look a little confusing). > > > > > > > >From this description, look like you have 8 counters but you don't have > > > 8 configuration registers for them. Do they always go in pairs? > > > > No. The individual event selection fields, such as they are are > > spread across MMCR0 (32bit) and MMCR1 (64bit). The rest of MMCR0 and > > MMCRA have general control bits, plus the settings for the various > > muxes which affect the interpretation of the event selection fields. > > I still don't fully understand the event selection logic, I have to > > admit - it's pretty baroque. > > > Ok, I see how it is spread across those registers. > > > What about on creation each set returns an opaque cookie which must be used > > > as the offset to mmap. This way, the user does not have to deal with page size. > > > We could add the cookie in the structures passed to PFM_CREATE_EVTSETS and > > > PFM_GETINFO_EVTSETS. This would cover set0 which is always created by default. > > > And yes, you need one page per set because each set can individually be destroyed. > > > The cookie could correspond to low or high offset depending on what is > > > more convenient. > > > > Well, given that it is the offset to pass to mmap, I wouldn't really > > call if "opaque". But returning this offset is a good idea, yes. > > > > Oh... hang on. I didn't immediately realize the implication of this. > > Does this mean that there is a separate sample buffer for each event > > set? > > > No, there is one sampling buffer per context and it is shared by all sets. Ok. Why do we need a separate offset for each event set then? > > > On the question of read() on fd vs. mmap, there is an important difference > > > here. The read() has the side effect of actually consuming the notification > > > message from the message queue. Making the queue visible via mmap avoid the > > > copying done by read but we loose the side-effect. We would still need a system > > > call to remove that message from the queue. Mmaping is read-only. > > > > I thought about this before, allowing mmap() access to the > > notification buffer. I could be done, allowing the ring buffer to be > > mmap()ed, however as you point out you'd need some way of consuming > > the messages. It occurred to me eventually that the natural way to do > > that would be to make appropriate lseek()s remove the messages. > > Yes, that's an idea. I need to check and see how the callback looks > for lseek. I hope we can filter/fails on baroque paramters for the > whence for instance. I don't see why we couldn't handle the whence parameter fully - in fact that would even be useful, lseek(fd, 0, SEEK_END) would be guaranteed to consume all the buffer, for example; lseek(fd, message_size, SEEK_CUR) would consume exactly one message. > How to deal with movements that are not mltiple of the message > size. If we just treat this as a ring buffer with a byte offset, I don't see that that causes a problem. > If I recall I fail the read(), so we could as well fail the lseek() > to a position that is in the middle of a message. Erm... that's probably not a good idea. lseek() doesn't usually fail based on the value of the offset. > But now, how is > using mmap+lseek more efficient than a plain read() of something > like 56 bytes? You need a system no matter what. Yes, but mmap()+lseek() avoids the copy. Whether that's enough to make a significant difference or not, I don't know. I was thinking of it more in the context of a unified sample/notification buffer, as mentioned below, where we could be copying out rather more than 56 bytes per event. > > However, that's not actually what I was getting at. I don't know that > > it's necessary to support mmap()ing the ring buffer. All I was saying > > is that if it were possible to mmap() at offsets around the value of > > the file pointer (so, near 0), it would be very peculiar for it to > > give you something other than read() does. So I suggest that for this > > other information we put it at a high offset and basically never allow > > the file pointer to reach up to it. > > > Yes, I think that would work. > > > Although, going back for a minute. Supposing we did allow the > > notification messages to be read and consumed via mmap() and lseek(), > > do we still need to provide special notification messages. Could we > > make the notification ring buffer be just the sample buffer? It would > > have a certain elegance to it. > > > Well, the issue there is that perfmon allows you to handle sampling > totally at the user level, i.e., no kernel level buffer. Well, except there still is a kernel level buffer, in the form of the queue of notification events. Why not just make it always one buffer. > In that case > you still want user level notification. Also in the future, the message > queue will carry more than just overflow notifications. So? No reason other events couldn't go in the sample buffer too. Puts all the information in one place - even if these other notifications carry more information than will fit in a small message. Plus naturally maintains the correct order between overflows and other sorts of sampling events. > > One other question: how are you planning to implement the mmap() > > based sampling and tsc virtualization features from perfctr? > > > I have started implemented the mmap() access for the virtualized 64-bit > PMD registers. I am not too happy with the page consumption that this implies. > at 320 PMDS x 8 bytes (minimum). Yes, that is a bit of an issue. This is where Mikael's approach of renumbering the counters to reduce cacheline usage comes in. But that doesn't really fit into the perfmon model. It's worse than 8 bytes each if you need to worry about both sum and offset values, as perfctr does. Perfmon doesn't need to in the cases where it writes back to the counters, but that may not always be possible or practical. According to Mikael on at least some x86 machines, writing the counters is high latency and needs to be avoided as much as possible. And even without that, things like the timebase/tsc obviously can't be written to, so we would need a start/offset approach to virtualize them. > Plus you need to indicate to the user > which set is active. Only for that set there is the requirement to read the > hardware registers. I guess. I think for this sort of operation a separate window for each event set might make more sense. > What is the common page size on PowerPC64 8k/16k? 4k. > It would be nice to fit this into a 4 kb page/set. Also note that this mode > is ONLY interesting in the case of a self-monitoring thread or for system > wide. It does not when a thread is monitoring another thread. Not entirely true. Certainly self-monitoring is the really useful case here, but it would be possible to use this for low-latency monitoring of another process too, if for the application it's acceptable to put up with slightly out of date statistics. That's plausible if monitoring a steady-state process and the counter values can be normalized by tsc, for example. In this scenario we also need some synchronization between kernel and user to ensure that the user sampling process gets an atomic snapshot of all the counters. Currently perfctr handles that based on the tsc value, which I'm not convinced is entirely correct. I think a seqlock like mechanism would be appropriate here. > Thus I was > thinking that this special page allocation could be made an option. By > default the 64-bit values would be allocated with the kernel set structure. > If the user requests remapping then they would be allocated on a distinct > page. All of this to try to spare memory. People are applying monitoring > to large workload with hundreds of threads to monitor. Hrm... having an 'if' on every single access to the soft counter, to determine the correct location doesn't really sound like a good idea. > I have not yet looked at the TSC virtualization. My idea was to use > a "soft" PMD to implement this feature. Sure. But since you can't write the tsc, remember that you'll need an start/sum mechanism to support user level sampling. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-13 22:37:02
|
David, > > > Well, given that it is the offset to pass to mmap, I wouldn't really > > > call if "opaque". But returning this offset is a good idea, yes. > > > > > > Oh... hang on. I didn't immediately realize the implication of this. > > > Does this mean that there is a separate sample buffer for each event > > > set? > > > > > No, there is one sampling buffer per context and it is shared by all sets. > > Ok. Why do we need a separate offset for each event set then? > Here is a simple example, imagine the following sequence (simplified): fd = PFM_CREATE_CONTEXT() -> creates context and set0 PFM_CREATE_EVTSETS(1) -> creates event set1 Now you would like to access set0 and set1 via remapping. You need to be able to uniquely identify each set for mmap(). PFM_GETINFO_EVTSETS(0, &setinfo0) -> get event set info for set0 off_set0 = setinfo0.set_mmap_offset; PFM_GETINFO_EVTSETS(1, &setinfo1) -> get event set info for set1 off_set1 = setinfo0.set_mmap_offset; /* now map each set separately */ addr_set0 = mmap(NULL, sizeof(pfm_set_view_t), PROT_READ, MAP_PRIVATE, fd, off_set0); addr_set1 = mmap(NULL, sizeof(pfm_set_view_t), PROT_READ, MAP_PRIVATE, fd, off_set1); This is necessary because we don't want to force the user the block create and delete sets. The pfm_set_view_t structure is defined something like: typedef struct { uint64_t set_runs; /* how many times the set was activated */ uint16_t set_id; /* set identification */ uint16_t set_status; /* inactive/active */ uint32_t reserved; uint64_t pmds[PFM_MAX_PMDS]; } pfm_set_view_t; > > > > On the question of read() on fd vs. mmap, there is an important difference > > > > here. The read() has the side effect of actually consuming the notification > > > > message from the message queue. Making the queue visible via mmap avoid the > > > > copying done by read but we loose the side-effect. We would still need a system > > > > call to remove that message from the queue. Mmaping is read-only. > > > > > > I thought about this before, allowing mmap() access to the > > > notification buffer. I could be done, allowing the ring buffer to be > > > mmap()ed, however as you point out you'd need some way of consuming > > > the messages. It occurred to me eventually that the natural way to do > > > that would be to make appropriate lseek()s remove the messages. > > > > Yes, that's an idea. I need to check and see how the callback looks > > for lseek. I hope we can filter/fails on baroque paramters for the > > whence for instance. > > I don't see why we couldn't handle the whence parameter fully - in > fact that would even be useful, lseek(fd, 0, SEEK_END) would be > guaranteed to consume all the buffer, for example; lseek(fd, > message_size, SEEK_CUR) would consume exactly one message. > The callback gets the whence parameter. I think we could handle it. > > How to deal with movements that are not mltiple of the message > > size. > > If we just treat this as a ring buffer with a byte offset, I don't see > that that causes a problem. > I went back and forth on this. If you look at the latest version of the document, I think I try to explain why it is not very efficient to treat the notification message queue as a byte stream. Because it forces the application to issue two reads to extract a message: 1 to get the type and a second to read the body. The queue is current managed as a message queue. Read size must be multiple of the fixed message structure. It is not possible to read partial messages. It simpifies a lot the read routine. The application only needs to issue one read to get a full message. Sure there are a few bytes wasted but so far the largest message is the overflow message and that likely going to be the most frequent message. The cost of copying 56 bytes is probbaly fairly small compared to the overall cost of the system call. > > If I recall I fail the read(), so we could as well fail the lseek() > > to a position that is in the middle of a message. > > Erm... that's probably not a good idea. lseek() doesn't usually fail > based on the value of the offset. > That's true. > > > Although, going back for a minute. Supposing we did allow the > > > notification messages to be read and consumed via mmap() and lseek(), > > > do we still need to provide special notification messages. Could we > > > make the notification ring buffer be just the sample buffer? It would > > > have a certain elegance to it. > > > > > Well, the issue there is that perfmon allows you to handle sampling > > totally at the user level, i.e., no kernel level buffer. > > Well, except there still is a kernel level buffer, in the form of the > queue of notification events. Why not just make it always one buffer. There is another aspect of perfmon that comes into play here. The buffer format is not under the control of the perfmon core. It sizes and existence is totally under the control of the application of the buffer sampling format. Also without a read, I wonder how would the application by waiting/polling for any new events. Perfmon does not use signals to notify of new events. The read/poll/select can be used. Should a signal be necessary, then it is setup normally by requesting ownserhip of the resrouce via some fcntl(). In the end you can receive a SIGIO on notification events, then you know there is something in the notification message queue. > > I have started implemented the mmap() access for the virtualized 64-bit > > PMD registers. I am not too happy with the page consumption that this implies. > > at 320 PMDS x 8 bytes (minimum). > > Yes, that is a bit of an issue. This is where Mikael's approach of > renumbering the counters to reduce cacheline usage comes in. But that > doesn't really fit into the perfmon model. > I guess it is a matter of finding a smart mapping to hide some of the holes. On Ia-4 so far we are lucky ther ePMD and PMC are allocated sequentially. For X86-64 and IA-32 (P6/Pentium M) it is trivial to implement sequential mappings. For PowerPC it looks fine as well. The P4/Xeon look more challenging. > It's worse than 8 bytes each if you need to worry about both sum and > offset values, as perfctr does. Perfmon doesn't need to in the cases > where it writes back to the counters, but that may not always be > possible or practical. According to Mikael on at least some x86 > machines, writing the counters is high latency and needs to be avoided > as much as possible. And even without that, things like the > timebase/tsc obviously can't be written to, so we would need a > start/offset approach to virtualize them. > > > > It would be nice to fit this into a 4 kb page/set. Also note that this mode > > is ONLY interesting in the case of a self-monitoring thread or for system > > wide. It does not when a thread is monitoring another thread. > > Not entirely true. Certainly self-monitoring is the really useful > case here, but it would be possible to use this for low-latency > monitoring of another process too, if for the application it's > acceptable to put up with slightly out of date statistics. That's > plausible if monitoring a steady-state process and the counter values > can be normalized by tsc, for example. > I would add, it is useful for self-monitoring where it is possible to read the PMD for the user level directly. Moreover, it is only useful if we know that the counter is going to exceed its hardware width, i.e. a direct read is not enough. Other than that I agree with you for the non-self-monitoring case. I guess we would have to set the level of expectations of the "age" of the values. > In this scenario we also need some synchronization between kernel and > user to ensure that the user sampling process gets an atomic snapshot > of all the counters. Currently perfctr handles that based on the tsc > value, which I'm not convinced is entirely correct. I think a seqlock > like mechanism would be appropriate here. > Yes, I saw that in the perctr code. I am not sure this is very reliable. BTW, I got rid of PFM_SET_CONFIG/PFM_GET_CONFIG as you suggested. I use a combination of /proc and /proc/sys. I will fix the number of PMDS and PMCS for all platforms. I move the reg_smpl_pmds/reg_reset_pmds/flags to the PFM_WRITE_PMDS from PFM_WRITE_PMCS. I will be changing the type of reg_value for pfarg_pmc_t to "unsigned long" from "uint64_t". I think for all PMU I looked at the PMC are always as wide as an unsigned long. For PMD, they type must remain uint64_t. Can you confirm this fact for PPC32 and PPC64? -- -Stephane |
From: David G. <da...@gi...> - 2005-04-14 01:40:30
|
On Wed, Apr 13, 2005 at 03:12:19PM -0700, Stephane Eranian wrote: > David, > > > > > Well, given that it is the offset to pass to mmap, I wouldn't really > > > > call if "opaque". But returning this offset is a good idea, yes. > > > > > > > > Oh... hang on. I didn't immediately realize the implication of this. > > > > Does this mean that there is a separate sample buffer for each event > > > > set? > > > > > > > No, there is one sampling buffer per context and it is shared by all sets. > > > > Ok. Why do we need a separate offset for each event set then? > > > Here is a simple example, imagine the following sequence (simplified): > > fd = PFM_CREATE_CONTEXT() -> creates context and set0 > PFM_CREATE_EVTSETS(1) -> creates event set1 > > Now you would like to access set0 and set1 via remapping. > You need to be able to uniquely identify each set for mmap(). > > PFM_GETINFO_EVTSETS(0, &setinfo0) -> get event set info for set0 > off_set0 = setinfo0.set_mmap_offset; > > PFM_GETINFO_EVTSETS(1, &setinfo1) -> get event set info for set1 > off_set1 = setinfo0.set_mmap_offset; > > /* now map each set separately */ > addr_set0 = mmap(NULL, sizeof(pfm_set_view_t), PROT_READ, MAP_PRIVATE, fd, off_set0); > addr_set1 = mmap(NULL, sizeof(pfm_set_view_t), PROT_READ, MAP_PRIVATE, fd, off_set1); > > This is necessary because we don't want to force the user the block create > and delete sets. The pfm_set_view_t structure is defined something like: > typedef struct { > uint64_t set_runs; /* how many times the set was activated */ > uint16_t set_id; /* set identification */ > uint16_t set_status; /* inactive/active */ > uint32_t reserved; > uint64_t pmds[PFM_MAX_PMDS]; > } pfm_set_view_t; Ah, ok, I take it this is for the user-level sampling support. Remember the last version of the document I've read didn't have this support, so I was confused as to what per-set information you were mapping; last I had details the only mappable thing was the sample buffer. > > > > > On the question of read() on fd vs. mmap, there is an important difference > > > > > here. The read() has the side effect of actually consuming the notification > > > > > message from the message queue. Making the queue visible via mmap avoid the > > > > > copying done by read but we loose the side-effect. We would still need a system > > > > > call to remove that message from the queue. Mmaping is read-only. > > > > > > > > I thought about this before, allowing mmap() access to the > > > > notification buffer. I could be done, allowing the ring buffer to be > > > > mmap()ed, however as you point out you'd need some way of consuming > > > > the messages. It occurred to me eventually that the natural way to do > > > > that would be to make appropriate lseek()s remove the messages. > > > > > > Yes, that's an idea. I need to check and see how the callback looks > > > for lseek. I hope we can filter/fails on baroque paramters for the > > > whence for instance. > > > > I don't see why we couldn't handle the whence parameter fully - in > > fact that would even be useful, lseek(fd, 0, SEEK_END) would be > > guaranteed to consume all the buffer, for example; lseek(fd, > > message_size, SEEK_CUR) would consume exactly one message. > > > The callback gets the whence parameter. I think we could handle it. > > > > How to deal with movements that are not mltiple of the message > > > size. > > > > If we just treat this as a ring buffer with a byte offset, I don't see > > that that causes a problem. > > > I went back and forth on this. If you look at the latest version of the > document, I think I try to explain why it is not very efficient to treat > the notification message queue as a byte stream. Because it forces the > application to issue two reads to extract a message: 1 to get the type > and a second to read the body. Well, that's a consequence of merging the two buffers, thereby making the messages of variable size, not a consequence of treating the notification queue as a byte stream per se. But if we allowed mmap() access to the notification buffer, then we don't need two syscalls() after all. Just read the data from memory, then one lseek() to consume the message. > The queue is current managed as a message > queue. Read size must be multiple of the fixed message structure. It is > not possible to read partial messages. It simpifies a lot the read routine. > The application only needs to issue one read to get a full message. Sure > there are a few bytes wasted but so far the largest message is the overflow > message and that likely going to be the most frequent message. The cost > of copying 56 bytes is probbaly fairly small compared to the overall cost > of the system call. > > > > If I recall I fail the read(), so we could as well fail the lseek() > > > to a position that is in the middle of a message. > > > > Erm... that's probably not a good idea. lseek() doesn't usually fail > > based on the value of the offset. > > > That's true. > > > > > Although, going back for a minute. Supposing we did allow the > > > > notification messages to be read and consumed via mmap() and lseek(), > > > > do we still need to provide special notification messages. Could we > > > > make the notification ring buffer be just the sample buffer? It would > > > > have a certain elegance to it. > > > > > > > Well, the issue there is that perfmon allows you to handle sampling > > > totally at the user level, i.e., no kernel level buffer. > > > > Well, except there still is a kernel level buffer, in the form of the > > queue of notification events. Why not just make it always one buffer. > > There is another aspect of perfmon that comes into play here. The buffer > format is not under the control of the perfmon core. It sizes and existence > is totally under the control of the application of the buffer sampling format. So..? Obviously the sample formats would have to be written such that the single sample buffer has enough meta-information that it can be unambiguously parsed, but that's no big deal. > Also without a read, I wonder how would the application by waiting/polling > for any new events. Perfmon does not use signals to notify of new events. > The read/poll/select can be used. Should a signal be necessary, then it is > setup normally by requesting ownserhip of the resrouce via some fcntl(). > In the end you can receive a SIGIO on notification events, then you know > there is something in the notification message queue. Well, the stream is still there, so the application can still use select(). It's just that after the select() or poll() returns it will get the data from the mmap()ed region and then lseek() instead of doing a read(). Or alternatively, if it suits the application's structure, it could do a small (blocking) read() to get the header of the next block of sampling data, then read any extra information from the mmap. > > > I have started implemented the mmap() access for the virtualized 64-bit > > > PMD registers. I am not too happy with the page consumption that this implies. > > > at 320 PMDS x 8 bytes (minimum). > > > > Yes, that is a bit of an issue. This is where Mikael's approach of > > renumbering the counters to reduce cacheline usage comes in. But that > > doesn't really fit into the perfmon model. > > > I guess it is a matter of finding a smart mapping to hide some of the holes. > On Ia-4 so far we are lucky ther ePMD and PMC are allocated sequentially. > For X86-64 and IA-32 (P6/Pentium M) it is trivial to implement sequential > mappings. For PowerPC it looks fine as well. The P4/Xeon look more challenging. > > > It's worse than 8 bytes each if you need to worry about both sum and > > offset values, as perfctr does. Perfmon doesn't need to in the cases > > where it writes back to the counters, but that may not always be > > possible or practical. According to Mikael on at least some x86 > > machines, writing the counters is high latency and needs to be avoided > > as much as possible. And even without that, things like the > > timebase/tsc obviously can't be written to, so we would need a > > start/offset approach to virtualize them. > > > > > > > > It would be nice to fit this into a 4 kb page/set. Also note that this mode > > > is ONLY interesting in the case of a self-monitoring thread or for system > > > wide. It does not when a thread is monitoring another thread. > > > > Not entirely true. Certainly self-monitoring is the really useful > > case here, but it would be possible to use this for low-latency > > monitoring of another process too, if for the application it's > > acceptable to put up with slightly out of date statistics. That's > > plausible if monitoring a steady-state process and the counter values > > can be normalized by tsc, for example. > > > > I would add, it is useful for self-monitoring where it is possible > to read the PMD for the user level directly. Moreover, it is only > useful if we know that the counter is going to exceed its hardware > width, i.e. a direct read is not enough. Or if we're using the start/sum approach to avoid unnecessary writes to the PMDs. According to Mikael this is very important on some x86 models. > Other than that I agree with you for the non-self-monitoring case. > I guess we would have to set the level of expectations of the "age" of the values. Yes, some sort of setting for sampling frequency would probably make sense. > > In this scenario we also need some synchronization between kernel and > > user to ensure that the user sampling process gets an atomic snapshot > > of all the counters. Currently perfctr handles that based on the tsc > > value, which I'm not convinced is entirely correct. I think a seqlock > > like mechanism would be appropriate here. > > > Yes, I saw that in the perctr code. I am not sure this is very reliable. > > BTW, I got rid of PFM_SET_CONFIG/PFM_GET_CONFIG as you suggested. > I use a combination of /proc and /proc/sys. Excellent. Is there somewhere I can grab these latest test versions? > I will fix the number of PMDS and PMCS for all platforms. I move > the reg_smpl_pmds/reg_reset_pmds/flags to the PFM_WRITE_PMDS from > PFM_WRITE_PMCS. Good > I will be changing the type of reg_value for pfarg_pmc_t > to "unsigned long" from "uint64_t". I think for all PMU I looked at the > PMC are always as wide as an unsigned long. For PMD, they type must > remain uint64_t. Can you confirm this fact for PPC32 and PPC64? No, no no! Don't do this! Yes, things would fit on ppc32 and ppc64, but having the size here variable is a really bad idea. It means that 32-bit applications on a 64-bit kernel (very common on ppc64 and x86_64) will have a different notion of the structure to the kernel, so we would need to implement an ugly translation layer to support them. And even that wouldn't work properly for >32bit registers being accessed from 32-bit apps. Using fixed-width types everywhere will save us a lot of pain in the ABI later. Speaking of which if there are any unsigned longs in the interface at preset, get rid of them. In fact get rid of any unsigned ints as well - I believe they're reliably 32-bit on all current platforms, but there's no guarantee that will always be the case. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |
From: Stephane E. <er...@hp...> - 2005-04-14 10:55:36
|
David, > > I went back and forth on this. If you look at the latest version of the > > document, I think I try to explain why it is not very efficient to treat > > the notification message queue as a byte stream. Because it forces the > > application to issue two reads to extract a message: 1 to get the type > > and a second to read the body. > > Well, that's a consequence of merging the two buffers, thereby making > the messages of variable size, not a consequence of treating the > notification queue as a byte stream per se. > > But if we allowed mmap() access to the notification buffer, then we > don't need two syscalls() after all. Just read the data from memory, > then one lseek() to consume the message. > It is true that if the message queue was mmapped, the two read() calls would be replaced by a load for the type and possibly loads for the rest. In the case where the application does all the sampling at the user level, it would mmaped the notification queue, that's a page right there. If the application is using a buffer format with buffer remapping, then it would issue an mmap for the notification queue and one for the buffer. OTOH,I think that is your point, you could say the first case is like the second but where the buffer size is actually zero. In other words, the mmap offset for the buffer and and the message queue would be, in effect, the same. That sound appealing to me. Now my problem would be to support legacy IA-64 applications which would still use the read model. but I think this could work. I will see how this can be made to work with the code I have. > Obviously the sample formats would have to be written such that the > single sample buffer has enough meta-information that it can be > unambiguously parsed, but that's no big deal. > Yes. > > Also without a read, I wonder how would the application by waiting/polling > > for any new events. Perfmon does not use signals to notify of new events. > > The read/poll/select can be used. Should a signal be necessary, then it is > > setup normally by requesting ownserhip of the resrouce via some fcntl(). > > In the end you can receive a SIGIO on notification events, then you know > > there is something in the notification message queue. > > > Well, the stream is still there, so the application can still use > select(). It's just that after the select() or poll() returns it will > get the data from the mmap()ed region and then lseek() instead of > doing a read(). Or alternatively, if it suits the application's > structure, it could do a small (blocking) read() to get the header of > the next block of sampling data, then read any extra information from > the mmap. > Yes, I think you could still poll/select and then go to the mmaped area to get the actual data. > > I would add, it is useful for self-monitoring where it is possible > > to read the PMD for the user level directly. Moreover, it is only > > useful if we know that the counter is going to exceed its hardware > > width, i.e. a direct read is not enough. > > Or if we're using the start/sum approach to avoid unnecessary writes > to the PMDs. According to Mikael this is very important on some x86 > models. > I am not sure I understand thet start/sum approach you are talking about. The x86 variants all have this problem that it is very expensive to read and write the PMU registers. Writes are killer of context swith out because there is no other way to stop monitoring but to write the perfsel registers. Maybe on P4, there is something else. > > BTW, I got rid of PFM_SET_CONFIG/PFM_GET_CONFIG as you suggested. > > I use a combination of /proc and /proc/sys. > > Excellent. Is there somewhere I can grab these latest test versions? > I will try to make the curent sources available next week. It would be nice if you could attempt the ppc32/ppc64 port. I am confident the core perfmon would support this now. > > I will be changing the type of reg_value for pfarg_pmc_t > > to "unsigned long" from "uint64_t". I think for all PMU I looked at the > > PMC are always as wide as an unsigned long. For PMD, they type must > > remain uint64_t. Can you confirm this fact for PPC32 and PPC64? > > No, no no! Don't do this! Yes, things would fit on ppc32 and ppc64, > but having the size here variable is a really bad idea. It means that > 32-bit applications on a 64-bit kernel (very common on ppc64 and > x86_64) will have a different notion of the structure to the kernel, > so we would need to implement an ugly translation layer to support > them. And even that wouldn't work properly for >32bit registers being > accessed from 32-bit apps. Using fixed-width types everywhere will > save us a lot of pain in the ABI later. Speaking of which if there > are any unsigned longs in the interface at preset, get rid of them. > In fact get rid of any unsigned ints as well - I believe they're > reliably 32-bit on all current platforms, but there's no guarantee > that will always be the case. > Well, I would like you to take a look at the data structures defined in the document. I tried hard to make them fixed size. Yet some of them use size_t. The mmap offset would introduce an off_t. Both are likely defined as "unsigned long". There is also a big problem with the sample entry structure for the default format. Thin about the instruction pointer, this one has to be defined as unsigned long (or uintptr_t). This works for both ILP32 and LP64 systems. However we have a problem for LP64 kernels but ILP32 applications such as a ppc32 monitoring tools try to decode the mmap sampling buffer written by a ppc64 kernel. There is no automatic way to tell a compiler: "I am a ILP32 application ut I would like to use the 64-bit a certain structures". At best you need to have some #define to force the pre-processor to pick up the right data structure. In the case of ILP32 running on LP64, it would need to pick up a sample entry format where the instruction point is defined as uint64_t. But having it forced to uint64_t would be overkill (space) when running on an ILP32 kernel. If you think of the mmap offset, most likely there is 32<->64 bit emulation for an ILP32 running on an LP64 kernel. The offset can only be 32 bit in that mode. -- -Stephane |
From: David G. <da...@gi...> - 2005-04-15 05:12:44
|
On Thu, Apr 14, 2005 at 03:30:57AM -0700, Stephane Eranian wrote: > David, > > > > I went back and forth on this. If you look at the latest version of the > > > document, I think I try to explain why it is not very efficient to treat > > > the notification message queue as a byte stream. Because it forces the > > > application to issue two reads to extract a message: 1 to get the type > > > and a second to read the body. > > > > Well, that's a consequence of merging the two buffers, thereby making > > the messages of variable size, not a consequence of treating the > > notification queue as a byte stream per se. > > > > But if we allowed mmap() access to the notification buffer, then we > > don't need two syscalls() after all. Just read the data from memory, > > then one lseek() to consume the message. > > > It is true that if the message queue was mmapped, the > two read() calls would be replaced by a load for the type and possibly > loads for the rest. > > In the case where the application does all the sampling at the user > level, it would mmaped the notification queue, that's a page right there. > If the application is using a buffer format with buffer remapping, then > it would issue an mmap for the notification queue and one for the buffer. > OTOH,I think that is your point, you could say the first case is like > the second but where the buffer size is actually zero. In other words, > the mmap offset for the buffer and and the message queue would be, in > effect, the same. That sound appealing to me. Now my problem would be to > support legacy IA-64 applications which would still use the read model. > but I think this could work. I will see how this can be made to work > with the code I have. Ok, I'm no longer entirely sure we're talking about the same thing here, and I'm not sure which things you think are a good idea and which you don't. To clarify, there are two separate ideas here: 1) To allow the notification queue to be accessed with mmap(), and consumed with lseek() as a lower overhead of getting the messages. I wouldn't envisage removing the read() mechanism - an application could use either approach to read the messages from the stream, depending which was suitable (or even a combination of both). 2) To merge the notification queue and sampling buffer into a single data stream. The only connection with idea 1 is that using both together would mitigate some of the potential performance problems with idea 2. My impression is that you're convinced (1) is a good idea, but I'm not so sure what your current position on (2) is. For (2) supporting legacy ia64 programs would, indeed, be a tricky problem. > > Obviously the sample formats would have to be written such that the > > single sample buffer has enough meta-information that it can be > > unambiguously parsed, but that's no big deal. > > > Yes. > > > > Also without a read, I wonder how would the application by waiting/polling > > > for any new events. Perfmon does not use signals to notify of new events. > > > The read/poll/select can be used. Should a signal be necessary, then it is > > > setup normally by requesting ownserhip of the resrouce via some fcntl(). > > > In the end you can receive a SIGIO on notification events, then you know > > > there is something in the notification message queue. > > > > Well, the stream is still there, so the application can still use > > select(). It's just that after the select() or poll() returns it will > > get the data from the mmap()ed region and then lseek() instead of > > doing a read(). Or alternatively, if it suits the application's > > structure, it could do a small (blocking) read() to get the header of > > the next block of sampling data, then read any extra information from > > the mmap. > > Yes, I think you could still poll/select and then go to the mmaped area > to get the actual data. > > > > I would add, it is useful for self-monitoring where it is possible > > > to read the PMD for the user level directly. Moreover, it is only > > > useful if we know that the counter is going to exceed its hardware > > > width, i.e. a direct read is not enough. > > > > Or if we're using the start/sum approach to avoid unnecessary writes > > to the PMDs. According to Mikael this is very important on some x86 > > models. > > I am not sure I understand thet start/sum approach you are talking about. > The x86 variants all have this problem that it is very expensive to > read and write the PMU registers. Writes are killer of context swith out > because there is no other way to stop monitoring but to write > the perfsel registers. Maybe on P4, there is something else. The impression I've gotten from Mikael and the perfctr code is that apparently on at least some CPUs, reads are substantially less expensive than writes. Hence, the code is organized so as to avoid writes, replacing them with reads. So, to maintain a virtualized counter for a particular task/thread, instead of writing the virtualized value back to hardware counter whenever the task becomes active it uses the following sequence: When switching to a monitored task: start = hardware pmc value When switching away from a monitored task: sum += (hardware pmc value - start) When sampling a monitored task during its time slace: sum += (hardware pmc value - start) start = hardware pmc value At any point when the monitored task is running, the current virtualized counter value is (sum + (hardware value - start)). This achieves a fully virtualized counter value without any writes to the hardware counter. The perfctr mmap()ed window gives both the start and sum values for each active counter, so a self-monitoring task can compute the full software counter value itself. Obviously this technique cannot be used for counters generating overflow interrupts, since the hardware and software values must be congruent in that case. For such counters (i-mode counters, in perfctr terminology) perfctr falls back to writing the hardware counter value when switching to the monitored task. The same technique can be used where counter writes are not just expensive, but impossible, such as for the tsc/timebase. > > > BTW, I got rid of PFM_SET_CONFIG/PFM_GET_CONFIG as you suggested. > > > I use a combination of /proc and /proc/sys. > > > > Excellent. Is there somewhere I can grab these latest test versions? > > > I will try to make the curent sources available next week. It would be nice > if you could attempt the ppc32/ppc64 port. I am confident the core perfmon > would support this now. Excellent. I will be occupied at linux.conf.au most of next week, but I'll see what I can do. With any luck I will make some progress in the week after. > > > I will be changing the type of reg_value for pfarg_pmc_t > > > to "unsigned long" from "uint64_t". I think for all PMU I looked at the > > > PMC are always as wide as an unsigned long. For PMD, they type must > > > remain uint64_t. Can you confirm this fact for PPC32 and PPC64? > > > > No, no no! Don't do this! Yes, things would fit on ppc32 and ppc64, > > but having the size here variable is a really bad idea. It means that > > 32-bit applications on a 64-bit kernel (very common on ppc64 and > > x86_64) will have a different notion of the structure to the kernel, > > so we would need to implement an ugly translation layer to support > > them. And even that wouldn't work properly for >32bit registers being > > accessed from 32-bit apps. Using fixed-width types everywhere will > > save us a lot of pain in the ABI later. Speaking of which if there > > are any unsigned longs in the interface at preset, get rid of them. > > In fact get rid of any unsigned ints as well - I believe they're > > reliably 32-bit on all current platforms, but there's no guarantee > > that will always be the case. > > > Well, I would like you to take a look at the data structures defined in the > document. I tried hard to make them fixed size. Yet some of them use size_t. > The mmap offset would introduce an off_t. Both are likely defined as "unsigned long". > There is also a big problem with the sample entry structure for the default > format. Thin about the instruction pointer, this one has to be defined > as unsigned long (or uintptr_t). This works for both ILP32 and LP64 systems. > However we have a problem for LP64 kernels but ILP32 applications such as a > ppc32 monitoring tools try to decode the mmap sampling buffer written by a ppc64 > kernel. There is no automatic way to tell a compiler: "I am a ILP32 application ut > I would like to use the 64-bit a certain structures". At best you need to have some > #define to force the pre-processor to pick up the right data structure. In the > case of ILP32 running on LP64, it would need to pick up a sample entry format > where the instruction point is defined as uint64_t. But having it forced > to uint64_t would be overkill (space) when running on an ILP32 kernel. > If you think of the mmap offset, most likely there is 32<->64 bit emulation > for an ILP32 running on an LP64 kernel. The offset can only be 32 bit in that > mode. Ah, yes, all those things could cause problems. I'll take a closer look when I get the chance. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson |