Thread: [perfmon2] Fwd: [patch 0/3] [Announcement] Performance Counters for Linux
Status: Beta
Brought to you by:
seranian
From: stephane e. <er...@go...> - 2008-12-05 07:14:43
|
Hello everyone, Here is another competing perfmon API proposal posted just today on LKML by Ingo Molnar, and Thomas Gleixner (x86 kernel maintainers). They think their design is superior to that of the perfmon API. Looks like there is yet another battle to fight... I intend to respond tomorrow but feel free to contribute to the LKML thread if you have a strong opinion on the matter. ---------- Forwarded message ---------- From: Thomas Gleixner <tg...@li...> Date: Fri, Dec 5, 2008 at 12:44 AM Subject: [patch 0/3] [Announcement] Performance Counters for Linux To: LKML <lin...@vg...> Cc: lin...@vg..., Andrew Morton <ak...@li...>, Ingo Molnar <mi...@el...>, Stephane Eranian <er...@go...>, Eric Dumazet <da...@co...>, Robert Richter <rob...@am...>, Arjan van de Veen <ar...@in...>, Peter Anvin <hp...@zy...>, Peter Zijlstra <a.p...@ch...>, Steven Rostedt <ro...@go...>, David Miller <da...@da...>, Paul Mackerras <pa...@sa...> Performance counters are special hardware registers available on most modern CPUs. These register count the number of certain types of hw events: such as instructions executed, cachemisses suffered, or branches mis-predicted, without slowing down the kernel or applications. These registers can also trigger interrupts when a threshold number of events have passed - and can thus be used to profile the code that runs on that CPU. We'd like to announce a brand new implementation of performance counter support for Linux. It is a very simple and extensible design that has the potential to implement the full range of features we would expect from such a subsystem. The Linux Performance Counter subsystem (implemented via the patches posted in this announcement) provides an abstraction of performance counter hardware capabilities. It provides per task and per CPU counters, and it provides event capabilities on top of those. The code is far from complete - but the basic approach is already there and stable. The biggest missing detail is lowlevel support for non-Intel CPUs and older Intel CPUs - right now the code is implemented for Intel Core2 (and later) Intel CPUs that have the PERFMON CPU feature. (see below a wider list of missing/upcoming features) We are aware of the perfmon3 patchset that has been submitted to lkml recently. Our patchset tries to achieve a similar end result, with a fundamentally different (and we believe, superior :-) design: - The API is based on a single counter abstraction - Only one single new system call is needed: sys_perf_counter_open(). All performance-counter operations are implemented via standard VFS APIs such as read() / fcntl() and poll(). - User-space is not exposed to lowlevel details like contexts or arrays of counters. Opening and reading a basic counter is as simple as 2 lines of C code: void main(void) { u64 count; fd = perf_counter_open(3 /* PERF_COUNT_CACHE_MISSES */, 0, 0, 0, -1); ret = read(fd, &count, sizeof(count)); if (ret == sizeof(count)) printf("Current count: %Ld cachemisses!", count); } - Events, blocking/sleep are natural built-in properties of counters. - No interaction with ptrace: any task (with sufficient permissions) can monitor other tasks, without having to stop that task. - Mapping of counters to hw counters is not static - counters are scheduled dynamically on each CPU where a task runs. - There's a /sys based reservation facility that allows the allocation of a certain number of hw counters for guaranteed sysadmin access. - Generalized enumeration for common hw event types. Raw event codes can be passed to the API too - but the most common (and most useful) event codes are generalized into a hardware-independent registry of events: enum hw_event_types { PERF_COUNT_CYCLES, PERF_COUNT_INSTRUCTIONS, PERF_COUNT_CACHE_REFERENCES, PERF_COUNT_CACHE_MISSES, PERF_COUNT_BRANCH_INSTRUCTIONS, PERF_COUNT_BRANCH_MISSES, }; - Simplified lowlevel/arch support. The x86 code for Intel CPUs (with the PERFMON CPU feature) is 340 lines of code that implements 7 straightforward lowlevel API calls: int hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type); void hw_perf_counter_enable(struct perf_counter *counter); void hw_perf_counter_disable(struct perf_counter *counter); void hw_perf_counter_read(struct perf_counter *counter); void hw_perf_counter_enable_config(struct perf_counter *counter); void hw_perf_counter_disable_config(struct perf_counter *counter); void hw_perf_counter_setup(void); There's one kernel/perf_counter.c core file, and a single arch/x86/kernel/cpu/perf_counter.c architecture support file. The impact on the kernel tree is relatively moderate: 27 files changed, 1641 insertions(+), 7 deletions(-) TODO: - Non-Intel CPU support. Help is welcome :-) - Round-robin scheduling of counters, when there's more task counters than hw counters available. - Support for extended record types such as PEBS. - Support for NMI events in the x86 code (the core design is ready) - Make sure it works well with OProfile and the x86 NMI watchdog Short documentation is available in Documentation/perf-counters.txt Find below the source of a simple monitoring demo. We'd like to seek the feedback of perfmon developers and architecture maintainers - what do you think about this approach? Comments, reports, suggestions, flames and other types of feedback is more than welcome, Thomas, Ingo --- /* * Performance counters monitoring test case */ #include <sys/types.h> #include <sys/stat.h> #include <sys/time.h> #include <unistd.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <getopt.h> #include <fcntl.h> #include <stdio.h> #include <errno.h> #define __user #include "sys.h" static int count = 10000; static int eventid; static int tid; static char *debuginfo; static void display_help(void) { printf("monitor\n"); printf("Usage:\n" "monitor options threadid\n\n" "-e EID --eventid=EID eventid\n" "-c CNT --count=CNT event count on which IP is sampled\n" "-d FILE --debug=FILE path to binary file with debug info\n"); exit(0); } static void process_options (int argc, char *argv[]) { int error = 0; for (;;) { int option_index = 0; /** Options for getopt */ static struct option long_options[] = { {"count", required_argument, NULL, 'c'}, {"debug", required_argument, NULL, 'd'}, {"eventid", required_argument, NULL, 'e'}, {"help", no_argument, NULL, 'h'}, {NULL, 0, NULL, 0} }; int c = getopt_long(argc, argv, "c:d:e:", long_options, &option_index); if (c == -1) break; switch (c) { case 'c': count = atoi(optarg); break; case 'd': debuginfo = strdup(optarg); break; case 'e': eventid = atoi(optarg); break; default: error = 1; break; } } if (error || optind == argc) display_help (); tid = atoi(argv[optind]); } int main(int argc, char *argv[]) { char str[256]; uint64_t ip; ssize_t res; int fd; process_options(argc, argv); fd = perf_counter_open(eventid, count, 1, tid, -1); if (fd < 0) { perror("Create counter"); exit(-1); } while (1) { res = read(fd, (char *) &ip, sizeof(ip)); if (res != sizeof(ip)) { perror("Read counter"); break; } if (!debuginfo) { printf("IP: 0x%016llx\n", (unsigned long long)ip); } else { sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo, (unsigned long long)ip); system(str); } } close(fd); exit(0); } |
From: stephane e. <er...@go...> - 2008-12-06 02:36:43
|
Hello, I have been reading all the threads after this unexpected announcement of a competing proposal for an interface to access the performance counters. I would like to respond to some of the things I have seen. * ptrace: as Paul just pointed out, ptrace() is a limitation of the current perfmon implementation. This is not a limitation of the interface as has been insinuated earlier. In my mind, this does not justify starting from scratch. There is nothing that precludes removing ptrace and using the IPI to chase down the PMU state, like you are doing. And in fact I believe we can do it more efficiently because we would potentially collect multiple values in one IPI, something your API cannot allow because it is single event oriented. * There is more to perfmon than what you have looked at on LKML. There is advanced sampling support with a kernel level buffer which is remapped to user space. So there is no such thing as a couple of ptrace() calls per sample. In fact, there is zero copy export to user space. In the case of PEBS, there is even zero-copy from HW to user space. * The proposed API exposes events as individual entities. To measure N events, you need N file descriptors. There is no coordination of actions between the various events. If you want to start/stop all events, it seems you have to close the file descriptors and start over. That is not how people use this, especially people doing self monitoring. They want to start/stop around critical loops or functions and they want this to be fast. * To read N events you need N syscalls and potentially N IPIs. There is no guarantee of atomicity between the reads. The argument of raising the priority to prevent preemption is bogus and unrealistic. We want regular users to be able to measure their own applications without having to have special privileges. This is especially unpractical when you want to read from another thread. It is important to get a view of the counters that is as consistent as possible and for that you want to read the registers are closely as possible from each other. * As mentioned by Paul, Corey, the API inevitably forces the kernel to know about ALL the events and how they map onto counters. People who have been doing this in userland, and I am one of them, can tell you that this is a very hard problem. Looking at it just on the Intel and AMD x86 is misleading. It is not the number of events that matters, even it contributes to the kernel bloat, it is managing the constraints between events (event A and B cannot be measured together, if event A uses counter X then B cannot be measured on counter Y). Sometimes, the value of a config register depends on which register you load it on. With the proposed API, all this complexity would have to go in the kernel. I don't think it belongs here and it will leads to maintenance problems, and longer delays to enable support of new hardware. The argument for doing this was that it would facilitate writing tools. But all that complexity does not belong in the tools but in a user library. This is what libpfm is designed for and it has worked nicely so far. The role of the kernel is to control access to the PMU resource and to make sure incorrect programming of the registers cannot crash the kernel. If you do this, then providing support for new hardware is for the most part simply exposing the registers. Something which can even be discovered automatically on newer processors, e.g., ones supporting Intel architectural perfmon. * Tools usually manage monitoring as a session. There was criticism about the perfmon context abstraction and vectors. A context is merely a synonym for session. I believe having a file descriptor per session is a natural thing to have. Vectors are used to access multiple registers in one syscall. Vector have variable sizes, it depends on what you want to access. The size is not mandated by the number of registers of the underlying hardware. * As mentioned by Paul, with certain PMUs, it is not possible to solve the event -> counter problem without having a global view of all the events. Your API being single-event oriented, it is not clear to me how this can be solved. * It is not because you run a per thread session, that you should be limited to measuring at priv level 3. * Modern PMU, including AMD Barcelona. Itanium2, expose more than counters. Any API than assumes PMU export only counters is going to be limited, e.g. Oprofile. Perfmon does not make that mistake, the interface does not know anything about counters nor sampling periods. It sees registers with values you can read or write. That has allowed us to support advanced features such as Itanium2 Opcode filter, Itanium2 Code/Data range restrictions (hosted in debug regs), AMD Barcelona IBS which has no event associated with it, Itanium2 BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU. Some of those features have no ties with counters, they do not even overflow (e.g., LBR). They must be used in combination with counters, e.g., LBRs. I don't think you will be able to do this with your API. * With regards to sampling, advanced users have long been collecting more than just the IP. They want to collect the values of other PMU registers or even values of other non-PMU resources. With your API, it seems for every new need, you'd have to create a new perf_record_type, which translates into a kernel patch. This is not what people want. With perfmon, you have a choice of doing user level sampling (users gets notification for each sample) but you can also use a kernel sampling buffer. In that case, you can express what you want recorded in the buffer using simple bitmasks of PMU registers. There is no predefined set, no kernel patch. To make this even more flexible the buffer format is not part of the interface, you can define your own and record whatever you want in whatever format you want. All is provided by kernel modules. You want double-buffer, cyclic buffer, just add your kernel module. It seems this feature has been overlooked by LKML reviewers but it is really powerful. * It is not clear to me how you would add a sampling buffer and remapping using your API given the number of file descriptors you will end up using and the fact that you do not have the notion of a session. * When sampling, you want to freeze the counters on overflow to get an as consistent as possible view. There is no such guarantee in your API nor implementation. On some hardware platforms, e.g., Itanium, you have no choice this is the behavior. * Multiple counters can overflow at the same time and generate a single interrupt. With your approach, if two counters overflow simultaneously, then you need to enqueue two messages, yet only one SIGIO wil be generated, it seems. Wonder how that works when self-monitoring. In summary, although the idea of simplifying tools by moving the complexity elsewhere is legitimate, pushing it down to the kernel is the wrong approach in my opinion, perfmon has avoided that as much as possible for good reasons. We have shown , with libpfm, that a large part of complexity can easily be encapsulated into a user library. I also don't think the approach of managing events independently of each others works for all processors. As pointed out by others, there are other factors at stake and they may not even be on the same core. S. Eranian |
From: Dan T. <ter...@ee...> - 2008-12-08 02:22:26
|
I'm reminded of the quote attributed to Einstein: "Make things as simple as possible, but no simpler". In that regard, it appears that Stephane's perfmon is closer to the mark than this proposal. If Stephane's observations below are even close to correct, it would make PAPI's first-person event-set caliper model essentially useless. We must be able to start and stop multiple counter values simultaneously and quickly to infer any validity even for derived measurements as simple as instructions-per-cycle. dan terpstra for the PAPI team > -----Original Message----- > From: stephane eranian [mailto:er...@go...] > Sent: Friday, December 05, 2008 9:37 PM > To: Thomas Gleixner > Cc: lin...@vg...; Peter Zijlstra; David Miller; LKML; Steven > Rostedt; Eric Dumazet; Paul Mackerras; Peter Anvin; Andrew Morton; Ingo > Molnar; perfmon2-devel; Arjan van de Veen > Subject: Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters > forLinux > > Hello, > > I have been reading all the threads after this unexpected announcement > of a competing proposal for an interface to access the performance > counters. > I would like to respond to some of the things I have seen. > > * ptrace: as Paul just pointed out, ptrace() is a limitation of the > current perfmon implementation. This is not a limitation of the > interface as has been insinuated earlier. In my mind, this does > not justify starting from scratch. There is nothing that precludes > removing ptrace and using the IPI to chase down the PMU state, > like you are doing. And in fact I believe we can do it more efficiently > because we would potentially collect multiple values in one IPI, > something your API cannot allow because it is single event oriented. > > * There is more to perfmon than what you have looked at on LKML. There > is advanced sampling support with a kernel level buffer which is > remapped > to user space. So there is no such thing as a couple of ptrace() calls > per > sample. In fact, there is zero copy export to user space. In the > case of PEBS, > there is even zero-copy from HW to user space. > > * The proposed API exposes events as individual entities. To measure N > events, you need N file descriptors. There is no coordination of > actions > between the various events. If you want to start/stop all events, it > seems > you have to close the file descriptors and start over. That is not > how people > use this, especially people doing self monitoring. They want to > start/stop > around critical loops or functions and they want this to be fast. > > * To read N events you need N syscalls and potentially N IPIs. There > is no guarantee of atomicity between the reads. The argument of raising > the priority to prevent preemption is bogus and unrealistic. We want > regular > users to be able to measure their own applications without having to > have > special privileges. This is especially unpractical when you want to > read from > another thread. It is important to get a view of the counters that > is as consistent > as possible and for that you want to read the registers are closely > as possible > from each other. > > * As mentioned by Paul, Corey, the API inevitably forces the kernel to > know about > ALL the events and how they map onto counters. People who have been > doing this > in userland, and I am one of them, can tell you that this is a very > hard problem. > Looking at it just on the Intel and AMD x86 is misleading. It is not > the number of > events that matters, even it contributes to the kernel bloat, it is > managing the constraints > between events (event A and B cannot be measured together, if event > A uses counter X > then B cannot be measured on counter Y). Sometimes, the value of a > config register depends > on which register you load it on. With the proposed API, all this > complexity would have to go in > the kernel. I don't think it belongs here and it will leads to > maintenance problems, and longer > delays to enable support of new hardware. The argument for doing > this was that it would > facilitate writing tools. But all that complexity does not belong in > the tools but in a user library. > This is what libpfm is designed for and it has worked nicely so far. > The role of the kernel > is to control access to the PMU resource and to make sure incorrect > programming of the registers > cannot crash the kernel. If you do this, then providing support for > new hardware is for the most part > simply exposing the registers. Something which can even be > discovered automatically on newer > processors, e.g., ones supporting Intel architectural perfmon. > > * Tools usually manage monitoring as a session. There was criticism > about the perfmon context abstraction and vectors. A context is merely > a synonym for session. I believe having a file descriptor per session > is > a natural thing to have. Vectors are used to access multiple registers > in > one syscall. Vector have variable sizes, it depends on what you want to > access. The size is not mandated by the number of registers of the > underlying hardware. > > * As mentioned by Paul, with certain PMUs, it is not possible to solve > the event -> counter problem without having a global view > of all the events. Your API being single-event oriented, it is not > clear to me how this can be solved. > > * It is not because you run a per thread session, that you should be > limited to measuring at priv level 3. > > * Modern PMU, including AMD Barcelona. Itanium2, expose more than > counters. Any API than assumes PMU export only > counters is going to be limited, e.g. Oprofile. Perfmon does not > make that mistake, the interface does not know anything > about counters nor sampling periods. It sees registers with values > you can read or write. That has allowed us to support > advanced features such as Itanium2 Opcode filter, Itanium2 > Code/Data range restrictions (hosted in debug regs), AMD > Barcelona IBS which has no event associated with it, Itanium2 > BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU. > Some of those features have no ties with counters, they do not even > overflow (e.g., LBR). They must be used in combination with > counters, e.g., LBRs. I don't think you will be able to do this > with your API. > > * With regards to sampling, advanced users have long been collecting > more than just the IP. They want to collect the values of other > PMU registers or even values of other non-PMU resources. With your > API, it seems for every new need, you'd have to create a new > perf_record_type, which translates into a kernel patch. This is not > what people want. With perfmon, you have a choice of doing user > level sampling (users gets notification for each sample) but you can > also use a kernel sampling buffer. In that case, you can express > what you want recorded in the buffer using simple bitmasks of PMU > registers. There is no predefined set, no kernel patch. > To make this even more flexible the buffer format is not part of the > interface, you can define your own and record whatever you want > in whatever format you want. All is provided by kernel modules. You > want double-buffer, cyclic buffer, just add your kernel module. It > seems this feature has been overlooked by LKML reviewers but it is > really powerful. > > * It is not clear to me how you would add a sampling buffer and > remapping using your API given the number of file descriptors you will > end up using and the fact that you do not have the notion of a session. > > * When sampling, you want to freeze the counters on overflow to get an > as consistent as possible view. There is no such guarantee in > your API nor implementation. On some hardware platforms, e.g., > Itanium, you have no choice this is the behavior. > > * Multiple counters can overflow at the same time and generate a > single interrupt. With your approach, if two counters overflow > simultaneously, then you need to enqueue two messages, yet only > one SIGIO wil be generated, it seems. Wonder how that works when > self-monitoring. > > > In summary, although the idea of simplifying tools by moving the > complexity elsewhere is legitimate, pushing it down to the kernel > is the wrong approach in my opinion, perfmon has avoided that as much > as possible for good reasons. We have shown , with libpfm, > that a large part of complexity can easily be encapsulated into a user > library. I also don't think the approach of managing events > independently of each others works for all processors. As pointed out > by others, there are other factors at stake and they may not > even be on the same core. > > S. Eranian > > -------------------------------------------------------------------------- > ---- > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, > Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.co > m/ > _______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel |
From: stephane e. <er...@go...> - 2008-12-09 01:50:38
|
Hi, Forgot to cc the list. ---------- Forwarded message ---------- From: stephane eranian <er...@go...> Date: Tue, Dec 9, 2008 at 1:21 AM Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux To: Ingo Molnar <mi...@el...> Cc: Paul Mackerras <pa...@sa...>, Peter Zijlstra <a.p...@ch...>, Thomas Gleixner <tg...@li...>, LKML <lin...@vg...>, lin...@vg..., Andrew Morton <ak...@li...>, Eric Dumazet <da...@co...>, Robert Richter <rob...@am...>, Arjan van de Veen <ar...@in...>, Peter Anvin <hp...@zy...>, Steven Rostedt <ro...@go...>, David Miller <da...@da...> On Mon, Dec 8, 2008 at 12:11 PM, Ingo Molnar <mi...@el...> wrote: > > * stephane eranian <er...@go...> wrote: > >> Let me explain the HW complexity a bit. It's all a matter of tradeoffs. >> I have regular discussions with the PMU design architects about this. >> If you talk to them, then you understand the environment they have to >> live in and you understand why those constraints are there. The key >> point to understand is that the PMU is never critical to the chip. The >> chip can work well without. The real-estate on the chip is always very >> tight. PMU is a 2nd class citizen, thus low in the priority list. [...] > > The chip designers i talk to with my scheduler maintainer hat on do point > out that performance monitoring is (of course) in the critical path of > any chip, and hence its overhead and impact on the gate count of various > critical components of the CPU core and its impact on the power envelope > must be kept very low. > You have a talent for turning people's argument into something else. You dropped my example about the wire limitation. It was describing my point about constraints and PMU as 2nd class citizen. I'd rather have a new constrained PMU feature that no new feature at all. You also seem to limit your world to x86, you have to look beyond like Itanium and Power, for instance. I know quite well that the PMU is used for debugging internally and early on, so don't lecture me on this! I have participated in the architectural design of some. > Nevertheless, the same chip designers rely on performance counters on a > daily basis to plan their next-gen chip. They very much want them to work > fine, and they work hard on making them relevant and easy to use. Often > the performance counters are the _only_ real cheap hands-on insight into > the dynamic situation of a modern CPU core, even for hw designers. > Like, I did not know that? > And all the current hw trends show that it's not just talk but action as > well: the Core2 PMCs are already much saner (less constrained) than the > P4 ones, and now they even expanded on them: Nehalem / Core i7 doubled > the number of generic PMCs from two to four. > You think I am not aware of that?I know that quite well because I talk to the PMU architects on a regular basis trying to get them to add new features and make the PMU easier to manage. And I make sure I broaden my horizon beyond x86. And yes, the PMU is becoming more and more critical and a true-value add. That's good for end-users as long as the new features can be exposed. > So, contrary to your suggestion, chip designers very much care about You did not get my point, but I am not surprised... > performance counters and they are working very hard to make this stuff > useful to us. [ Yes, there are constraints even with generic counters > (for example you only want a single line towards a PMC register from > divider units), but the number of cross-counter constraints and their > relevance is decreasing, not increasing. ] > > Anyway ... i think your reply highlights why the fundamental premise of > your patchset is so wrong: i believe you have designed your code and APIs > at the wrong level by (paradoxically) assuming in essence that > performance counters do not matter in the general scheme of things. (!) > > So you introduced limited, special-purpose but still quite complex APIs That's not a valid argument! Perfmon, unlike any other existing API, has exposed all advanced features of all existing PMU models and across multiple architectures. > that tailored the ABIs to intricate low level details of PMUs. I see an > explosion in complexity due to that incorrect design choice: too many You current API does not offer access to any of the advanced features of X86, like PEBS, IBS, LBR and others, let alone on the other architectures. So again your arguments are unfounded. > syscalls, too broad interaction between core code and architecture code, > and too little practical utility in the end. > I think the number of syscalls is irrelevant, that's not how I measure the usefulness of an API. What matters is the functionalities. Any performance monitoring API should have: - create a session - program the registers - start and stop on demand and has many times as you want - attach to a thread or CPU - read the register values - advanced support for event-based sampling > We did what we believe to be the right thing: we gave performance > counters the proper high-level abstraction they _deserve_, and we made > performance counters a prime-time Linux citizen as well. > You have no validation to prove you chose the right level. As if the perfmon project did not put the PMU on the forefront. Who is going to buy that? |
From: Rob F. <rj...@re...> - 2008-12-10 16:29:29
|
My reaction is more from a downstream tool developer and end user perspective. What I don't see in the new proposal is support for real end users of hardware performance counter information. There is a long-existing community that is using the counters, including the hardware designers, driver writers, tool developers, and performance tuning specialists working for both vendors and end customers. Not everyone is in the same camp, as each the hardware capabilities change from revision to revision of the chips as features are added, architectures evolve, and implementations are cleaned up. System vendors have their own tools and developers (SpeedShop, Vtune, Tprof, Sun Studio Code Analyst, etc). There are academic and open source efforts with long histories (PAPI, oprofile, HPCToolkit (Rice, not IBM), etc). We've lived with proprietary drivers/APIs and with a succession of open-source drivers (pci, perfctr, oprofile, perfmon). (My apologies to readers/developers whose favorite tool(s) I haven't mentioned.) Out-and-out religious wars have not erupted, but there are a lot of healthy disagreements. A significant part of this community has been converging around Perfmon2/3, not because it is a thing of beauty, but because it is a tool that exposes the full HPM capabilities (which are often ugly) in a useful way for a community of tool developers and end users. Before considering this new proposal seriously, I'd need to see it proven. This means that it needs to be developed, by the proposers, enough to be used seriously. I've got collaborators that measure compute resources in units of tens of TeraFLOP-years, so my definition of "seriously" is that the HPM tool chain has to work with low overhead on huge clusters of multi-core, multi-socket machines and it has to be able to provide performance insights that will let us get even more performance out of applications that already do pretty well. Google and other large users have similar notions of "serious". Here's my set of strawman requirements: -- Can it support a *completely* functional PAPI? There are a lot of tools (HPCToolkit, TAU, etc.) built on this layer. -- Means to support IBS/EBS profiling and efficiently record execution contexts? Can it support event-based call stack profiling? -- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that depend on it? -- Kernel and daemon profiling capabilities? -- Does it have sufficiently low overhead? Six years ago DCPI/ProfileMe was capable of collecting around 5000 samples/second on a quad socket 1GHz Alpha EV67 system with about a 1.5% overhead. That's the gold standard. Oprofile and pfmon are not far off that mark. -- Does it even scale within one box? My workhorse systems today are quad-socket Barcelonas. I'm reliably using multiple, cooperating (Some measure on-core, others measure off-core events.) instances of pfmon to collect profiles using all 64 (4 per core x 16 cores) counters productively with low overhead. Real soon now I will have similar expectations regarding multi-socket Nehalems where the resources will be 7 (heterogeneous) counters per core plus 8 "uncore" counters (I prefer "nest", Alex Mericas' terminology.) per socket. Regards, Rob stephane eranian wrote: > Hello, > > I have been reading all the threads after this unexpected announcement > of a competing proposal for an interface to access the performance counters. > I would like to respond to some of the things I have seen. > <<<<<< Details of Stephane's comment's elided >>>>>> > > In summary, although the idea of simplifying tools by moving the > complexity elsewhere is legitimate, pushing it down to the kernel > is the wrong approach in my opinion, perfmon has avoided that as much > as possible for good reasons. We have shown , with libpfm, > that a large part of complexity can easily be encapsulated into a user > library. I also don't think the approach of managing events > independently of each others works for all processors. As pointed out > by others, there are other factors at stake and they may not > even be on the same core. > > S. Eranian > > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ > _______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel -- Robert J. Fowler Chief Domain Scientist, HPC Renaissance Computing Institute The University of North Carolina at Chapel Hill 100 Europa Dr, Suite 540 Chapel Hill, NC 27517 V: 919.445.9670 F: 919 445.9669 rj...@re... |
From: Andi K. <an...@fi...> - 2008-12-10 17:10:52
|
Rob Fowler <rj...@re...> writes: > > -- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that > depend on it? There's no need to supplant/support oprofile really because at least short term oprofile will not go away. -Andi -- ak...@li... |
From: Corey J A. <cja...@us...> - 2008-12-05 09:30:05
|
At first glance: this would push all of the libpfm-type of code into the kernel, where it's harder to maintain. it would be slower to read multiple counters, requiring one syscall per counter. how does it handle sampling and interrupt on overflow? if scheduling of the counters is dynamic, I think that means there's more overhead on every task switch there's a pile of TODO there, code which is already written and working in perfmon (full) I'll look at this some more tomorrow. Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cja...@us... "stephane eranian" <er...@go...> wrote on 12/04/2008 11:14:39 PM: > Hello everyone, > > Here is another competing perfmon API proposal posted just today on LKML > by Ingo Molnar, and Thomas Gleixner (x86 kernel maintainers). > > They think their design is superior to that of the perfmon API. > Looks like there is yet another battle to fight... > > I intend to respond tomorrow but feel free to contribute to the LKML thread > if you have a strong opinion on the matter. > > > ---------- Forwarded message ---------- > From: Thomas Gleixner <tg...@li...> > Date: Fri, Dec 5, 2008 at 12:44 AM > Subject: [patch 0/3] [Announcement] Performance Counters for Linux > To: LKML <lin...@vg...> > Cc: lin...@vg..., Andrew Morton > <ak...@li...>, Ingo Molnar <mi...@el...>, Stephane > Eranian <er...@go...>, Eric Dumazet <da...@co...>, > Robert Richter <rob...@am...>, Arjan van de Veen > <ar...@in...>, Peter Anvin <hp...@zy...>, Peter Zijlstra > <a.p...@ch...>, Steven Rostedt <ro...@go...>, David > Miller <da...@da...>, Paul Mackerras <pa...@sa...> > > > Performance counters are special hardware registers available on most modern > CPUs. These register count the number of certain types of hw events: such > as instructions executed, cachemisses suffered, or branches mis-predicted, > without slowing down the kernel or applications. These registers can also > trigger interrupts when a threshold number of events have passed - and can > thus be used to profile the code that runs on that CPU. > > We'd like to announce a brand new implementation of performance counter > support for Linux. It is a very simple and extensible design that has the > potential to implement the full range of features we would expect from such > a subsystem. > > The Linux Performance Counter subsystem (implemented via the patches > posted in this announcement) provides an abstraction of performance counter > hardware capabilities. It provides per task and per CPU counters, and it > provides event capabilities on top of those. > > The code is far from complete - but the basic approach is already there > and stable. > > The biggest missing detail is lowlevel support for non-Intel CPUs and > older Intel CPUs - right now the code is implemented for Intel Core2 > (and later) Intel CPUs that have the PERFMON CPU feature. (see below > a wider list of missing/upcoming features) > > We are aware of the perfmon3 patchset that has been submitted to lkml > recently. Our patchset tries to achieve a similar end result, with > a fundamentally different (and we believe, superior :-) design: > > - The API is based on a single counter abstraction > > - Only one single new system call is needed: sys_perf_counter_open(). > All performance-counter operations are implemented via standard > VFS APIs such as read() / fcntl() and poll(). > > - User-space is not exposed to lowlevel details like contexts or > arrays of counters. Opening and reading a basic counter is as simple > as 2 lines of C code: > > void main(void) > { > u64 count; > > fd = perf_counter_open(3 /* PERF_COUNT_CACHE_MISSES */, 0, 0, 0, -1); > ret = read(fd, &count, sizeof(count)); > if (ret == sizeof(count)) > printf("Current count: %Ld cachemisses!", count); > } > > - Events, blocking/sleep are natural built-in properties of counters. > > - No interaction with ptrace: any task (with sufficient permissions) can > monitor other tasks, without having to stop that task. > > - Mapping of counters to hw counters is not static - counters are > scheduled dynamically on each CPU where a task runs. > > - There's a /sys based reservation facility that allows the allocation > of a certain number of hw counters for guaranteed sysadmin access. > > - Generalized enumeration for common hw event types. Raw event codes > can be passed to the API too - but the most common (and most useful) > event codes are generalized into a hardware-independent registry > of events: > > enum hw_event_types { > PERF_COUNT_CYCLES, > PERF_COUNT_INSTRUCTIONS, > PERF_COUNT_CACHE_REFERENCES, > PERF_COUNT_CACHE_MISSES, > PERF_COUNT_BRANCH_INSTRUCTIONS, > PERF_COUNT_BRANCH_MISSES, > }; > > - Simplified lowlevel/arch support. The x86 code for Intel CPUs (with > the PERFMON CPU feature) is 340 lines of code that implements > 7 straightforward lowlevel API calls: > > int hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type); > void hw_perf_counter_enable(struct perf_counter *counter); > void hw_perf_counter_disable(struct perf_counter *counter); > void hw_perf_counter_read(struct perf_counter *counter); > void hw_perf_counter_enable_config(struct perf_counter *counter); > void hw_perf_counter_disable_config(struct perf_counter *counter); > void hw_perf_counter_setup(void); > > There's one kernel/perf_counter.c core file, and a single > arch/x86/kernel/cpu/perf_counter.c architecture support file. > > The impact on the kernel tree is relatively moderate: > > 27 files changed, 1641 insertions(+), 7 deletions(-) > > TODO: > > - Non-Intel CPU support. Help is welcome :-) > > - Round-robin scheduling of counters, when there's more task counters > than hw counters available. > > - Support for extended record types such as PEBS. > > - Support for NMI events in the x86 code (the core design is ready) > > - Make sure it works well with OProfile and the x86 NMI watchdog > > Short documentation is available in Documentation/perf-counters.txt > > Find below the source of a simple monitoring demo. > > We'd like to seek the feedback of perfmon developers and architecture > maintainers - what do you think about this approach? > > Comments, reports, suggestions, flames and other types of feedback > is more than welcome, > > Thomas, Ingo > --- > > /* > * Performance counters monitoring test case > */ > #include <sys/types.h> > #include <sys/stat.h> > #include <sys/time.h> > #include <unistd.h> > #include <stdint.h> > #include <stdlib.h> > #include <string.h> > #include <getopt.h> > #include <fcntl.h> > #include <stdio.h> > #include <errno.h> > > #define __user > > #include "sys.h" > > static int count = 10000; > static int eventid; > static int tid; > static char *debuginfo; > > static void display_help(void) > { > printf("monitor\n"); > printf("Usage:\n" > "monitor options threadid\n\n" > "-e EID --eventid=EID eventid\n" > "-c CNT --count=CNT event count on which IP is sampled\n" > "-d FILE --debug=FILE path to binary file with > debug info\n"); > exit(0); > } > > static void process_options (int argc, char *argv[]) > { > int error = 0; > > for (;;) { > int option_index = 0; > /** Options for getopt */ > static struct option long_options[] = { > {"count", required_argument, NULL, 'c'}, > {"debug", required_argument, NULL, 'd'}, > {"eventid", required_argument, NULL, 'e'}, > {"help", no_argument, NULL, 'h'}, > {NULL, 0, NULL, 0} > }; > int c = getopt_long(argc, argv, "c:d:e:", > long_options, &option_index); > if (c == -1) > break; > switch (c) { > case 'c': count = atoi(optarg); break; > case 'd': debuginfo = strdup(optarg); break; > case 'e': eventid = atoi(optarg); break; > default: error = 1; break; > } > } > if (error || optind == argc) > display_help (); > > tid = atoi(argv[optind]); > } > > int main(int argc, char *argv[]) > { > char str[256]; > uint64_t ip; > ssize_t res; > int fd; > > process_options(argc, argv); > > fd = perf_counter_open(eventid, count, 1, tid, -1); > if (fd < 0) { > perror("Create counter"); > exit(-1); > } > > while (1) { > res = read(fd, (char *) &ip, sizeof(ip)); > if (res != sizeof(ip)) { > perror("Read counter"); > break; > } > > if (!debuginfo) { > printf("IP: 0x%016llx\n", (unsigned long long)ip); > } else { > sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo, > (unsigned long long)ip); > system(str); > } > } > > close(fd); > exit(0); > } > > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ > _______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel |