|
From: Matt W. <mw...@al...> - 2013-03-11 23:09:04
|
The xprof tool I proposed several weeks ago is available as a patch to valgrind-3.8.1. https://code.google.com/p/valgrind-xprof/downloads/list I'm assuming the chance of this being rolled into the valgrind distribution is near-zero. Date: Wed, 20 Feb 2013 19:48:18 -0800 From: Matthew Wette <mw...@al...> Subject: [Valgrind-developers] proposed new tool: xprof To: val...@li... Message-ID: <4D3...@al...> Content-Type: text/plain; charset=us-ascii Hi Folks, I have been working on a new valgrind tool and want to get feedback on approach and chances for getting this rolled into the distribution. If this has potential, I'd like to get feedback on ideas for user options, etc. I'm calling the tool "xprof" (prefix "xp"). It is an execution profiler. The context for tool use is the following: The user develops code on his desktop computer, but the downstream target is an embedded real-time application. He develops the code in a (physics based) simulation of the target environment. For example, he develops a a fuel-injection algorithm for an automobile engine. Early in the project the embedded real-time group wants an estimate of the CPU utilization of his algorithm. The algorithm is difficult to run without the context of the enviroment simulation, so he has trouble answering this question. Typically, he can only make crude estimates. Enter valgrind/xprof. This tool would allow the user to quickly provide a better CPU loading estimate within his simulation environment by providing cycle count estimates of specified regions of code. We are not after exact clock counts, but something better than crude flop estimates. |
|
From: Rich C. <rc...@wi...> - 2013-03-12 01:23:22
|
On Mon, 11 Mar 2013 15:08:44 -0700 Matt Wette <mw...@al...> wrote: > The xprof tool I proposed several weeks ago is available as a patch to valgrind-3.8.1. I tried running the demo program and I got this result: count: 1070 delta: 112 ==16514== ==16514== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==16514== Access not within mapped region at address 0x0 ==16514== at 0x400ED1D: _dl_fini (dl-fini.c:199) ==16514== by 0x4C5D8B0: __run_exit_handlers (exit.c:78) ==16514== by 0x4C5D934: exit (exit.c:100) ==16514== by 0x4C4745B: (below main) (libc-start.c:258) -- Rich Coe rc...@wi... |
|
From: Josef W. <Jos...@gm...> - 2013-03-12 10:51:46
|
Am 11.03.2013 23:08, schrieb Matt Wette: > The xprof tool I proposed several weeks ago is available as a patch to > valgrind-3.8.1. > > https://code.google.com/p/valgrind-xprof/downloads/list > > I'm assuming the chance of this being rolled into the valgrind > distribution is near-zero. I just had a look: I see that you assign each VEX IR a cycle penalty, and maintain a counter which increments according to executed code. The resulting cycle latency only makes sense if you have a very simple processor, without superscalarity, with in-order execution and no caches. The model does not include any throughput constrains or conflict penalities. Still the tool goes a long way to cover every VEX IR. Wouldn't it be enough to have just a few instruction classes? Passing the results back to the client application via a volatile variable really is weird (you are influencing what you measure). Why not a separate output? Or as return value of a client request? The only runtime information you really need is the execution count of SBs. Everything else can be done with post-processing, making the tool much faster. Regarding any chance for merging: this is not my decision, but the current, separate tool seems to have limited value. A simplified version of your approach may be interesting as add-on to cache simulation. Another suggestion: the lackey tool ("--detailed-counts=yes") prints out some statistics which could be refined to show counts for different AluOps, and maintain a counter using cycle penalities. This way, it may provide the same results as your tool. You already have it public on google code. Instead of a patch to a fixed Valgrind version, you could make a separate package out of it: You can detect existance of an installed Valgrind in a configure script via pkg-config, and compile/link with the installed Valgrind headers/libraries. Josef > > Date: Wed, 20 Feb 2013 19:48:18 -0800 > From: Matthew Wette <mw...@al... > <mailto:mw...@al...>> > Subject: [Valgrind-developers] proposed new tool: xprof > To: val...@li... > <mailto:val...@li...> > Message-ID: <4D3...@al... > <mailto:4D3...@al...>> > Content-Type: text/plain; charset=us-ascii > > Hi Folks, > > I have been working on a new valgrind tool and want to get feedback on > approach and chances for getting this rolled into the distribution. If > this has potential, I'd like to get feedback on ideas for user options, etc. > > I'm calling the tool "xprof" (prefix "xp"). It is an execution profiler. > > The context for tool use is the following: > The user develops code on his desktop computer, but the downstream > target is an embedded real-time application. He develops the code > in a (physics based) simulation of the target environment. For > example, he develops a a fuel-injection algorithm for an automobile > engine. Early in the project the embedded real-time group wants an > estimate of the CPU utilization of his algorithm. The algorithm is > difficult to run without the context of the enviroment simulation, > so he has trouble answering this question. Typically, he can only > make crude estimates. Enter valgrind/xprof. This tool would allow > the user to quickly provide a better CPU loading estimate within > his simulation environment by providing cycle count estimates of > specified regions of code. We are not after exact clock counts, but > something better than crude flop estimates. > > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Matt W. <mw...@al...> - 2013-03-13 02:30:58
|
Josef, Thanks for taking time to check this out. You are correct that it does not capture a good picture of superscalar processors. I feel that adding that capability may be doable as future improvement. One could have a table that provides some information on latency and through put of the pipelines and maybe rules for parallel processing. However, what I'm after is estimates that are better than nothing or super-crude estimates. One critical feature, for the use-case I'm looking to help, is the ability to estimate cycles for a specific part of the code. Expanding lackey will not do the trick. The only other way I could see would be to have command-line options to have counters for specific functions. Still, I think this option provides the most flexibility. Matt On Mar 12, 2013, at 3:51 AM, Josef Weidendorfer wrote: > Am 11.03.2013 23:08, schrieb Matt Wette: >> The xprof tool I proposed several weeks ago is available as a patch to >> valgrind-3.8.1. >> >> https://code.google.com/p/valgrind-xprof/downloads/list >> >> I'm assuming the chance of this being rolled into the valgrind >> distribution is near-zero. > > I just had a look: I see that you assign each VEX IR a cycle > penalty, and maintain a counter which increments according to executed > code. > > The resulting cycle latency only makes sense if you have > a very simple processor, without superscalarity, with in-order > execution and no caches. The model does not include any > throughput constrains or conflict penalities. > Still the tool goes a long way to cover every VEX IR. > Wouldn't it be enough to have just a few instruction classes? > > Passing the results back to the client application via a > volatile variable really is weird (you are influencing what you > measure). Why not a separate output? Or as return value of a client > request? > The only runtime information you really need is the execution > count of SBs. Everything else can be done with post-processing, > making the tool much faster. > > Regarding any chance for merging: this is not my decision, > but the current, separate tool seems to have limited value. > A simplified version of your approach may be interesting as > add-on to cache simulation. Another suggestion: the lackey tool > ("--detailed-counts=yes") prints out some statistics which could > be refined to show counts for different AluOps, and maintain a > counter using cycle penalities. This way, it may provide the same > results as your tool. > > You already have it public on google code. Instead of a patch to > a fixed Valgrind version, you could make a separate package out of it: > You can detect existance of an installed Valgrind in a configure > script via pkg-config, and compile/link with the installed Valgrind > headers/libraries. > > Josef > > >> >> Date: Wed, 20 Feb 2013 19:48:18 -0800 >> From: Matthew Wette <mw...@al... >> <mailto:mw...@al...>> >> Subject: [Valgrind-developers] proposed new tool: xprof >> To: val...@li... >> <mailto:val...@li...> >> Message-ID: <4D3...@al... >> <mailto:4D3...@al...>> >> Content-Type: text/plain; charset=us-ascii >> >> Hi Folks, >> >> I have been working on a new valgrind tool and want to get feedback on >> approach and chances for getting this rolled into the distribution. If >> this has potential, I'd like to get feedback on ideas for user options, etc. >> >> I'm calling the tool "xprof" (prefix "xp"). It is an execution profiler. >> >> The context for tool use is the following: >> The user develops code on his desktop computer, but the downstream >> target is an embedded real-time application. He develops the code >> in a (physics based) simulation of the target environment. For >> example, he develops a a fuel-injection algorithm for an automobile >> engine. Early in the project the embedded real-time group wants an >> estimate of the CPU utilization of his algorithm. The algorithm is >> difficult to run without the context of the enviroment simulation, >> so he has trouble answering this question. Typically, he can only >> make crude estimates. Enter valgrind/xprof. This tool would allow >> the user to quickly provide a better CPU loading estimate within >> his simulation environment by providing cycle count estimates of >> specified regions of code. We are not after exact clock counts, but >> something better than crude flop estimates. >> >> >> >> ------------------------------------------------------------------------------ >> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester >> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the >> endpoint security space. For insight on selecting the right partner to >> tackle endpoint security challenges, access the full report. >> http://p.sf.net/sfu/symantec-dev2dev >> >> >> >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers >> > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Josef W. <Jos...@gm...> - 2013-03-13 10:00:42
|
Am 13.03.2013 03:30, schrieb Matt Wette: > You are correct that it does not capture a good picture of superscalar processors. > I feel that adding that capability may be doable as future improvement. One could > have a table that provides some information on latency and through put of the pipelines > and maybe rules for parallel processing. However, what I'm after is estimates that are > better than nothing or super-crude estimates. For SoCs/embedded, your approach may be fine. Modern general purpose processors is a completely different game. There it is better to assume that everything is pipelined and out-of-order works more or less perfect, ie. only throughputs and long-lasting events (memory accesses) are limiting. > One critical feature, for the use-case I'm looking to help, is the ability to estimate cycles > for a specific part of the code. Expanding lackey will not do the trick. You could add two client requests to lackey to reset/read a counter. Still it's a bit weird because the implementation of the client request itself will influence your counter. Your volatile variable approach is really broken: Assume that the counter variable needs to be read with two read operations (lower/upper part). Now if your tool increments the counter between reading lower and upper half, sometimes you get bogus values. The only other way > I could see would be to have command-line options to have counters for specific functions. > Still, I think this option provides the most flexibility. Client requests and command line flags often can do the same things, and are convenience for the user. You could have a client request which tells your tool to capture a cycle estimation for a given function. Josef > > Matt > > > On Mar 12, 2013, at 3:51 AM, Josef Weidendorfer wrote: > >> Am 11.03.2013 23:08, schrieb Matt Wette: >>> The xprof tool I proposed several weeks ago is available as a patch to >>> valgrind-3.8.1. >>> >>> https://code.google.com/p/valgrind-xprof/downloads/list >>> >>> I'm assuming the chance of this being rolled into the valgrind >>> distribution is near-zero. >> >> I just had a look: I see that you assign each VEX IR a cycle >> penalty, and maintain a counter which increments according to executed >> code. >> >> The resulting cycle latency only makes sense if you have >> a very simple processor, without superscalarity, with in-order >> execution and no caches. The model does not include any >> throughput constrains or conflict penalities. >> Still the tool goes a long way to cover every VEX IR. >> Wouldn't it be enough to have just a few instruction classes? >> >> Passing the results back to the client application via a >> volatile variable really is weird (you are influencing what you >> measure). Why not a separate output? Or as return value of a client >> request? >> The only runtime information you really need is the execution >> count of SBs. Everything else can be done with post-processing, >> making the tool much faster. >> >> Regarding any chance for merging: this is not my decision, >> but the current, separate tool seems to have limited value. >> A simplified version of your approach may be interesting as >> add-on to cache simulation. Another suggestion: the lackey tool >> ("--detailed-counts=yes") prints out some statistics which could >> be refined to show counts for different AluOps, and maintain a >> counter using cycle penalities. This way, it may provide the same >> results as your tool. >> >> You already have it public on google code. Instead of a patch to >> a fixed Valgrind version, you could make a separate package out of it: >> You can detect existance of an installed Valgrind in a configure >> script via pkg-config, and compile/link with the installed Valgrind >> headers/libraries. >> >> Josef >> >> >>> >>> Date: Wed, 20 Feb 2013 19:48:18 -0800 >>> From: Matthew Wette <mw...@al... >>> <mailto:mw...@al...>> >>> Subject: [Valgrind-developers] proposed new tool: xprof >>> To: val...@li... >>> <mailto:val...@li...> >>> Message-ID: <4D3...@al... >>> <mailto:4D3...@al...>> >>> Content-Type: text/plain; charset=us-ascii >>> >>> Hi Folks, >>> >>> I have been working on a new valgrind tool and want to get feedback on >>> approach and chances for getting this rolled into the distribution. If >>> this has potential, I'd like to get feedback on ideas for user options, etc. >>> >>> I'm calling the tool "xprof" (prefix "xp"). It is an execution profiler. >>> >>> The context for tool use is the following: >>> The user develops code on his desktop computer, but the downstream >>> target is an embedded real-time application. He develops the code >>> in a (physics based) simulation of the target environment. For >>> example, he develops a a fuel-injection algorithm for an automobile >>> engine. Early in the project the embedded real-time group wants an >>> estimate of the CPU utilization of his algorithm. The algorithm is >>> difficult to run without the context of the enviroment simulation, >>> so he has trouble answering this question. Typically, he can only >>> make crude estimates. Enter valgrind/xprof. This tool would allow >>> the user to quickly provide a better CPU loading estimate within >>> his simulation environment by providing cycle count estimates of >>> specified regions of code. We are not after exact clock counts, but >>> something better than crude flop estimates. >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester >>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the >>> endpoint security space. For insight on selecting the right partner to >>> tackle endpoint security challenges, access the full report. >>> http://p.sf.net/sfu/symantec-dev2dev >>> >>> >>> >>> _______________________________________________ >>> Valgrind-developers mailing list >>> Val...@li... >>> https://lists.sourceforge.net/lists/listinfo/valgrind-developers >>> >> >> >> ------------------------------------------------------------------------------ >> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester >> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the >> endpoint security space. For insight on selecting the right partner to >> tackle endpoint security challenges, access the full report. >> http://p.sf.net/sfu/symantec-dev2dev >> _______________________________________________ >> Valgrind-developers mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > |
|
From: Matt W. <mw...@al...> - 2013-03-14 02:22:42
|
On Mar 13, 2013, at 3:00 AM, Josef Weidendorfer wrote: > Am 13.03.2013 03:30, schrieb Matt Wette: >> One critical feature, for the use-case I'm looking to help, is the ability to estimate cycles >> for a specific part of the code. Expanding lackey will not do the trick. > > You could add two client requests to lackey to reset/read a counter. > Still it's a bit weird because the implementation of the client request > itself will influence your counter. I have added another interface to do exactly this (CLRCTR(), GETCTR()). This is how I started (using lackey) but thought this would be easier for the user. I think the counter requests will be in the noise. Still there is a possibility of decrementing the counter during instrumentation (by checking for call to XP client requests). > Your volatile variable approach is really broken: Assume that the > counter variable needs to be read with two read operations (lower/upper part). > Now if your tool increments the counter between > reading lower and upper half, sometimes you get bogus values. I am looking into this. Note that the host accesses only on SB exits. But still I believe this could be a problem. Regarding using the entire Iop table: it is essential that I capture the clock count differences between multiply/add and divide. (PPC multiplies in one clock, divide in 17 clocks). Matt |
|
From: Josef W. <Jos...@gm...> - 2013-03-14 10:18:06
|
Am 14.03.2013 03:22, schrieb Matt Wette: >> Your volatile variable approach is really broken: Assume that the >> counter variable needs to be read with two read operations >> (lower/upper part). >> Now if your tool increments the counter between >> reading lower and upper half, sometimes you get bogus values. > > I am looking into this. Note that the host accesses only on SB exits. What do you mean with host? Your tool or the client code? I thought you instrument a dirty call which increments the counter after every original VEX IR... > But still I believe this could be a problem. > > Regarding using the entire Iop table: it is essential that I capture the > clock count > differences between multiply/add and divide. (PPC multiplies in one > clock, divide in 17 clocks). Sure. Josef > > Matt > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Matt W. <mw...@al...> - 2013-03-14 13:09:01
|
On Mar 14, 2013, at 3:17 AM, Josef Weidendorfer wrote: > Am 14.03.2013 03:22, schrieb Matt Wette: >>> Your volatile variable approach is really broken: Assume that the >>> counter variable needs to be read with two read operations >>> (lower/upper part). >>> Now if your tool increments the counter between >>> reading lower and upper half, sometimes you get bogus values. >> >> I am looking into this. Note that the host accesses only on SB exits. > > What do you mean with host? Your tool or the client code? I thought > you instrument a dirty call which increments the counter after every > original VEX IR... I meant the tool. The instrumentation function counts instructions within a superblock and then adds a dirty call at each exit to update the users counter (and then zeros the instrumentation-time counter). Matt |
|
From: Josef W. <Jos...@gm...> - 2013-03-14 13:29:08
|
Am 14.03.2013 14:08, schrieb Matt Wette: > > On Mar 14, 2013, at 3:17 AM, Josef Weidendorfer wrote: > >> Am 14.03.2013 03:22, schrieb Matt Wette: >>>> Your volatile variable approach is really broken: Assume that the >>>> counter variable needs to be read with two read operations >>>> (lower/upper part). >>>> Now if your tool increments the counter between >>>> reading lower and upper half, sometimes you get bogus values. >>> >>> I am looking into this. Note that the host accesses only on SB exits. >> >> What do you mean with host? Your tool or the client code? I thought >> you instrument a dirty call which increments the counter after every >> original VEX IR... > > I meant the tool. The instrumentation function counts instructions within a > superblock and then adds a dirty call at each exit to update the users counter > (and then zeros the instrumentation-time counter). Ok, that's fine then. Josef > > Matt > > > |
|
From: Matt W. <mw...@al...> - 2013-03-14 13:17:48
|
On Mar 14, 2013, at 6:08 AM, Matt Wette wrote: > I meant the tool. The instrumentation function counts instructions within a > superblock and then adds a dirty call at each exit to update the users counter > (and then zeros the instrumentation-time counter). Oops. I will need to trap the client requests also to make sure the counter is updated. |