|
Re: [Valgrind-developers] Details about forthcoming
cachegrind/callgrind patch adding hardware costs
From: Niall D. <ndo...@bl...> - 2013-04-11 15:11:13
Attachments:
smime.p7s
|
> Am 10.04.2013 22:02, schrieb Niall Douglas: > > The patch is fairly minimal. At the start of cachegrind/callgrind if > > the cache configuration parameters were not supplied, it looks for a > > file called 'cpucacheconfig.xml' if one was not supplied. If that is > > present, it loads it. If it is not present, it calls > > I think this behaviour should be guarded by a command line flag: > e.g. on x86, the timing tests should not run when no cache parameter file is > present, as it will be supplied via CPUID. CPUID can't tell you cache/memory access latencies. The code is clever enough to realize when CPUID has returned cache configuration and it automatically skips the generic tests for cache configuration, running only those tests determining latencies. You may have meant that you want a command line flag that skips the latency tests too? A cpucacheconfig.xml file with zeroed entries is sufficient. Is this too non-obvious though? Perhaps an explicit command line flag would be better. > > desc: I1 cache: 32768 B, 64 B, 8-way associative, 153 picosec > > desc: D1 cache: 32768 B, 64 B, 8-way associative, 153 picosec > > desc: LL cache: 6291456 B, 64 B, 12-way associative, 1112 picosec > > desc: main memory: 1427 picosec > > desc: max instruct cost:302 picosec > > desc: min instruct cost:101 picosec > > What are these times? It would be nice to be more descriptive here (or better in > the manual). Estimated cost times for a cache line, so a D1 cache access costs 153 picosec, whereas a LL cache access costs 1112 picosec. Or, put another way, a L1 cache miss costs 1112 picosec, a LL cache miss costs 1427 picosec etc. The max and min instruction cost is a test for CPU ILP, so basically it determines max instruction cost like this: for(many) a[x]+=a[x-1] This forces the CPU to stall waiting for the result of each prior calculation. Min instruction cost looks more like this: for(many) a[x]+=a[x+n] This lets the CPU execute as many adds in parallel as it can. You'll note the max instruction cost is ~300 picosec while the L1 cache line is ~150 picosec for 64 bytes. I would assume Intel deliberately chose that to ensure that a single arithmetic instruction adding across cache lines can run at single cycle speed. Regarding the documentation, I am happy to update this once we have the patch sorted out and into Legal. I don't need approval for documentation patches thankfully. > Putting all this information behind "desc:" lines is fine for the *_annotate scripts > and KCachegrind, as this information is just forwarded to the user without trying > to parse it. But this also means that tools can not rely on it never to change, or > to have a specific format... I think your idea about custom event counters is a great one for specifying in a guaranteed machine readable way what these numbers mean. I'll have a play with those, maybe today, hopefully tomorrow. I'm currently caught up in GSoC mentor petitioning :( Many thanks for your advice Josef. Niall |
|
Re: [Valgrind-developers] Details about forthcoming
cachegrind/callgrind patch adding hardware costs
From: Niall D. <ndo...@bl...> - 2013-04-11 20:34:19
Attachments:
smime.p7s
|
> Am 11.04.2013 17:10, schrieb Niall Douglas: > >>> desc: I1 cache: 32768 B, 64 B, 8-way associative, 153 picosec > > > Estimated cost times for a cache line, so a D1 cache access costs 153 > > picosec, whereas a LL cache access costs 1112 picosec. > > Ok. "0.153 ns hit latency" may be more descriptive. Is 153 picosec hit latency okay? I try to avoid floating point printing. I think valgrind's sprint now supports %f, but not %e. I had to write my own %e formatter which is a challenge without a C math library. Also, the read_picosecond_timer() suggests picosec. As that's the minimum resolution offered by POSIX, that seems the right unit. > But these numbers are so small... I assume there is prefetching going on, and > you are actually measuring the maximal bandwidth from core to a cache level. I > would assume that an unpredictable LL access has a latency more in the range > of 15-30 cycles (and not 1ns). As I mentioned, I do permute the cache line fetches by reversing the last eight bits of the cache line indexes to make them appear random-ish. I *really* wish I could just point you at the code, but it's still tied up in bureaucracy. > I get the impression that we always should do prefetch simulation, and add an > event "predicted LL miss" to be not totally off with the some time estimation. It's certainly coming to that with Sandy Bridge and later on Intel. We're not there yet on ARM. > So this somehow is the latency of an add instruction, using registers. I suppose > with a multiplication it would be slower, and with a divison even more so... I.e. > that does not really match with any definition of "max cost" for me ;-) There is a bit of method in the madness there. If I have read it correctly, according to http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.5816&rep=rep1&ty pe=pdf roughly speaking, as an average case, most general purpose code has an ILP of around 1.0, so if you can calculate the clock speed, you more or less get average instruction cost. So this asks the question, how do you portably calculate the clock speed? And by portably, I mean no assembler and no OS specific code (POSIX is okay). It turns out this isn't easy. The best approach I could think of was to calculate for the add instruction given that it is *probably* reasonably likely to be the same as the clock speed on most contemporary CPUs [1]. Therefore when I use max instruction cost in my calculations, it's really a proxy for average instruction cost based on a number of reasonable assumptions. [1]: The Qualcomm Krait is not one of these CPUs. It incurs a stochastic non-linear latency on arithmetic instructions. In the end, what can you do in this situation. > > You'll note the max instruction cost is ~300 picosec while the L1 > > cache line is ~150 picosec for 64 bytes. I would assume Intel > > deliberately chose that to ensure that a single arithmetic instruction > > adding across cache lines can run at single cycle speed. > > Ah, so these results are from an Intel processor? > 101 ps means 1 cycle with 10GHz. I wondered which ARM processor you were > using to get these impressive figures. So it's more like 3.3GHz and 3 add > instructions throughput per cycle. Correct. Sorry if I confused you. As I must prepare a patch for you guys, and one which must pass Legal, I ported the work from internal valgrind to Linux valgrind. Which runs inside a VM on my Intel work PC. Hence the figures are for that machine. Main memory cost is particularly incorrect inside a VM. I'll backport the patch to internal valgrind once internal peer review approves it. There's absolutely no reason it shouldn't work seamlessly on ARM though. BTW, this is the custom event I added: desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec desc: D1 cache: 32768 B, 64 B, 8-way associative, 157 picosec desc: LL cache: 6291456 B, 64 B, 12-way associative, 1120 picosec desc: main memory: 1362 picosec desc: max instruct cost: 316 picosec desc: min instruct cost: 107 picosec event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + 1362 DLmr + 1362 DLmw event: EPpsec : Estimated Possible Picosecs I think that's a reasonable calculation, more or less. I'll change "picosec" => "picosec hit latency" shortly. Niall |
|
Re: [Valgrind-developers] Details about forthcoming
cachegrind/callgrind patch adding hardware costs
From: Niall D. <ndo...@bl...> - 2013-04-17 19:46:20
Attachments:
smime.p7s
|
I thought the list might appreciate seeing the improved callgrind in action: https://plus.google.com/photos/103330209479934785641/albums/5867891891426048 561 The patch is submitted for release. It should clear within six weeks. Next step is a callgrind output to XML converter. Any complaints if I write the callgrind output parser in Boost.Spirit Josef? I wouldn't want it to be a maintenance headache. Niall > -----Original Message----- > From: Niall Douglas > Sent: 11 April 2013 16:34 > To: val...@li... > Subject: Re: [Valgrind-developers] Details about forthcoming > cachegrind/callgrind patch adding hardware costs > > > Am 11.04.2013 17:10, schrieb Niall Douglas: > > >>> desc: I1 cache: 32768 B, 64 B, 8-way associative, 153 picosec > > > > > Estimated cost times for a cache line, so a D1 cache access costs 153 > > > picosec, whereas a LL cache access costs 1112 picosec. > > > > Ok. "0.153 ns hit latency" may be more descriptive. > > Is 153 picosec hit latency okay? I try to avoid floating point printing. I > think valgrind's sprint now supports %f, but not %e. I had to write my own > %e formatter which is a challenge without a C math library. > > Also, the read_picosecond_timer() suggests picosec. As that's the minimum > resolution offered by POSIX, that seems the right unit. > > > But these numbers are so small... I assume there is prefetching going on, > and > > you are actually measuring the maximal bandwidth from core to a cache > level. I > > would assume that an unpredictable LL access has a latency more in the > range > > of 15-30 cycles (and not 1ns). > > As I mentioned, I do permute the cache line fetches by reversing the last > eight bits of the cache line indexes to make them appear random-ish. I > *really* wish I could just point you at the code, but it's still tied up in > bureaucracy. > > > I get the impression that we always should do prefetch simulation, and add > an > > event "predicted LL miss" to be not totally off with the some time > estimation. > > It's certainly coming to that with Sandy Bridge and later on Intel. We're > not there yet on ARM. > > > So this somehow is the latency of an add instruction, using registers. I > suppose > > with a multiplication it would be slower, and with a divison even more > so... I.e. > > that does not really match with any definition of "max cost" for me ;-) > > There is a bit of method in the madness there. If I have read it correctly, > according to > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.5816&rep=rep1& > ty > pe=pdf roughly speaking, as an average case, most general purpose code has > an ILP of around 1.0, so if you can calculate the clock speed, you more or > less get average instruction cost. > > So this asks the question, how do you portably calculate the clock speed? > And by portably, I mean no assembler and no OS specific code (POSIX is > okay). It turns out this isn't easy. The best approach I could think of was > to calculate for the add instruction given that it is *probably* reasonably > likely to be the same as the clock speed on most contemporary CPUs [1]. > > Therefore when I use max instruction cost in my calculations, it's really a > proxy for average instruction cost based on a number of reasonable > assumptions. > > [1]: The Qualcomm Krait is not one of these CPUs. It incurs a stochastic > non-linear latency on arithmetic instructions. In the end, what can you do > in this situation. > > > > You'll note the max instruction cost is ~300 picosec while the L1 > > > cache line is ~150 picosec for 64 bytes. I would assume Intel > > > deliberately chose that to ensure that a single arithmetic instruction > > > adding across cache lines can run at single cycle speed. > > > > Ah, so these results are from an Intel processor? > > 101 ps means 1 cycle with 10GHz. I wondered which ARM processor you were > > using to get these impressive figures. So it's more like 3.3GHz and 3 add > > instructions throughput per cycle. > > Correct. Sorry if I confused you. As I must prepare a patch for you guys, > and one which must pass Legal, I ported the work from internal valgrind to > Linux valgrind. Which runs inside a VM on my Intel work PC. Hence the > figures are for that machine. Main memory cost is particularly incorrect > inside a VM. > > I'll backport the patch to internal valgrind once internal peer review > approves it. There's absolutely no reason it shouldn't work seamlessly on > ARM though. > > BTW, this is the custom event I added: > > desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec > desc: D1 cache: 32768 B, 64 B, 8-way associative, 157 picosec > desc: LL cache: 6291456 B, 64 B, 12-way associative, 1120 picosec > desc: main memory: 1362 picosec > desc: max instruct cost: 316 picosec > desc: min instruct cost: 107 picosec > > event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + > 1362 DLmr + 1362 DLmw > event: EPpsec : Estimated Possible Picosecs > > I think that's a reasonable calculation, more or less. > > I'll change "picosec" => "picosec hit latency" shortly. > > Niall |
|
Re: [Valgrind-developers] Details about forthcoming
cachegrind/callgrind patch adding hardware costs
From: Josef W. <Jos...@gm...> - 2013-04-11 17:20:32
|
Am 11.04.2013 17:10, schrieb Niall Douglas: >>> desc: I1 cache: 32768 B, 64 B, 8-way associative, 153 picosec > Estimated cost times for a cache line, so a D1 cache access costs 153 > picosec, whereas a LL cache access costs 1112 picosec. Ok. "0.153 ns hit latency" may be more descriptive. But these numbers are so small... I assume there is prefetching going on, and you are actually measuring the maximal bandwidth from core to a cache level. I would assume that an unpredictable LL access has a latency more in the range of 15-30 cycles (and not 1ns). I get the impression that we always should do prefetch simulation, and add an event "predicted LL miss" to be not totally off with the some time estimation. > The max and min instruction cost is a test for CPU ILP, so basically it > determines max instruction cost like this: > > for(many) > a[x]+=a[x-1] > > This forces the CPU to stall waiting for the result of each prior > calculation. So this somehow is the latency of an add instruction, using registers. I suppose with a multiplication it would be slower, and with a divison even more so... I.e. that does not really match with any definition of "max cost" for me ;-) > Min instruction cost looks more like this: > > for(many) > a[x]+=a[x+n] And this then measures the throughput of an add instruction. > You'll note the max instruction cost is ~300 picosec while the L1 cache line > is ~150 picosec for 64 bytes. I would assume Intel deliberately chose that > to ensure that a single arithmetic instruction adding across cache lines can > run at single cycle speed. Ah, so these results are from an Intel processor? 101 ps means 1 cycle with 10GHz. I wondered which ARM processor you were using to get these impressive figures. So it's more like 3.3GHz and 3 add instructions throughput per cycle. Josef |