|
From: Nguyen Vu H. <vuh...@gm...> - 2008-02-28 02:26:53
|
Hello all, valgrind --tool=cachegrind reports that I ve 2 millions L2 misses. cachegrind manual page[1] says that a L2 miss cost around 10 cycles. So, if I am using Pentium 4, 3.2GHz ( clock rate: 3200 M cycles per second ), then the "performance lost" ( I don't know if it is the right terminology ) is: 3.2 x 1000 millions / 2,012,977 = 123 seconds. Is that correct ? ==18449== I refs: 3,593,017,532 ==18449== I1 misses: 2,585,830 ==18449== L2i misses: 21,133 ==18449== I1 miss rate: 0.07% ==18449== L2i miss rate: 0.00% ==18449== ==18449== D refs: 1,912,484,305 (1,291,273,957 rd + 621,210,348 wr) ==18449== D1 misses: 20,941,668 ( 18,995,384 rd + 1,946,284 wr) ==18449== L2d misses: 1,991,844 ( 998,459 rd + 993,385 wr) ==18449== D1 miss rate: 1.0% ( 1.4% + 0.3% ) ==18449== L2d miss rate: 0.1% ( 0.0% + 0.1% ) ==18449== ==18449== L2 refs: 23,527,498 ( 21,581,214 rd + 1,946,284 wr) ==18449== L2 misses: 2,012,977 ( 1,019,592 rd + 993,385 wr) ==18449== L2 miss rate: 0.0% ( 0.0% + 0.1% ) [1] http://valgrind.org/docs/manual/cg-manual.html#cg-manual.cache -- Best Regards, Nguyen Hung Vu ( Nguyễn Vũ Hưng ) vuhung16plus{remove}@gmail.dot.com An inquisitive look at Harajuku http://www.flickr.com/photos/vuhung/sets/72157600109218238/ |
|
From: Mehmet B. <mb...@gm...> - 2008-02-28 06:37:27
|
Only if you are living in a perfect world. Once I had the same dream (!) to be able to predict performance loss by using these statistics. I even used HW counters and took every factor I know of into calculation. I predicted the penalties due to cache misses using the calibrator tool. I considered latencies due to L1 and TLB misses as well. But I wasn't even close. The way processors handle these events vary vastly. Besides, most results, not only from simulators like cachgrind but also from HW counters, are speculative. I wish you good luck. I will be very much interested in any reasonable results you may get. Please keep us updated. -Memo On Wed, Feb 27, 2008 at 9:26 PM, Nguyen Vu Hung < vuh...@gm...> wrote: > Hello all, > > valgrind --tool=cachegrind reports that I ve 2 millions L2 misses. > cachegrind manual page[1] says that a L2 miss cost around 10 cycles. > > So, if I am using Pentium 4, 3.2GHz ( clock rate: 3200 M cycles per > second ), then the "performance lost" ( I don't know if it is the > right terminology ) is: > > 3.2 x 1000 millions / 2,012,977 = 123 seconds. > > Is that correct ? > > ==18449== I refs: 3,593,017,532 > ==18449== I1 misses: 2,585,830 > ==18449== L2i misses: 21,133 > ==18449== I1 miss rate: 0.07% > ==18449== L2i miss rate: 0.00% > ==18449== > ==18449== D refs: 1,912,484,305 (1,291,273,957 rd + 621,210,348 > wr) > ==18449== D1 misses: 20,941,668 ( 18,995,384 rd + 1,946,284 > wr) > ==18449== L2d misses: 1,991,844 ( 998,459 rd + 993,385 > wr) > ==18449== D1 miss rate: 1.0% ( 1.4% + 0.3% ) > ==18449== L2d miss rate: 0.1% ( 0.0% + 0.1% ) > ==18449== > ==18449== L2 refs: 23,527,498 ( 21,581,214 rd + 1,946,284 > wr) > ==18449== L2 misses: 2,012,977 ( 1,019,592 rd + 993,385 > wr) > ==18449== L2 miss rate: 0.0% ( 0.0% + 0.1% ) > > [1] http://valgrind.org/docs/manual/cg-manual.html#cg-manual.cache > > -- > Best Regards, > Nguyen Hung Vu ( Nguyễn Vũ Hưng ) > vuhung16plus{remove}@gmail.dot.com > An inquisitive look at Harajuku > http://www.flickr.com/photos/vuhung/sets/72157600109218238/ > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |
|
From: Nicholas N. <nj...@cs...> - 2008-02-29 05:56:57
|
On Thu, 28 Feb 2008, Nguyen Vu Hung wrote: > valgrind --tool=cachegrind reports that I ve 2 millions L2 misses. > cachegrind manual page[1] says that a L2 miss cost around 10 cycles. L1 is around 10. L2 is hundreds of cycles. > So, if I am using Pentium 4, 3.2GHz ( clock rate: 3200 M cycles per > second ), then the "performance lost" ( I don't know if it is the > right terminology ) is: > > 3.2 x 1000 millions / 2,012,977 = 123 seconds. > > Is that correct ? As Mehmet said, converting a cache miss count to actual time is extremely difficult, because modern processors are so complicated. Nick |
|
From: Nguyen Vu H. <vuh...@gm...> - 2008-02-29 06:52:50
|
2008/2/29, Nicholas Nethercote <nj...@cs...>:
>
> As Mehmet said, converting a cache miss count to actual time is extremely
> difficult, because modern processors are so complicated.
For an unknown reason, Mehmet's emai[1]l didn't come to my mailbox.I
had to get it from ML archive.
So like all you said, this is a hard task which is not have been done yet.
If we can't deliver performance penalty from cache misses
*theorically*, is there any way doing it empirically?
FYI, in the following example, the first run has ( almost ) no cache
misses while the second run does.
I am sorry for throwing the raw result at you because I have no time
and luck of understanding on this topic.
[vuhung@ cachegrind-test]$cat 16M.c
#define BUF_SIZE 4096
static char buf[BUF_SIZE][BUF_SIZE]; // 16M
int main(void)
{
int i, j;
for (i = 0; i < BUF_SIZE; i++) {
for (j = 0; j < BUF_SIZE; j++) {
#ifndef INVERSE
buf[i][j] = 'a';
#else
buf[j][i] = 'a';
#endif
}
}
}
[vuhung@ cachegrind-test]$gcc 16M.c
[vuhung@ cachegrind-test]$time ./a.out
real 0m0.112s
user 0m0.083s
sys 0m0.029s
[vuhung@ cachegrind-test]$gcc -DINVERSE 16M.c
[vuhung@ cachegrind-test]$time ./a.out
real 0m2.030s
user 0m1.918s
sys 0m0.021s
[vuhung@ cachegrind-test]$gcc 16M.c ; valgrind --tool=cachegrind ./a.out
[snip]
--28717-- warning: Pentium 4 with 12 KB micro-op instruction trace cache
--28717-- Simulating a 16 KB I-cache with 32 B lines
==28717==
==28717== I refs: 167,917,156
==28717== I1 misses: 954
==28717== L2i misses: 542
==28717== I1 miss rate: 0.00%
==28717== L2i miss rate: 0.00%
==28717==
==28717== D refs: 83,956,459 (67,159,832 rd + 16,796,627 wr)
==28717== D1 misses: 263,270 ( 898 rd + 262,372 wr)
==28717== L2d misses: 263,061 ( 699 rd + 262,362 wr)
==28717== D1 miss rate: 0.3% ( 0.0% + 1.5% )
==28717== L2d miss rate: 0.3% ( 0.0% + 1.5% )
==28717==
==28717== L2 refs: 264,224 ( 1,852 rd + 262,372 wr)
==28717== L2 misses: 263,603 ( 1,241 rd + 262,362 wr)
==28717== L2 miss rate: 0.1% ( 0.0% + 1.5% )
[vuhung@ cachegrind-test]$gcc -DINVERSE 16M.c ; valgrind
--tool=cachegrind ./a.out
[snip]
--28730-- warning: Pentium 4 with 12 KB micro-op instruction trace cache
--28730-- Simulating a 16 KB I-cache with 32 B lines
==28730==
==28730== I refs: 167,917,158
==28730== I1 misses: 954
==28730== L2i misses: 542
==28730== I1 miss rate: 0.00%
==28730== L2i miss rate: 0.00%
==28730==
==28730== D refs: 83,956,461 (67,159,832 rd + 16,796,629 wr)
==28730== D1 misses: 16,778,342 ( 898 rd + 16,777,444 wr)
==28730== L2d misses: 16,778,133 ( 699 rd + 16,777,434 wr)
==28730== D1 miss rate: 19.9% ( 0.0% + 99.8% )
==28730== L2d miss rate: 19.9% ( 0.0% + 99.8% )
==28730==
==28730== L2 refs: 16,779,296 ( 1,852 rd + 16,777,444 wr)
==28730== L2 misses: 16,778,675 ( 1,241 rd + 16,777,434 wr)
==28730== L2 miss rate: 6.6% ( 0.0% + 99.8% )
[vuhung@ cachegrind-test]$cat /proc/cpuinfo ( dual CPU )
processor : 0,1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 1
cpu MHz : 3201.594
cache size : 1024 KB
[snip]
[1] http://sourceforge.net/mailarchive/forum.php?thread_name=78d7dd350802271826s6b351fdw4725ce7a8b78c9da%40mail.gmail.com&forum_name=valgrind-users
--
Best Regards,
Nguyen Hung Vu ( Nguyễn Vũ Hưng )
vuhung16plus{remove}@gmail.dot.com
An inquisitive look at Harajuku
http://www.flickr.com/photos/vuhung/sets/72157600109218238/
|
|
From: Van P. L. <Lud...@sc...> - 2008-02-29 07:50:15
|
Hi,
>
> valgrind --tool=cachegrind reports that I ve 2 millions L2 misses.
> cachegrind manual page[1] says that a L2 miss cost around 10 cycles.
>
> So, if I am using Pentium 4, 3.2GHz ( clock rate: 3200 M
> cycles per second ), then the "performance lost" ( I don't
> know if it is the right terminology ) is:
>
> 3.2 x 1000 millions / 2,012,977 = 123 seconds.
>
> Is that correct ?
To answer this specific questions: I think it is wrong.
2,012,977 misses x 10 cycles
------
miss
---------------------------- = 0.0063 seconds
3.2 x 1000 millions cycles
------
second
But I guess L2 misses are more costly, as pointed out by Nicholas.
Hope this helps you a bit further.
KR,
Ludo Van Put
- - - - - Appended by Scientific Atlanta, a Cisco company - - - - -
This e-mail and any attachments may contain information which is confidential,
proprietary, privileged or otherwise protected by law. The information is solely
intended for the named addressee (or a person responsible for delivering it to
the addressee). If you are not the intended recipient of this message, you are
not authorized to read, print, retain, copy or disseminate this message or any
part of it. If you have received this e-mail in error, please notify the sender
immediately by return e-mail and delete it from your computer.
|
|
From: Nicholas N. <nj...@cs...> - 2008-02-29 09:00:04
|
On Fri, 29 Feb 2008, Nguyen Vu Hung wrote: > If we can't deliver performance penalty from cache misses > *theorically*, is there any way doing it empirically? If you really want to know more about this, read this paper: www.ece.wisc.edu/~jes/papers/Perf_Ctr_ASPLOS.pdf In short, on superscalar out-of-order processors you can compute this kind of thing accurately based on a few pieces of data, related to cache and TLB misses and branch mispredications. However, the pieces of data you need are ones that would require hardware performance counters, and I don't think any existing machines have the right kind of hardware performance counters. Nick |
|
From: Josef W. <Jos...@gm...> - 2008-03-04 13:35:52
|
On Friday 29 February 2008, Nicholas Nethercote wrote: > On Fri, 29 Feb 2008, Nguyen Vu Hung wrote: > > > If we can't deliver performance penalty from cache misses > > *theorically*, is there any way doing it empirically? > > If you really want to know more about this, read this paper: > > www.ece.wisc.edu/~jes/papers/Perf_Ctr_ASPLOS.pdf > > In short, on superscalar out-of-order processors you can compute this kind > of thing accurately based on a few pieces of data, related to cache and TLB > misses and branch mispredications. However, the pieces of data you need are > ones that would require hardware performance counters, and I don't think any > existing machines have the right kind of hardware performance counters. The hardware performance counters of the Itanium2 are quite sophisticated. AFAIR, the specific reason(s) of pipeline stalls (called bubbles there) can be checked out in detail there. Check the architecture manual. Josef |
|
From: Nguyen Vu H. <vuh...@gm...> - 2008-02-29 09:25:30
|
2008/2/29, Van Put, Ludo <Lud...@sc...>: > Hi, > > > > > > valgrind --tool=cachegrind reports that I ve 2 millions L2 misses. > > cachegrind manual page[1] says that a L2 miss cost around 10 cycles. > > > > So, if I am using Pentium 4, 3.2GHz ( clock rate: 3200 M > > cycles per second ), then the "performance lost" ( I don't > > know if it is the right terminology ) is: > > > > 3.2 x 1000 millions / 2,012,977 = 123 seconds. > > > > Is that correct ? > > > To answer this specific questions: I think it is wrong. > > 2,012,977 misses x 10 cycles > ------ > miss > ---------------------------- = 0.0063 seconds > 3.2 x 1000 millions cycles > ------ > second > > But I guess L2 misses are more costly, as pointed out by Nicholas. > > Hope this helps you a bit further. Thanks for the correction. Calibrator[1] is a small utility helps finding the relationship between cache miss latency, cycles and performance loss in time[3]. It also has a ( quite out-of-date ) database[2] of tested CPUs. Mehmet, would you share your result? What is wrong with Calibrator ? [1] http://monetdb.cwi.nl/Calibrator/ [2] http://www.cwi.nl/htbin/ins1/publications?request=search&field=KEYWORDS&pattern=Calibrator&title=Keyword:+Calibrator [3] http://monetdb.cwi.nl/Calibrator/doc/calibrator.pdf -- Best Regards, Nguyen Hung Vu ( Nguyễn Vũ Hưng ) vuhung16plus{remove}@gmail.dot.com An inquisitive look at Harajuku http://www.flickr.com/photos/vuhung/sets/72157600109218238/ |
|
From: Nicholas N. <nj...@cs...> - 2008-02-29 23:31:16
|
On Fri, 29 Feb 2008, Nguyen Vu Hung wrote: > Calibrator[1] is a small utility helps finding the relationship > between cache miss latency, cycles and performance loss in time[3]. It > also has a ( quite out-of-date ) database[2] of tested CPUs. > > Mehmet, would you share your result? What is wrong with Calibrator ? Calibrator is good for giving you the worst-case latencies of L1 and L2 cache misses. The tricky thing with cache misses is that often machines can do other stuff while waiting for the cache miss to be serviced. But exactly how much depends on a lot of factors to do with the exact program being run. That's why it's so hard to estimate this kind of thing. Nick |
|
From: Mehmet B. <mb...@gm...> - 2008-03-01 06:10:46
|
I didn't say there's something wrong with the calibrator. It worked fine for me to estimate cache miss latencies. I am traveling right now, but will be happy to share my results (for an athlon machine) when I get back to home... I really liked the ps graphs it generates, they tell a lot. -Memo On Fri, Feb 29, 2008 at 6:31 PM, Nicholas Nethercote < nj...@cs...> wrote: > On Fri, 29 Feb 2008, Nguyen Vu Hung wrote: > > > Calibrator[1] is a small utility helps finding the relationship > > between cache miss latency, cycles and performance loss in time[3]. It > > also has a ( quite out-of-date ) database[2] of tested CPUs. > > > > Mehmet, would you share your result? What is wrong with Calibrator ? > > Calibrator is good for giving you the worst-case latencies of L1 and L2 > cache misses. The tricky thing with cache misses is that often machines > can > do other stuff while waiting for the cache miss to be serviced. But > exactly > how much depends on a lot of factors to do with the exact program being > run. > That's why it's so hard to estimate this kind of thing. > > Nick > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |
|
From: Mehmet B. <mb...@gm...> - 2008-03-06 16:11:49
|
As promised: (Calibrator estimates for an Athlon machine) L1 latency : 12 cycles L2 latency : 162 cycles TLB latency : 28 cycles Note that, while using HW counters, the L2 misses incurred by TLB misses are also counted in the same way as L2 misses due to poor locality. So the value for TLB miss excludes this implicit latency. Thanks, -Memo On Sat, Mar 1, 2008 at 1:10 AM, Mehmet Belgin <mb...@gm...> wrote: > I didn't say there's something wrong with the calibrator. It worked fine > for me to estimate cache miss latencies. I am traveling right now, but will > be happy to share my results (for an athlon machine) when I get back to > home... I really liked the ps graphs it generates, they tell a lot. > -Memo > > > On Fri, Feb 29, 2008 at 6:31 PM, Nicholas Nethercote < > nj...@cs...> wrote: > > > On Fri, 29 Feb 2008, Nguyen Vu Hung wrote: > > > > > Calibrator[1] is a small utility helps finding the relationship > > > between cache miss latency, cycles and performance loss in time[3]. It > > > also has a ( quite out-of-date ) database[2] of tested CPUs. > > > > > > Mehmet, would you share your result? What is wrong with Calibrator ? > > > > Calibrator is good for giving you the worst-case latencies of L1 and L2 > > cache misses. The tricky thing with cache misses is that often machines > > can > > do other stuff while waiting for the cache miss to be serviced. But > > exactly > > how much depends on a lot of factors to do with the exact program being > > run. > > That's why it's so hard to estimate this kind of thing. > > > > Nick > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2008. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Valgrind-users mailing list > > Val...@li... > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > |
|
From: Mehmet B. <mb...@gm...> - 2008-03-07 18:50:25
|
Calibrator results sound about right to me. Are you looking at a manual for a newer architecture? Because those data sheet values are way too small for old architectures (if only they were correct !!!). For recent architectures, I would believe these values might be correct tough. Just for fun: here's (http://people.cs.vt.edu/~mehmetb/drop/Estimation.pdf) how close I got to estimating performance using HW counter stats. On a second look, I did pretty good :-) X axis is different problem sizes (matrices). Y is the time per problem. The red curve is the baseline. Than I did some optimizations and measured the results (blue curve) and I estimated the expected performance improvements (black curve) by using cache event stats (nothing fancy: time_def = num_event * event_latency). I will appreciate all comments/suggestions... -Memo On Fri, Mar 7, 2008 at 5:58 AM, Andi Kleen <an...@fi...> wrote: > "Mehmet Belgin" <mb...@gm...> writes: > > > (Calibrator estimates for an Athlon machine) > > Assuming you mean a K8, but K7 is similar > > > L1 latency : 12 cycles > > It should be <= 4 cycles actually. > > > L2 latency : 162 cycles > > That looks far too slow. Normally it should be <20 cycles > (minimum 13 cycles) according to data sheets. > > Looks to me like calibrator is not very accurate. > > -Andi > |
|
From: Nicholas N. <nj...@cs...> - 2008-03-07 20:24:15
|
On Fri, 7 Mar 2008, Andi Kleen wrote: > But I doubt that any CPU you can find in the last 20 years or so > had a 12 cycle L1 latency. Especially not a x86 CPU. That would just be > horrible in performance. I tried Calibrator with an Athlon (K7 model 4) a few years back and got L1/L2 worst-case latencies of 12 cycles and 206 cycles. My understanding is that L1 misses costing ~10 cycles and L2 misses costing ~100 cycles or mroe (both in the worst case) is typical. That's why so much research has been done relating to cache performance. Nick |