|
From: Josef W. <Jos...@gm...> - 2006-08-14 09:37:09
|
On Monday 14 August 2006 10:02, Hynek Schlawack wrote: > I thought my computer is fast: > > caches: > level size linesize miss-latency replace-time > 1 64 KB 32 bytes 5.99 ns = 13 cy 8.37 ns = 18 cy > 2 512 KB 128 bytes 80.71 ns = 178 cy 78.33 ns = 172 cy > > :/ So I have to replace the 10 and 100 through 13 and 178? I'd say, this > should also go into the docs, or have I just overlooked it? Yes and yes. Usually, "fast" is associated with high peak performance, which does not talk about main memory speed. That's even the problem of the Top500 list (and no processor vendor is interested to change this). But 80 ns is not really too bad either. You should look at latencies of 4/8-socket opteron systems... > There are bars but only aprox. one pixel wide. That's fine. > Also, Callgrind says > > ==14861== I refs: 1,206,122,472 > ==14861== I1 misses: 3,955 > ==14861== L2i misses: 2,648 > ==14861== I1 miss rate: 0.0% > ==14861== L2i miss rate: 0.0% Hmm... instruction fetches most often hit in the cache. What's about the data accesses? > >> So, any hints for throughout measuring? Callgrind was kindof my last > >> hope... > > What do you really want to see? Are you interested in exact time > > measurement (min/max/average time from point A to point B in your > > source)? Then your best bet is to put rdtsc calls yourself at these > > points. > > Looks like it. But as said, it's also problematic, because of moving > additional code. You can subtract the overhead of the inserted rdtsc instruction, as you know this overhead (you can measure it before). BTW, you also should be able to read performance counters (... I am not really sure if rdmsr is allowed in user space ...). > So I guess a combination of rdtsc (exact times), oprofile > (aprox. runtime distribution) and callgrind (caches + callgraphs) is the > way to go. That's also what I expected. Yes. That's also a big TODO item for KCachegrind: to combine measurement results of different tools to come up with something better. VTune is supposed to support this (callgraph from instrumentation mode, time from sampling). > Hm, what would speak against rdtsc-instrumentation? Why? If you are doing it yourself, you can avoid unneeded instrumentation, you can control the overhead, and even subtract it from the result. Josef |