|
From: Hynek S. <hs+...@ox...> - 2006-08-13 17:18:37
|
Hi, I'm profiling software using Callgrind and as I have some rather strange results, I'd like to ask some questions to go for sure... I understand, that callgrind counts "instructions". Does it mean, that it simply adds up assembler instructions and uses it as the costs? I ask because it seems as a strange metric to me because some instructions take longer than others. This would explain, why pthread_mutex_lock() seems to hog the most costs and memcpy() (which gprof indentified as the major hog) is only neglectable. Or am I missing something here? I'd really like to understand, why Callgrind's and gprof's (and OProfile's btw, but that's not useful for me as it's granularity is too low) results differ. I couldn't find anything about this neither in the Valgrind manual nor on the KCachegrind pages, so I hope that someone here can help me... TIA, -hs |
|
From: Josef W. <Jos...@gm...> - 2006-08-13 19:52:06
|
On Sunday 13 August 2006 19:18, Hynek Schlawack wrote: > Hi, > > I'm profiling software using Callgrind and as I have some rather strange > results, I'd like to ask some questions to go for sure... > > I understand, that callgrind counts "instructions". Does it mean, that > it simply adds up assembler instructions and uses it as the costs? Yes. "Instructions fetched" is one cost type provided by Cachegrind/Callgrind. You should not confuse this with any time cost. > I ask > because it seems as a strange metric to me because some instructions > take longer than others. Probably. However, a time estimation using instruction latencies as factors is not better either. Relevant is the throughput of a given instruction stream, and not single latencies. For estimation of that, you need to simulate a CPU pipeline; and to match some reality, you need to know the branch prediction algorithm, and the superscalar configuration of your processor and so on (which is not really documented BTW). Even with these parameters, a simulator probably would be way too slow to be practical. Cachegrind/Callgrind is good to see whether your code has cache problems and potential for cache optimizations. Together with average L1/L2 cache latencies, you can come up with a rough time estimation which is often quite good (there is a derived cost type "Cycle Estimation" provided with KCachegrind which defaults to 10/100 cycles latency for L1/L2. You should adjust the formula for your machine. However, if you do not have cache problems, there probably is no good way to estimate the time by using the "instruction fetch" cost given by Cachegrind/Callgrind. > This would explain, why pthread_mutex_lock() > seems to hog the most costs and memcpy() (which gprof indentified as the > major hog) is only neglectable. This sounds like you only look at the instruction fetches even though you have a lot of L2 misses. Can it be that you run with the cache simulator switched off (which is the default with Callgrind)? Use "--simulate-cache=yes" and look at the cycle estimation cost (in KCachegrind). > Or am I missing something here? I'd really like to understand, why > Callgrind's and gprof's (and OProfile's btw, but that's not useful for > me as it's granularity is too low) results differ. I hope the above helps you understand your case. BTW, gprof is doing source instrumentation, and depending on the application, overhead can be near 100%. This also disturbs the measurement itself. Pure sampling (like oprofile) is more exact. Why is OProfile's granularity too low for you? In contrast to GProf, you even can adjust the sample interval there to tune the overhead. GProf is doing sample too, but only with timers, and with the handler in user land. OProfile really should be more exact, as it does sample handling in kernel space with lower latency. Of course, pure sampling can not give you a full call graph (you can make OProfile to come up with some extracts of the real call graph by doing a stack backtrace at every sample point). > I couldn't find anything about this neither in the Valgrind manual nor > on the KCachegrind pages, so I hope that someone here can help me... This is becoming a FAQ. I will try to come up with something for the Callgrind manual. Josef > > TIA, > -hs > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > |
|
From: Hynek S. <hs+...@ox...> - 2006-08-13 21:38:57
|
Hello Josef, Josef Weidendorfer <Jos...@gm...> writes: First of all, many thanks for your detailed answer! >> I'm profiling software using Callgrind and as I have some rather >> strange results, I'd like to ask some questions to go for sure... I >> understand, that callgrind counts "instructions". Does it mean, that >> it simply adds up assembler instructions and uses it as the costs? > Yes. "Instructions fetched" is one cost type provided by > Cachegrind/Callgrind. You should not confuse this with any time cost. I don't. :) >> I ask because it seems as a strange metric to me because some >> instructions take longer than others. > Probably. However, a time estimation using instruction latencies as > factors is not better either. Relevant is the throughput of a given > instruction stream, and not single latencies. For estimation of that, > you need to simulate a CPU pipeline; and to match some reality, you > need to know the branch prediction algorithm, and the superscalar > configuration of your processor and so on (which is not really > documented BTW). Even with these parameters, a simulator probably > would be way too slow to be practical. I don't doubt that, sorry if it sounded elsewise. In fact that was my thought behind the question - that it's impossible to compute time from instructions (on today's CPUs). > Cachegrind/Callgrind is good to see whether your code has cache problems > and potential for cache optimizations. Together with average L1/L2 cache > latencies, you can come up with a rough time estimation which is often > quite good (there is a derived cost type "Cycle Estimation" provided > with KCachegrind which defaults to 10/100 cycles latency for L1/L2. You > should adjust the formula for your machine. To be honest, I have problems to see it because I have no comparision. When do I have cache problems? All my functions have cache miss sums for both L1 and L2 < 0.5... > However, if you do not have cache problems, there probably is no good > way to estimate the time by using the "instruction fetch" cost given by > Cachegrind/Callgrind. Ok. The docs at Valgrind sounded pretty general purpose (`Callgrind: a heavyweight profiler'), so I've been trying to use it generally. :) The callgraph functions are excellent though. >> This would explain, why pthread_mutex_lock()=20 >> seems to hog the most costs and memcpy() (which gprof indentified as the >> major hog) is only neglectable. > This sounds like you only look at the instruction fetches even though you > have a lot of L2 misses. Can it be that you run with the cache simulator= =20 > switched off (which is the default with Callgrind)? > Use "--simulate-cache=3Dyes" and look at the cycle estimation cost (in > KCachegrind). I tried both but, to my shame, I've overlooked the "Cycle Estimation". > BTW, gprof is doing source instrumentation, and depending on the > application, overhead can be near 100%. This also disturbs the > measurement itself.=20 I know - I found that even minimal instrumentation (ie. rdtsc) can have huge impact on the results. > Why is OProfile's granularity too low for you? In contrast to GProf, > you even can adjust the sample interval there to tune the > overhead.=20 I'm profiling _thin_ network layers over gigabit ethernet whose latencies are measured in < 100 =C2=B5s. OProfile's lowest granularity is 3,000 cycles; if I'm not mistaken, I'd need 2,200 on my 2.2 GHz CPU to have 1,000,000 samples / second. So, any hints for throughout measuring? Callgrind was kindof my last hope... > GProf is doing sample too, but only with timers, and with the handler > in user land. OProfile really should be more exact, as it does sample > handling in kernel space with lower latency. Ok, this is a shock for me now...I always thought, that gprof doesn't sample. :( Why does it instrumentation then? Just for the callgraph? Got to look at its internals I guess. Or to just throw it in the trashcan. >> I couldn't find anything about this neither in the Valgrind manual >> nor on the KCachegrind pages, so I hope that someone here can help >> me... > This is becoming a FAQ. I will try to come up with something for the > Callgrind manual. I'm sorry. :( I guess I had the wrong search terms for Gmane. -hs |
|
From: Josef W. <Jos...@gm...> - 2006-08-13 23:20:25
|
On Sunday 13 August 2006 23:38, Hynek Schlawack wrote: > > Cachegrind/Callgrind is good to see whether your code has cache problems > > and potential for cache optimizations. Together with average L1/L2 cache > > latencies, you can come up with a rough time estimation which is often > > quite good (there is a derived cost type "Cycle Estimation" provided > > with KCachegrind which defaults to 10/100 cycles latency for L1/L2. You Typo: Default of that formula is 10 for L1 miss (=3D L2 access), and 100 for L2 miss (=3D main memory access) ... > > should adjust the formula for your machine. >=20 > To be honest, I have problems to see it because I have no > comparision. You can use calibrator (see http://monetdb.cwi.nl/Calibrator). > When do I have cache problems? All my functions have cache=20 > miss sums for both L1 and L2 < 0.5... Good cache behavior would be a hit ratio > 97%, depending on latencies of your system (slower main memory wants a higher hit ratio). Unfortunately, KCachegrind currently does not show ratios explicitly. When you select the "Cycle estimation" and look at the colored bars, the red part will show the fraction of the time estimation which comes from L2 misses. If there is no red part, you are fine. > > BTW, gprof is doing source instrumentation, and depending on the > > application, overhead can be near 100%. This also disturbs the > > measurement itself.=20 >=20 > I know - I found that even minimal instrumentation (ie. rdtsc) can have > huge impact on the results. It always depends on how often the instrumention is executed itself. > > Why is OProfile's granularity too low for you? In contrast to GProf, > > you even can adjust the sample interval there to tune the > > overhead.=20 >=20 > I'm profiling _thin_ network layers over gigabit ethernet whose > latencies are measured in < 100 =C2=B5s. And still, it is enough to measure userlevel only? Note that callgrind or GProf can not give you latency spent in the kernel part of the network stack. > OProfile's lowest granularity is=20 > 3,000 cycles; if I'm not mistaken, I'd need 2,200 on my 2.2 GHz CPU to > have 1,000,000 samples / second. I do not know about these limits. But OProfiles interrupt handler probably takes around 500-1000 cycles (only my rough estimation). > So, any hints for throughout measuring? Callgrind was kindof my last > hope... What do you really want to see? Are you interested in exact time measurement (min/max/average time from point A to point B in your source)? Then your best bet is to put rdtsc calls yourself at these points. But if you are interested what code is touched between point A and B, and what instructions in the code path are taking most of the time, a statistical approach should be enough. And you do not need a very high sample resolution, but you have to sample long enough for the counts to be statistically relevant. When you get 1 million sample points between point A and B, the time distribution should really be precise enough (depending on the amount of code that can be touched in code paths from A to B). >=20 > > GProf is doing sample too, but only with timers, and with the handler > > in user land. OProfile really should be more exact, as it does sample > > handling in kernel space with lower latency. >=20 > Ok, this is a shock for me now...I always thought, that gprof doesn't > sample. :( Why does it instrumentation then? Just for the callgraph? Yes. The instrumentation increments counters for call arcs among functions only. There is no rdtsc or similar. GProf does sampling, and sampling itself can only give self costs. These costs are propagated up along the call graph to get an estimation of the inclusive costs. And for this to be possible, the call graph has to be exact - which needs the instrumentation in the first place. > > This is becoming a FAQ. I will try to come up with something for the > > Callgrind manual. >=20 > I'm sorry. :( I guess I had the wrong search terms for Gmane. No problem. The callgrind manual needs to be improved. Josef |
|
From: Hynek S. <hs+...@ox...> - 2006-08-14 08:03:03
|
Hello Josef, Josef Weidendorfer <Jos...@gm...> writes: Thanks again! >> > should adjust the formula for your machine. >> To be honest, I have problems to see it because I have no >> comparision. > You can use calibrator (see http://monetdb.cwi.nl/Calibrator). I thought my computer is fast: caches: level size linesize miss-latency replace-time 1 64 KB 32 bytes 5.99 ns =3D 13 cy 8.37 ns =3D 18 cy 2 512 KB 128 bytes 80.71 ns =3D 178 cy 78.33 ns =3D 172 cy :/ So I have to replace the 10 and 100 through 13 and 178? I'd say, this should also go into the docs, or have I just overlooked it? >> When do I have cache problems? All my functions have cache=20 >> miss sums for both L1 and L2 < 0.5... > Good cache behavior would be a hit ratio > 97%, depending on latencies > of your system (slower main memory wants a higher hit ratio). > Unfortunately, KCachegrind currently does not show ratios explicitly. > When you select the "Cycle estimation" and look at the colored bars, > the red part will show the fraction of the time estimation which comes > from L2 misses. If there is no red part, you are fine. There are bars but only aprox. one pixel wide. Also, Callgrind says =3D=3D14861=3D=3D I refs: 1,206,122,472 =3D=3D14861=3D=3D I1 misses: 3,955 =3D=3D14861=3D=3D L2i misses: 2,648 =3D=3D14861=3D=3D I1 miss rate: 0.0% =3D=3D14861=3D=3D L2i miss rate: 0.0% So I guess I have not cache-problems? :) >> I know - I found that even minimal instrumentation (ie. rdtsc) can have >> huge impact on the results. > It always depends on how often the instrumention is executed itself. In had funny effects (subfunctions seem to take longer than the whole run) even for being called once. But it's also problematic due to its nature as a network application. >> I'm profiling _thin_ network layers over gigabit ethernet whose >> latencies are measured in < 100 =C2=B5s. > And still, it is enough to measure userlevel only? Yes, my code is purely in the user space. > Note that callgrind or GProf can not give you latency spent in the > kernel part of the network stack. I know, only OProfile can. >> OProfile's lowest granularity is 3,000 cycles; if I'm not mistaken, >> I'd need 2,200 on my 2.2 GHz CPU to have 1,000,000 samples / second. > I do not know about these limits.=20 They are documented inside `opcontrol -l'. If you want callgraphs, you even have to multiple it by 15 (ain't documented, I stumbled over 45,000 and asked John Levon). > But OProfiles interrupt handler probably takes around 500-1000 cycles > (only my rough estimation). So you're suggesting it wouldn't make sense anyway...that's true indeed. >> So, any hints for throughout measuring? Callgrind was kindof my last >> hope... > What do you really want to see? Are you interested in exact time > measurement (min/max/average time from point A to point B in your > source)? Then your best bet is to put rdtsc calls yourself at these > points. Looks like it. But as said, it's also problematic, because of moving additional code. > But if you are interested what code is touched between point A and B, > and what instructions in the code path are taking most of the time, a > statistical approach should be enough. And you do not need a very high > sample resolution, but you have to sample long enough for the counts > to be statistically relevant. When you get 1 million sample points > between point A and B, the time distribution should really be precise > enough (depending on the amount of code that can be touched in code > paths from A to B). So I guess a combination of rdtsc (exact times), oprofile (aprox. runtime distribution) and callgrind (caches + callgraphs) is the way to go. That's also what I expected. >> > GProf is doing sample too, but only with timers, and with the >> > handler in user land. OProfile really should be more exact, as it >> > does sample handling in kernel space with lower latency. >> Ok, this is a shock for me now...I always thought, that gprof doesn't >> sample. :( Why does it instrumentation then? Just for the callgraph? > Yes. The instrumentation increments counters for call arcs among > functions only. There is no rdtsc or similar. GProf does sampling, and > sampling itself can only give self costs. These costs are propagated > up along the call graph to get an estimation of the inclusive costs. > And for this to be possible, the call graph has to be exact - which > needs the instrumentation in the first place. Hm, what would speak against rdtsc-instrumentation? In case of gprof it's clear (portability) but for people like me gprof is useless. I'm glad I didn't use it for serious stuff. -hs |
|
From: Josef W. <Jos...@gm...> - 2006-08-14 09:37:09
|
On Monday 14 August 2006 10:02, Hynek Schlawack wrote: > I thought my computer is fast: > > caches: > level size linesize miss-latency replace-time > 1 64 KB 32 bytes 5.99 ns = 13 cy 8.37 ns = 18 cy > 2 512 KB 128 bytes 80.71 ns = 178 cy 78.33 ns = 172 cy > > :/ So I have to replace the 10 and 100 through 13 and 178? I'd say, this > should also go into the docs, or have I just overlooked it? Yes and yes. Usually, "fast" is associated with high peak performance, which does not talk about main memory speed. That's even the problem of the Top500 list (and no processor vendor is interested to change this). But 80 ns is not really too bad either. You should look at latencies of 4/8-socket opteron systems... > There are bars but only aprox. one pixel wide. That's fine. > Also, Callgrind says > > ==14861== I refs: 1,206,122,472 > ==14861== I1 misses: 3,955 > ==14861== L2i misses: 2,648 > ==14861== I1 miss rate: 0.0% > ==14861== L2i miss rate: 0.0% Hmm... instruction fetches most often hit in the cache. What's about the data accesses? > >> So, any hints for throughout measuring? Callgrind was kindof my last > >> hope... > > What do you really want to see? Are you interested in exact time > > measurement (min/max/average time from point A to point B in your > > source)? Then your best bet is to put rdtsc calls yourself at these > > points. > > Looks like it. But as said, it's also problematic, because of moving > additional code. You can subtract the overhead of the inserted rdtsc instruction, as you know this overhead (you can measure it before). BTW, you also should be able to read performance counters (... I am not really sure if rdmsr is allowed in user space ...). > So I guess a combination of rdtsc (exact times), oprofile > (aprox. runtime distribution) and callgrind (caches + callgraphs) is the > way to go. That's also what I expected. Yes. That's also a big TODO item for KCachegrind: to combine measurement results of different tools to come up with something better. VTune is supposed to support this (callgraph from instrumentation mode, time from sampling). > Hm, what would speak against rdtsc-instrumentation? Why? If you are doing it yourself, you can avoid unneeded instrumentation, you can control the overhead, and even subtract it from the result. Josef |
|
From: Hynek S. <hs+...@ox...> - 2006-08-14 12:39:48
|
Hello Josef, Josef Weidendorfer <Jos...@gm...> writes: >> Also, Callgrind says >> >> ==14861== I refs: 1,206,122,472 >> ==14861== I1 misses: 3,955 >> ==14861== L2i misses: 2,648 >> ==14861== I1 miss rate: 0.0% >> ==14861== L2i miss rate: 0.0% > Hmm... instruction fetches most often hit in the cache. What's about > the data accesses? I measured with more real data and got: ==22798== Events : Ir Dr Dw I1mr D1mr D1mw I2mr D2mr D2mw ==22798== Collected : 754730 253655 132391 3789 4150 911 2589 2094 802 ==22798== ==22798== I refs: 754,730 ==22798== I1 misses: 3,789 ==22798== L2i misses: 2,589 ==22798== I1 miss rate: 0.50% ==22798== L2i miss rate: 0.34% ==22798== ==22798== D refs: 386,046 (253,655 rd + 132,391 wr) ==22798== D1 misses: 5,061 ( 4,150 rd + 911 wr) ==22798== L2d misses: 2,896 ( 2,094 rd + 802 wr) ==22798== D1 miss rate: 1.3% ( 1.6% + 0.6% ) ==22798== L2d miss rate: 0.7% ( 0.8% + 0.6% ) ==22798== ==22798== L2 refs: 8,850 ( 7,939 rd + 911 wr) ==22798== L2 misses: 5,485 ( 4,683 rd + 802 wr) ==22798== L2 miss rate: 0.4% ( 0.4% + 0.6% ) Still okay, is it? >> Looks like it. But as said, it's also problematic, because of moving >> additional code. > You can subtract the overhead of the inserted rdtsc instruction, as > you know this overhead (you can measure it before). BTW, you also > should be able to read performance counters (... I am not really sure > if rdmsr is allowed in user space ...). Hm, I guess I'll hack some general .a together, to make that easy. I used to use just a preprocessor macro but it turned out pretty copious. >> So I guess a combination of rdtsc (exact times), oprofile >> (aprox. runtime distribution) and callgrind (caches + callgraphs) is the >> way to go. That's also what I expected. > Yes. That's also a big TODO item for KCachegrind: to combine > measurement results of different tools to come up with something > better. VTune is supposed to support this (callgraph from > instrumentation mode, time from sampling). Sounds like a really huge TODO. >> Hm, what would speak against rdtsc-instrumentation? > Why? If you are doing it yourself, you can avoid unneeded > instrumentation, you can control the overhead, and even subtract it > from the result. Some kind of automation would be nice I'd say. -hs |