|
From: Josef W. <Jos...@gm...> - 2006-08-13 23:20:25
|
On Sunday 13 August 2006 23:38, Hynek Schlawack wrote: > > Cachegrind/Callgrind is good to see whether your code has cache problems > > and potential for cache optimizations. Together with average L1/L2 cache > > latencies, you can come up with a rough time estimation which is often > > quite good (there is a derived cost type "Cycle Estimation" provided > > with KCachegrind which defaults to 10/100 cycles latency for L1/L2. You Typo: Default of that formula is 10 for L1 miss (=3D L2 access), and 100 for L2 miss (=3D main memory access) ... > > should adjust the formula for your machine. >=20 > To be honest, I have problems to see it because I have no > comparision. You can use calibrator (see http://monetdb.cwi.nl/Calibrator). > When do I have cache problems? All my functions have cache=20 > miss sums for both L1 and L2 < 0.5... Good cache behavior would be a hit ratio > 97%, depending on latencies of your system (slower main memory wants a higher hit ratio). Unfortunately, KCachegrind currently does not show ratios explicitly. When you select the "Cycle estimation" and look at the colored bars, the red part will show the fraction of the time estimation which comes from L2 misses. If there is no red part, you are fine. > > BTW, gprof is doing source instrumentation, and depending on the > > application, overhead can be near 100%. This also disturbs the > > measurement itself.=20 >=20 > I know - I found that even minimal instrumentation (ie. rdtsc) can have > huge impact on the results. It always depends on how often the instrumention is executed itself. > > Why is OProfile's granularity too low for you? In contrast to GProf, > > you even can adjust the sample interval there to tune the > > overhead.=20 >=20 > I'm profiling _thin_ network layers over gigabit ethernet whose > latencies are measured in < 100 =C2=B5s. And still, it is enough to measure userlevel only? Note that callgrind or GProf can not give you latency spent in the kernel part of the network stack. > OProfile's lowest granularity is=20 > 3,000 cycles; if I'm not mistaken, I'd need 2,200 on my 2.2 GHz CPU to > have 1,000,000 samples / second. I do not know about these limits. But OProfiles interrupt handler probably takes around 500-1000 cycles (only my rough estimation). > So, any hints for throughout measuring? Callgrind was kindof my last > hope... What do you really want to see? Are you interested in exact time measurement (min/max/average time from point A to point B in your source)? Then your best bet is to put rdtsc calls yourself at these points. But if you are interested what code is touched between point A and B, and what instructions in the code path are taking most of the time, a statistical approach should be enough. And you do not need a very high sample resolution, but you have to sample long enough for the counts to be statistically relevant. When you get 1 million sample points between point A and B, the time distribution should really be precise enough (depending on the amount of code that can be touched in code paths from A to B). >=20 > > GProf is doing sample too, but only with timers, and with the handler > > in user land. OProfile really should be more exact, as it does sample > > handling in kernel space with lower latency. >=20 > Ok, this is a shock for me now...I always thought, that gprof doesn't > sample. :( Why does it instrumentation then? Just for the callgraph? Yes. The instrumentation increments counters for call arcs among functions only. There is no rdtsc or similar. GProf does sampling, and sampling itself can only give self costs. These costs are propagated up along the call graph to get an estimation of the inclusive costs. And for this to be possible, the call graph has to be exact - which needs the instrumentation in the first place. > > This is becoming a FAQ. I will try to come up with something for the > > Callgrind manual. >=20 > I'm sorry. :( I guess I had the wrong search terms for Gmane. No problem. The callgrind manual needs to be improved. Josef |