|
From: Hynek S. <hs+...@ox...> - 2006-08-14 08:03:03
|
Hello Josef, Josef Weidendorfer <Jos...@gm...> writes: Thanks again! >> > should adjust the formula for your machine. >> To be honest, I have problems to see it because I have no >> comparision. > You can use calibrator (see http://monetdb.cwi.nl/Calibrator). I thought my computer is fast: caches: level size linesize miss-latency replace-time 1 64 KB 32 bytes 5.99 ns =3D 13 cy 8.37 ns =3D 18 cy 2 512 KB 128 bytes 80.71 ns =3D 178 cy 78.33 ns =3D 172 cy :/ So I have to replace the 10 and 100 through 13 and 178? I'd say, this should also go into the docs, or have I just overlooked it? >> When do I have cache problems? All my functions have cache=20 >> miss sums for both L1 and L2 < 0.5... > Good cache behavior would be a hit ratio > 97%, depending on latencies > of your system (slower main memory wants a higher hit ratio). > Unfortunately, KCachegrind currently does not show ratios explicitly. > When you select the "Cycle estimation" and look at the colored bars, > the red part will show the fraction of the time estimation which comes > from L2 misses. If there is no red part, you are fine. There are bars but only aprox. one pixel wide. Also, Callgrind says =3D=3D14861=3D=3D I refs: 1,206,122,472 =3D=3D14861=3D=3D I1 misses: 3,955 =3D=3D14861=3D=3D L2i misses: 2,648 =3D=3D14861=3D=3D I1 miss rate: 0.0% =3D=3D14861=3D=3D L2i miss rate: 0.0% So I guess I have not cache-problems? :) >> I know - I found that even minimal instrumentation (ie. rdtsc) can have >> huge impact on the results. > It always depends on how often the instrumention is executed itself. In had funny effects (subfunctions seem to take longer than the whole run) even for being called once. But it's also problematic due to its nature as a network application. >> I'm profiling _thin_ network layers over gigabit ethernet whose >> latencies are measured in < 100 =C2=B5s. > And still, it is enough to measure userlevel only? Yes, my code is purely in the user space. > Note that callgrind or GProf can not give you latency spent in the > kernel part of the network stack. I know, only OProfile can. >> OProfile's lowest granularity is 3,000 cycles; if I'm not mistaken, >> I'd need 2,200 on my 2.2 GHz CPU to have 1,000,000 samples / second. > I do not know about these limits.=20 They are documented inside `opcontrol -l'. If you want callgraphs, you even have to multiple it by 15 (ain't documented, I stumbled over 45,000 and asked John Levon). > But OProfiles interrupt handler probably takes around 500-1000 cycles > (only my rough estimation). So you're suggesting it wouldn't make sense anyway...that's true indeed. >> So, any hints for throughout measuring? Callgrind was kindof my last >> hope... > What do you really want to see? Are you interested in exact time > measurement (min/max/average time from point A to point B in your > source)? Then your best bet is to put rdtsc calls yourself at these > points. Looks like it. But as said, it's also problematic, because of moving additional code. > But if you are interested what code is touched between point A and B, > and what instructions in the code path are taking most of the time, a > statistical approach should be enough. And you do not need a very high > sample resolution, but you have to sample long enough for the counts > to be statistically relevant. When you get 1 million sample points > between point A and B, the time distribution should really be precise > enough (depending on the amount of code that can be touched in code > paths from A to B). So I guess a combination of rdtsc (exact times), oprofile (aprox. runtime distribution) and callgrind (caches + callgraphs) is the way to go. That's also what I expected. >> > GProf is doing sample too, but only with timers, and with the >> > handler in user land. OProfile really should be more exact, as it >> > does sample handling in kernel space with lower latency. >> Ok, this is a shock for me now...I always thought, that gprof doesn't >> sample. :( Why does it instrumentation then? Just for the callgraph? > Yes. The instrumentation increments counters for call arcs among > functions only. There is no rdtsc or similar. GProf does sampling, and > sampling itself can only give self costs. These costs are propagated > up along the call graph to get an estimation of the inclusive costs. > And for this to be possible, the call graph has to be exact - which > needs the instrumentation in the first place. Hm, what would speak against rdtsc-instrumentation? In case of gprof it's clear (portability) but for people like me gprof is useless. I'm glad I didn't use it for serious stuff. -hs |