|
From: Josef W. <Jos...@gm...> - 2006-08-13 19:52:06
|
On Sunday 13 August 2006 19:18, Hynek Schlawack wrote: > Hi, > > I'm profiling software using Callgrind and as I have some rather strange > results, I'd like to ask some questions to go for sure... > > I understand, that callgrind counts "instructions". Does it mean, that > it simply adds up assembler instructions and uses it as the costs? Yes. "Instructions fetched" is one cost type provided by Cachegrind/Callgrind. You should not confuse this with any time cost. > I ask > because it seems as a strange metric to me because some instructions > take longer than others. Probably. However, a time estimation using instruction latencies as factors is not better either. Relevant is the throughput of a given instruction stream, and not single latencies. For estimation of that, you need to simulate a CPU pipeline; and to match some reality, you need to know the branch prediction algorithm, and the superscalar configuration of your processor and so on (which is not really documented BTW). Even with these parameters, a simulator probably would be way too slow to be practical. Cachegrind/Callgrind is good to see whether your code has cache problems and potential for cache optimizations. Together with average L1/L2 cache latencies, you can come up with a rough time estimation which is often quite good (there is a derived cost type "Cycle Estimation" provided with KCachegrind which defaults to 10/100 cycles latency for L1/L2. You should adjust the formula for your machine. However, if you do not have cache problems, there probably is no good way to estimate the time by using the "instruction fetch" cost given by Cachegrind/Callgrind. > This would explain, why pthread_mutex_lock() > seems to hog the most costs and memcpy() (which gprof indentified as the > major hog) is only neglectable. This sounds like you only look at the instruction fetches even though you have a lot of L2 misses. Can it be that you run with the cache simulator switched off (which is the default with Callgrind)? Use "--simulate-cache=yes" and look at the cycle estimation cost (in KCachegrind). > Or am I missing something here? I'd really like to understand, why > Callgrind's and gprof's (and OProfile's btw, but that's not useful for > me as it's granularity is too low) results differ. I hope the above helps you understand your case. BTW, gprof is doing source instrumentation, and depending on the application, overhead can be near 100%. This also disturbs the measurement itself. Pure sampling (like oprofile) is more exact. Why is OProfile's granularity too low for you? In contrast to GProf, you even can adjust the sample interval there to tune the overhead. GProf is doing sample too, but only with timers, and with the handler in user land. OProfile really should be more exact, as it does sample handling in kernel space with lower latency. Of course, pure sampling can not give you a full call graph (you can make OProfile to come up with some extracts of the real call graph by doing a stack backtrace at every sample point). > I couldn't find anything about this neither in the Valgrind manual nor > on the KCachegrind pages, so I hope that someone here can help me... This is becoming a FAQ. I will try to come up with something for the Callgrind manual. Josef > > TIA, > -hs > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > |