|
From: Chaitali G. <cha...@ya...> - 2009-02-20 03:13:30
|
Hi,
I am a new user to Valgrind tool.
I am trying to do cache profiling for a multi-threaded program written in C. I am using cachegrind to do so. Though the output is really descriptive, I could not find the results of L1 and L2 cache misses for different cores. I am running the multi-threaded program in a dual core machine. How do I get to know the L1 and L2 data cache misses incurred in each core ?
Any suggestion would be highly appreciated.
Regards
Chaitali
|
|
From: Josef W. <Jos...@gm...> - 2009-02-20 11:21:40
|
On Friday 20 February 2009, Chaitali Gupta wrote: > I am trying to do cache profiling for a multi-threaded program written in C. > I am using cachegrind to do so. Though the output is really descriptive, I > could not find the results of L1 and L2 cache misses for different cores. I > am running the multi-threaded program in a dual core machine. How do I get > to know the L1 and L2 data cache misses incurred in each core ? Callgrind outputs cache simulation results by thread with "--separate-threads=yes" (use "--simulate-cache=yes" in addition). Note that Valgrind serializes threads with, AFAIK, a "work slice" of 100.000 executed basic (super?) blocks. Thus, there is no way to say anything about similarity of work sharing among threads compared to how it could have happened in reality, be it by time-sharing on one core or running threads simultaneously on multiple cores in parallel. The conclusion: The separated data by thread do not say much in principle. However, it *can* be useful if you know that you use static work partitioning (eg. OpenMP without dynamic/guided scheduling), and the code executed in each thread is more or less the same only with different data: then, one can expect that because of potential similar cache characteristic in each of the threads, Valgrind approximates reality in some way. And of course, cachegrind/callgrind just simulates one cache hierarchy even for multithreaded code. However, multicores nowaday often have a shared LLC (last level cache). So you should get some idea of the real LLC behavior when you look at cachegrinds L2 results. For everything better, one definitely needs a simulated time for each of the threads, and a way to make Valgrind influence thread scheduling dynamically. (note the Valgrind currently leaves all scheduling decisions to the kernel). > Any suggestion would be highly appreciated. What are you after? I usually first would check that the sequential version runs well, and then go over to parallelization. Cachegrind/callgrind is (currently) quite limited for the latter. Why not use OProfile (or a similar tool) to check for load balancing? This is probably not satisfying if you would like to analyse eg. data sharing behaviour among cores. For the latter, simulation would be useful, but cachegrind/callgrind currently are not. Josef > > Regards > Chaitali > > > > |