|
From: Aniruddha S. <sh...@cs...> - 2005-11-16 06:16:02
|
Hi Josef, I am able to run KCacheGrind. It truly is a wonderful tool, very informative and powerful. Thanks. I need clarification on the numbers that appear under the columns AcCost1, SpLoss1, AcCost2 and SpLoss2. I initially felt that (AcCost2 = L2 misses * L2 line size) and something similar for AcCost1 in case of L1. And SpLoss2 is a fraction of AcCost2, denoting the number of bytes never used. The numbers in the log file (obtained using --log-file option) don't reflect this and so my impression is obviously wrong. Can you please explain how these numbers are obtained? Thanks, Aniruddha -- ----------------------------------------------------------------------------------------- Aniruddha G. Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos Dept. of Comp. Sci. & Engg | Personal webpage: http://www.cse.ohio-state.edu/~shet The Ohio State University | Office: DL 474 2015 Neil Avenue | Phone: +1 (614) 292 7036 Columbus OH 43210-1277 | Cell: +1 (614) 446 1630 ----------------------------------------------------------------------------------------- |
|
From: Josef W. <Jos...@gm...> - 2005-11-16 17:45:39
|
On Wednesday 16 November 2005 07:15, you wrote: > Callgrind is aborting during the profiling process. The execution of the > profiled code completed successfully but the log file obtained by setting > the --log-file option shows that Callgrind aborted at some stage. I have > attached the log file for your reference. Hmm... It is really at the very end. So results are fine. It looks like some thread ID is set to 0 which is not a valid ID. Does this happen with smaller runs on your machine, too? > I need clarification on the numbers that appear under the columns AcCost1, > SpLoss1, AcCost2 and SpLoss2. I initially felt that (AcCost2 = L2 misses * > L2 line size) and something similar for AcCost1 in case of L1. And SpLoss2 is > a fraction of AcCost2, denoting the number of bytes never used. The numbers in > the log file (obtained using --log-file option) don't reflect this and so my > impression is obviously wrong. Can you please explain how these numbers are > obtained? The first question was how to get most meaningful values. Understanding this will help you why I did it this way. I wanted to have a metric for the number of bytes actually touched in a cache line, and some kind of reuse, ie. how often a cache line was accessed before being evicted. It is the best to be able to simply sum up metric numbers; thus, a sum of all such individual numbers for a cache line should be a sign of the relevance of the performance problem you have at this point in your program. To put it another way: A large number should correspond to a large performance problem, because KCachegrind (or any profiling visualization) shows ordered lists, highest numbers at top. Thus, I did not choose the number of bytes touched, but the number of bytes *not* touched (thus "spatial loss"). Similarily for the number of accesses to a memory block before being evicted from a cache line: the performance problem is big if the number of accesses to one cache line is low. Thus, I took the reziprocal value of the access number. Because the profile format only handles integer values, I use as metric 1000/(access count), calling this "access cost". The second question is about how to attribute these numbers to code positions. This is needed because the user wants to see where he has to do optimization. I get the use metrics at eviction time of some memory block. Candidates are: 1) Data structure of memory block evicted 2) Data structure of memory block evicting 3) Source code position which triggered loading of evicted block 4) Current code position = position which triggered eviction 5) Any combinations of tupels from 1-4 As callgrind currently can not attribute regarding data structures, 1) and 2) is not possible. 5) is not supported by visualization. 3) should allow you to identify the data strucure of which has a problem with spatial loss (structure layout should be better arranged according to usage) or access cost (e.g. candidates for blocking). 4) is in general not that useful; perhaps to detect code which is polluting the cache... I used 3) in Callgrind for the cache metrics, i.e. the numbers are attributed to the code position where a cacheline was loaded. This is all at research state, and I am interested in any comments. Regarding implementation details: For every cache line, I have a bit mask of bytes used and an access count for the current loaded block. As callgrind simulates an inclusive cache, I have for every memory block in L1 a pointer to where this block is in L2. When a block is evicted from L1, I update SpLoss1 and AcCost1, and combine the values with the coresponding values of this memory block in L2. On L2 block eviction, I update SpLoss2 and AcCost2 accordingly. If you run e.g. a KDE app with --cacheuse, you will see that most of the spatial loss comes from the runtime linker: for every symbol name, it looks up in around 20 hash tables. As a hash table access looses 60 bytes (of 64 bytes with cache line size 64), you get 1.2 KB loss/sym. Multiply this with the number of lookups (e.g. konqueror around 20,000), and the problem with slow startup times will be more obvious. Josef |
|
From: Aniruddha S. <sh...@cs...> - 2005-11-16 20:25:16
|
Hi Josef, The profiling completes successfully with smaller runs. You have mentioned this and so I can consider the profiling output to be relevant and complete even for the aborted run? I would like you to comment on my understanding of the AcCost and SpLoss events. While I get what they stand for, I am not clear on how their values are actually being obtained. From what I gather :- 1) SpLoss1 is some fraction of (D1 misses * D1 line size + I1 misses * I1 line size) and hence serves as a measure of spatial loss at L1 level. 2) SpLoss2 is some fraction of (L2 misses * L2 line size) and hence serves as a measure of spatial loss at L2 level. 3) (D1 misses + I1 misses) <= AcCost1 <= (D1 misses + I1 misses) * 1000. 4) (L2 misses) <= AcCost2 <= (L2 misses) * 1000. Thanks, Aniruddha ----- Original Message ----- From: "Josef Weidendorfer" <Jos...@gm...> To: <val...@li...> Cc: "Aniruddha Shet" <sh...@cs...> Sent: Wednesday, November 16, 2005 12:45 PM Subject: [SPAM] Re: [Valgrind-users] Clarification on AcCost1, SpLoss1, AcCost2 and SpLoss2 > On Wednesday 16 November 2005 07:15, you wrote: >> Callgrind is aborting during the profiling process. The execution of the >> profiled code completed successfully but the log file obtained by setting >> the --log-file option shows that Callgrind aborted at some stage. I have >> attached the log file for your reference. > > Hmm... > It is really at the very end. So results are fine. > It looks like some thread ID is set to 0 which is not a valid ID. > Does this happen with smaller runs on your machine, too? > >> I need clarification on the numbers that appear under the columns >> AcCost1, >> SpLoss1, AcCost2 and SpLoss2. I initially felt that (AcCost2 = L2 misses >> * >> L2 line size) and something similar for AcCost1 in case of L1. And >> SpLoss2 is >> a fraction of AcCost2, denoting the number of bytes never used. The >> numbers in >> the log file (obtained using --log-file option) don't reflect this and so >> my >> impression is obviously wrong. Can you please explain how these numbers >> are >> obtained? > > The first question was how to get most meaningful values. Understanding > this will help you why I did it this way. > > I wanted to have a metric for the number of bytes actually touched in > a cache line, and some kind of reuse, ie. how often a cache line was > accessed before being evicted. > It is the best to be able to simply sum up metric numbers; thus, a > sum of all such individual numbers for a cache line should be a sign > of the relevance of the performance problem you have at this point > in your program. To put it another way: A large number should > correspond to a large performance problem, because KCachegrind > (or any profiling visualization) shows ordered lists, highest > numbers at top. > Thus, I did not choose the number of bytes touched, but the number of > bytes *not* touched (thus "spatial loss"). Similarily for the number of > accesses to a memory block before being evicted from a cache line: > the performance problem is big if the number of accesses to one > cache line is low. Thus, I took the reziprocal value of the access > number. Because the profile format only handles integer values, I > use as metric 1000/(access count), calling this "access cost". > > The second question is about how to attribute these numbers to code > positions. This is needed because the user wants to see where he > has to do optimization. I get the use metrics at eviction time of > some memory block. > Candidates are: > 1) Data structure of memory block evicted > 2) Data structure of memory block evicting > 3) Source code position which triggered loading of evicted block > 4) Current code position = position which triggered eviction > 5) Any combinations of tupels from 1-4 > As callgrind currently can not attribute regarding data structures, > 1) and 2) is not possible. 5) is not supported by visualization. > 3) should allow you to identify the data strucure of which has a problem > with spatial loss (structure layout should be better arranged according > to usage) or access cost (e.g. candidates for blocking). > 4) is in general not that useful; perhaps to detect code which is > polluting the cache... > I used 3) in Callgrind for the cache metrics, i.e. the numbers are > attributed to the code position where a cacheline was loaded. > > This is all at research state, and I am interested in any comments. > > Regarding implementation details: > For every cache line, I have a bit mask of bytes used and an access count > for the current loaded block. As callgrind simulates an inclusive > cache, I have for every memory block in L1 a pointer to where this > block is in L2. When a block is evicted from L1, I update SpLoss1 and > AcCost1, and combine the values with the coresponding values of this > memory block in L2. On L2 block eviction, I update SpLoss2 and AcCost2 > accordingly. > > If you run e.g. a KDE app with --cacheuse, you will see that most of > the spatial loss comes from the runtime linker: for every symbol name, > it looks up in around 20 hash tables. As a hash table access looses > 60 bytes (of 64 bytes with cache line size 64), you get 1.2 KB loss/sym. > Multiply this with the number of lookups (e.g. konqueror around 20,000), > and the problem with slow startup times will be more obvious. > > Josef > |