|
From: Aniruddha S. <sh...@cs...> - 2005-11-05 20:19:46
|
Hi, Does Cachegrind allow you to specify that certain functions not be profiled? In other words, is there a way to turn on and off profiling in certain code segments? Thanks, Aniruddha |
|
From: Josef W. <Jos...@gm...> - 2005-11-05 21:35:12
|
On Saturday 05 November 2005 21:19, Aniruddha Shet wrote: > Hi, > > Does Cachegrind allow you to specify that certain functions not be profiled? > In other words, is there a way to turn on and off profiling in certain code > segments? Cachegrind gives you self cost of functions, so why not simply ignore the functions in the output you do not want to be profiled? Or check out the callgrind tool and the --toggle-collect=<func> option. It will toggle the collection state when entering or leaving a given function. Josef |
|
From: Josef W. <Jos...@gm...> - 2005-11-09 09:49:19
|
On Wednesday 09 November 2005 04:03, you wrote: > I want to speedup the simulation by specifying that a certain code segment > not be profiled while the remainder of the code be profiled. Ah, you always should say what you want to achieve when asking a question on the mailing list. A speedup can only be achieved by simplified instrumentation. In Cachegrind/Callgrind, this would mean that the cache simulator is not feeded with the memory access stream. And this would make the whole cache simulation go wrong. So you do not want to change instrumentation with Cachegrind/Callgrind, if you are interested in any cache misses/hits. With Callgrind, the default is to not to do cache simulation. But Callgrind has an additional slowdown because of call graph tracing. So this is not the way for a speedup. But you can temporarily switch of instrumentation for all the code which is run. Then, if you switch it on later on, you have to prepared that there will be cache misses which would not happen in reality. But an approximation of the real cache state usually should be reached a few million memory accesses later (depending on the application). Thus, depending on cache size, you will have a fixed number of cache misses more, but this number disappears in noise level after some time. To start without instrumentation (same as valgrind --tool=none...), start callgrind with "callgrind --instr-atstart=no ...", and run a "callgrind_control -i on" afterwards, or use CALLGRIND_START_INSTRUMENTATION() from $prefix/include/valgrind/callgrind.h to switch instrumentation on programatically. Josef > Will this > really speedup the simulation? Does the --toggle-collect=<func> option > ensure that functions called from within <func> are not profiled? > > Cheers, > Aniruddha > > ----- Original Message ----- > From: "Josef Weidendorfer" <Jos...@gm...> > To: <val...@li...> > Sent: Saturday, November 05, 2005 5:04 PM > Subject: [SPAM] Re: [Valgrind-users] Selective profiling with Cachegrind > > > > On Saturday 05 November 2005 21:19, Aniruddha Shet wrote: > >> Hi, > >> > >> Does Cachegrind allow you to specify that certain functions not be > >> profiled? > >> In other words, is there a way to turn on and off profiling in certain > >> code > >> segments? > > > > Cachegrind gives you self cost of functions, so why not simply ignore the > > functions in the output you do not want to be profiled? > > > > Or check out the callgrind tool and the --toggle-collect=<func> option. > > It will toggle the collection state when entering or leaving a given > > function. > > > > Josef > > > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > > Download > > it for free - -and be entered to win a 42" plasma tv or your very own > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > _______________________________________________ > > Valgrind-users mailing list > > Val...@li... > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > |
|
From: Aniruddha S. <sh...@cs...> - 2005-11-11 04:05:14
|
On Wed, 9 Nov 2005, Josef Weidendorfer wrote: Hi, I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the lines: Prefetch Up: 0 Prefetch Down: 0 What do these lines mean? From what I understand, --simulate-hwpref=yes simulates a hardware prefetcher, as is found in the Intel Pentium 4 processor. Also, does --cacheuse=yes collect cache line utilization statistics i.e. what percentage of a line is utilized after being brought into cache and before being evicted from the cache? Where can this information viewed? Thanks, Aniruddha > On Wednesday 09 November 2005 04:03, you wrote: > > I want to speedup the simulation by specifying that a certain code segment > > not be profiled while the remainder of the code be profiled. > > Ah, you always should say what you want to achieve when asking a question > on the mailing list. > > A speedup can only be achieved by simplified instrumentation. In > Cachegrind/Callgrind, this would mean that the cache simulator is not > feeded with the memory access stream. And this would make the whole > cache simulation go wrong. > > So you do not want to change instrumentation with Cachegrind/Callgrind, > if you are interested in any cache misses/hits. > > With Callgrind, the default is to not to do cache simulation. But Callgrind > has an additional slowdown because of call graph tracing. So this is not > the way for a speedup. > > But you can temporarily switch of instrumentation for all the code which is > run. Then, if you switch it on later on, you have to prepared that there > will be cache misses which would not happen in reality. But an approximation > of the real cache state usually should be reached a few million memory accesses > later (depending on the application). Thus, depending on cache size, you will > have a fixed number of cache misses more, but this number disappears in > noise level after some time. > > To start without instrumentation (same as valgrind --tool=none...), start > callgrind with "callgrind --instr-atstart=no ...", and run > a "callgrind_control -i on" afterwards, or use CALLGRIND_START_INSTRUMENTATION() > from $prefix/include/valgrind/callgrind.h to switch instrumentation on > programatically. > > Josef > > > > > Will this > > really speedup the simulation? Does the --toggle-collect=<func> option > > ensure that functions called from within <func> are not profiled? > > > > Cheers, > > Aniruddha > > > > ----- Original Message ----- > > From: "Josef Weidendorfer" <Jos...@gm...> > > To: <val...@li...> > > Sent: Saturday, November 05, 2005 5:04 PM > > Subject: [SPAM] Re: [Valgrind-users] Selective profiling with Cachegrind > > > > > > > On Saturday 05 November 2005 21:19, Aniruddha Shet wrote: > > >> Hi, > > >> > > >> Does Cachegrind allow you to specify that certain functions not be > > >> profiled? > > >> In other words, is there a way to turn on and off profiling in certain > > >> code > > >> segments? > > > > > > Cachegrind gives you self cost of functions, so why not simply ignore the > > > functions in the output you do not want to be profiled? > > > > > > Or check out the callgrind tool and the --toggle-collect=<func> option. > > > It will toggle the collection state when entering or leaving a given > > > function. > > > > > > Josef > > > > > > > > > > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Tame your development challenges with Apache's Geronimo App Server. > > > Download > > > it for free - -and be entered to win a 42" plasma tv or your very own > > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > > _______________________________________________ > > > Valgrind-users mailing list > > > Val...@li... > > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > -- ----------------------------------------------------------------------------------------- Aniruddha G. Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos Dept. of Comp. Sci. & Engg | Personal webpage: http://www.cse.ohio-state.edu/~shet The Ohio State University | Office: DL 474 2015 Neil Avenue | Phone: +1 (614) 292 7036 Columbus OH 43210-1277 | Cell: +1 (614) 446 1630 ----------------------------------------------------------------------------------------- |
|
From: Josef W. <Jos...@gm...> - 2005-11-11 11:27:35
|
On Friday 11 November 2005 05:05, you wrote: > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > Hi, > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > lines: > > Prefetch Up: 0 > Prefetch Down: 0 Oh, someone which is using the more advanced (and probably not that much tested), code! Very good. It would be nice if you can tell me if these features are useful for you. You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is separated simulator code. I will change the code to give out a warning regarding this, thanks. If I do e.g. callgrind -v --simulate-hwpref=yes ls This option also switches on cache simulation. I get --12922-- Prefetch Up: 1507 --12922-- Prefetch Down: 36 so I think this still works fine. > What do these lines mean? From what I understand, --simulate-hwpref=yes > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > processor. Yes. The P4 (and P-M) automatically detects upward and downward streaming, stopping at 4kB boundaries (streams on virtual addresses get a disrupted stream of physical addresses at 4kB boundaries because of VM). A nice thing is that the Pentium-M has hardware performance counters exact for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, and compare the results with that from Callgrind. By using --simulate-hwpref=yes I add this heuristic, and presume that every line loaded by the hardware prefetcher will give a hit when accessed later on. Note that this is not always the case: the real access could come that early that you still would get a miss in reality, even if the hardware prefetcher has catched the line. Unfortunately, callgrind has no way to get a simulated wall clock time, which would be needed to detect such cases. So callgrind --simulate-hwpref will give the best case possible for the prefetcher. In reality, it is between the results without and with this option. The usage is to compare results with and without the prefetcher. For functions where you see a big difference, the prefetcher is working quite good, i.e. any microoptimizations to bring down the usual callgrind results (without prefetcher) will not lead to any real improvements. But in the code regions, where the results are not really different, you see that the prefetching heuristic of the P4/PM is not working, and you can try to add software prefetch instructions (or otherwise change the code). A drawback is that callgrind does not take software prefetch instructions into account, as Valgrind does not feed these instructions to the tool, but ignores them. But if there really are users for this simulator enhancement, we can try to include them into VG core (e.g. cachegrind). To make the comparision of the two runs more easy, I should include a compare mode in KCachegrind. > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > what percentage of a line is utilized after being brought into cache and > before being evicted from the cache? Where can this information viewed? Yes. The number of bytes never used in a cache line will be attributed to the instruction which triggered the load. This is event SpLoss1 (for L1) and more important SpLoss2 (for L2). The full amount of bytes loaded by an instruction is given by the number of L1 or L2 misses this instruction gets attributed, multiplied with the cache line size. In KCachegrind, add new derived events with the formula "64 L1m" and "64 L2m" to directly get the numbers to compare. You can view this information with KCachegrind. Unfortunately, there was a a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. And --cacheuse=yes gives you 12 event types, leading to a load error. This changes in the version in KDE 3.5, or use the newest one from the website (kcachegrind.sf.net). Theoretically, callgrind_annotate should be able to show these results, too. For it to cope with the format, you have to additionally provide --compress-pos=no --compress-strings=no on the callgrind line. Even then, it fails with Line xxxx: summary event and total event mismatch Oh yeah, it is time to provide a better command line tool... Josef > > Thanks, > Aniruddha |
|
From: Aniruddha S. <sh...@cs...> - 2005-11-12 19:01:40
|
On Fri, 11 Nov 2005, Josef Weidendorfer wrote: Hi, As you have indicated, I too want to use the --simulate-hwpref option to determine the performance benefit with and without prefetcher. It serves as a measure of spatial locality in the profiled code. I am yet to view the output of --cacheuse option. Again, the objective is to understand the extent of spatial locality in the profiled code. Thanks, Aniruddha > On Friday 11 November 2005 05:05, you wrote: > > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > > Hi, > > > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > > lines: > > > > Prefetch Up: 0 > > Prefetch Down: 0 > > Oh, someone which is using the more advanced (and probably not that much tested), > code! Very good. It would be nice if you can tell me if these features are > useful for you. > > You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is > separated simulator code. > I will change the code to give out a warning regarding this, thanks. > > If I do e.g. > > callgrind -v --simulate-hwpref=yes ls > > This option also switches on cache simulation. I get > > --12922-- Prefetch Up: 1507 > --12922-- Prefetch Down: 36 > > so I think this still works fine. > > > What do these lines mean? From what I understand, --simulate-hwpref=yes > > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > > processor. > > Yes. The P4 (and P-M) automatically detects upward and downward streaming, > stopping at 4kB boundaries (streams on virtual addresses get a disrupted > stream of physical addresses at 4kB boundaries because of VM). > > A nice thing is that the Pentium-M has hardware performance counters exact > for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the > hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, > and compare the results with that from Callgrind. > > By using --simulate-hwpref=yes I add this heuristic, and presume that > every line loaded by the hardware prefetcher will give a hit when accessed > later on. > > Note that this is not always the case: the real access could come that > early that you still would get a miss in reality, even if the hardware > prefetcher has catched the line. > Unfortunately, callgrind has no way to get a simulated wall clock time, > which would be needed to detect such cases. > > So callgrind --simulate-hwpref will give the best case possible for the > prefetcher. In reality, it is between the results without and with this > option. > > The usage is to compare results with and without the prefetcher. > For functions where you see a big difference, the prefetcher is working > quite good, i.e. any microoptimizations to bring down the usual callgrind > results (without prefetcher) will not lead to any real improvements. > > But in the code regions, where the results are not really different, you > see that the prefetching heuristic of the P4/PM is not working, and you > can try to add software prefetch instructions (or otherwise change the code). > > A drawback is that callgrind does not take software prefetch instructions > into account, as Valgrind does not feed these instructions to the tool, but > ignores them. But if there really are users for this simulator enhancement, > we can try to include them into VG core (e.g. cachegrind). > > To make the comparision of the two runs more easy, I should include a compare > mode in KCachegrind. > > > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > > what percentage of a line is utilized after being brought into cache and > > before being evicted from the cache? Where can this information viewed? > > Yes. The number of bytes never used in a cache line will be attributed to > the instruction which triggered the load. This is event SpLoss1 (for L1) > and more important SpLoss2 (for L2). > > The full amount of bytes loaded by an instruction is given by the number of > L1 or L2 misses this instruction gets attributed, multiplied with the cache > line size. In KCachegrind, add new derived events with the formula > "64 L1m" and "64 L2m" to directly get the numbers to compare. > > You can view this information with KCachegrind. Unfortunately, there was a > a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. > And --cacheuse=yes gives you 12 event types, leading to a load error. > This changes in the version in KDE 3.5, or use the newest one from the > website (kcachegrind.sf.net). > > Theoretically, callgrind_annotate should be able to show these results, too. > For it to cope with the format, you have to additionally provide > --compress-pos=no --compress-strings=no > on the callgrind line. Even then, it fails with > Line xxxx: summary event and total event mismatch > > Oh yeah, it is time to provide a better command line tool... > > Josef > > > > > > Thanks, > > Aniruddha > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > -- ----------------------------------------------------------------------------------------- Aniruddha G. Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos Dept. of Comp. Sci. & Engg | Personal webpage: http://www.cse.ohio-state.edu/~shet The Ohio State University | Office: DL 474 2015 Neil Avenue | Phone: +1 (614) 292 7036 Columbus OH 43210-1277 | Cell: +1 (614) 446 1630 ----------------------------------------------------------------------------------------- |
|
From: Josef W. <Jos...@gm...> - 2005-11-13 11:11:26
|
On Saturday 12 November 2005 20:01, Aniruddha Shet wrote: > On Fri, 11 Nov 2005, Josef Weidendorfer wrote: > Hi, > > As you have indicated, I too want to use the --simulate-hwpref option to > determine the performance benefit with and without prefetcher. It serves > as a measure of spatial locality in the profiled code. The P4-like hardware prefetcher gives you a benefit if you do large sequential accesses into memory, ie. streams. This is a special kind of spatial locality. But note that this hardware prefetcher (at least my simulation) detects streams of accessed *cache lines*, i.e. it will work even with a stride size of 64 bytes (if your cache line size is 64 bytes). This does not give you inner-cache-line spatial locality. The best to see spatial locality is to change the cache line size of the simulator and compare the miss results. If you have no spatial locality at all, you should get the same number of misses independent on the cache line size. If the misses go down with larger cache line size, your program exhibits spatial locality. You can set the cache parameters with cachegrind/callgrind. Compare e.g. the usual result with a result with 8 byte line size. > I am yet to view the output of --cacheuse option. Again, the objective is > to understand the extent of spatial locality in the profiled code. Yes. "No spatial locality" would mean: only one byte or word accessed per cache line before eviction. And this should be visible via cache use. Josef > > Thanks, > Aniruddha > > > > On Friday 11 November 2005 05:05, you wrote: > > > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > > > Hi, > > > > > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > > > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > > > lines: > > > > > > Prefetch Up: 0 > > > Prefetch Down: 0 > > > > Oh, someone which is using the more advanced (and probably not that much tested), > > code! Very good. It would be nice if you can tell me if these features are > > useful for you. > > > > You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is > > separated simulator code. > > I will change the code to give out a warning regarding this, thanks. > > > > If I do e.g. > > > > callgrind -v --simulate-hwpref=yes ls > > > > This option also switches on cache simulation. I get > > > > --12922-- Prefetch Up: 1507 > > --12922-- Prefetch Down: 36 > > > > so I think this still works fine. > > > > > What do these lines mean? From what I understand, --simulate-hwpref=yes > > > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > > > processor. > > > > Yes. The P4 (and P-M) automatically detects upward and downward streaming, > > stopping at 4kB boundaries (streams on virtual addresses get a disrupted > > stream of physical addresses at 4kB boundaries because of VM). > > > > A nice thing is that the Pentium-M has hardware performance counters exact > > for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the > > hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, > > and compare the results with that from Callgrind. > > > > By using --simulate-hwpref=yes I add this heuristic, and presume that > > every line loaded by the hardware prefetcher will give a hit when accessed > > later on. > > > > Note that this is not always the case: the real access could come that > > early that you still would get a miss in reality, even if the hardware > > prefetcher has catched the line. > > Unfortunately, callgrind has no way to get a simulated wall clock time, > > which would be needed to detect such cases. > > > > So callgrind --simulate-hwpref will give the best case possible for the > > prefetcher. In reality, it is between the results without and with this > > option. > > > > The usage is to compare results with and without the prefetcher. > > For functions where you see a big difference, the prefetcher is working > > quite good, i.e. any microoptimizations to bring down the usual callgrind > > results (without prefetcher) will not lead to any real improvements. > > > > But in the code regions, where the results are not really different, you > > see that the prefetching heuristic of the P4/PM is not working, and you > > can try to add software prefetch instructions (or otherwise change the code). > > > > A drawback is that callgrind does not take software prefetch instructions > > into account, as Valgrind does not feed these instructions to the tool, but > > ignores them. But if there really are users for this simulator enhancement, > > we can try to include them into VG core (e.g. cachegrind). > > > > To make the comparision of the two runs more easy, I should include a compare > > mode in KCachegrind. > > > > > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > > > what percentage of a line is utilized after being brought into cache and > > > before being evicted from the cache? Where can this information viewed? > > > > Yes. The number of bytes never used in a cache line will be attributed to > > the instruction which triggered the load. This is event SpLoss1 (for L1) > > and more important SpLoss2 (for L2). > > > > The full amount of bytes loaded by an instruction is given by the number of > > L1 or L2 misses this instruction gets attributed, multiplied with the cache > > line size. In KCachegrind, add new derived events with the formula > > "64 L1m" and "64 L2m" to directly get the numbers to compare. > > > > You can view this information with KCachegrind. Unfortunately, there was a > > a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. > > And --cacheuse=yes gives you 12 event types, leading to a load error. > > This changes in the version in KDE 3.5, or use the newest one from the > > website (kcachegrind.sf.net). > > > > Theoretically, callgrind_annotate should be able to show these results, too. > > For it to cope with the format, you have to additionally provide > > --compress-pos=no --compress-strings=no > > on the callgrind line. Even then, it fails with > > Line xxxx: summary event and total event mismatch > > > > Oh yeah, it is time to provide a better command line tool... > > > > Josef > > > > > > > > > > Thanks, > > > Aniruddha > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. Download > > it for free - -and be entered to win a 42" plasma tv or your very own > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > _______________________________________________ > > Valgrind-users mailing list > > Val...@li... > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > |
|
From: Aniruddha S. <sh...@cs...> - 2005-11-15 00:04:42
|
Hi, Can you please help me in correcting the following error that I am encountering while trying to install KCachegrind? if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include/kde -I/usr/lib/qt-3.1/include -I/usr/X11R6/include -DQT_THREAD_SUPPORT -D_REENTRANT -Wnon-virtual-dtor -Wno-long-long -Wundef -ansi -D_XOPEN_SOURCE=500 -D_BSD_SOURCE -Wcast-align -Wconversion -Wchar-subscripts -Wall -W -Wpointer-arith -Wwrite-strings -O2 -Wformat-security -Wmissing-format-attribute -fno-exceptions -fno-check-new -fno-common -MT callgraphview.o -MD -MP -MF ".deps/callgraphview.Tpo" -c -o callgraphview.o callgraphview.cpp; \ then mv -f ".deps/callgraphview.Tpo" ".deps/callgraphview.Po"; else rm -f ".deps/callgraphview.Tpo"; exit 1; fi callgraphview.cpp: In constructor `PannerView::PannerView(QWidget*, const char*)': callgraphview.cpp:955: `WNoAutoErase' undeclared (first use this function) callgraphview.cpp:955: (Each undeclared identifier is reported only once for each function it appears in.) make[2]: *** [callgraphview.o] Error 1 make[2]: Leaving directory `/a/osu4005/Valgrind/kcachegrind-0.4.6/kcachegrind' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/a/osu4005/Valgrind/kcachegrind-0.4.6' make: *** [all] Error 2 Thanks, Aniruddha On Sun, 13 Nov 2005, Josef Weidendorfer wrote: > On Saturday 12 November 2005 20:01, Aniruddha Shet wrote: > > On Fri, 11 Nov 2005, Josef Weidendorfer wrote: > > Hi, > > > > As you have indicated, I too want to use the --simulate-hwpref option to > > determine the performance benefit with and without prefetcher. It serves > > as a measure of spatial locality in the profiled code. > > The P4-like hardware prefetcher gives you a benefit if you do > large sequential accesses into memory, ie. streams. This is a special > kind of spatial locality. > > But note that this hardware prefetcher (at least my simulation) detects > streams of accessed *cache lines*, i.e. it will work even with a stride size > of 64 bytes (if your cache line size is 64 bytes). > This does not give you inner-cache-line spatial locality. > > The best to see spatial locality is to change the cache line size of the > simulator and compare the miss results. If you have no spatial locality at > all, you should get the same number of misses independent on the cache line > size. If the misses go down with larger cache line size, your program exhibits > spatial locality. > > You can set the cache parameters with cachegrind/callgrind. > Compare e.g. the usual result with a result with 8 byte line size. > > > I am yet to view the output of --cacheuse option. Again, the objective is > > to understand the extent of spatial locality in the profiled code. > > Yes. "No spatial locality" would mean: only one byte or word accessed per cache > line before eviction. And this should be visible via cache use. > > Josef > > > > > Thanks, > > Aniruddha > > > > > > > On Friday 11 November 2005 05:05, you wrote: > > > > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > > > > Hi, > > > > > > > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > > > > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > > > > lines: > > > > > > > > Prefetch Up: 0 > > > > Prefetch Down: 0 > > > > > > Oh, someone which is using the more advanced (and probably not that much tested), > > > code! Very good. It would be nice if you can tell me if these features are > > > useful for you. > > > > > > You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is > > > separated simulator code. > > > I will change the code to give out a warning regarding this, thanks. > > > > > > If I do e.g. > > > > > > callgrind -v --simulate-hwpref=yes ls > > > > > > This option also switches on cache simulation. I get > > > > > > --12922-- Prefetch Up: 1507 > > > --12922-- Prefetch Down: 36 > > > > > > so I think this still works fine. > > > > > > > What do these lines mean? From what I understand, --simulate-hwpref=yes > > > > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > > > > processor. > > > > > > Yes. The P4 (and P-M) automatically detects upward and downward streaming, > > > stopping at 4kB boundaries (streams on virtual addresses get a disrupted > > > stream of physical addresses at 4kB boundaries because of VM). > > > > > > A nice thing is that the Pentium-M has hardware performance counters exact > > > for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the > > > hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, > > > and compare the results with that from Callgrind. > > > > > > By using --simulate-hwpref=yes I add this heuristic, and presume that > > > every line loaded by the hardware prefetcher will give a hit when accessed > > > later on. > > > > > > Note that this is not always the case: the real access could come that > > > early that you still would get a miss in reality, even if the hardware > > > prefetcher has catched the line. > > > Unfortunately, callgrind has no way to get a simulated wall clock time, > > > which would be needed to detect such cases. > > > > > > So callgrind --simulate-hwpref will give the best case possible for the > > > prefetcher. In reality, it is between the results without and with this > > > option. > > > > > > The usage is to compare results with and without the prefetcher. > > > For functions where you see a big difference, the prefetcher is working > > > quite good, i.e. any microoptimizations to bring down the usual callgrind > > > results (without prefetcher) will not lead to any real improvements. > > > > > > But in the code regions, where the results are not really different, you > > > see that the prefetching heuristic of the P4/PM is not working, and you > > > can try to add software prefetch instructions (or otherwise change the code). > > > > > > A drawback is that callgrind does not take software prefetch instructions > > > into account, as Valgrind does not feed these instructions to the tool, but > > > ignores them. But if there really are users for this simulator enhancement, > > > we can try to include them into VG core (e.g. cachegrind). > > > > > > To make the comparision of the two runs more easy, I should include a compare > > > mode in KCachegrind. > > > > > > > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > > > > what percentage of a line is utilized after being brought into cache and > > > > before being evicted from the cache? Where can this information viewed? > > > > > > Yes. The number of bytes never used in a cache line will be attributed to > > > the instruction which triggered the load. This is event SpLoss1 (for L1) > > > and more important SpLoss2 (for L2). > > > > > > The full amount of bytes loaded by an instruction is given by the number of > > > L1 or L2 misses this instruction gets attributed, multiplied with the cache > > > line size. In KCachegrind, add new derived events with the formula > > > "64 L1m" and "64 L2m" to directly get the numbers to compare. > > > > > > You can view this information with KCachegrind. Unfortunately, there was a > > > a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. > > > And --cacheuse=yes gives you 12 event types, leading to a load error. > > > This changes in the version in KDE 3.5, or use the newest one from the > > > website (kcachegrind.sf.net). > > > > > > Theoretically, callgrind_annotate should be able to show these results, too. > > > For it to cope with the format, you have to additionally provide > > > --compress-pos=no --compress-strings=no > > > on the callgrind line. Even then, it fails with > > > Line xxxx: summary event and total event mismatch > > > > > > Oh yeah, it is time to provide a better command line tool... > > > > > > Josef > > > > > > > > > > > > > > Thanks, > > > > Aniruddha > > > > > > > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Tame your development challenges with Apache's Geronimo App Server. Download > > > it for free - -and be entered to win a 42" plasma tv or your very own > > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > > _______________________________________________ > > > Valgrind-users mailing list > > > Val...@li... > > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > -- ----------------------------------------------------------------------------------------- Aniruddha G. Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos Dept. of Comp. Sci. & Engg | Personal webpage: http://www.cse.ohio-state.edu/~shet The Ohio State University | Office: DL 474 2015 Neil Avenue | Phone: +1 (614) 292 7036 Columbus OH 43210-1277 | Cell: +1 (614) 446 1630 ----------------------------------------------------------------------------------------- |
|
From: Aniruddha S. <sh...@cs...> - 2005-11-16 04:11:48
Attachments:
W_s10_128K.sum.19842
|
On Mon, 14 Nov 2005, Aniruddha Shet wrote: Hi, Callgrind is aborting during the profiling process. The execution of the profiled code completed successfully but the log file obtained by setting the --log-file option shows that Callgrind aborted at some stage. I have attached the log file for your reference. Thanks, Aniruddha > Hi, > > Can you please help me in correcting the following error that I am > encountering while trying to install KCachegrind? > > if > g++ -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include/kde > -I/usr/lib/qt-3.1/include > -I/usr/X11R6/include -DQT_THREAD_SUPPORT -D_REENTRANT > -Wnon-virtual-dtor > -Wno-long-long -Wundef -ansi -D_XOPEN_SOURCE=500 -D_BSD_SOURCE > -Wcast-align > -Wconversion -Wchar-subscripts -Wall -W -Wpointer-arith -Wwrite-strings > -O2 > -Wformat-security -Wmissing-format-attribute -fno-exceptions > -fno-check-new > -fno-common -MT callgraphview.o -MD -MP -MF > ".deps/callgraphview.Tpo" -c -o callgraphview.o callgraphview.cpp; \ > then mv -f ".deps/callgraphview.Tpo" ".deps/callgraphview.Po"; else rm -f > ".deps/callgraphview.Tpo"; exit 1; fi > callgraphview.cpp: In constructor `PannerView::PannerView(QWidget*, const > char*)': > callgraphview.cpp:955: `WNoAutoErase' undeclared (first use this function) > callgraphview.cpp:955: (Each undeclared identifier is reported only once > for > each function it appears in.) > make[2]: *** [callgraphview.o] Error 1 > make[2]: Leaving directory > `/a/osu4005/Valgrind/kcachegrind-0.4.6/kcachegrind' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/a/osu4005/Valgrind/kcachegrind-0.4.6' > make: *** [all] Error 2 > > > Thanks, > Aniruddha > > On Sun, 13 Nov 2005, Josef Weidendorfer wrote: > > > On Saturday 12 November 2005 20:01, Aniruddha Shet wrote: > > > On Fri, 11 Nov 2005, Josef Weidendorfer wrote: > > > Hi, > > > > > > As you have indicated, I too want to use the --simulate-hwpref option to > > > determine the performance benefit with and without prefetcher. It serves > > > as a measure of spatial locality in the profiled code. > > > > The P4-like hardware prefetcher gives you a benefit if you do > > large sequential accesses into memory, ie. streams. This is a special > > kind of spatial locality. > > > > But note that this hardware prefetcher (at least my simulation) detects > > streams of accessed *cache lines*, i.e. it will work even with a stride size > > of 64 bytes (if your cache line size is 64 bytes). > > This does not give you inner-cache-line spatial locality. > > > > The best to see spatial locality is to change the cache line size of the > > simulator and compare the miss results. If you have no spatial locality at > > all, you should get the same number of misses independent on the cache line > > size. If the misses go down with larger cache line size, your program exhibits > > spatial locality. > > > > You can set the cache parameters with cachegrind/callgrind. > > Compare e.g. the usual result with a result with 8 byte line size. > > > > > I am yet to view the output of --cacheuse option. Again, the objective is > > > to understand the extent of spatial locality in the profiled code. > > > > Yes. "No spatial locality" would mean: only one byte or word accessed per cache > > line before eviction. And this should be visible via cache use. > > > > Josef > > > > > > > > Thanks, > > > Aniruddha > > > > > > > > > > On Friday 11 November 2005 05:05, you wrote: > > > > > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > > > > > Hi, > > > > > > > > > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > > > > > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > > > > > lines: > > > > > > > > > > Prefetch Up: 0 > > > > > Prefetch Down: 0 > > > > > > > > Oh, someone which is using the more advanced (and probably not that much tested), > > > > code! Very good. It would be nice if you can tell me if these features are > > > > useful for you. > > > > > > > > You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is > > > > separated simulator code. > > > > I will change the code to give out a warning regarding this, thanks. > > > > > > > > If I do e.g. > > > > > > > > callgrind -v --simulate-hwpref=yes ls > > > > > > > > This option also switches on cache simulation. I get > > > > > > > > --12922-- Prefetch Up: 1507 > > > > --12922-- Prefetch Down: 36 > > > > > > > > so I think this still works fine. > > > > > > > > > What do these lines mean? From what I understand, --simulate-hwpref=yes > > > > > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > > > > > processor. > > > > > > > > Yes. The P4 (and P-M) automatically detects upward and downward streaming, > > > > stopping at 4kB boundaries (streams on virtual addresses get a disrupted > > > > stream of physical addresses at 4kB boundaries because of VM). > > > > > > > > A nice thing is that the Pentium-M has hardware performance counters exact > > > > for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the > > > > hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, > > > > and compare the results with that from Callgrind. > > > > > > > > By using --simulate-hwpref=yes I add this heuristic, and presume that > > > > every line loaded by the hardware prefetcher will give a hit when accessed > > > > later on. > > > > > > > > Note that this is not always the case: the real access could come that > > > > early that you still would get a miss in reality, even if the hardware > > > > prefetcher has catched the line. > > > > Unfortunately, callgrind has no way to get a simulated wall clock time, > > > > which would be needed to detect such cases. > > > > > > > > So callgrind --simulate-hwpref will give the best case possible for the > > > > prefetcher. In reality, it is between the results without and with this > > > > option. > > > > > > > > The usage is to compare results with and without the prefetcher. > > > > For functions where you see a big difference, the prefetcher is working > > > > quite good, i.e. any microoptimizations to bring down the usual callgrind > > > > results (without prefetcher) will not lead to any real improvements. > > > > > > > > But in the code regions, where the results are not really different, you > > > > see that the prefetching heuristic of the P4/PM is not working, and you > > > > can try to add software prefetch instructions (or otherwise change the code). > > > > > > > > A drawback is that callgrind does not take software prefetch instructions > > > > into account, as Valgrind does not feed these instructions to the tool, but > > > > ignores them. But if there really are users for this simulator enhancement, > > > > we can try to include them into VG core (e.g. cachegrind). > > > > > > > > To make the comparision of the two runs more easy, I should include a compare > > > > mode in KCachegrind. > > > > > > > > > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > > > > > what percentage of a line is utilized after being brought into cache and > > > > > before being evicted from the cache? Where can this information viewed? > > > > > > > > Yes. The number of bytes never used in a cache line will be attributed to > > > > the instruction which triggered the load. This is event SpLoss1 (for L1) > > > > and more important SpLoss2 (for L2). > > > > > > > > The full amount of bytes loaded by an instruction is given by the number of > > > > L1 or L2 misses this instruction gets attributed, multiplied with the cache > > > > line size. In KCachegrind, add new derived events with the formula > > > > "64 L1m" and "64 L2m" to directly get the numbers to compare. > > > > > > > > You can view this information with KCachegrind. Unfortunately, there was a > > > > a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. > > > > And --cacheuse=yes gives you 12 event types, leading to a load error. > > > > This changes in the version in KDE 3.5, or use the newest one from the > > > > website (kcachegrind.sf.net). > > > > > > > > Theoretically, callgrind_annotate should be able to show these results, too. > > > > For it to cope with the format, you have to additionally provide > > > > --compress-pos=no --compress-strings=no > > > > on the callgrind line. Even then, it fails with > > > > Line xxxx: summary event and total event mismatch > > > > > > > > Oh yeah, it is time to provide a better command line tool... > > > > > > > > Josef > > > > > > > > > > > > > > > > > > Thanks, > > > > > Aniruddha > > > > > > > > > > > > ------------------------------------------------------- > > > > SF.Net email is sponsored by: > > > > Tame your development challenges with Apache's Geronimo App Server. Download > > > > it for free - -and be entered to win a 42" plasma tv or your very own > > > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > > > _______________________________________________ > > > > Valgrind-users mailing list > > > > Val...@li... > > > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > > > > > > > > > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. Download > > it for free - -and be entered to win a 42" plasma tv or your very own > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > _______________________________________________ > > Valgrind-users mailing list > > Val...@li... > > https://lists.sourceforge.net/lists/listinfo/valgrind-users > > > > -- ----------------------------------------------------------------------------------------- Aniruddha G. Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html Graduate Research Associate | Project webpage: http://www.cs.unm.edu/~fastos Dept. of Comp. Sci. & Engg | Personal webpage: http://www.cse.ohio-state.edu/~shet The Ohio State University | Office: DL 474 2015 Neil Avenue | Phone: +1 (614) 292 7036 Columbus OH 43210-1277 | Cell: +1 (614) 446 1630 ----------------------------------------------------------------------------------------- |