|
From: Jerry S. <js...@fi...> - 2014-07-03 06:06:29
|
Hi, I'm new to the list. A quick look at the archives found a number of people asking and not much clear guidance. If there is a document or thread that I need to look at, could you please point the way? I have a critical need to understand the cache use and memory access for a circuit simulation situation. We are doing circuit simulation and need to run months of simulation for a particular project. I am the person tasked with building up the compute platform for this and I am trying to decide how to balance processor speed, cache size and cost to get the most simulation runs for our budget. The number of sims is near infinite, so it really about getting the most done for the buck. As an added twist, one of the apps is a vendor supplied 32 bit app so I have needed to get 32 bit apps and libraries on my 64 bit machine to get cachegrind to run. One of my test platforms that I build for another project has a pair of Intel Xeon e5-2643v2 processors. These have 6 cores and 25MB of L3 cache, so it gives me a great test for a high cache per core case. When I run cachegrind with no options, I get the right numbers back on the amount of memory and line width (64). I am guessing the number of associations (20) is right but I don't have specs to compare to. Why is it insisting that the numbers be powers of 2? If I use --LL with the real numbers it refuses to start. If I round these down to the next lower power of 2 (16M and 16 associations) it doesn't complain but it still doesn't run. What do I have to do to get this to work for me? What is the general solution for all the modern chips that have nothing bound to powers of 2? This seems to be a fundamental issue with modern processors. thanks in advance, jerry |
|
From: Tom H. <to...@co...> - 2014-07-03 07:33:14
|
On 03/07/14 07:06, Jerry Scharf wrote: > Why is it insisting that the numbers be powers of 2? If I use --LL with the real numbers it refuses to start. If I round these down to the next lower power of 2 (16M and 16 associations) it doesn't complain but it still doesn't run. The internal design of the data structures used by cachegrind assumes that sizes will be powers of two and that it can compute indexes by bit masking. That is, as you have found, no longer true for modern processors, but to date nobody has stepped up to fix cachegrind. On top of that it isn't clear that there is a good fix that wouldn't produce a loss of performance. Tom -- Tom Hughes (to...@co...) http://compton.nu/ |
|
From: Josef W. <Jos...@gm...> - 2014-07-03 09:27:06
|
Am 03.07.2014 09:32, schrieb Tom Hughes: > On 03/07/14 07:06, Jerry Scharf wrote: > >> Why is it insisting that the numbers be powers of 2? If I use --LL with the real numbers it refuses to start. If I round these down to the next lower power of 2 (16M and 16 associations) it doesn't complain but it still doesn't run. What's the output of "it still doesn't run"? > The internal design of the data structures used by cachegrind assumes > that sizes will be powers of two and that it can compute indexes by bit > masking. The cache size does not have to be a power of two, only the number of sets has to, to do the bit masking trick. E.g. 25MB with assoc 25, e.g. "--LL=26214400,25,64" works here. It probably does not make much difference between assoc of 20 or 25. Of course we could use modulo instead of the bit mask trick, but this gives around a factor of 2 slowdown, also for the original fast power-of-two cases. BTW, if you have a test system anyway, real performance measurements should be more important to you. Cachegrind mainly helps finding the code to optimize for reducing cache misses, and makes comparisons easier due to reproducability. > That is, as you have found, no longer true for modern processors, but to > date nobody has stepped up to fix cachegrind. The best way to fix this may just be to always find a good approximation of parameters (keeping cache size is more important), which make the number of sets a power of two, and run simulation with these. Cachegrind results are an approximation anyway (no hardware prefetcher simulation, LRU may not be really correct). Josef > On top of that it isn't > clear that there is a good fix that wouldn't produce a loss of performance. > > Tom > |
|
From: Jerry S. <js...@fi...> - 2014-07-03 17:49:33
|
Tom and Josef, Thank you for your speedy responses. Is cachegrind an instrumentation of the real system or a simulation of the processor cache based on the code executed? From your responses, it sounds like the later. If so, this is actually better for what I want to do. Because of the fact that the cache is shared among the cores, it is hard to tell from single runs what it will be like with concurrent jobs. If it is simulated and I can just set the cache size to whatever I want, I can run a job with different cache parameters and find at least a first order guess of how the job responds to cache size. This is the most useful thing for me right now. It's a bit daunting when someone says I want millions of simulations running as fast as possible. That is a good hard challenge. I've done work on long lasting compute bound jobs, but this has a bunch more moving parts than just the sparse matrix solver and forward error tests. jerry ----- Original Message ----- | From: "Tom Hughes" <to...@co...> | To: "Jerry Scharf" <js...@fi...>, Val...@li... | Sent: Thursday, July 3, 2014 12:32:59 AM | Subject: Re: cachegrind for Xeon e5-2643v2 | | On 03/07/14 07:06, Jerry Scharf wrote: | | > Why is it insisting that the numbers be powers of 2? If I use --LL | > with the real numbers it refuses to start. If I round these down | > to the next lower power of 2 (16M and 16 associations) it doesn't | > complain but it still doesn't run. | | The internal design of the data structures used by cachegrind assumes | that sizes will be powers of two and that it can compute indexes by | bit | masking. | | That is, as you have found, no longer true for modern processors, but | to | date nobody has stepped up to fix cachegrind. On top of that it isn't | clear that there is a good fix that wouldn't produce a loss of | performance. | | Tom | | -- | Tom Hughes (to...@co...) | http://compton.nu/ | |
|
From: Josef W. <Jos...@gm...> - 2014-07-04 08:00:56
|
Am 03.07.2014 19:22, schrieb Jerry Scharf: > Tom and Josef, > > Thank you for your speedy responses. > > Is cachegrind an instrumentation of the real system or a simulation of the processor cache based on the code executed? From your responses, it sounds like the later. Yes. For the first, use perf/PAPI/oprofile/VTUNE/... which use performance counters of your processor. > If so, this is actually better for what I want to do. > > Because of the fact that the cache is shared among the cores, it is hard to tell from single runs what it will be like with concurrent jobs. If it is simulated and I can just set the cache size to whatever I want, I can run a job with different cache parameters and find at least a first order guess of how the job responds to cache size. This is the most useful thing for me right now. That's true. If you want to run 6 seperate processes on your 6-core, and you have 25MB L3, you should check how one process works with 25MB /6 L3, ie. around 4MB. As processes do not share address space. However, if you have multithreaded code: cachegrind currently does not simulate shared L3 for multithreaded code, but expects the full hierarchy to be private for each thread. > It's a bit daunting when someone says I want millions of simulations running as fast as possible. That is a good hard challenge. I've done work on long lasting compute bound jobs, but this has a bunch more moving parts than just the sparse matrix solver and forward error tests. At least it sounds embarrasing parallel. Not too bad. Josef > > jerry > > ----- Original Message ----- > | From: "Tom Hughes" <to...@co...> > | To: "Jerry Scharf" <js...@fi...>, Val...@li... > | Sent: Thursday, July 3, 2014 12:32:59 AM > | Subject: Re: cachegrind for Xeon e5-2643v2 > | > | On 03/07/14 07:06, Jerry Scharf wrote: > | > | > Why is it insisting that the numbers be powers of 2? If I use --LL > | > with the real numbers it refuses to start. If I round these down > | > to the next lower power of 2 (16M and 16 associations) it doesn't > | > complain but it still doesn't run. > | > | The internal design of the data structures used by cachegrind assumes > | that sizes will be powers of two and that it can compute indexes by > | bit > | masking. > | > | That is, as you have found, no longer true for modern processors, but > | to > | date nobody has stepped up to fix cachegrind. On top of that it isn't > | clear that there is a good fix that wouldn't produce a loss of > | performance. > | > | Tom > | > | -- > | Tom Hughes (to...@co...) > | http://compton.nu/ > | > > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |
|
From: Josef W. <Jos...@gm...> - 2014-07-04 08:57:03
|
Am 04.07.2014 10:00, schrieb Josef Weidendorfer: > However, if you have multithreaded code: cachegrind currently does not > simulate shared L3 for multithreaded > code, but expects the full hierarchy to be private for each thread. Sorry, that's wrong. Cachegrind only maintains one hierarchy, and all threads go through the same single hierarchy. We probably should change something here. Josef > >> It's a bit daunting when someone says I want millions of simulations running as fast as possible. That is a good hard challenge. I've done work on long lasting compute bound jobs, but this has a bunch more moving parts than just the sparse matrix solver and forward error tests. > > At least it sounds embarrasing parallel. Not too bad. > > Josef > >> >> jerry >> >> ----- Original Message ----- >> | From: "Tom Hughes" <to...@co...> >> | To: "Jerry Scharf" <js...@fi...>, Val...@li... >> | Sent: Thursday, July 3, 2014 12:32:59 AM >> | Subject: Re: cachegrind for Xeon e5-2643v2 >> | >> | On 03/07/14 07:06, Jerry Scharf wrote: >> | >> | > Why is it insisting that the numbers be powers of 2? If I use --LL >> | > with the real numbers it refuses to start. If I round these down >> | > to the next lower power of 2 (16M and 16 associations) it doesn't >> | > complain but it still doesn't run. >> | >> | The internal design of the data structures used by cachegrind assumes >> | that sizes will be powers of two and that it can compute indexes by >> | bit >> | masking. >> | >> | That is, as you have found, no longer true for modern processors, but >> | to >> | date nobody has stepped up to fix cachegrind. On top of that it isn't >> | clear that there is a good fix that wouldn't produce a loss of >> | performance. >> | >> | Tom >> | >> | -- >> | Tom Hughes (to...@co...) >> | http://compton.nu/ >> | >> >> ------------------------------------------------------------------------------ >> Open source business process management suite built on Java and Eclipse >> Turn processes into business applications with Bonita BPM Community Edition >> Quickly connect people, data, and systems into organized workflows >> Winner of BOSSIE, CODIE, OW2 and Gartner awards >> http://p.sf.net/sfu/Bonitasoft >> _______________________________________________ >> Valgrind-users mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-users >> > > ------------------------------------------------------------------------------ > Open source business process management suite built on Java and Eclipse > Turn processes into business applications with Bonita BPM Community Edition > Quickly connect people, data, and systems into organized workflows > Winner of BOSSIE, CODIE, OW2 and Gartner awards > http://p.sf.net/sfu/Bonitasoft > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |