|
From: stavros k. <ska...@gm...> - 2014-05-28 19:34:52
|
Hello Josef, I am not sure how to directly reply to your post. In case I was supposed to receive an email, I didn't. So I hope this will work as a reply. >this looks like a quite straight-forward extension, which is nice. Thank you >Why does this only work for x86? Because the aim of my dissertation was to do any extensions to be valid for Intel x86 so other architectures haven't been considered. For instance, in cases where I had to add extra CPUID instructions or add extra code, I only did this for parts of the code related to Intel x86. Furthermore, the TLB simulator is created in a way that it simulates the TLB in a way that Intel x86 would handle it. In some cases it might work for other architectures as well, but chances are that it will not. >What is the envisioned use of the detailed page tracking stuff? I can't think of any useful use. I just did it because it could be done and I thought that somebody may need it. >Can you give some numbers about the performance penalties of the simulator with the added features? L2 cache inclusion for not cache-unfriendly code according to the tests made did not introduce a slowdown greater than 10%. Most of the programs that took part on this test were Unix commands. Cache-unfriendly programs introduced a slowdown of 16%. TLB measuring on normal programs (the same used to test L2 cache inclusion), that do not have - or they have no reason to have - an unfriendly TLB behaviour introduced a maximum slowdown of 41% and a minimum of 2%. However, the same programs with detailed page tracking option enabled introduced a maximum slowdown of 76% and a minimum of 6%. Moreover, TLB measuring tests on programs that are TLB unfriendly introduced a maximum slowdown of 440% and 1400% with the page tracking option enabled. Stavros |
|
From: stavros k. <ska...@gm...> - 2014-05-31 23:03:57
|
>I suppose the worst case is random access on a huge array.
Close enough! The worst case among the programs tried was access on huge
array but it was not random. I used a variation of the program presented by
Nicholas Nethercote is his PhD thesis which has as follows:
#define SIZE (2048)
int main(void){
int h,i,j;
static int a[SIZE][SIZE];
for(h=0;h<10;h++)
for(i=0;i<SIZE;i++)
for(j=0;j<SIZE;j++)
a[i][j]=0;
return 0;
}
>The usage of cachegrind is to find bottlenecks related to the cache
>hierarchy.
>I wonder if L2 miss behavior really helps a lot in this regard.
I agree with that. It would make results slightly more accurate, but it
does not contribute significantly.
It could be added as an option to help the very curious.
>Does your simulation run L2/L3 accesses then with physical addresses?
I am not sure I understand what you mean here. If I understand correctly,
you are asking if virtual addresses are translated to physical ones before
CPU cache simulation (the simulation Cachegrind already provided) takes
place? and as a result to simulate the CPU caches using physical addresses
instead of virtual?
If that was the question, then the answer is no. The TLB simulator is not
run before CPU cache simulation, thus, physical addresses are not used
instead of virtual ones to simulate the CPU cache (L1,L2,L3).
>To understand whether good/bad TLB behavior has an impact on
>performance, I think
>one should be able to see whether page walks resulting from TLB misses
>themself
>can be served from cache or not.
>I suppose your simulation does not
>include this?
the TLB extension is only concerned with the TLB and does not interfere
with CPU caches at all. Therefore, it does not provide results that concern
both the TLB and the CPU caches together.
I will try and see if I can get permission to provide my dissertation where
I think a good description of the capabilities and the characteristics of
the extensions can be found.
Stavros
|
|
From: Josef W. <Jos...@gm...> - 2014-05-30 09:47:07
|
Am 28.05.2014 21:34, schrieb stavros kaparelos: > >Why does this only work for x86? > Because the aim of my dissertation was to do any extensions to be valid > for Intel x86 so other architectures haven't been considered. > ... Ok, so that's just a limitation of supported configurations. These should be simulated quite fine also with other ISAs. Of course, they may not reflect the cache configuration of these other architectures then. But this is similar to the current situation: cachegrind/callgrind is simulating LRU replacement, but ARM often does random replacement in L1. In any case, configuration parameters must be able to be specified on the command line. >>Can you give some numbers about the performance penalties of the simulator with the added features? > > L2 cache inclusion for not cache-unfriendly code according to the tests > made did not introduce a slowdown greater than 10%. Most of the programs > that took part on this test were Unix commands. Cache-unfriendly > programs introduced a slowdown of 16%. I suppose the worst case is random access on a huge array. The added cost is a failed tag search in the additional L2, and the penalty then depends on L2 associativity. The usage of cachegrind is to find bottlenecks related to the cache hierarchy. I wonder if L2 miss behavior really helps a lot in this regard. It would be different if we would be able to simulate multiple private L1/L2 of a multicore chip, to understand whether there is contention at the shared L3 level generated by all the cores. > TLB measuring on normal programs (the same used to test L2 cache > inclusion), that do not have - or they have no reason to have - an > unfriendly TLB behaviour introduced a maximum slowdown of 41% and a > minimum of 2%. So TLB simulation must be optional. Does your simulation run L2/L3 accesses then with physical addresses? If so, you must have some page allocation algorithm implemented? To understand whether good/bad TLB behavior has an impact on performance, I think one should be able to see whether page walks resulting from TLB misses themself can be served from cache or not. I suppose your simulation does not include this? Josef However, the same programs with detailed page tracking > option enabled introduced a maximum slowdown of 76% and a minimum of 6%. > Moreover, TLB measuring tests on programs that are TLB unfriendly > introduced a maximum slowdown of 440% and 1400% with the page tracking > option enabled. > > Stavros > > > ------------------------------------------------------------------------------ > Time is money. Stop wasting it! Get your web API in 5 minutes. > www.restlet.com/download > http://p.sf.net/sfu/restlet > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |