|
From: Dominic A. <zer...@go...> - 2009-04-16 10:24:29
|
Hi everyone,
I am currently trying to extend CacheGrind to support multi-core cache
simulation
( MESI protocol, 1:1 thread/L1 cache mapping come to my mind ).
However, I have stumbled over the event merging mechanism. I suspect
the way events
are currently merged, results into more reads being reported then writes.
I think this is the case because writes (Dw-events) get merged with
reads (Dr-events) to
modify-events (Dm) which are only handled by read-handlers (e.g.
"log_0I_1Dr_cache_access").
The functions to look at are "addEvent_Dw" and "flushEvents".
BTW: I noticed this "strangeness" when I included memory bus event
annotations in "InstrInfo".
My goal is to count how many "locked" (Intel-prefix for cache
exclusive r/w) read/writes/modify miss the cache.
The writes never showed up because they seem to be merged away.
( The merged Dm-events go to log...Dr..... functions like
"log_0I_1Dr_cache_access").
I temporarily disabled merging in "addEvent_Dw" and immediately
saw locked writes!
My current assumption is that CacheGrind was not designed to be really
accurate - or would you consider it a bug?
( The cache miss-statistics are correct after all because the cache is
write-allocate and merging
subsequent read/write-operations into a single read does not change
anything - except the read/write counters which
are off. The man-page and source-code comments - however - do not
indicate such inaccuracies. )
I hope someone can enlighten me a bit :-)
So far I must say the codebase of Valgrind is quite nicely structured!
Thank you
Dominic
|
|
From: Julian S. <js...@ac...> - 2009-04-16 10:55:49
|
> My current assumption is that CacheGrind was not designed to be really > accurate - or would you consider it a bug? My impression is that Cachegrind is designed to model a single-processor world, and has no concept of multiprocessors sharing a memory hierarchy, or of the associated coherence protocols. So it may well be that what is currently there doesn't make much sense for a multiprocessor world. I can say for sure that you know more about multi-core cache stuff than me :-) Nevertheless, as Nick is the author of Cachegrind, he may have more comments about this. > So far I must say the codebase of Valgrind is quite nicely structured! Thanks! We aim to please :-) J |
|
From: Dominic A. <zer...@go...> - 2009-04-16 11:52:36
|
Hi Julian,
(1)
I would probably suggest to handle Ev_Dr/Ev_Dm differently in "flushEvents".
Currently, a Ev_Dm is treated like a Ev_Dr (in two case statements) although a
Ev_Dm could have been constructed from a Dw-event (data write) in
"addEvent_Dw" (actually it is only done there!)
So, the log_* functions will loose write-events.
A log_*modify* could count a read and write :
n->parent->Dr.a++;
n->parent->Dw.a++;
A quick fix is to remove merging in "addEvent_Dw". Thus no Dm-events
ever get created.
(2) Another suggestion:
"cg_main.c"
"Ist_Dirty"-opcodes are limited in dataSize to 16 bytes. I would suggest to
limit the data size to twice the configured cache line size instead.
Current code:
// Large (eg. 28B, 108B, 512B on x86) data-sized
// instructions will be done inaccurately, but they're
// very rare and this avoids errors from hitting more
// than two cache lines in the simulation.
if (dataSize > MIN_LINE_SIZE)
dataSize = MIN_LINE_SIZE;
MIN_LINE_SIZE is set to 16
However, "addEvent_Dr" has an assertion which checks "datasize":
tl_assert(datasize >= 1 && datasize <= MIN_LINE_SIZE);
Is this correct? Should not it be twice the configured line size as well?
Currently, it is not an issue because there seems to be no instruction yet that
reads/writes more than 16 bytes.
(3)
A comment in "cg_instrument" ( the "Ist_Dirty" switch-case) may help
understanding
the code better:
// "Ifx_Modify" is turned into an Ev_Dm within addEvent_Dw
if (d->mFx == Ifx_Read || d->mFx == Ifx_Modify)
addEvent_Dr( &cgs, curr_inode, dataSize, d->mAddr );
if (d->mFx == Ifx_Write || d->mFx == Ifx_Modify)
addEvent_Dw( &cgs, curr_inode, dataSize, d->mAddr );
All this stuff applies to Valgrind as it is - no multi-core stuff
involved yet ;-)
I am quite new to Valgrind. Therefore I thought it is better to ask on
the mailing list
before issuing (a) bug-report(s).
Thanks
Dominic
P.S. Nick directed me to the mailing list.
On Thu, Apr 16, 2009 at 12:59 PM, Julian Seward <js...@ac...> wrote:
>
>> My current assumption is that CacheGrind was not designed to be really
>> accurate - or would you consider it a bug?
>
> My impression is that Cachegrind is designed to model a single-processor
> world, and has no concept of multiprocessors sharing a memory hierarchy,
> or of the associated coherence protocols. So it may well be that what
> is currently there doesn't make much sense for a multiprocessor world.
>
> I can say for sure that you know more about multi-core cache stuff
> than me :-)
>
> Nevertheless, as Nick is the author of Cachegrind, he may have more
> comments about this.
>
>> So far I must say the codebase of Valgrind is quite nicely structured!
>
> Thanks! We aim to please :-)
>
> J
>
|
|
From: Dominic A. <zer...@go...> - 2009-04-17 10:07:40
|
Hi,
I have verified the read/write-counting in CacheGrind. It is not correct!
"test.c" is a small test case. The loop in the "test"-function contains
the "80483b3: subl $0x1,-0x4(%ebp)" statement which causes a
read and write. Due to the merging in "addEvent_Dw" and the conversion
into a Dm-event which is treated like a Dr-event the write counter
is wrong:
==8584== D refs: 400,060,588 (300,043,577 rd + 100,017,011 wr)
vs.
==10891== D refs: 500,062,749 (300,043,577 rd + 200,019,172 wr)
(disabled merging in "addEvent_Dw" = quick fix)
If you look at the assembly you will see that the loop has a 3/2
read/write ratio.
Again, if you look at the D-refs you will see that the "fixed" -
CacheGrind reports
the correct ratio.
-----------------------------------------------------------------------------
"test" function disassembly -----------------------------------------
-----------------------------------------------------------------------------
08048394 <test>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 7f 96 98 00 movl $0x98967f,-0x4(%ebp)
80483a1: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%ebp)
80483a8: eb 0d jmp 80483b7 <test+0x23>
80483aa: 8b 45 f8 mov -0x8(%ebp),%eax
80483ad: 83 c0 02 add $0x2,%eax
80483b0: 89 45 f8 mov %eax,-0x8(%ebp)
80483b3: 83 6d fc 01 subl $0x1,-0x4(%ebp)
80483b7: 83 7d fc 00 cmpl $0x0,-0x4(%ebp)
80483bb: 75 ed jne 80483aa <test+0x16>
80483bd: c9 leave
80483be: c3 ret
compiled with "gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)"
-----------------------------------------------------------------------------
test.c ----------------------------------------------------------------------
-----------------------------------------------------------------------------
#include <stdio.h>
void test()
{
unsigned long i = 9999999L;
volatile int x = 0;
while(i>0) {
x+=2;
i--;
}
}
int main()
{
int i;
for(i=0;i<10;i++) test();
}
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
modified "addEvent_Dw" ---------------------------------------------
-----------------------------------------------------------------------------
/* Is it possible to merge this write with the preceding read? */
/* Disable this code, because Dm events are treated like reads later on:
lastEvt = &cgs->events[cgs->events_used-1];
if (cgs->events_used > 0
&& lastEvt->tag == Ev_Dr
&& lastEvt->Ev.Dr.szB == datasize
&& lastEvt->inode == inode
&& eqIRAtom(lastEvt->Ev.Dr.ea, ea))
{
if (inode->locked == False) {
VG_(printf)("write merges with read without lock prefix:\n");
show_debug_info(inode->instr_addr);
}
lastEvt->tag = Ev_Dm;
return;
}
*/
|
|
From: Josef W. <Jos...@gm...> - 2009-04-17 19:26:11
|
Hi Dominic, On Thursday 16 April 2009, Dominic Account wrote: > I am currently trying to extend CacheGrind to support multi-core cache > simulation > ( MESI protocol, 1:1 thread/L1 cache mapping come to my mind ). Cool. To get realistic behavior, I suppose you track a simulated time somehow and try to influence the scheduling of threads, perhaps also with reducing the time slice used in Valgrind for switching between threads? > However, I have stumbled over the event merging mechanism. I suspect > the way events > are currently merged, results into more reads being reported then writes. > > I think this is the case because writes (Dw-events) get merged with > reads (Dr-events) to > modify-events (Dm) which are only handled by read-handlers (e.g. > "log_0I_1Dr_cache_access"). Yes. As you note, because of the cache model, it does not really matter for miss ratios (reads and writes produce the same result regarding cache state). It was so since the early times of Cachegrind, and actually, I can only assume the reasoning: perhaps the results matched better with reality. AFAIK, Nick did quite some investigations then... Anyway, Callgrind uses the same model, and it interpretes "modify" as a write, which I need in one optional extended model, in which further L2 events are produced depending on the need of the writeback of a dirty line. Hmm.. the modify->write is wrong in a similar way. The better solution really is to either call into the simulator with 2 events, or have an handler for a modify event (as you suggest in the other mail; the latter would be better regarding performance of the simulation). > (2) Another suggestion: > > "cg_main.c" > > "Ist_Dirty"-opcodes are limited in dataSize to 16 bytes. I would suggest to > limit the data size to twice the configured cache line size instead. If you do that, accesses can cover 3 cache lines, which the simulation code is not prepared for. BTW, there are x86 instructions with large data read/writes, e.g. pushf/popf. Cheers, Josef |
|
From: Dominic A. <zer...@go...> - 2009-04-18 10:29:45
|
Hi Josef,
Yes, indeed I have thought about influencing the scheduling in Valgrind.
This should be doable by granting the big lock only according to a custom
scheduler. This modifies the core however and will break compatibility.
Reducing the time slice was also on my mind. Simics has a similar feature
which allows switching between single instructions even. There is a
performance/accuracy trade-off obviously.
I have not thought much about simulated time but a straight forward
way would be to feed the instructions into a timing model.
Regarding the miss-rate. Everything which depends on "Dw_total" may
be off. Please, have a look at "cg_fini" :
D_total.a = Dr_total.a + Dw_total.a;
D_total.m1 = Dr_total.m1 + Dw_total.m1;
D_total.m2 = Dr_total.m2 + Dw_total.m2;
Ciao
Dominic
P.S.
Anyway, I am very glad CacheGrind exists. It is a very good starting point!
I have also prepared a patch.
|
|
From: Vince W. <vi...@cs...> - 2009-04-18 19:26:14
|
It's interesting to hear about the interest in multi-core Cachegrind. I'm working on a somewhat similar project, where I am generating memory traces with Valgrind, but I'm planning on using the Ruby CMP memory simulator from Wisconsin's Multifacet group to do the actual simulation. The work is going a bit slowly though, as unfortunately not all source code trees out there are as clean and understandable as Valgrind's. On Fri, 17 Apr 2009, Josef Weidendorfer wrote: > If you do that, accesses can cover 3 cache lines, which the simulation code > is not prepared for. BTW, there are x86 instructions with large data read/writes, > e.g. pushf/popf. another example of large cache writes on x86 are the string instructions, for example rep stosb, etc. Actual hardware considers a rep stosb with ecx=4096 as one 4096 byte write, not as 4096 individual byte writes. So if you are ever comparing your resutls against hardware perf counters this is something to keep in mind. Also if you are monitoring icache behavior, that's counted as one single store instruction, not 4096 ones (I have special code in my valgrind tools that handles this correctly. I forget if cachegrind currently does this properly). Vince |