|
From: Julian S. <js...@ac...> - 2012-05-21 14:24:13
|
Hi Nick, Josef,
I recently added some AVX support to V, and as a result added a new type
of 32-byte values (Ity_V256) to IR. Loads and stores of such values
cause Cachegrind and Callgrind to assert, because the size (32 bytes)
is larger than MIN_LINE_SIZE, which is 16.
As they currently are, both tools refuse to process memory accesses
bigger than 16, on the basis that the minimum possible line size is 16,
and so a 16 byte access could access 2 adjacent lines, which is a
situation they are prepared to handle. But not 3 lines, which is a
possible case for a 32 byte access w/ 16 byte lines (something for
which I'm sure no hardware actually exists).
So I was wondering if you had any views on how to fix it properly?
I hacked up the patch shown below (w/ equivalent for Callgrind), but
obviously it's not a good long term solution.
Options, all non-ideal:
* if we process 32 byte accesses "properly" then need to handle the
case of an access going over 3 16-byte lines (clearly would never
happen in real h/w); slow, complex, unrealistic
* or we can change MIN_LINE_SIZE to 32, but then we can't accurately
simulate caches with 16 byte lines
* or we can handle 32 byte accesses as 2 x 16 byte accesses, but then
we get incorrect access count numbers
* or we can change MIN_LINE_SIZE at run time to be 32 on AVX-capable
platforms, but that will slow the simulator down (maybe a lot) since
the shifting/masking can't then be baked in at V-build time
* or we could use the kludge in the patch below, but then we might
miss some misses (joke not intended)
Urrr. Any opinions? Other options?
Also, any opinions on me committing the the kludge temporarily? Current
situation is now that both tools assert any time they hit AVX code, which
is kinda ungood.
Thanks.
J
Index: cachegrind/cg_main.c
===================================================================
--- cachegrind/cg_main.c (revision 12570)
+++ cachegrind/cg_main.c (working copy)
@@ -1030,10 +1030,14 @@
IRExpr* data = st->Ist.WrTmp.data;
if (data->tag == Iex_Load) {
IRExpr* aexpr = data->Iex.Load.addr;
+ Int dataSize = sizeofIRType(data->Iex.Load.ty);
+ /* BEGIN AVX kludge */
+ if (dataSize > MIN_LINE_SIZE)
+ dataSize = MIN_LINE_SIZE;
+ /* END AVX kludge */
// Note also, endianness info is ignored. I guess
// that's not interesting.
- addEvent_Dr( &cgs, curr_inode, sizeofIRType(data-
>Iex.Load.ty),
- aexpr );
+ addEvent_Dr( &cgs, curr_inode, dataSize, aexpr );
}
break;
}
@@ -1041,8 +1045,12 @@
case Ist_Store: {
IRExpr* data = st->Ist.Store.data;
IRExpr* aexpr = st->Ist.Store.addr;
- addEvent_Dw( &cgs, curr_inode,
- sizeofIRType(typeOfIRExpr(tyenv, data)), aexpr );
+ Int dataSize = sizeofIRType(typeOfIRExpr(tyenv, data));
+ /* BEGIN AVX kludge */
+ if (dataSize > MIN_LINE_SIZE)
+ dataSize = MIN_LINE_SIZE;
+ /* END AVX kludge */
+ addEvent_Dw( &cgs, curr_inode, dataSize, aexpr );
break;
}
|
|
From: Nicholas N. <n.n...@gm...> - 2012-05-21 23:25:14
|
On Mon, May 21, 2012 at 7:22 AM, Julian Seward <js...@ac...> wrote: > > I recently added some AVX support to V, and as a result added a new type > of 32-byte values (Ity_V256) to IR. Loads and stores of such values > cause Cachegrind and Callgrind to assert, because the size (32 bytes) > is larger than MIN_LINE_SIZE, which is 16. > > As they currently are, both tools refuse to process memory accesses > bigger than 16, on the basis that the minimum possible line size is 16, > and so a 16 byte access could access 2 adjacent lines, which is a > situation they are prepared to handle. But not 3 lines, which is a > possible case for a 32 byte access w/ 16 byte lines (something for > which I'm sure no hardware actually exists). Can we model the 32-byte accesses correctly and just assert if the 3 line case occurs? (I wrote that simulation code so long ago that I don't have any additional insight compared to someone looking at it afresh...) Nick |
|
From: Josef W. <Jos...@gm...> - 2012-05-22 08:06:58
|
Hi Julian,
Am 21.05.2012 16:22, schrieb Julian Seward:
> Options, all non-ideal:
>
> * if we process 32 byte accesses "properly" then need to handle the
> case of an access going over 3 16-byte lines (clearly would never
> happen in real h/w); slow, complex, unrealistic
>
> * or we can change MIN_LINE_SIZE to 32, but then we can't accurately
> simulate caches with 16 byte lines
>
> * or we can handle 32 byte accesses as 2 x 16 byte accesses, but then
> we get incorrect access count numbers
>
> * or we can change MIN_LINE_SIZE at run time to be 32 on AVX-capable
> platforms, but that will slow the simulator down (maybe a lot) since
> the shifting/masking can't then be baked in at V-build time
The line size is already a runtime parameter. Where do you expect
slowness because of MIN_LINE_SIZE not being able to be baked in at build
time?
I see a few tl_asserts, but these all happen at instrumentation time,
which I do not think is so time critical.
Anyway, the simulation routine itself barks if we straddle more than
two lines, without needing MIN_LINE_SIZE for that...
> * or we could use the kludge in the patch below, but then we might
> miss some misses (joke not intended)
>
> Urrr. Any opinions? Other options?
The kludge is bad. For something like PUSHA, this is rarely enough,
but not really for inner loops using AVX.
What was the point to have a constant MIN_LIN_SIZE?
We should use the minimum of line size of L1 and LL instead.
Josef
>
> Also, any opinions on me committing the the kludge temporarily? Current
> situation is now that both tools assert any time they hit AVX code, which
> is kinda ungood.
>
> Thanks.
>
> J
>
>
> Index: cachegrind/cg_main.c
> ===================================================================
> --- cachegrind/cg_main.c (revision 12570)
> +++ cachegrind/cg_main.c (working copy)
> @@ -1030,10 +1030,14 @@
> IRExpr* data = st->Ist.WrTmp.data;
> if (data->tag == Iex_Load) {
> IRExpr* aexpr = data->Iex.Load.addr;
> + Int dataSize = sizeofIRType(data->Iex.Load.ty);
> + /* BEGIN AVX kludge */
> + if (dataSize> MIN_LINE_SIZE)
> + dataSize = MIN_LINE_SIZE;
> + /* END AVX kludge */
> // Note also, endianness info is ignored. I guess
> // that's not interesting.
> - addEvent_Dr(&cgs, curr_inode, sizeofIRType(data-
>> Iex.Load.ty),
> - aexpr );
> + addEvent_Dr(&cgs, curr_inode, dataSize, aexpr );
> }
> break;
> }
> @@ -1041,8 +1045,12 @@
> case Ist_Store: {
> IRExpr* data = st->Ist.Store.data;
> IRExpr* aexpr = st->Ist.Store.addr;
> - addEvent_Dw(&cgs, curr_inode,
> - sizeofIRType(typeOfIRExpr(tyenv, data)), aexpr );
> + Int dataSize = sizeofIRType(typeOfIRExpr(tyenv, data));
> + /* BEGIN AVX kludge */
> + if (dataSize> MIN_LINE_SIZE)
> + dataSize = MIN_LINE_SIZE;
> + /* END AVX kludge */
> + addEvent_Dw(&cgs, curr_inode, dataSize, aexpr );
> break;
> }
>
>
|
|
From: Julian S. <js...@ac...> - 2012-05-23 18:20:06
|
On Tuesday, May 22, 2012, Josef Weidendorfer wrote: > > * or we can change MIN_LINE_SIZE at run time to be 32 on AVX-capable > > > > platforms, but that will slow the simulator down (maybe a lot) since > > the shifting/masking can't then be baked in at V-build time > > The line size is already a runtime parameter. Where do you expect > slowness because of MIN_LINE_SIZE not being able to be baked in at build > time? Urr, my mistake -- I assumed it was baked in, but I did not actually inspect the sources :-) > What was the point to have a constant MIN_LIN_SIZE? > We should use the minimum of line size of L1 and LL instead. I don't know .. it came from Nick's code originally, I think. J |
|
From: Josef W. <Jos...@gm...> - 2012-05-22 14:55:03
|
Am 22.05.2012 01:24, schrieb Nicholas Nethercote:
> On Mon, May 21, 2012 at 7:22 AM, Julian Seward<js...@ac...> wrote:
>>
>> I recently added some AVX support to V, and as a result added a new type
>> of 32-byte values (Ity_V256) to IR. Loads and stores of such values
>> cause Cachegrind and Callgrind to assert, because the size (32 bytes)
>> is larger than MIN_LINE_SIZE, which is 16.
>>
>> As they currently are, both tools refuse to process memory accesses
>> bigger than 16, on the basis that the minimum possible line size is 16,
>> and so a 16 byte access could access 2 adjacent lines, which is a
>> situation they are prepared to handle. But not 3 lines, which is a
>> possible case for a 32 byte access w/ 16 byte lines (something for
>> which I'm sure no hardware actually exists).
>
> Can we model the 32-byte accesses correctly and just assert if the 3
> line case occurs?
As far as I understand, the assertion in the simulator should not
trigger in any case, as memory access lengths are always cut down to
MIN_LINE_SIZE at the moment.
This is needed, as there are some rare instructions with larger memory
accesses, such as PUSHA.
I think it makes more sense to cut access lengths down to the real
minimum of used line sizes, using the patch below.
Hmm. If people explicitly configure for 16-byte line size, and use AVX,
this will result in wrong simulations because of the large number of
cut down access sizes. I see these options:
* if we detect a processor with AVX, make 32 the minimal accepted
line size in manual configuration
* count the number of cuts, and print a warning if a threshould of e.g.
10000 is reached ?
* handle accesses straddling >2 lines correctly
Here is the result of vg_perf (vg_minline is with the patch), also
running "none" to see the measurement noise produced on my laptop. It
looks like all numbers more or less stay the same.
Josef
weidendo@lapbode134:~/SW/GitRepos/vg/valgrind (master)> perl
perf/vg_perf --tools=none,cachegrind --vg=. --vg=../vg-minline perf/
-- Running tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 . :0.14s no: 2.0s (14.5x, -----) ca: 6.3s (44.7x, -----)
bigcode1 vg-minline:0.14s no: 2.0s (14.3x, 1.5%) ca: 6.3s (44.7x, 0.0%)
-- bigcode2 --
bigcode2 . :0.14s no: 4.8s (34.6x, -----) ca:10.4s (74.5x, -----)
bigcode2 vg-minline:0.14s no: 4.8s (34.3x, 0.8%) ca:10.4s (74.3x, 0.3%)
-- bz2 --
bz2 . :0.67s no: 2.6s ( 3.9x, -----) ca:18.4s (27.5x, -----)
bz2 vg-minline:0.67s no: 2.6s ( 3.9x, -0.4%) ca:18.1s (27.1x, 1.6%)
-- fbench --
fbench . :0.28s no: 1.2s ( 4.4x, -----) ca: 5.1s (18.1x, -----)
fbench vg-minline:0.28s no: 1.2s ( 4.4x, 0.0%) ca: 5.1s (18.1x, 0.4%)
-- ffbench --
ffbench . :0.26s no: 1.2s ( 4.5x, -----) ca: 6.4s (24.5x, -----)
ffbench vg-minline:0.26s no: 1.2s ( 4.5x, 0.9%) ca: 6.4s (24.5x, -0.2%)
-- heap --
heap . :0.12s no: 0.8s ( 6.4x, -----) ca: 5.3s (44.3x, -----)
heap vg-minline:0.12s no: 0.8s ( 6.8x, -5.2%) ca: 5.3s (44.6x, -0.6%)
-- heap_pdb4 --
heap_pdb4 . :0.12s no: 0.9s ( 7.3x, -----) ca: 5.8s (48.3x, -----)
heap_pdb4 vg-minline:0.12s no: 0.8s ( 7.1x, 3.4%) ca: 5.8s (48.1x, 0.5%)
-- many-loss-records --
many-loss-records . :0.01s no: 0.3s (32.0x, -----) ca: 1.0s
(98.0x, -----)
many-loss-records vg-minline:0.01s no: 0.3s (29.0x, 9.4%) ca: 1.0s
(97.0x, 1.0%)
-- many-xpts --
many-xpts . :0.05s no: 0.4s ( 7.8x, -----) ca: 1.4s (28.0x, -----)
many-xpts vg-minline:0.05s no: 0.4s ( 7.6x, 2.6%) ca: 1.4s (28.4x, -1.4%)
-- sarp --
sarp . :0.03s no: 0.3s (11.0x, -----) ca: 1.4s (47.3x, -----)
sarp vg-minline:0.03s no: 0.3s (11.0x, 0.0%) ca: 1.4s (46.7x, 1.4%)
-- tinycc --
tinycc . :0.21s no: 1.7s ( 7.9x, -----) ca:11.5s (54.8x, -----)
tinycc vg-minline:0.21s no: 1.7s ( 8.0x, -1.8%) ca:11.4s (54.3x, 0.9%)
-- Finished tests in perf ----------------------------------------------
====================================================================
diff --git a/cachegrind/cg_main.c b/cachegrind/cg_main.c
index 4b36204..9982f23 100644
--- a/cachegrind/cg_main.c
+++ b/cachegrind/cg_main.c
@@ -69,6 +69,13 @@ static Bool clo_cache_sim = True; /* do cache
simulation? */
static Bool clo_branch_sim = False; /* do branch simulation? */
static Char* clo_cachegrind_out_file = "cachegrind.out.%p";
+
+/*------------------------------------------------------------*/
+/*--- Cachesim configuration ---*/
+/*------------------------------------------------------------*/
+
+static Int min_line_size = 0; /* min of L1 and LL cache line sizes */
+
/*------------------------------------------------------------*/
/*--- Types and Data Structures ---*/
/*------------------------------------------------------------*/
@@ -846,7 +853,7 @@ void addEvent_Dr ( CgState* cgs, InstrInfo* inode,
Int datasize, IRAtom*
{
Event* evt;
tl_assert(isIRAtom(ea));
- tl_assert(datasize >= 1 && datasize <= MIN_LINE_SIZE);
+ tl_assert(datasize >= 1 && datasize <= min_line_size);
if (!clo_cache_sim)
return;
if (cgs->events_used == N_EVENTS)
@@ -868,7 +875,7 @@ void addEvent_Dw ( CgState* cgs, InstrInfo* inode,
Int datasize, IRAtom*
Event* evt;
tl_assert(isIRAtom(ea));
- tl_assert(datasize >= 1 && datasize <= MIN_LINE_SIZE);
+ tl_assert(datasize >= 1 && datasize <= min_line_size);
if (!clo_cache_sim)
return;
@@ -1058,8 +1065,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
// instructions will be done inaccurately, but they're
// very rare and this avoids errors from hitting more
// than two cache lines in the simulation.
- if (dataSize > MIN_LINE_SIZE)
- dataSize = MIN_LINE_SIZE;
+ if (dataSize > min_line_size)
+ dataSize = min_line_size;
if (d->mFx == Ifx_Read || d->mFx == Ifx_Modify)
addEvent_Dr( &cgs, curr_inode, dataSize, d->mAddr );
if (d->mFx == Ifx_Write || d->mFx == Ifx_Modify)
@@ -1085,8 +1092,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
if (cas->dataHi != NULL)
dataSize *= 2; /* since it's a doubleword-CAS */
/* I don't think this can ever happen, but play safe. */
- if (dataSize > MIN_LINE_SIZE)
- dataSize = MIN_LINE_SIZE;
+ if (dataSize > min_line_size)
+ dataSize = min_line_size;
addEvent_Dr( &cgs, curr_inode, dataSize, cas->addr );
addEvent_Dw( &cgs, curr_inode, dataSize, cas->addr );
break;
@@ -1724,6 +1731,12 @@ static void cg_post_clo_init(void)
&clo_D1_cache,
&clo_LL_cache);
+ // min_line_size is used to make sure that we never feed
+ // accesses to the simulator straddling more than two
+ // cache lines at any cache level
+ min_line_size = (I1c.size < D1c.size) ? I1c.size : D1c.size;
+ min_line_size = (LLc.size < min_line_size) ? LLc.size : min_line_size;
+
cachesim_I1_initcache(I1c);
cachesim_D1_initcache(D1c);
cachesim_LL_initcache(LLc);
====================================================================
|
|
From: Julian S. <js...@ac...> - 2012-05-22 15:22:37
|
On Tuesday, May 22, 2012, Josef Weidendorfer wrote:
> > Can we model the 32-byte accesses correctly and just assert if the 3
> > line case occurs?
>
> As far as I understand, the assertion in the simulator should not
> trigger in any case, as memory access lengths are always cut down to
> MIN_LINE_SIZE at the moment.
Yes, I think you are right. My wording above was sloppy -- when I said
"refuse to process", I meant that the instrumenter asserted, not the
simulator.
> I think it makes more sense to cut access lengths down to the real
> minimum of used line sizes, using the patch below.
Yes.
> Hmm. If people explicitly configure for 16-byte line size, and use AVX,
> this will result in wrong simulations because of the large number of
> cut down access sizes. I see these options:
> * if we detect a processor with AVX, make 32 the minimal accepted
> line size in manual configuration
> * count the number of cuts, and print a warning if a threshould of e.g.
> 10000 is reached ?
> * handle accesses straddling >2 lines correctly
This seems such a rare and obscure case that I don't think it is
much worth worrying about. One other option is to print a warning message
at the start ("you configured a line size < the maximum access size")
and then cut access lengths down to the line size, with no further
warnings.
IIUC, the attached patch (Josef's) almost implements this anyway; only
the initial warning message is missing. Josef, is that correct?
J
|
|
From: Josef W. <Jos...@gm...> - 2012-05-22 16:16:54
|
Am 22.05.2012 17:20, schrieb Julian Seward:
> This seems such a rare and obscure case that I don't think it is
> much worth worrying about. One other option is to print a warning message
> at the start ("you configured a line size< the maximum access size")
What is "the maximum access size" here? Do you mean e.g. 32 when
detecting a processor supporting AVX? This would be fine IMHO.
Can a tool easily check for AVX support?
> and then cut access lengths down to the line size, with no further
> warnings.
We can not simply print the warning the first time an access length
is larger than "configured line size", as every x86 code can have pusha
(or similar, I remember size 108 for one instruction), resulting in the
warning...
>
> IIUC, the attached patch (Josef's) almost implements this anyway; only
> the initial warning message is missing. Josef, is that correct?
Yes.
Josef
>
> J
>
|
|
From: Julian S. <js...@ac...> - 2012-05-23 18:31:39
|
On Tuesday, May 22, 2012, Josef Weidendorfer wrote:
> Am 22.05.2012 17:20, schrieb Julian Seward:
> > This seems such a rare and obscure case that I don't think it is
> > much worth worrying about. One other option is to print a warning
> > message at the start ("you configured a line size< the maximum access
> > size")
>
> What is "the maximum access size" here? Do you mean e.g. 32 when
> detecting a processor supporting AVX?
Yes.
> This would be fine IMHO.
> Can a tool easily check for AVX support?
Uh, I don't think so. Hmm.
> > and then cut access lengths down to the line size, with no further
> > warnings.
>
> We can not simply print the warning the first time an access length
> is larger than "configured line size", as every x86 code can have pusha
> (or similar, I remember size 108 for one instruction), resulting in the
> warning...
Yes, 108 is fsave/frstor etc.
Maybe the concept of "cache line size < max access size" is not the
right one, since that is always going to be the case for (eg) fxsave,
and we will have to continue to silently truncate those accesses.
A better test (for emitting a warning message) is maybe "cache line
size < max size of any register" since that will be 16 on SSE-only
CPUs and 32 on AVX capable CPUs. That makes it non-AVX specific.
It does however mean making it possible for a tool to ask the core
for the size of the largest (guest) register.
J
|
|
From: Julian S. <js...@ac...> - 2012-06-01 16:28:49
|
Hi Josef, Nick,
Do either of you have any objection if I commit basically the patch
below, after some testing? I am trying to finish up the AVX infrastructure
changes this weekend and this is the last thing on my list.
Thanks.
J
On Tuesday, May 22, 2012, Josef Weidendorfer wrote:
> Am 22.05.2012 01:24, schrieb Nicholas Nethercote:
> > On Mon, May 21, 2012 at 7:22 AM, Julian Seward<js...@ac...> wrote:
> >> I recently added some AVX support to V, and as a result added a new type
> >> of 32-byte values (Ity_V256) to IR. Loads and stores of such values
> >> cause Cachegrind and Callgrind to assert, because the size (32 bytes)
> >> is larger than MIN_LINE_SIZE, which is 16.
> >>
> >> As they currently are, both tools refuse to process memory accesses
> >> bigger than 16, on the basis that the minimum possible line size is 16,
> >> and so a 16 byte access could access 2 adjacent lines, which is a
> >> situation they are prepared to handle. But not 3 lines, which is a
> >> possible case for a 32 byte access w/ 16 byte lines (something for
> >> which I'm sure no hardware actually exists).
> >
> > Can we model the 32-byte accesses correctly and just assert if the 3
> > line case occurs?
>
> As far as I understand, the assertion in the simulator should not
> trigger in any case, as memory access lengths are always cut down to
> MIN_LINE_SIZE at the moment.
>
> This is needed, as there are some rare instructions with larger memory
> accesses, such as PUSHA.
>
> I think it makes more sense to cut access lengths down to the real
> minimum of used line sizes, using the patch below.
>
> Hmm. If people explicitly configure for 16-byte line size, and use AVX,
> this will result in wrong simulations because of the large number of
> cut down access sizes. I see these options:
> * if we detect a processor with AVX, make 32 the minimal accepted
> line size in manual configuration
> * count the number of cuts, and print a warning if a threshould of e.g.
> 10000 is reached ?
> * handle accesses straddling >2 lines correctly
>
> Here is the result of vg_perf (vg_minline is with the patch), also
> running "none" to see the measurement noise produced on my laptop. It
> looks like all numbers more or less stay the same.
>
> Josef
>
> weidendo@lapbode134:~/SW/GitRepos/vg/valgrind (master)> perl
> perf/vg_perf --tools=none,cachegrind --vg=. --vg=../vg-minline perf/
> -- Running tests in perf ----------------------------------------------
> -- bigcode1 --
> bigcode1 . :0.14s no: 2.0s (14.5x, -----) ca: 6.3s (44.7x, -----)
> bigcode1 vg-minline:0.14s no: 2.0s (14.3x, 1.5%) ca: 6.3s (44.7x, 0.0%)
> -- bigcode2 --
> bigcode2 . :0.14s no: 4.8s (34.6x, -----) ca:10.4s (74.5x, -----)
> bigcode2 vg-minline:0.14s no: 4.8s (34.3x, 0.8%) ca:10.4s (74.3x, 0.3%)
> -- bz2 --
> bz2 . :0.67s no: 2.6s ( 3.9x, -----) ca:18.4s (27.5x, -----)
> bz2 vg-minline:0.67s no: 2.6s ( 3.9x, -0.4%) ca:18.1s (27.1x, 1.6%)
> -- fbench --
> fbench . :0.28s no: 1.2s ( 4.4x, -----) ca: 5.1s (18.1x, -----)
> fbench vg-minline:0.28s no: 1.2s ( 4.4x, 0.0%) ca: 5.1s (18.1x, 0.4%)
> -- ffbench --
> ffbench . :0.26s no: 1.2s ( 4.5x, -----) ca: 6.4s (24.5x, -----)
> ffbench vg-minline:0.26s no: 1.2s ( 4.5x, 0.9%) ca: 6.4s (24.5x, -0.2%)
> -- heap --
> heap . :0.12s no: 0.8s ( 6.4x, -----) ca: 5.3s (44.3x, -----)
> heap vg-minline:0.12s no: 0.8s ( 6.8x, -5.2%) ca: 5.3s (44.6x, -0.6%)
> -- heap_pdb4 --
> heap_pdb4 . :0.12s no: 0.9s ( 7.3x, -----) ca: 5.8s (48.3x,
> -----) heap_pdb4 vg-minline:0.12s no: 0.8s ( 7.1x, 3.4%) ca: 5.8s
> (48.1x, 0.5%) -- many-loss-records --
> many-loss-records . :0.01s no: 0.3s (32.0x, -----) ca: 1.0s
> (98.0x, -----)
> many-loss-records vg-minline:0.01s no: 0.3s (29.0x, 9.4%) ca: 1.0s
> (97.0x, 1.0%)
> -- many-xpts --
> many-xpts . :0.05s no: 0.4s ( 7.8x, -----) ca: 1.4s (28.0x,
> -----) many-xpts vg-minline:0.05s no: 0.4s ( 7.6x, 2.6%) ca: 1.4s
> (28.4x, -1.4%) -- sarp --
> sarp . :0.03s no: 0.3s (11.0x, -----) ca: 1.4s (47.3x, -----)
> sarp vg-minline:0.03s no: 0.3s (11.0x, 0.0%) ca: 1.4s (46.7x, 1.4%)
> -- tinycc --
> tinycc . :0.21s no: 1.7s ( 7.9x, -----) ca:11.5s (54.8x, -----)
> tinycc vg-minline:0.21s no: 1.7s ( 8.0x, -1.8%) ca:11.4s (54.3x, 0.9%)
> -- Finished tests in perf ----------------------------------------------
>
>
>
> ====================================================================
>
> diff --git a/cachegrind/cg_main.c b/cachegrind/cg_main.c
> index 4b36204..9982f23 100644
> --- a/cachegrind/cg_main.c
> +++ b/cachegrind/cg_main.c
> @@ -69,6 +69,13 @@ static Bool clo_cache_sim = True; /* do cache
> simulation? */
> static Bool clo_branch_sim = False; /* do branch simulation? */
> static Char* clo_cachegrind_out_file = "cachegrind.out.%p";
>
> +
> +/*------------------------------------------------------------*/
> +/*--- Cachesim configuration ---*/
> +/*------------------------------------------------------------*/
> +
> +static Int min_line_size = 0; /* min of L1 and LL cache line sizes */
> +
> /*------------------------------------------------------------*/
> /*--- Types and Data Structures ---*/
> /*------------------------------------------------------------*/
> @@ -846,7 +853,7 @@ void addEvent_Dr ( CgState* cgs, InstrInfo* inode,
> Int datasize, IRAtom*
> {
> Event* evt;
> tl_assert(isIRAtom(ea));
> - tl_assert(datasize >= 1 && datasize <= MIN_LINE_SIZE);
> + tl_assert(datasize >= 1 && datasize <= min_line_size);
> if (!clo_cache_sim)
> return;
> if (cgs->events_used == N_EVENTS)
> @@ -868,7 +875,7 @@ void addEvent_Dw ( CgState* cgs, InstrInfo* inode,
> Int datasize, IRAtom*
> Event* evt;
>
> tl_assert(isIRAtom(ea));
> - tl_assert(datasize >= 1 && datasize <= MIN_LINE_SIZE);
> + tl_assert(datasize >= 1 && datasize <= min_line_size);
>
> if (!clo_cache_sim)
> return;
> @@ -1058,8 +1065,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
> // instructions will be done inaccurately, but they're
> // very rare and this avoids errors from hitting more
> // than two cache lines in the simulation.
> - if (dataSize > MIN_LINE_SIZE)
> - dataSize = MIN_LINE_SIZE;
> + if (dataSize > min_line_size)
> + dataSize = min_line_size;
> if (d->mFx == Ifx_Read || d->mFx == Ifx_Modify)
> addEvent_Dr( &cgs, curr_inode, dataSize, d->mAddr );
> if (d->mFx == Ifx_Write || d->mFx == Ifx_Modify)
> @@ -1085,8 +1092,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
> if (cas->dataHi != NULL)
> dataSize *= 2; /* since it's a doubleword-CAS */
> /* I don't think this can ever happen, but play safe. */
> - if (dataSize > MIN_LINE_SIZE)
> - dataSize = MIN_LINE_SIZE;
> + if (dataSize > min_line_size)
> + dataSize = min_line_size;
> addEvent_Dr( &cgs, curr_inode, dataSize, cas->addr );
> addEvent_Dw( &cgs, curr_inode, dataSize, cas->addr );
> break;
> @@ -1724,6 +1731,12 @@ static void cg_post_clo_init(void)
> &clo_D1_cache,
> &clo_LL_cache);
>
> + // min_line_size is used to make sure that we never feed
> + // accesses to the simulator straddling more than two
> + // cache lines at any cache level
> + min_line_size = (I1c.size < D1c.size) ? I1c.size : D1c.size;
> + min_line_size = (LLc.size < min_line_size) ? LLc.size : min_line_size;
> +
> cachesim_I1_initcache(I1c);
> cachesim_D1_initcache(D1c);
> cachesim_LL_initcache(LLc);
> ====================================================================
|
|
From: Josef W. <Jos...@gm...> - 2012-06-01 21:29:03
|
Am 01.06.2012 18:25, schrieb Julian Seward:
>
> Hi Josef, Nick,
>
> Do either of you have any objection if I commit basically the patch
> below, after some testing?
I am fine with it.
The only issue is that the user will get a panic message from the
simulator for 32-byte (AVX) accesses when he configures a cache line
size of 16.
It would be nice to avoid that as you already suggested: allow a tool
to get the maximum guest register size, and let cachegrind/callgrind
check that the cache line size is not smaller (instead of the current
constant 16).
Josef
I am trying to finish up the AVX infrastructure
> changes this weekend and this is the last thing on my list.
>
> Thanks.
>
> J
>
>
> On Tuesday, May 22, 2012, Josef Weidendorfer wrote:
>> Am 22.05.2012 01:24, schrieb Nicholas Nethercote:
>>> On Mon, May 21, 2012 at 7:22 AM, Julian Seward<js...@ac...> wrote:
>>>> I recently added some AVX support to V, and as a result added a new type
>>>> of 32-byte values (Ity_V256) to IR. Loads and stores of such values
>>>> cause Cachegrind and Callgrind to assert, because the size (32 bytes)
>>>> is larger than MIN_LINE_SIZE, which is 16.
>>>>
>>>> As they currently are, both tools refuse to process memory accesses
>>>> bigger than 16, on the basis that the minimum possible line size is 16,
>>>> and so a 16 byte access could access 2 adjacent lines, which is a
>>>> situation they are prepared to handle. But not 3 lines, which is a
>>>> possible case for a 32 byte access w/ 16 byte lines (something for
>>>> which I'm sure no hardware actually exists).
>>>
>>> Can we model the 32-byte accesses correctly and just assert if the 3
>>> line case occurs?
>>
>> As far as I understand, the assertion in the simulator should not
>> trigger in any case, as memory access lengths are always cut down to
>> MIN_LINE_SIZE at the moment.
>>
>> This is needed, as there are some rare instructions with larger memory
>> accesses, such as PUSHA.
>>
>> I think it makes more sense to cut access lengths down to the real
>> minimum of used line sizes, using the patch below.
>>
>> Hmm. If people explicitly configure for 16-byte line size, and use AVX,
>> this will result in wrong simulations because of the large number of
>> cut down access sizes. I see these options:
>> * if we detect a processor with AVX, make 32 the minimal accepted
>> line size in manual configuration
>> * count the number of cuts, and print a warning if a threshould of e.g.
>> 10000 is reached ?
>> * handle accesses straddling>2 lines correctly
>>
>> Here is the result of vg_perf (vg_minline is with the patch), also
>> running "none" to see the measurement noise produced on my laptop. It
>> looks like all numbers more or less stay the same.
>>
>> Josef
>>
>> weidendo@lapbode134:~/SW/GitRepos/vg/valgrind (master)> perl
>> perf/vg_perf --tools=none,cachegrind --vg=. --vg=../vg-minline perf/
>> -- Running tests in perf ----------------------------------------------
>> -- bigcode1 --
>> bigcode1 . :0.14s no: 2.0s (14.5x, -----) ca: 6.3s (44.7x, -----)
>> bigcode1 vg-minline:0.14s no: 2.0s (14.3x, 1.5%) ca: 6.3s (44.7x, 0.0%)
>> -- bigcode2 --
>> bigcode2 . :0.14s no: 4.8s (34.6x, -----) ca:10.4s (74.5x, -----)
>> bigcode2 vg-minline:0.14s no: 4.8s (34.3x, 0.8%) ca:10.4s (74.3x, 0.3%)
>> -- bz2 --
>> bz2 . :0.67s no: 2.6s ( 3.9x, -----) ca:18.4s (27.5x, -----)
>> bz2 vg-minline:0.67s no: 2.6s ( 3.9x, -0.4%) ca:18.1s (27.1x, 1.6%)
>> -- fbench --
>> fbench . :0.28s no: 1.2s ( 4.4x, -----) ca: 5.1s (18.1x, -----)
>> fbench vg-minline:0.28s no: 1.2s ( 4.4x, 0.0%) ca: 5.1s (18.1x, 0.4%)
>> -- ffbench --
>> ffbench . :0.26s no: 1.2s ( 4.5x, -----) ca: 6.4s (24.5x, -----)
>> ffbench vg-minline:0.26s no: 1.2s ( 4.5x, 0.9%) ca: 6.4s (24.5x, -0.2%)
>> -- heap --
>> heap . :0.12s no: 0.8s ( 6.4x, -----) ca: 5.3s (44.3x, -----)
>> heap vg-minline:0.12s no: 0.8s ( 6.8x, -5.2%) ca: 5.3s (44.6x, -0.6%)
>> -- heap_pdb4 --
>> heap_pdb4 . :0.12s no: 0.9s ( 7.3x, -----) ca: 5.8s (48.3x,
>> -----) heap_pdb4 vg-minline:0.12s no: 0.8s ( 7.1x, 3.4%) ca: 5.8s
>> (48.1x, 0.5%) -- many-loss-records --
>> many-loss-records . :0.01s no: 0.3s (32.0x, -----) ca: 1.0s
>> (98.0x, -----)
>> many-loss-records vg-minline:0.01s no: 0.3s (29.0x, 9.4%) ca: 1.0s
>> (97.0x, 1.0%)
>> -- many-xpts --
>> many-xpts . :0.05s no: 0.4s ( 7.8x, -----) ca: 1.4s (28.0x,
>> -----) many-xpts vg-minline:0.05s no: 0.4s ( 7.6x, 2.6%) ca: 1.4s
>> (28.4x, -1.4%) -- sarp --
>> sarp . :0.03s no: 0.3s (11.0x, -----) ca: 1.4s (47.3x, -----)
>> sarp vg-minline:0.03s no: 0.3s (11.0x, 0.0%) ca: 1.4s (46.7x, 1.4%)
>> -- tinycc --
>> tinycc . :0.21s no: 1.7s ( 7.9x, -----) ca:11.5s (54.8x, -----)
>> tinycc vg-minline:0.21s no: 1.7s ( 8.0x, -1.8%) ca:11.4s (54.3x, 0.9%)
>> -- Finished tests in perf ----------------------------------------------
>>
>>
>>
>> ====================================================================
>>
>> diff --git a/cachegrind/cg_main.c b/cachegrind/cg_main.c
>> index 4b36204..9982f23 100644
>> --- a/cachegrind/cg_main.c
>> +++ b/cachegrind/cg_main.c
>> @@ -69,6 +69,13 @@ static Bool clo_cache_sim = True; /* do cache
>> simulation? */
>> static Bool clo_branch_sim = False; /* do branch simulation? */
>> static Char* clo_cachegrind_out_file = "cachegrind.out.%p";
>>
>> +
>> +/*------------------------------------------------------------*/
>> +/*--- Cachesim configuration ---*/
>> +/*------------------------------------------------------------*/
>> +
>> +static Int min_line_size = 0; /* min of L1 and LL cache line sizes */
>> +
>> /*------------------------------------------------------------*/
>> /*--- Types and Data Structures ---*/
>> /*------------------------------------------------------------*/
>> @@ -846,7 +853,7 @@ void addEvent_Dr ( CgState* cgs, InstrInfo* inode,
>> Int datasize, IRAtom*
>> {
>> Event* evt;
>> tl_assert(isIRAtom(ea));
>> - tl_assert(datasize>= 1&& datasize<= MIN_LINE_SIZE);
>> + tl_assert(datasize>= 1&& datasize<= min_line_size);
>> if (!clo_cache_sim)
>> return;
>> if (cgs->events_used == N_EVENTS)
>> @@ -868,7 +875,7 @@ void addEvent_Dw ( CgState* cgs, InstrInfo* inode,
>> Int datasize, IRAtom*
>> Event* evt;
>>
>> tl_assert(isIRAtom(ea));
>> - tl_assert(datasize>= 1&& datasize<= MIN_LINE_SIZE);
>> + tl_assert(datasize>= 1&& datasize<= min_line_size);
>>
>> if (!clo_cache_sim)
>> return;
>> @@ -1058,8 +1065,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
>> // instructions will be done inaccurately, but they're
>> // very rare and this avoids errors from hitting more
>> // than two cache lines in the simulation.
>> - if (dataSize> MIN_LINE_SIZE)
>> - dataSize = MIN_LINE_SIZE;
>> + if (dataSize> min_line_size)
>> + dataSize = min_line_size;
>> if (d->mFx == Ifx_Read || d->mFx == Ifx_Modify)
>> addEvent_Dr(&cgs, curr_inode, dataSize, d->mAddr );
>> if (d->mFx == Ifx_Write || d->mFx == Ifx_Modify)
>> @@ -1085,8 +1092,8 @@ IRSB* cg_instrument ( VgCallbackClosure* closure,
>> if (cas->dataHi != NULL)
>> dataSize *= 2; /* since it's a doubleword-CAS */
>> /* I don't think this can ever happen, but play safe. */
>> - if (dataSize> MIN_LINE_SIZE)
>> - dataSize = MIN_LINE_SIZE;
>> + if (dataSize> min_line_size)
>> + dataSize = min_line_size;
>> addEvent_Dr(&cgs, curr_inode, dataSize, cas->addr );
>> addEvent_Dw(&cgs, curr_inode, dataSize, cas->addr );
>> break;
>> @@ -1724,6 +1731,12 @@ static void cg_post_clo_init(void)
>> &clo_D1_cache,
>> &clo_LL_cache);
>>
>> + // min_line_size is used to make sure that we never feed
>> + // accesses to the simulator straddling more than two
>> + // cache lines at any cache level
>> + min_line_size = (I1c.size< D1c.size) ? I1c.size : D1c.size;
>> + min_line_size = (LLc.size< min_line_size) ? LLc.size : min_line_size;
>> +
>> cachesim_I1_initcache(I1c);
>> cachesim_D1_initcache(D1c);
>> cachesim_LL_initcache(LLc);
>> ====================================================================
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>
|
|
From: Julian S. <js...@ac...> - 2012-06-03 22:44:16
|
On Friday, June 01, 2012, Josef Weidendorfer wrote: > Am 01.06.2012 18:25, schrieb Julian Seward: > > Hi Josef, Nick, > > > > Do either of you have any objection if I commit basically the patch > > below, after some testing? > > I am fine with it. > > The only issue is that the user will get a panic message from the > simulator for 32-byte (AVX) accesses when he configures a cache line > size of 16. > > It would be nice to avoid that as you already suggested: allow a tool > to get the maximum guest register size, and let cachegrind/callgrind > check that the cache line size is not smaller (instead of the current > constant 16). Ok, done, for Cachegrind at least. I will try to make an equivalent change for Callgrind. r12605. The failure message is, eg $ ./vg-in-place --tool=cachegrind --D1=16384,2,16 none/tests/amd64/avx-1 [...] ==22824== Cachegrind: cannot continue: the minimum line size (16) ==22824== must be equal to or larger than the maximum register size (32) ==22824== but it is not. Exiting now. J |
|
From: Josef W. <Jos...@gm...> - 2012-06-04 16:56:27
|
Am 04.06.2012 00:40, schrieb Julian Seward: > Ok, done, for Cachegrind at least. Looks good to me. > I will try to make an equivalent change for Callgrind. I see you already did that. Thanks! Josef |
|
From: Julian S. <js...@ac...> - 2012-06-04 17:03:45
|
On Monday, June 04, 2012, Josef Weidendorfer wrote: > Am 04.06.2012 00:40, schrieb Julian Seward: > > Ok, done, for Cachegrind at least. > > Looks good to me. > > > I will try to make an equivalent change for Callgrind. > > I see you already did that. Thanks! Oh, yeah. I was going to mail about that and forgot. Can you look over the diff a bit? I think it's OK but would appreciate a second opinion. It is a bit more complex than the Cachegrind case because in this case, the assertions re size fail if cache simulation is not selected, and so have to be "gated" on that condition. J |
|
From: Josef W. <Jos...@gm...> - 2012-06-04 17:56:41
|
Am 04.06.2012 19:00, schrieb Julian Seward: > second opinion. It is a bit more complex than the Cachegrind case > because in this case, the assertions re size fail if cache simulation > is not selected, and so have to be "gated" on that condition. You could have set CLG_(min_line_size) in any case, but it's ok this way, too. It looks fine to me. I just wondered about whether we do the right thing not really taking I1 cache line size into account for calculating CLG_(min_line_size). But we actually do: according to the comment in cg_arch.c atop the check against MIN_LINE_SIZE, it is set to 16 because that is larger than any instruction. Josef |
|
From: Josef W. <Jos...@gm...> - 2012-06-04 18:15:24
|
Am 04.06.2012 19:56, schrieb Josef Weidendorfer: > Am 04.06.2012 19:00, schrieb Julian Seward: >> second opinion. It is a bit more complex than the Cachegrind case >> because in this case, the assertions re size fail if cache simulation >> is not selected, and so have to be "gated" on that condition. > > You could have set CLG_(min_line_size) in any case, My bad. Of course it's better to gate it on whether simulation is done at all. So everything is fine. |