|
From: Carl L. <ce...@us...> - 2017-09-22 15:28:45
|
Valgrind developers:
I have two users that have recently run into what appears to be the same
issue. One of the workloads is a video application, the other is a
float128 workload for the new Power 9 instructions. Both workloads fail
with the message:
Pool = TEMP, start 0x5861af28 curr 0x58adfa48 end 0x58adfa67 (size
5000000)
vex: the `impossible' happened:
VEX temporary storage exhausted.
Increase N_{TEMPORARY,PERMANENT}_BYTES and recompile.
I increased the N_TEMPORARY_BYTES and N_PERMANENT_BYTES #defines as
given below.
--- a/VEX/priv/main_util.c
+++ b/VEX/priv/main_util.c
@@ -55,10 +55,10 @@
#if defined(ENABLE_INNER)
/* 5 times more memory to be on the safe side: consider each
allocation is
8 bytes, and we need 16 bytes redzone before and after. */
-#define N_TEMPORARY_BYTES (5*5000000)
+#define N_TEMPORARY_BYTES (5*2000000000)
static Bool mempools_created = False;
#else
-#define N_TEMPORARY_BYTES 5000000
+#define N_TEMPORARY_BYTES 2000000000
#endif
static HChar temporary[N_TEMPORARY_BYTES]
__attribute__((aligned(REQ_ALIGN)));
@@ -70,9 +70,9 @@ static ULong temporary_bytes_allocd_TOT = 0;
#if defined(ENABLE_INNER)
/* See N_TEMPORARY_BYTES */
-#define N_PERMANENT_BYTES (5*10000)
+#define N_PERMANENT_BYTES (5*100000)
#else
-#define N_PERMANENT_BYTES 10000
+#define N_PERMANENT_BYTES 100000
Once these were increased both workloads then hit the error:
x264 [info]: profile High, level 3.1
vex: priv/host_generic_reg_alloc3.c:470 (doRegisterAllocation_v3):
Assertion `instrs_in->arr_used <= 15000' failed.
vex storage: T total 373013384 bytes allocated
vex storage: P total 192 bytes allocated
>From the comments in the code, it doesn't look like increasing the 1500
is a viable option. It appears that something is just generating too
much stuff. I am wondering if anyone can give me some idea what is going
on here. It all appears to be architecture independent code. Any
suggestions on how to go about debugging this would be helpful. Thanks.
Carl Love
|
|
From: Ivo R. <iv...@iv...> - 2017-09-22 15:53:01
|
> From the comments in the code, it doesn't look like increasing the 15000 > is a viable option. Actually this limit is somewhat arbitrary. For register live ranges, type Short is used but only because invalid start/end range is indicated with -2. The rest of negative range is actually unused. I think this can be easily changed to UShort instead, and the limit raised to 62000, for example. I can send you a patch if you are interested and willing to try it. > It appears that something is just generating too > much stuff. I am wondering if anyone can give me some idea what is going > on here. It all appears to be architecture independent code. Any > suggestions on how to go about debugging this would be helpful. Thanks. Dump the information what the VEX JIT is doing. 1) Start with --trace-flags=10000000 --trace-notbelow=0 >From the last block dumped, note the SB number. 2) Refine with --trace-flags=11111100 --trace-notbelow=<SB-1> You'll have relatively short dump with very useful information. I. |
|
From: Carl L. <ce...@us...> - 2017-09-22 17:20:49
|
On Fri, 2017-09-22 at 17:52 +0200, Ivo Raisr wrote:
> > From the comments in the code, it doesn't look like increasing the 15000
> > is a viable option.
>
> Actually this limit is somewhat arbitrary.
> For register live ranges, type Short is used but only because invalid
> start/end range is indicated with -2.
> The rest of negative range is actually unused.
> I think this can be easily changed to UShort instead, and the limit
> raised to 62000, for example.
>
> I can send you a patch if you are interested and willing to try it.
>
> > It appears that something is just generating too
> > much stuff. I am wondering if anyone can give me some idea what is going
> > on here. It all appears to be architecture independent code. Any
> > suggestions on how to go about debugging this would be helpful. Thanks.
>
> Dump the information what the VEX JIT is doing.
> 1) Start with --trace-flags=10000000 --trace-notbelow=0
> From the last block dumped, note the SB number.
> 2) Refine with --trace-flags=11111100 --trace-notbelow=<SB-1>
> You'll have relatively short dump with very useful information.
>
> I.
>
Ivo:
So, with some help from Aaron, it looks like we are generating a lot of
temporaries. At one point, I see the temporary map with:
------------------------ After pre-instr IR optimisation ------------------------
IRSB {
t0:F128 t1:F128 t2:F128 t3:I32 t4:I32 t5:I1 t6:I1 t7:I1
<cut for readability>
t9312:I32 t9313:I32 t9314:I32 t9315:I1 t9316:I32 t9317:I32 t9318:I1 t9319:I32
t9320:I32 t9321:I32 t9322:I1 t9323:I32 t9324:I32 t9325:I64
This occurs in the middle of a block of a bunch of P9 Floating point 128
instructions. Some of the P9 floating point 128 instructions take a
fair number of Iops to implement. I don't remember specifically for
each instruction at this point. It looks to us like there may just
be too many of these instructions in the Valgrind basic block. When the
instructions get converted to Iops it is just too big. Looking at a
partial assembly listing that Aaron gave me for the floating point test,
it looks like the assembly code is a very large sequential block of
instructions. One thought Aaron had was to see if we can tell Valgrind
to limit the size of its basic block. Not seeing a command line option
to do that.
In VEX/priv/main_main.c
There are the line:
vcon->iropt_unroll_thresh = 120;
vcon->guest_max_insns = 60;
I found if they are changed to
vcon->iropt_unroll_thresh = 4;
vcon->guest_max_insns = 2;
It appears to "fix" the issue. Not sure what the ramifications of
forcing these so low. Will do some more looking in the code to see what
these really do. Don't know if there is an existing mechanism to allow
contorl of this or not? Thoughts?
Carl Love
|
|
From: Carl L. <ce...@us...> - 2017-09-22 17:53:27
|
Ivo, Julian: Looks like our issue is the same as was seen in https://bugs.kde.org/show_bug.cgi?id=375839 The bug reports the same error and symptoms. Found the bug based on comments in guest_generic_bb_to_IR.c. /* Although we will try to disassemble up to vex_control.guest_max_insns insns into the block, the individual insn assemblers may hint to us that a disassembled instruction is verbose. In that case we will lower the limit so as to ensure that the JIT doesn't run out of space. See bug 375839 for the motivating example. */ Carl Love On Fri, 2017-09-22 at 10:20 -0700, Carl Love wrote: > On Fri, 2017-09-22 at 17:52 +0200, Ivo Raisr wrote: > > > From the comments in the code, it doesn't look like increasing the 15000 > > > is a viable option. > > > > Actually this limit is somewhat arbitrary. > > For register live ranges, type Short is used but only because invalid > > start/end range is indicated with -2. > > The rest of negative range is actually unused. > > I think this can be easily changed to UShort instead, and the limit > > raised to 62000, for example. > > > > I can send you a patch if you are interested and willing to try it. > > > > > It appears that something is just generating too > > > much stuff. I am wondering if anyone can give me some idea what is going > > > on here. It all appears to be architecture independent code. Any > > > suggestions on how to go about debugging this would be helpful. Thanks. > > > > Dump the information what the VEX JIT is doing. > > 1) Start with --trace-flags=10000000 --trace-notbelow=0 > > From the last block dumped, note the SB number. > > 2) Refine with --trace-flags=11111100 --trace-notbelow=<SB-1> > > You'll have relatively short dump with very useful information. > > > > I. > > > Ivo: > > So, with some help from Aaron, it looks like we are generating a lot of > temporaries. At one point, I see the temporary map with: > ------------------------ After pre-instr IR optimisation ------------------------ > > IRSB { > t0:F128 t1:F128 t2:F128 t3:I32 t4:I32 t5:I1 t6:I1 t7:I1 > > <cut for readability> > > t9312:I32 t9313:I32 t9314:I32 t9315:I1 t9316:I32 t9317:I32 t9318:I1 t9319:I32 > t9320:I32 t9321:I32 t9322:I1 t9323:I32 t9324:I32 t9325:I64 > > > This occurs in the middle of a block of a bunch of P9 Floating point 128 > instructions. Some of the P9 floating point 128 instructions take a > fair number of Iops to implement. I don't remember specifically for > each instruction at this point. It looks to us like there may just > be too many of these instructions in the Valgrind basic block. When the > instructions get converted to Iops it is just too big. Looking at a > partial assembly listing that Aaron gave me for the floating point test, > it looks like the assembly code is a very large sequential block of > instructions. One thought Aaron had was to see if we can tell Valgrind > to limit the size of its basic block. Not seeing a command line option > to do that. > > In VEX/priv/main_main.c > > There are the line: > > vcon->iropt_unroll_thresh = 120; > vcon->guest_max_insns = 60; > > I found if they are changed to > > vcon->iropt_unroll_thresh = 4; > vcon->guest_max_insns = 2; > > It appears to "fix" the issue. Not sure what the ramifications of > forcing these so low. Will do some more looking in the code to see what > these really do. Don't know if there is an existing mechanism to allow > contorl of this or not? Thoughts? > > Carl Love > > |
|
From: Julian S. <js...@ac...> - 2017-09-23 06:55:21
|
Carl, I see you found https://bugs.kde.org/show_bug.cgi?id=375839, which is good as a reference point, because this is a POWER-esque re-run of the same problem. As a short term workaround, you can possibly get around this by running V with --vex-guest-max-insns=<N> where <N> is, say, 30, 20, 10 or even 5. This limits the number of guest insns incorporated into each IRSB to the given value. The default is 50, I think. Going below about 10 isn't good though; you'll lose performance. For a proper fix, the underlying problem is that the ppc front end is generating excessively verbose translations of some particular instruction. As a first step, I'd recommend finding out which (using the normal --trace-flags= technique) and seeing if you can easily make it less verbose. I've found these problems normally to be associated with vector insns which get implemented by "doing" each lane separately (I think you mentioned that). It would be helpful if you could find and show an example of the problematic translation. If the translation can't easily be improved (or improving it doesn't fix this), then we can also use the dynamic hinting mechanism that you found. In short the PPC front end needs to mark the relevant instruction's disassembly-status return value (I forget the name now) to set it's .hint field to Dis_HintVerbose. Look in the amd64 front end for examples. We'll probably need both approaches. Please also file a bug, so we don't have to track this by email. J |