|
From: Petar J. <mip...@gm...> - 2012-11-20 16:04:42
|
hi everyone,
first of all, I apologize for a rather lengthy email.
Here is an issue to share and hopefully get some advice on.
Similar to other architectures, MIPS arch has a pair of instructions of
load-link and store-conditional, namely LL and SC.
We have been seeing some issues in which a program would end up in an infinite
loop due to SC failing each time. The probability to fail is closely related on
which compiler was used to compile Valgrind, but it will fail with any
eventually. With some native compilers, Valgrind always fails (i.e. stays in the
loop).
MIPS documentation lists some condition under which SC will fail (see down-
below more data). They also say SC may succeed or *fail* if "a memory access
instruction (load, store, or prefetch) is executed on the processor executing
the LL/SC."
We are not able to isolate the issue by writing a sequence that will fail, no
matter how much read/write memory access with put in a LL/SC region. Yet, in
some programs a sequence like this in the guest code:
lui $s0, 0x41
ori $s0, 0x00f0
move $t9, $zero
li $v0, 1
begin:
ll $v1, 0($s0) -------------------
bne $v1, $t9, kraj
move $a0, $zero
move $a0, $v0
sc $a0, 0($s0) ------------------
beqzl $a0, begin
move $at, $at
will fail. More precisely it will fail with Callgrind. More more precisely, it
will fail with Callgrind and option "--cacheuse=yes". Some instrumentation that
happens to the code between LL and SC will cause the subsequent SC to fail.
Two ideas have been talked about to resolve the issue:
A) leave RMW region in one translation block (i.e. if a branch is placed between
LL and SC, do not stop there) as long as it fits under max-size block;
B) try to emulate LL/SC differently.
A) would be quick, but would there be any side effects?
Any other ideas? Anybody had a similar issue on other architecture?
It may also be worth saying that GDB does LL/SC in one step, which means that
'si' will step from LL to SC directly, passing all instructions in between.
Any advice is welcome!
Thanks.
Petar
Part of MIPS documentation on LL/SC:
"If either of the following events occurs between the execution of LL and SC,
the SC fails:
• A coherent store is completed by another processor or coherent I/O module into
the block of synchronizable
physical memory containing the word. The size and alignment of the block is
implementation dependent, but it is
at least one word and at most the minimum page size.
• An ERET instruction is executed.
If either of the following events occurs between the execution of LL and SC,
the SC may succeed or it may fail; the
success or failure is not predictable. Portable programs should not cause one
of these events.
• A memory access instruction (load, store, or prefetch) is executed on the
processor executing the LL/SC.
• The instructions executed starting with the LL and ending with the SC do not
lie in a 2048-byte contiguous region of virtual memory. (The region does not
have to be aligned, other than the alignment required for instruction
words.)
The following conditions must be true or the result of the SC is
UNPREDICTABLE:
• Execution of SC must have been preceded by execution of an LL instruction.
• An RMW sequence executed without intervening events that would cause the SC to
fail must use the same address in the LL and SC. The address is the same if
the virtual address, physical address, and cacheability &
coherency attribute are identical.
"
|
|
From: Josef W. <Jos...@gm...> - 2012-11-20 18:23:10
|
Am 20.11.2012 17:04, schrieb Petar Jovanovic:
> begin:
> ll $v1, 0($s0) -------------------
> bne $v1, $t9, kraj
> move $a0, $zero
> move $a0, $v0
> sc $a0, 0($s0) ------------------
> beqzl $a0, begin
> move $at, $at
>
> will fail. More precisely it will fail with Callgrind. More more precisely, it
> will fail with Callgrind and option "--cacheuse=yes". Some instrumentation that
> happens to the code between LL and SC will cause the subsequent SC to fail.
Hmm. In your example, there is no memory access in the RMW region.
However, Cachegrind/Callgrind do not call cache simulation functions
synchroniously, but collect them and call them in bunches. The only
way I see "--cacheuse=yes" making a difference is that the simulator
calls for previous memory accesses are moved within the RMW region.
Which makes sense as there is a branch there.
It may help if outstanding simulator calls get flushed before entering
the RWM region.
diff --git a/callgrind/main.c b/callgrind/main.c
index 41fcd9e..a68f069 100644
--- a/callgrind/main.c
+++ b/callgrind/main.c
@@ -1073,6 +1073,8 @@ IRSB* CLG_(instrument)( VgCallbackClosure*
dataTy = typeOfIRTemp(sbIn->tyenv, st->Ist.LLSC.result);
addEvent_Dr( &clgs, curr_inode,
sizeofIRType(dataTy), st->Ist.LLSC.addr );
+ /* flush events before LL, should help SC to succeed */
+ flushEvents( &clgs );
} else {
/* SC */
> Two ideas have been talked about to resolve the issue:
>
>
> A) leave RMW region in one translation block (i.e. if a branch is placed between
> LL and SC, do not stop there
I do no think this is supported by VEX without larger changes.
) as long as it fits under max-size block;
>
> B) try to emulate LL/SC differently.
As Valgrind is serializing threads, it should be enough to check if
there was a schedule point within the RMW region, and make SC fail only
in this case.
Josef
|
|
From: Julian S. <js...@ac...> - 2012-11-20 22:22:56
|
On 11/20/2012 05:04 PM, Petar Jovanovic wrote: > We have been seeing some issues in which a program would end up in an infinite > loop due to SC failing each time. The probability to fail is closely related on > which compiler was used to compile Valgrind, but it will fail with any > eventually. With some native compilers, Valgrind always fails (i.e. stays in the > loop). Yes. I am not surprised to hear this. Because the JIT and the instrumentation adds arbitrary amounts of memory traffic between the original LL and SC, there must be some point at which it causes the LL-SC to fail in cases where the original guest-code LL-SC pair would have succeeded. Unfortunately I can't think of any easy way to avoid the problem. > A) leave RMW region in one translation block (i.e. if a branch is placed between > LL and SC, do not stop there) as long as it fits under max-size block; I don't think this would help. Whether or not the LL and SC are in the same block isn't important. The problem is that there are extra memory references in between the LL and SC that are not in the original code, and which cause the LL/SC to fail. J |
|
From: Petar J. <mip...@gm...> - 2012-11-22 23:47:04
|
hi Josef,
flushing events before LL indeed helped!
Thanks a million!
We were aware that some memory access in the RMW region was causing the failure,
but we failed to pinpoint which one and under what conditions. I should have
also mentioned that there were some hw platforms on which it was not failing (at
least not for the tests in question), and since MIPS spec says SC could
potentially fail for any load/store operation placed in between, we were loosing
focus with every new attept to resolve it.
Thanks again.
Petar
On Tue, Nov 20, 2012 at 7:22 PM, Josef Weidendorfer
<Jos...@gm...> wrote:
> Am 20.11.2012 17:04, schrieb Petar Jovanovic:
>
>> begin:
>>
>> ll $v1, 0($s0) -------------------
>> bne $v1, $t9, kraj
>> move $a0, $zero
>> move $a0, $v0
>> sc $a0, 0($s0) ------------------
>> beqzl $a0, begin
>> move $at, $at
>>
>> will fail. More precisely it will fail with Callgrind. More more
>> precisely, it
>> will fail with Callgrind and option "--cacheuse=yes". Some instrumentation
>> that
>> happens to the code between LL and SC will cause the subsequent SC to
>> fail.
>
>
> Hmm. In your example, there is no memory access in the RMW region.
>
> However, Cachegrind/Callgrind do not call cache simulation functions
> synchroniously, but collect them and call them in bunches. The only
> way I see "--cacheuse=yes" making a difference is that the simulator
> calls for previous memory accesses are moved within the RMW region.
> Which makes sense as there is a branch there.
>
> It may help if outstanding simulator calls get flushed before entering
> the RWM region.
>
> diff --git a/callgrind/main.c b/callgrind/main.c
> index 41fcd9e..a68f069 100644
> --- a/callgrind/main.c
> +++ b/callgrind/main.c
> @@ -1073,6 +1073,8 @@ IRSB* CLG_(instrument)( VgCallbackClosure*
> dataTy = typeOfIRTemp(sbIn->tyenv, st->Ist.LLSC.result);
> addEvent_Dr( &clgs, curr_inode,
> sizeofIRType(dataTy), st->Ist.LLSC.addr );
> + /* flush events before LL, should help SC to succeed */
> + flushEvents( &clgs );
> } else {
> /* SC */
>
>
>> Two ideas have been talked about to resolve the issue:
>>
>>
>> A) leave RMW region in one translation block (i.e. if a branch is placed
>> between
>> LL and SC, do not stop there
>
>
> I do no think this is supported by VEX without larger changes.
>
>
> ) as long as it fits under max-size block;
>>
>>
>> B) try to emulate LL/SC differently.
>
>
> As Valgrind is serializing threads, it should be enough to check if there
> was a schedule point within the RMW region, and make SC fail only
> in this case.
>
> Josef
>
|
|
From: Josef W. <Jos...@gm...> - 2012-11-27 10:50:08
|
Am 23.11.2012 00:46, schrieb Petar Jovanovic: > flushing events before LL indeed helped! Nice! Cachegrind and Lackey use the same mechanism: delay helper calls. I think it makes sense to do the same for these tools: Lackey even calls VG_(printf) from helpers, thus eventually from within RMW sections. Josef |