Re: [Valgrind-developers] mtV : memcheck VA bits

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 06/19/2012 09:50 AM, Julian Seward wrote:
> On Sunday, June 17, 2012, Philippe Waroquiers wrote:
> 
>>> Use LoadLocked and StoreConditional instructions, as on MIPS.
>>> These sense the state of the cache line, and implement a "greedy"
>>> solution.  Your code provides the fixup when greedy fails
>>> (usually: try again, after re-fetch and re-modify.)
>>
>> Are the LL/SC instructions not suffering from the same performance
>> degradation as the x86/amd64 atomic instructions ?
>> (cfr the unacceptable performance degration in memcheck when
>> adding one atomic instruction in the STORE helperc).
>>
>> I have very bad knowledge of all these atomic things and similar,
>> but I am guessing that there is no order of magnitude difference
>> of performance between these approaches.
> 
> I would agree with your guess.  In both cases (LL/SC or LOCK), some
> kind of global communication across all cores is implied, in the
> worst case.

Yes, in the worst case some kind of global communication across all cores
is required.  But if there is no contention then the cost is very low,
often zero.  This is faster than a x86-style bus-locked atomic operation,
which always costs at least 10 to 15 cycles even when no contention.
Providing hardware support to separate the read from the write, and
integrating the inspection of cache state into the LL+SC pair, are
big advantages.

When there _is_ contention, then [a single attempt at] LL+SC does not
have to be any slower than the normal communication that is required
to maintain cache coherency.  So again, LL+SC provides atomic update
at very low additional cost, except for the software retry that is
required when there is an actual collision.  A bus-locked atomic operation
guarantees forward progress for each contender [as far as hardware
is concerned], but at higher cost.  LL+SC provides much lower cost
per attempt, but at most one contender will make progress during a collision,
and repeated attempts can be necessary for the same logical operation.

> 
> One thing I do know (from testing on an old Core 2), is that the cost
> of LOCK depends on how far down the cache hierarchy the line in question
> is.  In the case where the line is in that core's D1, then the cost was
> "only" about 10-15 cycles instead of the often-rumoured 100+ cycles.
> 
> But I am not claiming to understand anything much about this, really;
> in particular I have no idea if such performance variations are something
> that it would be safe for us to depend on in the general case.

When there are only about 5 architectures, then using "general case"
can be misleading.  "Cost proportional to cache depth" is true in every
actual instance that I know.  However, the constant can vary a lot,
even from implementation to implementation within the same architecture.
For instance, typical Intel Core 2 have large L2 and no L3, but Core i3/i5/i7
have medium-sized L2 and medium-to-large L3.  On average, Core 2 can be faster
per cycle for some workloads, even with the same or smaller total cache size.

--