If I recall correctly (and it has been a couple of years now since I really messed with this stuff, and I was cribbing notes at the time), mutex acquisition and release, and CAS, each count as a reasonably strong (write?) barrier.  Now, it could be that the x86oid implementation of this is flawed, but this certainly was the original intent.

I don't have time right now to load the details of how all this works into my headspace right now, but I'd like to point out one possibly-counterintuitive aspect of the BARRIER macro: The body is executed BEFORE the barrier. I don't know if this factors into your design or not. I've always just found it easier to ignore the body parameter.

Umm... And the whole barrier thing might not actually be safe against code motion performed by the compiler. I don't know if the compiler does such things or not, just that it supposedly didn't when the barrier VOPs and macro were originally written.

I also note one worrisome comment about not having a data-dependency barrier in one place, allowing Alpha to re-order things. On non-Alpha, the data-dependency barrier emits no code. On Alpha, it's most likely required. Please add it before someone gets it into their head to start maintaining the Alpha backend.

-- Alastair Bridgewater

On Wed, Apr 16, 2014 at 2:10 AM, Douglas Katzman <dougk@google.com> wrote:
I would have thought that in writing
 (with-mutex (foo)
nothing in 'some-stuff' can be perceived as happening before the mutex is grabbed, *and* that all writes in flight within a mutex are flushed to memory as the mutex is released.
Jan Moringen has exposed a bug in globaldb that causes me to think otherwise or else to think that my analysis of something else is flawed, so here's the issue:

In the globaldb rehash algorithm, exactly one thread is elected to rehash an overflowed table, and other threads wait. (This is the infrequent non-lock-free logic obviously)
But it looks like more than one thread can think it is the chosen rehasher.
They don't try to rehash simultaneously - because mutexes work - but they try one after the other on the same storage vector, which should never happen. This speculation of mine comes from an AVER added to a patched globaldb that rehashing does not try to stomp on a key cell marked as visited.

The first question is whether people concur that in the code in 'info-env-rehash', the only way that a cell key can be marked +unavailable-key+ is if a rehasher put that there. This is done in a loop starting at line # 337 of trunk. The idea is that the rehasher tries to change every 'empty' key to 'unavailable' so that writers who try to claim empty cells fail, and block themselves until rehash completion. Non-empty keys/values get forwarded. Also, info-puthash AVERs that no caller supplies +unavailable-key+

So if it's the case that the rehasher *is* seeing unavailable-key in a key cell, the obvious next conclusion is that two rehashers both want to sweep the old storage.  The test for deciding whether to rehash is simple: having grabbed the mutex, is the storage vector that we _know_ to have overflowed EQ to the same storage vector that is currently in the table?
This is at the comment " ;; any thread could have beaten us to rehashing"
It sure seems like two threads are seeing that test pass. What I want to do is add a write barrier where it says "no write barrier needed" at the end of info-env-rehash and a read barrier a the "could have beaten" comment.
But I want to feel that this is more than just throwing a dart at a dartboard of possible solutions. We have used the same logic in QPX for years and the bug does not happen.

Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
Sbcl-devel mailing list