I would have thought that in writing
nothing in 'some-stuff' can be perceived as happening before the mutex is grabbed, *and* that all writes in flight within a mutex are flushed to memory as the mutex is released.
Jan Moringen has exposed a bug in globaldb that causes me to think otherwise or else to think that my analysis of something else is flawed, so here's the issue:
In the globaldb rehash algorithm, exactly one thread is elected to rehash an overflowed table, and other threads wait. (This is the infrequent non-lock-free logic obviously)
But it looks like more than one thread can think it is the chosen rehasher.
They don't try to rehash simultaneously - because mutexes work - but they try one after the other on the same storage vector, which should never happen. This speculation of mine comes from an AVER added to a patched globaldb that rehashing does not try to stomp on a key cell marked as visited.
The first question is whether people concur that in the code in 'info-env-rehash', the only way that a cell key can be marked +unavailable-key+ is if a rehasher put that there. This is done in a loop starting at line # 337 of trunk. The idea is that the rehasher tries to change every 'empty' key to 'unavailable' so that writers who try to claim empty cells fail, and block themselves until rehash completion. Non-empty keys/values get forwarded. Also, info-puthash AVERs that no caller supplies +unavailable-key+
So if it's the case that the rehasher *is* seeing unavailable-key in a key cell, the obvious next conclusion is that two rehashers both want to sweep the old storage. The test for deciding whether to rehash is simple: having grabbed the mutex, is the storage vector that we _know_ to have overflowed EQ to the same storage vector that is currently in the table?
This is at the comment " ;; any thread could have beaten us to rehashing"
It sure seems like two threads are seeing that test pass. What I want to do is add a write barrier where it says "no write barrier needed" at the end of info-env-rehash and a read barrier a the "could have beaten" comment.
But I want to feel that this is more than just throwing a dart at a dartboard of possible solutions. We have used the same logic in QPX for years and the bug does not happen.