From: David S. <dsc...@vy...> - 2001-03-28 16:48:02
|
> I've simply disabled the first 'read_lock' in the > DisplayObject::remove() method. I came to this because my application > always seems to hang at the point where it sets the 'visible' attribute > of an object to false, and this function is called in response to that > action. > > So, is there something wrong with the locking scheme in cvisual Yes, there is apparently something wrong. First of all, some background on the situation: The code in question is invoked to add or remove an object from the list of visible objects in a display. The reason it employs locking is that the visual renderer, in a separate thread, may be iterating through the same list and displaying the objects. It needs to be protected from encountering deleted objects, objects that are being modified, etc. To understand what's happening, we need to look in several places (1) The insert() and remove() functions themselves (2) DisplayListIterator::next() in displaylist.h - this is the function used by the renderer to walk through the list (3) The code in prim.cpp and arrprim.cpp that calls SetVisible(), which calls insert() and remove() This last is important, because it turns out that that code takes *another* write_lock, on "this". So the sequence for insert goes lock this lock head of list lock first item in list unlock first item unlock head unlock this The sequence for remove goes lock this lock prev unlock prev lock next unlock next unlock this The sequence for next is a little harder to see: lock head lock first unlock head lock second unlock first lock third unlock second lock fourth And so on. We need to watch out for deadlocks between next() in the rendering thread and either insert() or remove() in the other thread. No deadlocks can occur between insert() and next(). Intuitively, this is because both functions are walking down the list "with the grain". Specifically, each one locks a node A only if (1) It holds no other locks or (2) It holds a lock on A.prev and no other nodes If (1) is the case, this thread has *no* locks and so there can't possibly be a deadlock while this thread waits for A. If (2) is the case, and A is locked by the other thread, the other thread cannot be waiting for A.prev because it can't have both A.prev.prev and A locked. However, this logic doesn't work for remove(), which locks this->prev "against the grain". In fact, it's pretty easy to see that there CAN be a deadlock. To look at just one case: remove() next() -------- ------ lock this lock prev lock prev unlock prev.prev lock this (deadlock) This is fairly embarrassing, and also rather puzzling - why is it working for everyone except Fred? Now, Fred's fix removes the lock on prev. It's easy to see that the deadlock is eliminated, since remove() then runs with the grain of the list: lock this lock next unlock next unlock this Alas, this version of remove(), while deadlock proof, is highly suspicious in another way: it is modifying an object (prev) that isn't locked. To give a simple example of how this could cause problems, imagine that next() was written this way: void DisplayListIterator::next() { DisplayObject *old = ptr; ptr = old->next; // (1) if (old->next) // (2) old->next->mtx.sync_lock(); old->mtx.sync_unlock(); } This is a perfectly legitimate way to write it - ptr is locked going in, so all the accesses are to a locked object. However, imagine this code running alongside Fred's remove(). I won't give the details, but remove() could change old->next in between (1) and (2) above, resulting in ptr being set to one object, but a different object being locked! Now, the agonizing thing is that Fred's version probably *does* work as things are implemented now. The prev and next pointers are only used (thank goodness) in a very small number of places, and from a cursory examination all of them will probably work given the memory ordering semantics of Intel SMP. However, I hesitate to leave this land mine around for the next person who tries to work with (or port) the code. I think the best solution is for the locking in remove() to work like this: lock prev unlock prev lock this lock next unlock next unlock this This should be safe because the rendering thread never *modifies* the prev or this pointers, and multiple Python threads are protected from one another by the global interpreter lock. Therefore it isn't necessary to have this locked until we need to actually modify its prev or next pointers. For this to work, the write locks have to be *removed* from the code that calls SetVisible (in prim.cpp and arrprim.cpp, and maybe elsewhere?) insert() and remove() both need to be modified to lock "this" themselves. A plausible implementation of remove() is something like void DisplayObject::remove() { if (prev) { DisplayObject::read_lock P(prev->mtx); prev->next = next; } DisplayObject::write_lock T(this->mtx); if (next) { DisplayObject::read_lock N(next->mtx); next->prev = prev; } prev = next = 0; getObject().decrement_reference_count(); } Unfortunately, I have no more time for this today. > It does bother me that the insert() and remove() methods of > DisplayObject both create two locks, not releasing the first one before > acquiring the second even though the first seems no longer to be needed > when the second is acquired. I tried hacking the code to explicitly > delete the first lock as soon as possible, but that version still > deadlocks. Maybe you are unaware of the syntactic trick being used to implement the locks? Each lock is removed when it goes out of scope (roughly, when you hit a } in the code ): if (prev) { DisplayObject::read_lock P(prev->mtx); prev->next = next; } // <---- unlock P Dave |