From: Fred Y. <fr...@on...> - 2001-03-24 13:39:05
|
As I've mentioned before, I've got a VPython/Tk program that hangs on my machine, a dual Pentium3 800 system running WinNT4 SP6. Others have tried to recreate my problem, but none have seen it. I've found that with one small change to the C++ code for VPython itself, I can prevent the deadlock without seeming to cause any other problems. Here's the patch: --- displaylist.cpp.orig Sat Mar 24 07:09:58 2001 +++ displaylist.cpp Sat Mar 24 07:12:11 2001 @@ -143,7 +143,7 @@ void DisplayObject::remove() { if (prev) { - DisplayObject::read_lock P(prev->mtx); + //DisplayObject::read_lock P(prev->mtx); prev->next = next; } if (next) { -- I've simply disabled the first 'read_lock' in the DisplayObject::remove() method. I came to this because my application always seems to hang at the point where it sets the 'visible' attribute of an object to false, and this function is called in response to that action. So, is there something wrong with the locking scheme in cvisual, or am I just hacking around a deeper problem such as some glitch in display drivers? I admit that I know nothing about OpenGL, much less about what locking strategy is necessary. I don't even understand how there can be contention in my application since all work would seem to be done in the Tk thread in response to callback functions associated with the Slider objects. But here too, I don't understand just what threads are involved and how they can interleave. It does bother me that the insert() and remove() methods of DisplayObject both create two locks, not releasing the first one before acquiring the second even though the first seems no longer to be needed when the second is acquired. I tried hacking the code to explicitly delete the first lock as soon as possible, but that version still deadlocks. My problematic application is still at <http://www.ontosys.com/src/residue.py>. Here's the output from glinfo.py, showing my private version of cvisual.dll. My video card is a 3Dfx Voodoo-3 3000 16MB AGP, set at 1600 x 1200 x 32 at 85 Hz: -- $ python glinfo.py Visual-2001-02-27 e:\programs\python20\dlls\cvisual.dll Sat Mar 24 07:12:18 2001 e:\programs\python20\visual\__init__.py Tue Feb 27 21:58:28 2001 e:\programs\python20\visual\graph.py Tue Feb 20 20:35:42 2001 d:\winnt\system32\opengl32.dll Thu Nov 18 10:04:00 1999 OpenGL renderer active. Vendor: Microsoft Corporation Version: 1.1.0 Renderer: GDI Generic Extensions: GL_WIN_swap_hint GL_EXT_bgra GL_EXT_paletted_texture -- Fred Yankowski fr...@On... tel: +1.630.879.1312 Principal Consultant www.OntoSys.com fax: +1.630.879.1370 OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA |
From: Fred Y. <fc...@ac...> - 2001-03-24 14:05:26
|
As I've mentioned before, I've got a VPython/Tk program that hangs on my machine, a dual Pentium3 800 system running WinNT4 SP6. Others have tried to recreate my problem, but none have seen it. I've found that with one small change to the C++ code for VPython itself, I can prevent the deadlock without seeming to cause any other problems. Here's the patch: --- displaylist.cpp.orig Sat Mar 24 07:09:58 2001 +++ displaylist.cpp Sat Mar 24 07:12:11 2001 @@ -143,7 +143,7 @@ void DisplayObject::remove() { if (prev) { - DisplayObject::read_lock P(prev->mtx); + //DisplayObject::read_lock P(prev->mtx); prev->next = next; } if (next) { -- I've simply disabled the first 'read_lock' in the DisplayObject::remove() method. I came to this because my application always seems to hang at the point where it sets the 'visible' attribute of an object to false, and this function is called in response to that action. So, is there something wrong with the locking scheme in cvisual, or am I just hacking around a deeper problem such as some glitch in display drivers? I admit that I know nothing about OpenGL, much less about what locking strategy is necessary. I don't even understand how there can be contention in my application since all work would seem to be done in the Tk thread in response to callback functions associated with the Slider objects. But here too, I don't understand just what threads are involved and how they can interleave. It does bother me that the insert() and remove() methods of DisplayObject both create two locks, not releasing the first one before acquiring the second even though the first seems no longer to be needed when the second is acquired. I tried hacking the code to explicitly delete the first lock as soon as possible, but that version still deadlocks. My problematic application is still at <http://www.ontosys.com/src/residue.py>. Here's the output from glinfo.py, showing my private version of cvisual.dll. My video card is a 3Dfx Voodoo-3 3000 16MB AGP, set at 1600 x 1200 x 32 at 85 Hz: -- $ python glinfo.py Visual-2001-02-27 e:\programs\python20\dlls\cvisual.dll Sat Mar 24 07:12:18 2001 e:\programs\python20\visual\__init__.py Tue Feb 27 21:58:28 2001 e:\programs\python20\visual\graph.py Tue Feb 20 20:35:42 2001 d:\winnt\system32\opengl32.dll Thu Nov 18 10:04:00 1999 OpenGL renderer active. Vendor: Microsoft Corporation Version: 1.1.0 Renderer: GDI Generic Extensions: GL_WIN_swap_hint GL_EXT_bgra GL_EXT_paletted_texture -- Fred Yankowski fr...@On... tel: +1.630.879.1312 Principal Consultant www.OntoSys.com fax: +1.630.879.1370 OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA |
From: David S. <dsc...@vy...> - 2001-03-28 16:48:02
|
> I've simply disabled the first 'read_lock' in the > DisplayObject::remove() method. I came to this because my application > always seems to hang at the point where it sets the 'visible' attribute > of an object to false, and this function is called in response to that > action. > > So, is there something wrong with the locking scheme in cvisual Yes, there is apparently something wrong. First of all, some background on the situation: The code in question is invoked to add or remove an object from the list of visible objects in a display. The reason it employs locking is that the visual renderer, in a separate thread, may be iterating through the same list and displaying the objects. It needs to be protected from encountering deleted objects, objects that are being modified, etc. To understand what's happening, we need to look in several places (1) The insert() and remove() functions themselves (2) DisplayListIterator::next() in displaylist.h - this is the function used by the renderer to walk through the list (3) The code in prim.cpp and arrprim.cpp that calls SetVisible(), which calls insert() and remove() This last is important, because it turns out that that code takes *another* write_lock, on "this". So the sequence for insert goes lock this lock head of list lock first item in list unlock first item unlock head unlock this The sequence for remove goes lock this lock prev unlock prev lock next unlock next unlock this The sequence for next is a little harder to see: lock head lock first unlock head lock second unlock first lock third unlock second lock fourth And so on. We need to watch out for deadlocks between next() in the rendering thread and either insert() or remove() in the other thread. No deadlocks can occur between insert() and next(). Intuitively, this is because both functions are walking down the list "with the grain". Specifically, each one locks a node A only if (1) It holds no other locks or (2) It holds a lock on A.prev and no other nodes If (1) is the case, this thread has *no* locks and so there can't possibly be a deadlock while this thread waits for A. If (2) is the case, and A is locked by the other thread, the other thread cannot be waiting for A.prev because it can't have both A.prev.prev and A locked. However, this logic doesn't work for remove(), which locks this->prev "against the grain". In fact, it's pretty easy to see that there CAN be a deadlock. To look at just one case: remove() next() -------- ------ lock this lock prev lock prev unlock prev.prev lock this (deadlock) This is fairly embarrassing, and also rather puzzling - why is it working for everyone except Fred? Now, Fred's fix removes the lock on prev. It's easy to see that the deadlock is eliminated, since remove() then runs with the grain of the list: lock this lock next unlock next unlock this Alas, this version of remove(), while deadlock proof, is highly suspicious in another way: it is modifying an object (prev) that isn't locked. To give a simple example of how this could cause problems, imagine that next() was written this way: void DisplayListIterator::next() { DisplayObject *old = ptr; ptr = old->next; // (1) if (old->next) // (2) old->next->mtx.sync_lock(); old->mtx.sync_unlock(); } This is a perfectly legitimate way to write it - ptr is locked going in, so all the accesses are to a locked object. However, imagine this code running alongside Fred's remove(). I won't give the details, but remove() could change old->next in between (1) and (2) above, resulting in ptr being set to one object, but a different object being locked! Now, the agonizing thing is that Fred's version probably *does* work as things are implemented now. The prev and next pointers are only used (thank goodness) in a very small number of places, and from a cursory examination all of them will probably work given the memory ordering semantics of Intel SMP. However, I hesitate to leave this land mine around for the next person who tries to work with (or port) the code. I think the best solution is for the locking in remove() to work like this: lock prev unlock prev lock this lock next unlock next unlock this This should be safe because the rendering thread never *modifies* the prev or this pointers, and multiple Python threads are protected from one another by the global interpreter lock. Therefore it isn't necessary to have this locked until we need to actually modify its prev or next pointers. For this to work, the write locks have to be *removed* from the code that calls SetVisible (in prim.cpp and arrprim.cpp, and maybe elsewhere?) insert() and remove() both need to be modified to lock "this" themselves. A plausible implementation of remove() is something like void DisplayObject::remove() { if (prev) { DisplayObject::read_lock P(prev->mtx); prev->next = next; } DisplayObject::write_lock T(this->mtx); if (next) { DisplayObject::read_lock N(next->mtx); next->prev = prev; } prev = next = 0; getObject().decrement_reference_count(); } Unfortunately, I have no more time for this today. > It does bother me that the insert() and remove() methods of > DisplayObject both create two locks, not releasing the first one before > acquiring the second even though the first seems no longer to be needed > when the second is acquired. I tried hacking the code to explicitly > delete the first lock as soon as possible, but that version still > deadlocks. Maybe you are unaware of the syntactic trick being used to implement the locks? Each lock is removed when it goes out of scope (roughly, when you hit a } in the code ): if (prev) { DisplayObject::read_lock P(prev->mtx); prev->next = next; } // <---- unlock P Dave |
From: Fred Y. <fr...@on...> - 2001-03-28 17:45:49
|
David, Thank you for considering this problem. I look forward to getting an updated VPython that fixes the problem without requiring my hack. On Wed, Mar 28, 2001 at 11:52:04AM -0500, David Scherer wrote: > This is fairly embarrassing, and also rather puzzling - why is it working > for everyone except Fred? Such situations no longer surprise me. ;-) > Maybe you are unaware of the syntactic trick being used to implement the > locks? Each lock is removed when it goes out of scope (roughly, when you > hit a } in the code ): I thought that might be the case, but I wasn't sure if C++ guaranteed that behavior -- calling the destructor at exactly the end of the block -- and I didn't completely trust the compiler in any case. -- Fred Yankowski fr...@On... tel: +1.630.879.1312 Principal Consultant www.OntoSys.com fax: +1.630.879.1370 OntoSys, Inc 38W242 Deerpath Rd, Batavia, IL 60510, USA |
From: Markus G. <gr...@iu...> - 2001-04-03 10:42:23
|
Fred Yankowski wrote: > David, > > Thank you for considering this problem. I look forward to getting an > updated VPython that fixes the problem without requiring my hack. > > On Wed, Mar 28, 2001 at 11:52:04AM -0500, David Scherer wrote: > > This is fairly embarrassing, and also rather puzzling - why is it working > > for everyone except Fred? Fred's residue.py also hangs on my machine from time to time. I have a P200, Diamond Viper V550 16MB. Markus |