From: Tim P. <ti...@co...> - 2002-02-24 07:30:15
|
[Guido] > [I still can't mail directly to python-sets, so I'm > using a web interface to mail. I'm still trying to figure out what "can't mail" might mean, and, since I know you, I'm not going to offer helpful suggestions such as checking to see that your computer is plugged in <wink>. > ... > There's one situation that I had in mind, based on a > comment that Greg Wilson once made, where the refcount- > based approach wins over the explicit sticky flag. > > def algorithm(setOfSets): > s = set() > while s not in setOfSets: > "mutate s" > return s > > Here, the "s not in setOfSets" operation requires s to be > hashed. Using the straightforward sticky flag approach, > this means that each time the test fails and s is mutated > some more, a copy is made; but this is thrown away > immediately. (Greg complained about this problem with > his current "freeze when hash() is used" strategy.) Good point, and I had overlooked that. > Here's a way to avoid the copy in the specific case of > sets of sets (but not dicts of sets): the __contains__ > method of sets checks if its argument is a mutable set, > and then extracts the immutable set value from it *without* > setting the shared flag. Fine idea! > There are some concerns when another thread might have > access to the same mutable set object, and also when > the comparison of some element might be able to access it; > maybe the shared flag should be incremented at the start > of __contains__ and decremented afterwards, but a lock is > probably still needed. That's all par for the course for mutating operations that can't exploit the CPython GIL to ensure atomicity. Greg's current Set prototype ignores all that, i.e. when multiple threads mutate the same set at the same time, there's no telling what may happen short of studying the implemenations of the methods involved. Even (conceptually) non-mutating operations can blow chunks then. For example, here's the current Set.__hash__: def __hash__(self): """Calculate hash code for set by xor'ing hash codes of set elements. This algorithm ensures that the hash code does not depend on the order in which elements are added to the code.""" # If set already has hashcode, the set has been frozen, so # code is still valid. if self.hashcode is not None: return self.hashcode # Combine hash codes of set elements to produce set's hash code. self.hashcode = 0 for elt in self: self.hashcode ^= hash(elt) return self.hashcode If thread A and thread B both call this on a Set S before S has had its hashcode computed, amusing things that can happen include that the two streams of xors cancel each other out (y^y == 0 regardless of y) completely, leading to a bogus self.hashcode == 0 result. Set classes I've written in Python for my own use didn't bother with thread safety either. It's a tradeoff, and sometimes the standard Python library also and knowingly trades away thread sanity for speed (like random.random()). Then it's up to the user to supply their own locking around potentially unsafe operations, if their program cares. Most programs don't, so it seems like a win. Another example, worth spelling out because it's unclear whether it's an intentional hole: the standard bisect.insort() isn't sane in the presence of threads either. It can't lead to an insane object, but if multiple threads do insort() on the same list at the same time, the resulting list may not be sorted. I'll attach a little program that shows this (it prints at least three "Oops!" every time on Windows; on other OSes you may need to boost N and/or NTHREAD, and maybe a lot -- Windows is unusually happy to switch threads frequently). So is that "a bug" or "a feature"? I've never had a problem with it in real life, and I'd abhor slowing insort() just because some yahoo might. I confess I make this tradeoff in the same direction for virtually all data structures I implement in Python: it's great that operations on the builtin types act atomic, but I usually consider it my client code's problem for non-thread-specific services coded in Python. Indeed, it seems like almost everyone does. Common Java practice differs (perhaps because of direct language support for "synchronized" methods). So *do* Set ops have to be thread-sane on their own? I'm uncomfortable with my gut reactions, because I find I want to give a different answer depending on which language is used to code the Set type: in C yes, in Python no, in Java probably. Yuck. import thread import bisect N = 10 NTHREAD = 3 a = [] def f(mutex): for i in range(N): bisect.insort(a, i) mutex.release() locks = [thread.allocate_lock() for i in range(NTHREAD)] for lock in locks: lock.acquire() thread.start_new_thread(f, (lock,)) for lock in locks: lock.acquire() print "len(a)", len(a) print a for i in range(len(a)-1): if a[i] > a[i+1]: print 'Oops! At index', i, a[i], '>', a[i+1] |