[Jython-dev] RE: [Python-sets] re: status of set.py?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

[Guido]
> [I still can't mail directly to python-sets, so I'm
> using a web interface to mail.

I'm still trying to figure out what "can't mail" might mean, and, since I
know you, I'm not going to offer helpful suggestions such as checking to see
that your computer is plugged in <wink>.

> ...
> There's one situation that I had in mind, based on a
> comment that Greg Wilson once made, where the refcount-
> based approach wins over the explicit sticky flag.
>
>     def algorithm(setOfSets):
>         s = set()
>         while s not in setOfSets:
>             "mutate s"
>         return s
>
> Here, the "s not in setOfSets" operation requires s to be
> hashed.  Using the straightforward sticky flag approach,
> this means that each time the test fails and s is mutated
> some more, a copy is made; but this is thrown away
> immediately.  (Greg complained about this problem with
> his current "freeze when hash() is used" strategy.)

Good point, and I had overlooked that.

> Here's a way to avoid the copy in the specific case of
> sets of sets (but not dicts of sets): the __contains__
> method of sets checks if its argument is a mutable set,
> and then extracts the immutable set value from it *without*
> setting the shared flag.

Fine idea!

> There are some concerns when another thread might have
> access to the same mutable set object, and also when
> the comparison of some element might be able to access it;
> maybe the shared flag should be incremented at the start
> of __contains__ and decremented afterwards, but a lock is
> probably still needed.

That's all par for the course for mutating operations that can't exploit the
CPython GIL to ensure atomicity.  Greg's current Set prototype ignores all
that, i.e. when multiple threads mutate the same set at the same time,
there's no telling what may happen short of studying the implemenations of
the methods involved.  Even (conceptually) non-mutating operations can blow
chunks then.  For example, here's the current Set.__hash__:

    def __hash__(self):

        """Calculate hash code for set by xor'ing hash codes of set
        elements.  This algorithm ensures that the hash code does not
        depend on the order in which elements are added to the
        code."""

        # If set already has hashcode, the set has been frozen, so
        # code is still valid.
        if self.hashcode is not None:
            return self.hashcode

        # Combine hash codes of set elements to produce set's hash code.
        self.hashcode = 0
        for elt in self:
            self.hashcode ^= hash(elt)
        return self.hashcode

If thread A and thread B both call this on a Set S before S has had its
hashcode computed, amusing things that can happen include that the two
streams of xors cancel each other out (y^y == 0 regardless of y) completely,
leading to a bogus self.hashcode == 0 result.

Set classes I've written in Python for my own use didn't bother with thread
safety either.  It's a tradeoff, and sometimes the standard Python library
also and knowingly trades away thread sanity for speed (like
random.random()).  Then it's up to the user to supply their own locking
around potentially unsafe operations, if their program cares.  Most programs
don't, so it seems like a win.

Another example, worth spelling out because it's unclear whether it's an
intentional hole:  the standard bisect.insort() isn't sane in the presence
of threads either.  It can't lead to an insane object, but if multiple
threads do insort() on the same list at the same time, the resulting list
may not be sorted.  I'll attach a little program that shows this (it prints
at least three "Oops!" every time on Windows; on other OSes you may need to
boost N and/or NTHREAD, and maybe a lot -- Windows is unusually happy to
switch threads frequently).

So is that "a bug" or "a feature"?  I've never had a problem with it in real
life, and I'd abhor slowing insort() just because some yahoo might.

I confess I make this tradeoff in the same direction for virtually all data
structures I implement in Python:  it's great that operations on the builtin
types act atomic, but I usually consider it my client code's problem for
non-thread-specific services coded in Python.  Indeed, it seems like almost
everyone does.  Common Java practice differs (perhaps because of direct
language support for "synchronized" methods).

So *do* Set ops have to be thread-sane on their own?  I'm uncomfortable with
my gut reactions, because I find I want to give a different answer depending
on which language is used to code the Set type:  in C yes, in Python no, in
Java probably.  Yuck.

import thread
import bisect

N = 10
NTHREAD = 3

a = []

def f(mutex):
    for i in range(N):
        bisect.insort(a, i)
    mutex.release()

locks = [thread.allocate_lock() for i in range(NTHREAD)]
for lock in locks:
    lock.acquire()
    thread.start_new_thread(f, (lock,))

for lock in locks:
    lock.acquire()

print "len(a)", len(a)
print a

for i in range(len(a)-1):
    if a[i] > a[i+1]:
        print 'Oops!  At index', i, a[i], '>', a[i+1]