From: Rob S. <rsc...@xs...> - 2006-10-03 11:40:01
|
Hi, I noticed the recent changes to the IChemContainer interface. In this change the getAtoms()::List method was replaced by atoms()::Iterator. The idea to hide the atom collection for the outside world is alright, but I think there are two problems with the current solution: 1) Returning an Iterator breaks the possibility to use the Java 5 for-loop, because this expects an Iterable (an object that can produce an Iterator) or array. 2) Most Java iterator implementations are backed by their parent list, so removing items in the container also means the iterator won't touch them. This might cause a problem in a multithreaded environment. So you don't want to return the complete list, but you also don't want to return an iterator. I think the best solution is to return an array (or other immutable collection): public IAtom[] atoms() { return atomsList.toArray(atomsList.size()); // or java <5 equiv. } With kind regards, Rob Schellhorn |
From: Kai H. <Kai.Hartmann@Uni-Koeln.De> - 2006-10-04 12:30:49
|
Hi Rob, hi all. > I noticed the recent changes to the IChemContainer interface. In this > change the getAtoms()::List method was replaced by atoms()::Iterator. > The idea to hide the atom collection for the outside world is alright, > but I think there are two problems with the current solution: > > 1) Returning an Iterator breaks the possibility to use the Java 5 > for-loop, because this expects an Iterable (an object that can produce > an Iterator) or array. Good point. At least the classes that hold one container can implement Iterable. For AtomContainer, this is indeed a problem. > 2) Most Java iterator implementations are backed by their parent list, > so removing items in the container also means the iterator won't touch > them. This might cause a problem in a multithreaded environment. I'm not sure whether this is a really problem. Concurrent access from different threads as in the case you described will probably break many parts of cdk. > So you don't want to return the complete list, but you also don't want > to return an iterator. I think the best solution is to return an array > (or other immutable collection): > > public IAtom[] atoms() { > return atomsList.toArray(atomsList.size()); // or java <5 equiv. > } This is the way it was. Should we rethink our redesign? This would mean creating a copy of the internal arrays and pass the copy to the user. We wanted to avoid this. Thanks, Kai |
From: Rob S. <rsc...@xs...> - 2006-10-05 13:52:35
|
Kai Hartmann schreef: > Hi Rob, hi all. > >> >> 1) Returning an Iterator breaks the possibility to use the Java 5 >> for-loop, because this expects an Iterable (an object that can produce >> an Iterator) or array. > > Good point. At least the classes that hold one container can implement > Iterable. For AtomContainer, this is indeed a problem. > >> 2) Most Java iterator implementations are backed by their parent list, >> so removing items in the container also means the iterator won't touch >> them. This might cause a problem in a multithreaded environment. > > I'm not sure whether this is a really problem. Concurrent access from > different threads as in the case you described will probably break many > parts of cdk. The problem is that you can't make it thread safe, even if you want to. To achieve this, you need to syncronize with the internal atom collection in the IAtomContainer, which isn't available for outside classes. To solve this you might consider to wrap that collection using Collection.synchronized*() for instance. I *suspect* this will also synchronize the returned Iterator. >> So you don't want to return the complete list, but you also don't want >> to return an iterator. I think the best solution is to return an array >> (or other immutable collection): >> >> public IAtom[] atoms() { >> return atomsList.toArray(atomsList.size()); // or java <5 equiv. >> } > > This is the way it was. Should we rethink our redesign? This would mean > creating a copy of the internal arrays and pass the copy to the user. We > wanted to avoid this. Why do you want to avoid copying of such array's? The toArray() does not create a deep copy, so it actually only creates an array with references to existing object and is thus not that expensive. For both comments, what about this alternative: public class AtomContainer { private final List<IAtom> atoms = new ArrayList<IAtom>(); private final List<IBond> bond = new ArrayList<IBond>(); protected List<IAtom> getAtomList() { // For package use return atoms; } public Collection<IAtom> getAtoms() { /* * Unsupported operations (add, remove) throw by default an * UnsupportedOperationException, so the returned collection is * immutable. * Note that this construction is still not thread-safe. See 2) above.. */ return new AbstractCollection<IAtom>() { public Iterator<IAtom> iterator() { return atoms.iterator(); } public int size() { return atoms.size(); } }; } // Same for bonds } Regards, Rob |
From: Kai H. <Kai.Hartmann@Uni-Koeln.De> - 2006-10-06 10:03:10
|
Hi all. >>> 2) Most Java iterator implementations are backed by their parent >>> list, so removing items in the container also means the iterator >>> won't touch them. This might cause a problem in a multithreaded >>> environment. >> >> I'm not sure whether this is a really problem. Concurrent access from >> different threads as in the case you described will probably break many >> parts of cdk. > The problem is that you can't make it thread safe, even if you want to. > To achieve this, you need to syncronize with the internal atom > collection in the IAtomContainer, which isn't available for outside > classes. To solve this you might consider to wrap that collection using > Collection.synchronized*() for instance. I *suspect* this will also > synchronize the returned Iterator. You are right about the impossibility to create a thread-safe AtomContainer (and other cdk classes) at this moment - but this is not a new problem. To my knowledge, the cdk data classes are not designed to be thread-safe. It would be possible to create a thread-safe implementation of the interface, e.g. IAtomContainer, however. >>> So you don't want to return the complete list, but you also don't >>> want to return an iterator. I think the best solution is to return an >>> array (or other immutable collection): >>> >>> public IAtom[] atoms() { >>> return atomsList.toArray(atomsList.size()); // or java <5 equiv. >>> } >> >> This is the way it was. Should we rethink our redesign? This would mean >> creating a copy of the internal arrays and pass the copy to the user. We >> wanted to avoid this. > Why do you want to avoid copying of such array's? The toArray() does not > create a deep copy, so it actually only creates an array with references > to existing object and is thus not that expensive. Well, there have been quite a number of misuses, even within the cdk, because of the possibility to create an array. To give you an example: for (int k = 0; k < atomContainer.getAtoms().length;k++) { IAtom atom = atomContainer.getAtoms()[k]; } The advantage of *not* returning a copy is not only performance, but also the direct access to the data inside (including remove, add, etc. functionality), as the topic of this mail also suggests. Due to the recent comments, however, maybe it is a good idea to provide a method to get a shallow copy in form of an array. What do the other people on this list think? > For both comments, what about this alternative: > > public class AtomContainer { > > private final List<IAtom> atoms = new ArrayList<IAtom>(); > private final List<IBond> bond = new ArrayList<IBond>(); > > protected List<IAtom> getAtomList() { > // For package use > return atoms; > } > > public Collection<IAtom> getAtoms() { > /* > * Unsupported operations (add, remove) throw by default an > * UnsupportedOperationException, so the returned collection is > * immutable. > * Note that this construction is still not thread-safe. See 2) > above.. > */ > return new AbstractCollection<IAtom>() { > > public Iterator<IAtom> iterator() { > return atoms.iterator(); > } > > public int size() { > return atoms.size(); > } > }; > } > > // Same for bonds > } > This doesn't look bad either. We could keep the current Iterator implementations and wrap it inside the returned collection. I just wonder about the performance of this construct, as another temporary object will be created. Does someone have time to test these three approaches wrt runtime? - return the copied array and go through it - return the iterator and go through it - return the collection and go through it All with AtomContainer objects of different size (10, 100, 1000 atoms) and a warm-up phase for the JIT compiler. Any volunteer? Kai |
From: Kai H. <Kai.Hartmann@Uni-Koeln.De> - 2006-10-06 21:01:12
|
Rob Schellhorn schrieb: > Kai Hartmann wrote: >> Does someone have time to test these three >> approaches wrt runtime? >> >> - return the copied array and go through it >> - return the iterator and go through it >> - return the collection and go through it >> >> All with AtomContainer objects of different size (10, 100, 1000 atoms) >> and a warm-up phase for the JIT compiler. >> >> Any volunteer? >> >> Kai > I'll run a few tests this weekend. I will do some timing and memory > testing... a good change to become familiar with the TPTP > (http://www.eclipse.org/tptp/) plugin. I'll posts my results asap, > probably begin next week. Great! One more thing: it has to work with java 1.4, so it would be good if the test code doesn't use 1.5 features. Thanks for your comments and becoming involved, Kai |
From: Rob S. <rsc...@xs...> - 2006-10-09 08:10:14
|
Kai Hartmann wrote: > Rob Schellhorn schrieb: >> I'll run a few tests this weekend. I will do some timing and memory >> testing... a good change to become familiar with the TPTP >> (http://www.eclipse.org/tptp/) plugin. I'll posts my results asap, >> probably begin next week. > One more thing: it has to work with java 1.4, so it would be good if the > test code doesn't use 1.5 features. Test set-up: Java 1.5.0.9 Eclipse 3.2.1 TPTP 4.2.0 JUnit 4.0 runner Windows XP SP2 Test source: http://schellhorn.nl/iterationtest/source.jar Results: http://schellhorn.nl/iterationtest/results.xls (MS Excel, avg of 5 runs) Results2: http://schellhorn.nl/iterationtest/result.html (test for n=10000) I have compared the time to create a copy of a list and the time to iterate over the data structure. As well as the memory usage for lists containing 100, 1000 and 10000 items. Note that the timings are measured with the TPTP profiler, running without this profiler is a *lot* faster (+/- factor 100). Some observations: * An array consumes 4*n + 16 bytes of memory, so for n=10000 the memory usage is still only 40kb. On an 64bit OS this will probably be 8*n+c. * Iterators and iterable objects returned by ArrayList consume a constant amount of memory. So they *must* retrieve their data from the originating list. * Iterating over an array is way faster than iterating over any other data structure! This effect is so strong that even for n=100 an array is the fastest construction, although the list must be copied into the array first. * The new for loop in combination with lists is about twice as slow as a for(int i = 0; i < size; i++) or while() construction! I *suspect* this is an error in the JVM, because I don't see any reason why this should take longer -> bad optimalisation. * The list returned by Collections.unmodifiableList() is slower than creating an inline list without modification methods, in fact this is the slowest construction. Perhaps someone can run these tests under Linux and see whether the results are about the same? Rob |