From: Christoph S. <er...@do...> - 2007-07-15 13:32:44
|
Rajarshi, thanks for the interesting email. I've only had the time to quickly fly over it and would like to add my few cents. People here are also working on a conformer generator and we've opted for an internal coordinate representation which is stored separately from the AtomContainer. I certainly agree that storing lists of AtomContainers is not a good idea. As I said, our focus is currently on the generation of conformers and we found a reasonable way of storing the sets of internal coordinates. We'll revert to any scheme of handling things as soon as the "CDK way of doing it" has been sorted out. It looks to me that adding a way of handling internal coordinates to your concept would be pretty much what we need. One thing that came to my mind from an architectural point of view is that we seem to loose the natural coupling between the chemical concept (atom in this case) and the coordinate of the concept. We rely on the order of the List<Point3D> having the same order as the Atom array in the attached atom container. This is a bit of a "not so nice aspect" :-) but can easily fixed. One could imagine a hashtable with the atom object as key and the List<List<Point3D>> as the value. Cheers, Chris Cheers, Chris Rajarshi Guha wrote: > H, in relation to another project I have been thinking about handling > conformer data within the CDK. I've written some classes to handle > them and have done some performance testing. > > Prior apologies for a long mail, but comments would be appreciated. > > 1. Conformer data: Conformer data can be stored in many ways. I > generate confs using OE's Omega which gives me a compressed OE binary > file. To handle this outside OEChem I can convert it to SDF, where > each conformer for a given molecule has the same title. I think this > is a general way to handle most conformer data files. > > 2. Storing conformers: Since a set of conformers for a molecule is > basically a set of identical molecules just differing in the 3D > coordinates one way to handle such data in the current CDK is to use > IMoleculeSet/IChemModel. > > So all conformers for a given molecule are stored in a IMoleculeSet > which is added to a IChemModel. > > Then if we have multiple molecules, we can store each IChemModel in a > IChemSequence. > > This is fine but horiffically wasteful of memory - there is no need > to store the connection table and related information for all > conformers for a given molecule. > > So my ConformerContainer class improves upon this by storing a single > IAtomContainer which is initially the first conformer. So it has the > connection table and related details. Next, it has a List of atom > coordinate sets. There are various ways to do this: > List<List<Point3d>> or List<Point3d[]> > > So now, a given molecule and its conformers are represented by a > ConformerContainer object which stores *1* IAtomContainer object. > > I have modeled this class on a List by implementing the List > interface. I believe that this is better than subclassing > IAtomContainer since a collection of conformers is really just a list > of conformers. In general, one is interested in looping over all > conformers or accessing a single conformer. So common operations will > be: > > for (IAtomContainer conf : conformerContainer} { > // do something with this conf > } > > or > > IAtomContainer conf = conformerContainer.get(10); > > So rather than looking like a molecular object, the class looks like > a list of molecules. Each time a conformer is requested, the class > will internally populate the coordinates of each atom in the > IAtomContainer object and return the resultant molecule. > > 3. Reading conformers: As noted above, I've assumed that conformers > are stored in SDF format, such that all conformers for a given > molecule have the same title. Another way to check for conformers is > by doing graph isomorphism - which is more rigorous but also more > time consuming. For now, I have coded it so that it uses the titles > to check for conformers. > > Now, conformer files can get large. So I think that a common > operation will be to iterate over conformer collections, rather than > slurp them in all at once. So I wrapped IteratingMDLReader such that > rather than recieving a IMolecule at each iteration the user receives > a ConformerContainer object containing the conformers for the current > molecule. So you can do something like: > > IteratingMDLConformerReader reader = new IteratingMDLConformerReader( > new FileReader(new File(filename)), > DefaultChemObjectBuilder.getInstance()); > > while(reader.hasNext()) { > ConformerContainer cc = (ConformerContainer) reader.next(); > System.out.println("Base title = "+cc.getTitle()+" and > nconf = "+cc.size()); > } > > Now obviously, you could easily get all conformers for all molecules > by collecting all the individual ConformerContainer objects. > > Performance Testing: > -------------------- > > OK, so this allows us to easily work with conformer data files in an > intuitive manner. But is the extra class worth the effort? The > results below indicate that it is definitely so! > > For testing I looked at the following scenarios: > > V1. A ConformerContainer class where the coordinates for the > conformers are stored as List<List<Point3d>> > > V2. A ConformerContainer class where the coordinates are stored as > List<Point3d[]> > > V2. Using IMoleculeSet/IChemModel to represent the conformers for a > given molecule > > For data, I used 50 molecules and generated conformers using Omega > (2.1.0) with the default settings. This gave me 12,026 conformers in > total with an average of 241 conformers per molecule. > > So given the conformer file, I looped over the conformers for each > molecule using IteratingMDLConformerReader and at each iteration > determined by the memory consumption of the ConformerContainer or > IChemModel as the case may be. > > All testing was done with -Xms256m and -Xmx256m. > > The system details: OS X 10.4.10, JDK 1.5.0_07, latest CDK from SVN. > To evaluate memory usage I used the code from http://java.sun.com/ > docs/books/performance/1st_edition/html/JPRAMFootprint.fm.html#11055 > which is not going to be very accurate, but gives a good idea for > comparisons purpose. > > Implementation Average Memory used (Kb) > V1 1951 > V2 1941 > V3 7146 > > So not storing all the conformers as IAtomContainer objects is a big > improvement. Not surprsingly if I wanted to load all conformers for > all molecules into memory, I can do it with either V1 or V2 but it > will fail (out of heap space) if I try to use IChemModel approach (or > simply load 12K molecules into an IAtomContainer[]) > > ------------------------------------------------------------------- > Rajarshi Guha <rg...@in...> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > I live in my own little world... > but it's OK, they like me there > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel -- PD Dr. Christoph Steinbeck Lecturer in Chemoinformatics Univ. Tuebingen, WSI-RA, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071-29-78978 Fax: (+49/0) 7071-29-5091 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |