From: Christoph S. <ste...@ic...> - 2001-11-20 10:38:22
|
Since Simon's mail and Egon's answer are concerned with so many nice questions, I though I cut everything into different topics. > Yes, simple SQL query would not do. Though some clever indexing should > help. Without any literature read, my first guess would be to have the number > of atoms as an indices. Thus looking for larger substructures would reduce > the number of search molecules. And such a first reduction would be done in > SQL. This is a very important question which needs to be solved within the CDK database package. I'm extremely happy to have one of the early developers of JChemPaint, Stefan Krause, back in my group. He will work as a full time developer for the next six month on implementing a prototype version of NMRShiftDB, based on the CDK. NMRShiftDB is an open-access, open-sumission, open-source database for organic molecules and their carbon NMR spectra. Since this will be a formal thesis in computer science, we will get a very nice documentation :-) But honestly, as a part of this work we will certainly have to focus on solving the substructure search problem in databases. I'm still not sure how to do the preindexing that Egon mentions. It is clear that you have to analyse the molecule at the time you insert it into the database. One possiblity to do this is have a "basic fragment dictionary". Such a dictionary, composed of a few thousand fragments, has been published, so one could rely on that. The compound to be inserted into the database is scanned for the existence of each of the fragments and a Bit in a flag array is set if the fragment is present in the molecule. If you then search a substructure, you analyse this substructure in the same way and you perform an logical "and" operation with the flag array of all the database entries. This confines your full isomorphism checks on the subset of database entries for which this operation yields a non zero value. There are other possiblities like the one that Egon mentions, were you do not rely on a fixed dictionary of fragments but use graph theoretical descriptors to form the flags for screening the database. This seems to be the more general approach, but it is far less clear to me. Of course you can get very nice descriptors, like Morgan's "extended connectiviy index", for an atom in a molecule, which very uniquely characterize the surroundings of this very atom in this very molecule, but in a substructure search you only have substructures :-) and the indices will be different. You must thus confine yourselve on very small atom environments, which then have lower sceening power. Anyway, this has been solved very efficiently, so we should be able to do it :-) If there are any ideas or even experiences, please comment to the list. Cheers, Chris -- Dr. Christoph Steinbeck (http://www.ice.mpg.de/departments/ChemInf) MPI of Chemical Ecology, Carl-Zeiss-Promenade 10, 07745 Jena, Germany Tel: +49(0)3641 643644 - Fax: +49(0)3641 643665 What is man but that lofty spirit - that sense of enterprise. ... Kirk, "I, Mudd," stardate 4513.3.. |