[Cdk-devel] CDK Database engine for molecules

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Since Simon's mail and Egon's answer are concerned with so many nice
questions, I though I cut everything into different topics.

> Yes, simple SQL query would not do. Though some clever indexing should
> help. Without any literature read, my first guess would be to have the number
> of atoms as an indices. Thus looking for larger substructures would reduce
> the number of search molecules. And such a first reduction would be done in
> SQL.

This is a very important question which needs to be solved within the
CDK database package. I'm extremely happy to have one of the early
developers of JChemPaint, Stefan Krause, back in my group. He will work
as a full time developer for the next six month on implementing a
prototype version of NMRShiftDB, based on the CDK. NMRShiftDB is an
open-access, open-sumission, open-source database for organic molecules
and their carbon NMR spectra. Since this will be a formal thesis in
computer science, we will get a very nice documentation :-)

But honestly, as a part of this work we will certainly have to focus on
solving the substructure search problem in databases. I'm still not sure
how to do the preindexing that Egon mentions. 

It is clear that you have to analyse the molecule at the time you insert
it into the database. One possiblity to do this is have a "basic
fragment dictionary". Such a dictionary, composed of a few thousand
fragments, has been published, so one could rely on that. The compound
to be inserted into the database is scanned for the existence of each of
the fragments and a Bit in a flag array is set if the fragment is
present in the molecule. If you then search a substructure, you analyse
this substructure in the same way and you perform an logical "and"
operation with the flag array of all the database entries. This confines
your full isomorphism checks on the subset of database entries for which
this operation yields a non zero value. 

There are other possiblities like the one that Egon mentions, were you
do not rely on a fixed dictionary of fragments but use graph theoretical
descriptors to form the flags for screening the database. This seems to
be the more general approach, but it is far less clear to me. Of course
you can get very nice descriptors, like Morgan's "extended connectiviy
index", for an atom in a molecule, which very uniquely characterize the
surroundings of this very atom in this very molecule, but in a
substructure search you only have substructures :-) and the indices will
be different. You must thus confine yourselve on very small atom
environments, which then have lower sceening power. 

Anyway, this has been solved very efficiently, so we should be able to
do it :-)

If there are any ideas or even experiences, please comment to the list.

Cheers, 

Chris

--
Dr. Christoph Steinbeck (http://www.ice.mpg.de/departments/ChemInf)
MPI of Chemical Ecology, Carl-Zeiss-Promenade 10, 07745 Jena, Germany
Tel: +49(0)3641 643644 - Fax: +49(0)3641 643665

What is man but that lofty spirit - that sense of enterprise. 
... Kirk, "I, Mudd," stardate 4513.3..