Menu

Structure Identification

Stefan Kuhn

javadbchem saves each compound only once. In that sense, it comes with a compound registration system. In order to identify compounds the SMILES, chiral SMILES (both using CDK) and the InChI (using JNI-INCHI, InChI is Version 1 Standard InChI) is calculated. If two structures have identical SMILES, chiral SMILES and InChI, they are considered identical. This ensures that bugs in the SMILES implementation do not lead to problems. In theory using only the INChI would have been possible as well, but SMILES is calculated anyway. If for some reason the InChI cannot be calculated, the SMILES still works for identification.

The use of JNI-INCHI means that on an exotic platform the calculation of the InChI may not work. See JNI-INCHI documentation for details. Search the getMolecule(IAtomContainer molecule, boolean[] isDoubleBondSpecified) method in the Peer.vm file. This should have a if (rs.next()) clause, which should just do a "return oldmol;" to skip InChI generation.

javadbchem adds implicit hydrogens to structures before they are saved or searched. So it does not matter if implicit hydrogens are given (the only exception is if hydrogens matter for stereochemistry, then leaving them out might change stereochemistry and therefore trigger a new structure). All implicit hydrogens are always saved explicitly in the database (with generated coordinates). This makes sure that properties can be assigned to hydrogens properly. Also aromaticity detection is performed when saves/searches are performed. Finally, there is (crude) normalization procedure in CDK. It takes an xml file as configuration, there is one called normalizer.xml in the src/java directory of the javadbchem checkout. It only performs normalization of nitro groups. If you want to extend the file, see CDK api doc for details.

The chiral smiles will be different for structures which are only partly or not at all stereo specified. So you can have different entries for a certain structure with a) no stereo specification at all b) some stereo specification and c) fully stereo specified. For b) and c) there will be different entries with different stereo configuration. It can happen that a structure exists with all stereo centres specified and with some stereo centres specified exactly the same way as in the fully specified structure, these will be two different entries. No relationships are maintained in the database between these structures, but you can group them by non-stereo smiles or the first layers of the InChI (or the first bit of the InChI key). A full treatment would require some sort of ontology.


Related

Wiki: Home
Wiki: HowTo Usage