Note that there is now a separate sf project allowing to build structure databases as describe below. The code can be used in any java project basically like a chemistry cartridge. See https://sourceforge.net/projects/javadbchem/
nmrshiftdb2 uses the CDK to have structure and substructure searches in a MySQL database. The technology used is detailed here. Generally the architecture has worked very well, especially considering the relatively low amount of work required (there were of course tricky bits, see below). A disadvantage of the design is that code is partly in the Java application, and not completely encapsulated in the database. The approach should be possible with small changes in any database system. The system works as follows:
1) There is a database table called MOLECULE which has the following fields (more fields exist in nmrshiftdb2, but they are not relevant for the structure search):
CREATE TABLE `MOLECULE` (
`MOLECULE_ID` int(11) NOT NULL default '0',
`SMILES_STRING` mediumblob NOT NULL,
`SMILES_STRING_CHIRAL` mediumblob,
`fp0` bigint(20) default NULL,
...
`fp15` bigint(20) default NULL,
PRIMARY KEY (`MOLECULE_ID`),
)
The SMILES_STRING and SMILES_STRING_CHIRAL fields hold a CDK-generated smiles and chiral smiles of the molecule (chiral smiles is null, if no chirality is contained in molecule). This is used for identity check and identity search by generating smiles for the search structure and searching for the smiles (for details on the tricky question of molecule identity, see [StructureIDs]). With an index on the fields, this gives fast identity searches (notice CDK smiles are canonical).
The fields fp0 to fp15 hold a 1024 bit CDK generated fingerprint (a bigint in mySQL has 64 bits, so the 1024 bits fit in 16 fields). You need to do some maths to break down the fingerprint into 16 figures, where the bits in the binary representation of the figure are the same as in a certain fragment of the fingerprint. If we calculate the same fingerprint for a search structure and break it down into 16 figures using the same algorithm, a prefiltering step using the fingerprint can be done in SQL with a query like this (the fingerprint is that of Benzene):
select MOLECULE_ID from MOLECULE where fp0 & 8388609=8388609 and fp1 & 0=0 and fp2 & 281474993487874=281474993487874 and fp3 & 562949953421312=562949953421312 and fp4 & 0=0 and fp5 & 0=0 and fp6 & 0=0 and fp7 & 528=528 and fp8 & 0=0 and fp9 & 0=0 and fp10 & 0=0 and fp11 & 0=0 and fp12 & 0=0 and fp13 & 0=0 and fp14 & 0=0 and fp15 & 0=0;
Notice & is the bitwise AND in mySQL, if a fingerprint in the database AND the one to search for are the same as the one to search for, at least each bit is in the fingerprint in the database as well . This search is fast, it takes e. g. ~0.2 seconds on 41083 structures on the nmrshiftdb2 server. You then need to do the actual precise search by using the CDK subgraph isomorphism in the Java code. Notice the overall performance is dependent on the subgraph isomorphism, but this is true for any CDK based implementation, no matter if its using a cartridge or not, which database system it is etc. My guidelines are as follows:
2) In nmrshiftdb2, the actual structure drawing is not saved as a mol file (like in most systems), but in atom and bond tables, where a line basically contains a line in the ctab files. The attributes represent columns in the ctab. The tables look like this (BOND_ATOM gives the many-to-many relationship between bonds and atoms).
CREATE TABLE `ATOM` (
`ATOM_ID` int(11) NOT NULL default '0',
`SYMBOL` char(3) NOT NULL default _,_
`MOLECULE_ID` int(11) NOT NULL default '0',
`HOSE_CODE_WITH_RINGS` text,
`ATOMIC_MASS` int(11) NOT NULL default '0',
`FORMAL_CHARGE` int(11) NOT NULL default '0',
`IS_AROMATIC` enum('false','true') NOT NULL default 'false',
`IS_VISIBLE` enum('false','true') NOT NULL default 'false',
`HETERO` varchar(5) NOT NULL default 'false',
`HOSE_CODE` varchar(180) NOT NULL default _,_
PRIMARY KEY (`ATOM_ID`),
KEY `MOLECULE_ID_INDEX` (`MOLECULE_ID`),
KEY `HOSE_CODE_INDEX` (`HOSE_CODE_WITH_RINGS`(300)),
KEY `MOLECULE_ATOM_INDEX` (`MOLECULE_ID`,`ATOM_ID`),
CONSTRAINT `0_321` FOREIGN KEY (`MOLECULE_ID`) REFERENCES `MOLECULE` (`MOLECULE_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `BOND` (
`BOND_ID` int(11) NOT NULL default '0',
`DEGREE` int(11) NOT NULL default '0',
`IS_CONFIGURATION_SPECIFIED` enum('false','true') NOT NULL default 'false',
`STEREO` int(11) NOT NULL default '0',
`IS_AROMATIC` enum('false','true') NOT NULL default 'false',
PRIMARY KEY (`BOND_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `BOND_ATOM` (
`BOND_ID` int(11) NOT NULL default '0',
`ATOM_ID` int(11) NOT NULL default '0',
`ORDER_NUMBER` int(11) NOT NULL default '0',
PRIMARY KEY (`BOND_ID`,`ATOM_ID`),
KEY `ATOM_ID` (`ATOM_ID`),
CONSTRAINT `0_325` FOREIGN KEY (`BOND_ID`) REFERENCES `BOND` (`BOND_ID`),
CONSTRAINT `0_326` FOREIGN KEY (`ATOM_ID`) REFERENCES `ATOM` (`ATOM_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Notice not all columns in the ctab are in here, but this can be changed. Just to make things clear with an example: If we have the entry for benzene, there is one entry in the molecule table with ID 7901, there are this entries in the ATOM table:
| 10040872 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040873 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040874 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040875 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040876 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040877 | C | 7901 | C-3-6;*C*C(*C,*C/*C,*&/*&)// | 12 | 0 | true | true | false | C-3;*C*C(*C,*C/*C,*&/*&)// |
| 10040878 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
| 10040879 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
| 10040880 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
| 10040881 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
| 10040882 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
| 10040883 | H | 7901 | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// | 1 | 0 | false| false | false | H-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&// |
Note all atoms including the implicit Hs are modelled (this is shown by the IS_VISIBLE attribute being false). The bonds are like this:
| 10042647 | 1 | false | 0 | true |
| 10042652 | 2 | false | 0 | true |
| 10042653 | 1 | false | 0 | false |
| 10042648 | 2 | false | 0 | true |
| 10042654 | 1 | false | 0 | false |
| 10042649 | 1 | false | 0 | true |
| 10042655 | 1 | false | 0 | false |
| 10042650 | 2 | false | 0 | true |
| 10042656 | 1 | false | 0 | false |
| 10042651 | 1 | false | 0 | true |
| 10042657 | 1 | false | 0 | false |
| 10042658 | 1 | false | 0 | false |
Finally, the BOND_ATOM relationship is like this:
| 10042647 | 10040872 | 0 |
| 10042652 | 10040872 | 0 |
| 10042653 | 10040872 | 0 |
| 10042647 | 10040873 | 1 |
| 10042648 | 10040873 | 0 |
| 10042654 | 10040873 | 0 |
(not all given, the rest is obvious). From these tables, we can build a org.openscience.cdk.Molecule object (Atom and Bond objects can be populated from the tables and they can be added to the Molecule). From the Molecule, we can generate e. g. a mol file. In case somebody wonders where coordinates are: They are in separate COORDINATE_SET tables (see the nmrshiftdb2 ER-diagram for details), which enables as to hold several coordinate sets (effectively different drawings) for the same structure.
Now, why all the fuss? The main reason is this: If we want to hold shifts and their assignments to atoms (this is actually a special case of the "atom and property" requirement), we can now do this on the database level using referential integrity, since we have atoms in the database (there is a shift table as well, of course). This is crucial for keeping the database clean. In theory, it would be possible to have the mol files in the molecule table and give a combination of MOLECULE_ID and atom number in the mol file for the assignment, but then referential integrity is not enforced. From my experience, I imagine that a system where the relationship is not modelled on the database level, will become corrupt over time (this is not to say, of course, that the system here is fool-proof, because there are other possibilities for corruptions, but at least one major possibility is excluded). The same is true, by the way, for explicitly modelling the implict Hs: It will get you into a mess, if you don't. This is true at least (but I believe not only) if you want to have properties for implicit Hs.
Another small advantage is that properties can easily be seen and corrected in the database. So if you discover that e. g. a double bond in an aromatic ring is missing in your database, you can see this in the databse and correct this via an update in sql. Otherwise, you need to extract your molfile and update this (and trust me, any larger structure database will require manual updates).
As I said, altogether this was a relatively simple exercise. What took a lot of time was to fix and handle all the little details, some of them directly related to CDK, like optimizing the fingerprint, handling aromaticity etc. These are now part of CDK and so of e. g. OrChem as well. Note CDK is not perfect (no chemistry library is) and some problems directly affect the search. This is true at least of the aromaticity problems, but again this affects all CDK based implementations. If you are interested in Java code mentioned above, see the nmrshiftdb2 source code. Most methods are in org.openscience.nmrshiftdb.om.DBMolecule, setFingerprint(IMolecule mol) contains the fingerprint calculation (the underlying OR mapper torque has problems with the bgints, so in the handleFingerprints2 method a direct sql statement is done), getAsCDKMoleculeX build various CDK molecule (with/without implicit Hs etc.). The exact substructure match is done in org.openscience.nmrshiftdb.util.GeneralUtils (look for line SubgraphIsomorphism checker = new SubgraphIsomorphism(mol, thisMol,SubgraphIsomorphism.Algorithm.VF2)), but this is basic CDK code.