Re: [Rdkit-devel] PostgreSQL cartridge: Function to check whether a SMILES string is valid or not i
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Adrian S. <ma...@ad...> - 2010-10-29 15:08:51
|
Hi Greg, On Thu, Oct 28, 2010 at 14:04, Greg Landrum <gre...@gm...> wrote: > Dear Adrian, > > On Thu, Oct 28, 2010 at 11:34 AM, Adrian Schreyer <am...@ca...> wrote: >> >> I moved to PostgreSQL 9.0 recently and installed the RDKit cartridge >> which was easy to compile and install. It is also one of the fastest >> cartridges I have used so far, definitely a great extension of RDKit! > > Thank! I'm glad to hear that it's useful. > >> At the moment I am trying to create rdkit molecules and fingerprints >> for the latest version of the ChEMBL database, there is however a >> problem which can be solved easily. Since I already had the ChEMBL >> database (including the SMILES strings), I added tables to hold the >> mol and fp types for every compound and tried to populate it through >> "insert into... select mol_in(ism::cstring)..." which works nicely >> unless it finds a SMILES string it cannot parse such as this >> "CCCC1234B567B89%10B%11%12%13B8%14%15B%11%16%17B%12%18%19B59%13B16%18C2%16%19(c%20ccc(Oc%21ccc(cc%21)C(=O)O)cc%20)B3%14%17B47%10%15" >> (ChEBI: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3A105310). >> The query will fail then, and the only way to populate these table is >> to download the SMILES and filter them manually in Python. Is it >> possible to have a boolean function (is_valid(cstring)?) in the >> cartridge that simply checks if a SMILES can be parsed with RDKit or >> not? This would make it possible to add this check to a where clause >> in a query and as a result make the creation of mol types much easier. > > That's one good solution that would be easy to implement. The > is_valid() function would be a useful thing to have anyway, so I'll go > ahead and add it sometime in the near future. The downside is that it > will take more or less twice as long to populate the database (since > every molecule would have to be processed twice). Another option that > might be better, but I'll have to check about how feasible it is, > would be to have the molecule construction functions just return a > null (or whatever the postgresql equivalent is). Creating the mol types is actually not that slow I thought, compared to creating the indexes (based on the latest version of chembl). Simplifying the database creation is definitely worth it, and those costly operations are done only once. Another suggestion I had was to refactor the naming of the functions in the cartridge to make them more similar to the underlying functions in the Python/C++ libraries, e.g. mol_in ~ mol_from_smiles / mol_out ~ mol_to_smiles. Given the context, maybe it is possible to reduce the number of instances where a smiles string cannot be parsed. From what I have seen so far, this happens in three cases: the smiles string contains a non-daylight extension (not rdkit's fault), exotic inorganic chemistry (negligible) or the valence system in rdkit is violated. The last case is something where I often encounter problems, for example bromic acid Chem.MolFromSmiles('OBr(=O)=O') [15:21:29] Explicit valence for atom # 1 Br, 5, is greater than permitted Is there a definition for the valence model used in rdkit somewhere in the source tree? I assume the valence model is nothing that can be manually changed by the user without breaking other aspects of the software. Cheers, Adrian |