Re: [Rdkit-devel] PostgreSQL cartridge: Function to check whether a SMILES string is valid or not i
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2010-10-28 13:05:17
|
Dear Adrian, On Thu, Oct 28, 2010 at 11:34 AM, Adrian Schreyer <am...@ca...> wrote: > > I moved to PostgreSQL 9.0 recently and installed the RDKit cartridge > which was easy to compile and install. It is also one of the fastest > cartridges I have used so far, definitely a great extension of RDKit! Thank! I'm glad to hear that it's useful. > At the moment I am trying to create rdkit molecules and fingerprints > for the latest version of the ChEMBL database, there is however a > problem which can be solved easily. Since I already had the ChEMBL > database (including the SMILES strings), I added tables to hold the > mol and fp types for every compound and tried to populate it through > "insert into... select mol_in(ism::cstring)..." which works nicely > unless it finds a SMILES string it cannot parse such as this > "CCCC1234B567B89%10B%11%12%13B8%14%15B%11%16%17B%12%18%19B59%13B16%18C2%16%19(c%20ccc(Oc%21ccc(cc%21)C(=O)O)cc%20)B3%14%17B47%10%15" > (ChEBI: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3A105310). > The query will fail then, and the only way to populate these table is > to download the SMILES and filter them manually in Python. Is it > possible to have a boolean function (is_valid(cstring)?) in the > cartridge that simply checks if a SMILES can be parsed with RDKit or > not? This would make it possible to add this check to a where clause > in a query and as a result make the creation of mol types much easier. That's one good solution that would be easy to implement. The is_valid() function would be a useful thing to have anyway, so I'll go ahead and add it sometime in the near future. The downside is that it will take more or less twice as long to populate the database (since every molecule would have to be processed twice). Another option that might be better, but I'll have to check about how feasible it is, would be to have the molecule construction functions just return a null (or whatever the postgresql equivalent is). Thanks for the suggestion, -greg |