Re: [Rdkit-devel] PostgreSQL cartridge: Function to check whether a SMILES string is valid or not i

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Greg,

On Thu, Oct 28, 2010 at 14:04, Greg Landrum <gre...@gm...> wrote:
> Dear Adrian,
>
> On Thu, Oct 28, 2010 at 11:34 AM, Adrian Schreyer <am...@ca...> wrote:
>>
>> I moved to PostgreSQL 9.0 recently and installed the RDKit cartridge
>> which was easy to compile and install. It is also one of the fastest
>> cartridges I have used so far, definitely a great extension of RDKit!
>
> Thank! I'm glad to hear that it's useful.
>
>> At the moment I am trying to create rdkit molecules and fingerprints
>> for the latest version of the ChEMBL database, there is however a
>> problem which can be solved easily. Since I already had the ChEMBL
>> database (including the SMILES strings), I added tables to hold the
>> mol and fp types for every compound and tried to populate it through
>> "insert into... select mol_in(ism::cstring)..." which works nicely
>> unless it finds a SMILES string it cannot parse such as this
>> "CCCC1234B567B89%10B%11%12%13B8%14%15B%11%16%17B%12%18%19B59%13B16%18C2%16%19(c%20ccc(Oc%21ccc(cc%21)C(=O)O)cc%20)B3%14%17B47%10%15"
>> (ChEBI: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI%3A105310).
>> The query will fail then, and the only way to populate these table is
>> to download the SMILES and filter them manually in Python. Is it
>> possible to have a boolean function (is_valid(cstring)?) in the
>> cartridge that simply checks if a SMILES can be parsed with RDKit or
>> not? This would make it possible to add this check to a where clause
>> in a query and as a result make the creation of mol types much easier.
>
> That's one good solution that would be easy to implement. The
> is_valid() function would be a useful thing to have anyway, so I'll go
> ahead and add it sometime in the near future. The downside is that it
> will take more or less twice as long to populate the database (since
> every molecule would have to be processed twice). Another option that
> might be better, but I'll have to check about how feasible it is,
> would be to have the molecule construction functions just return a
> null (or whatever the postgresql equivalent is).

Creating the mol types is actually not that slow I thought, compared
to creating the indexes (based on the latest version of chembl).
Simplifying the database creation is definitely worth it, and those
costly operations are done only once. Another suggestion I had was to
refactor the naming of the functions in the cartridge to make them
more similar to the underlying functions in the Python/C++ libraries,
e.g. mol_in ~ mol_from_smiles / mol_out ~ mol_to_smiles.

Given the context, maybe it is possible to reduce the number of
instances where a smiles string cannot be parsed. From what I have
seen so far, this happens in three cases: the smiles string contains a
non-daylight extension (not rdkit's fault), exotic inorganic chemistry
(negligible) or the valence system in rdkit is violated. The last case
is something where I often encounter problems, for example bromic acid

Chem.MolFromSmiles('OBr(=O)=O')
[15:21:29] Explicit valence for atom # 1 Br, 5, is greater than permitted

Is there a definition for the valence model used in rdkit somewhere in
the source tree? I assume the valence model is nothing that can be
manually changed by the user without breaking other aspects of the
software.

Cheers,

Adrian

Re: [Rdkit-devel] PostgreSQL cartridge: Function to check whether a SMILES string is valid or not i

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-devel] PostgreSQL cartridge: Function to check whether a SMILES string is valid or not in RDKit