Re: [Rdkit-discuss] sanitization removes Hs - is this expected?
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2015-04-02 04:22:46
|
Hi Michal, On Wed, Apr 1, 2015 at 10:51 AM, Michal Krompiec <mic...@gm...> wrote: > Hi Greg, > Is it possible to do the same (i.e. create a molecule from SMILES without > removing explicit hydrogens) in the postgresql cartridge? I would like to > do a "restricted" substructure search using SMILES queries. > I think I understand your use case. > For example, with the standard behaviour (hydrogens removed), > c1ccccc1[CH3] is converted to c1ccccc1C and matches TNT and benzaldehyde, > whereas if the hydrogens are not removed, this SMILES query would match TNT > but not benzaldehyde. Of course, this can be done with SMARTS but SMILES > with explicit hydrogens can be drawn in MarvinSketch in KNIME by a > non-expert user. > Without getting overly into terminology, it sounds to me like you want people to be able to draw something corresponding to "C1=CC=CC=C1C([H])([H])[H]" in a sketcher, convert that to SMILES, and have the query constructed from that SMILES match toluene but not ethyl-benzene or benzaldehyde. Going via SMARTS here does not work because [#6]-1=[#6]-[#6]=[#6]-[#6]=[#6]-1[H]C([H])[H] doesn't match much of anything. Skipping sanitization, as you propose, isn't going to help here: the kekulized form of the ring will not be converted to aromatic and you won't get the matches you are looking for. Here's an approach to this that works in Python : In [8]: m =Chem.MolFromSmiles('c1ccnc([H])n1',sanitize=False) In [9]: nm=Chem.MergeQueryHs(m) In [10]: Chem.SanitizeMol(nm) Out[10]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE In [11]: Chem.MolFromSmiles('c1ccncn1').HasSubstructMatch(nm) Out[11]: True In [12]: Chem.MolFromSmiles('c1ccnc(C)n1').HasSubstructMatch(nm) Out[12]: False Notice the MergeQueryHs() step; that's essential unless you are storing molecules in the database with Hs attached (pretty unlikely). Being able to do something equivalent in the cartridge would certainly be useful. What I'd suggest is the addition of two functions: "query_mol_from_smiles()" and "query_mol_from_ctab()" that do this. Then you could do queries like: select * from mols where m @> query_mol_from_smiles('c1ccnc([H])n1'); and have it do the right thing. Sound reasonable? -greg |