Re: [Rdkit-discuss] Exact sub-structure match
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2016-05-02 14:03:25
|
Dear Stephane,
On Mon, May 2, 2016 at 2:06 PM, Téletchéa Stéphane <
ste...@un...> wrote:
> Dear all,
>
> I am trying to work on sub-structure match using rdkit, my goal is to
> identify unambiguously
> a carbohydrate moiety in carbohydrates, for example glucose or galactose
> in lactose.
>
> I have tried using the built-in functions from the doc, using "ideal"
> structures from pubchem as starting structures:
>
> gal = Chem.MolFromMolFile('Beta-D-Galactose_Structure3D_CID_6036.sdf')
> glu = Chem.MolFromMolFile('Beta-D-Glucose_Structure3D_CID_5793.sdf')
> lactose = Chem.MolFromMolFile('Lactose_Structure3D_CID_6134.sdf')
>
> However, I am not able to distinguish between galactose and glucose
> (they are isomers on the c4 position)
> during the SubStructure search:
>
>
> ms=[gal,glu]
> fps = [FingerprintMols.FingerprintMol(x) for x in ms]
> DataStructs.FingerprintSimilarity(fps[0],fps[1])
> 1.0
>
That's not doing a substructure search, it's looking at chemical similarity
using the RDKit fingerprint. The two things are quite different from each
other.
>
> I *understand* that a fingerprint is meant to find the maximum
> sub-structure,
> but how could I distinguish between glu and gal in such a simple
> molecule like lactose?
> Is it possible in rdkit? Is is a bug (or a misusage)?
>
That particular fingerprint is designed for molecular similarity, not
substructure matching.
If you are interested in finding out whether or not the molecules have an
atom-atom match to each other, you can directly use the substructure
functionality.
Here's a demonstration using glucose and galactose produced from SMILES
(should be the same for you constructing from the SDFs from pubchem):
In [15]: galactose = Chem.MolFromSmiles('C([C@@H]1[C@@H]([C@@H]([C@H]([C@
@H](O1)O)O)O)O)O')
In [16]: glucose = Chem.MolFromSmiles('C([C@@H]1[C@H]([C@@H]([C@H]([C@
@H](O1)O)O)O)O)O')
In [17]: glucose.HasSubstructMatch(galactose)
Out[17]: True
In [18]: galactose.HasSubstructMatch(glucose)
Out[18]: True
This demonstrates that the default substructure match behavior does not
take chirality into account. This can be changed:
In [19]: glucose.HasSubstructMatch(galactose,useChirality=True)Out[19]:
False
In [20]: galactose.HasSubstructMatch(glucose,useChirality=True)Out[20]:
False
And here's the same thing for lactose:
In [21]: lactose = Chem.MolFromSmiles('C([C@@H]1[C@@H]([C@@H]([C@H]([C@
@H](O1)O[C@@H]2[C@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O')
In [22]: len(lactose.GetSubstructMatches(glucose))
Out[22]: 2
In [23]: len(lactose.GetSubstructMatches(galactose))
Out[23]: 2
In [24]: len(lactose.GetSubstructMatches(glucose,useChirality=True))
Out[24]: 1
In [25]: len(lactose.GetSubstructMatches(galactose,useChirality=True))
Out[25]: 1
Does that help answer your question?
-greg
|