Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Greg,

Thank you very much for your reply. I will try to explain more what I would like to achieve, I hope that it will clarify things a little.

Let's look at your example firs and let's treat the first molecule (CC=CNC) in ["CC=CNC", "C=CNC=CC"] as a "query", we would like to check if it is an EXACT match to the second molecule ("C=CNC=CC").

Your example is a case of the "solution to the Liz Wylie problem" at its best.

["CC=CNC", "C=CNC=CC"] ==> CC=CN - so 'no' - no exact match! And it is what we would expect upon the implementation of the current "solution to the Liz Wylie problem" and this is what I would consider "CORRECT" for my purposes.
Tables below are as follows:
>>> bond_type, bond_start_atom, bond_start_atom_symbol, bond_start_atom_hyb, bond_end_atom, bond_end_atom_symbol, bond_end_atom_hyb

CC=CNC
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3

C=CNC=CC

DOUBLE 0 C SP2 1 C SP2
SINGLE 1 C SP2 2 N SP2
SINGLE 2 N SP2 3 C SP2
DOUBLE 3 C SP2 4 C SP2
SINGLE 4 C SP2 5 C SP3

In your example the hybridizations of C atoms in the CNC fragment of both molecules do not match and the overall result is ok. In the first "query" molecule the hybridization of the first C in the CNC fragment is sp2 (and it is connected to the first C in the "query" molecule via the DOUBLE bond), then the N is sp2, but the last C is sp3 and is bonded only via SINGLE bonds. In the second molecule (C=CNC=CC) both carbons in CNC fragment are sp2 AND both carbons are bonded via DOUBLE bonds, not like in the "query" molecule DOUBLE and SINGLE.
What I would like to do is to check if one structure is an exact match within the other, so the atoms must match, the bonds must match and the hybridization of an atom must match, but the bonding is the most important thing and that is where the exceptions show, because you can have an sp2 atom that is bonded via a SINGLE bond. Let me illustrate on couple of examples what I mean.

Examples to illustrate it:

Example 1, Ala-Ala dipeptide case:

CC(N)C(=O)NC(C)C(=O)O

SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
DOUBLE 3 C SP2 4 O SP2
SINGLE 3 C SP2 5 N SP2
SINGLE 5 N SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 6 C SP3 8 C SP2
SINGLE 8 C SP2 9 O SP2
DOUBLE 8 C SP2 10 O SP2

if I have two "query" molecules:

1) CC(N)C(N)=O
CC(N)C(N)=O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 N SP2
DOUBLE 3 C SP2 5 O SP2

["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O - so 'yes' - the exact match! And "CORRECT!"
2) CC(N)C(O)=O
CC(N)C(=O)O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 O SP2
DOUBLE 3 C SP2 5 O SP2
["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C=O - so 'no' - no exact match! But it should be "CORRECT" because it is there.

I would like to check if the query molecules are EXACT match in the Ala-Ala dipeptide case CC(N)C(=O)NC(C)C(=O)O then if we implement the current "solution to the Liz Wylie problem" only the molecule 1) will be found there and the molecule 2) will not be found in CC(N)C(=O)NC(C)C(=O)O due to the non-matching hybridizations of the N atom. I very much need the "solution to the Liz Wylie problem" to prevent matching atoms with different hybridizations but at the same time I would like to ensure that if atom happens to be have sp2 hybridization but at the same time it is bonded by a single bond then its hybridization state is less important and what really matters is its bonding.

Example 2:

C\C=C\NC1CCC1
CC=CNC1CCC1
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3
SINGLE 4 C SP3 5 C SP3
SINGLE 5 C SP3 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP3

Two "query" molecules:

1) C\C=C\N
CC=CN
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2

["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N - so 'yes' - the exact match! And "CORRECT!"

This is an easy example - everything matches between the "query" and the molecule - the atoms, the bonding and the hybridization.

2) NC1CCC1
NC1CCC1
SINGLE 0 N SP3 1 C SP3
SINGLE 1 C SP3 2 C SP3
SINGLE 2 C SP3 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP3
["NC1CCC1", "C\C=C\NC1CCC1"] ==> C1CCC1 - so 'no' - no exact match! But it should be "CORRECT"

What does not match is the hybridization of the N atom between the "query" and the C\C=C\NC1CCC1 molecule and that is true, but in both "query" and the C\C=C\NC1CCC1 molecules the N atom bond types match and both N atoms are bonded with SINGLE bonds. The bonding match, for me, is of higher order importance then the hybridization match.

Example 3:

The last example is an illustration of a hierarchical importance of matching I need. It is an example when everything matches but the result is "INCORRECT".

CC\N=N\C1=CCC1
CCN=NC1=CCC1
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP2
DOUBLE 2 N SP2 3 N SP2
SINGLE 3 N SP2 4 C SP2
DOUBLE 4 C SP2 5 C SP2
SINGLE 5 C SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP2

One "query" molecule:

1) NC1=CCC1

NC1=CCC1

SINGLE 0 N SP2 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP2

["NC1=CCC1", "CCN=NC1=CCC1"] ==> NC1=CCC1 - so 'yes' - exact match! But it is "INCORRECT".

Why? Even if the hybridizations of N atoms in the "query" and in the CCN=NC1=CCC1 is sp2, both N atoms in the CCN=NC1=CCC1 molecule are DOUBLE bonded and the N atom in the "query" molecule is SINGLE bonded, so the bonding does not match and as I mentioned earlier the bonding has higher order of importance than the hybridization.

I hope that that this clarifies what I would like to achieve, I know that it is probably highly non-standard problem and an unique one, but I would really appreciate your help with that matter! Of course the examples I gave are purely for computational purposes and they do not reflect the chemical stability of those molecules.
Thanks a lot once again!
Have a great Sunday!
Janusz Petkowski

________________________________
From: Greg Landrum [gre...@gm...]
Sent: Saturday, November 14, 2015 11:26 PM
To: Janusz Petkowski
Cc: rdk...@li...
Subject: Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

Hi Janusz,

I'm not 100% sure what you're looking for, but I think it has something to do with including information about bond conjugation in the MCS procedure.

To confirm, can you please give a couple of examples of what you would like to have as output from the algorithm? Something like this with the input molecules on the left and the desired result on the right would help :
['CNC=CC', 'C=CNC=CC'] -> 'CNC=CC'
(I realize that specific example is not what you're looking for, it's just intended to be an example)

Once I've seen that I can try to figure out if it is currently doable and, if not, if it's possible to modify the code to support it.

Best,
-greg

On Fri, Nov 13, 2015 at 9:17 PM, Janusz Petkowski <jjp...@mi...<mailto:jjp...@mi...>> wrote:
Dear RDKit Community,

I am looking for a way to use MCS module in RDKit to compare atoms and bonding of two molecules which will also take under consideration the hybridization of an atom.
The solution to similar problem was suggested before, (Inspired by this RDKit-discuss thread started by Liz Wylie: http://www.mail-archive.com/rdk...@li.../msg03676.html and see here http://sourceforge.net/p/rdkit/mailman/message/31830412/ )

but even if it is computationally correct it does not necessarily mirror some nuances of chemistry and one may want to modify it in certain specific cases.
While it works most of the time for cases like those proposed in the solution of Liz Wylie case:

smis = ['CC(C)=C','CC(C)C']
 or

smis2 = ['CC(C)=C','CC(C)=N']
 If we check if 'CCC' substructure is present in molecules from those two data sets upon implementation of Greg Landrum solution to CCC will be found only in  'CC(C)C', taking in to the account the atoms, the bonding and the hybridization of the atoms. It is all correct and cool!

But let's look at the other example:
Let's look for the N\CC\N substructure in 'C\C=C\NCCN\C=C\C' or the 'NCN' substructure in NCN-C=C or ' C=CNCNC=C'. It will not be found there even if "structurally speaking" it is there.
The problem is as follows:  an electronegative atom next to a C=C bond will pull electron density from that bond and so the N-C bond in NCN-C=C will have a ‘bit of’ double bond character, even if technically it is a single bond. The current solution to the Liz Wylie problem does not ignore that and distinguishes between regular N-C bond and an N-C bond next to C=C bond (like in NCN-C=C, because of that it will not find NCN in this structure). NCS in NCSC=C is matched because the S bond is more electropositive than N or O and so does not have that double-bond character. My question to the RDKit community is: How to modify Greg Landrum solution to Liz Wylie case to successfully match such cases I mentioned above, while still retaining the hybridization check (we do want to have hybridization match, we just want the bonding to be more important). The problem is that the atoms that are not matched like the N atoms above have sp2 hybridization but technically are bonded by single bonds from all sides.
Thanks a lot for your help, time and consideration. This is my first post on RDKit forum, I am new to RDKit and python in general, so I apologize if I anything is not clear.
I would really appreciate your help!

Best regards,

Janusz Petkowski

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdk...@li...<mailto:Rdk...@li...>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search