Re: [Rdkit-discuss] delete a substructure
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Pavel P. <pav...@uk...> - 2017-03-09 06:47:34
|
You might find this link useful - http://www.rdkit.org/docs/GettingStartedInPython.html#chemical-transformations However, the issue in your case is SMARTS definitions. If one SMARTS completely covers another one it would be difficult to understand is it artifact or not.I think it might be reasonable to revise SMARTS to avoid such overlapping or create a list of rules (maybe hierarchical) which will define valid and not valid overlappings. Pavel. On 03/08/2017 06:32 PM, Chenyang Shi wrote: > Dear Hongbin, > > I tried your method on a molecule, 4-Methylsalicylic acid > (CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in > Joback method (using SMARTS), and used m.GetSubstructMatches to print > out all atom positions. The result is summarized in the table. > > We can see there are duplicated counts--coming from COOH group. As > suggested by Hongbin, we can remove duplicated atoms by looking at > their positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),) > are subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these > duplicates. However, I also noticed that Atom (3,) from =C< (ring) > group is also a part of -OH (phenol) ((10,3),). If we apply the same > algorithm to remove duplicates, the =C<(ring) group will be only > counted twice instead of three times. > > Greg, you mentioned as an alternative I can delete substructure using > chemical reaction method. It would be greatly appreciated if you could > show me (point me to) a simple example code, perhaps on a simple > molecule? I find myself at a loss when browsing the manual. I would > like to try also in that direction. > > Thanks, > Chenyang > > > Inline image 1 > > > On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum <gre...@gm... > <mailto:gre...@gm...>> wrote: > > The solution that Hongbin proposes to the double-counting problem > is a good one. Just be sure to sort your substructure queries in > the right order so that the more complex ones come first. > > Another thing you might think about is making your queries more > specific. For example, as you pointed out "[OH]" is very general > and matches parts of carboxylic acids and a number of other > functional groups. The RDKit has a set of fairly well tested > (though certainly not perfect) functional group definitions in > $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol > definition from there looks like this: > [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])] > > > -greg > > > On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾 <yan...@16... > <mailto:yan...@16...>> wrote: > > Hi, Chenyang, > You don't need to delete the substructure from the molecule. > Just check whehter the mapped atoms have been matched. For > example: > > m = Chem.MolFromSmiles('CC(=O)O') > OH = Chem.MolFromSmarts('[OH]') > COOH = Chem.MolFromSmarts('C(O)=O') > > m.GetSubstructMatches(OH) > >>((3,),) > m.GetSubstructMatchs(COOH) > >>((1, 3, 2),) > > Since atom "3" has been already matched, it should be ignored. > So you can create a "set" to record the matched atoms to avoid > repetitive count. > > ------------------------------------------------------------------------ > Hongbin Yang 杨弘宾 > > *From:* Chenyang Shi <mailto:cs...@co...> > *Date:* 2017-03-06 14:04 > *To:* Greg Landrum <mailto:gre...@gm...> > *CC:* RDKit Discuss > <mailto:rdk...@li...> > *Subject:* Re: [Rdkit-discuss] delete a substructure > Hi Greg, > > Thanks for a prompt reply. I did try > "GetSubstructMatches()" and it returns correct numbers of > substructures for CH3COOH. The potential problem with this > approach is that if the molecule is getting complicated, > it will possibly generate duplicate numbers for certain > functional groups. For example, --OH (alcohol) group will > be likely also counted in --COOH. A safer way, in my mind, > is to remove the substructure that has been counted. > > Greg, you mentioned "chemical reaction functionality", can > you show me a demo script with that using CH3COOH as an > example. I will definitely delve into the manual to learn > more. But reading your code will be a good start. > > Thanks, > Chenyang > > > On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum > <gre...@gm... <mailto:gre...@gm...>> > wrote: > > Hi Chenyang, > > If you're really interested in counting the number of > times the substructure appears, you can do that much > quicker with `GetSubstructMatches()`: > > In [2]: m = Chem.MolFromSmiles('CC(C)CCO') > In [3]: > len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]'))) > Out[3]: 2 > > Is that sufficient, or do you actually want to > sequentially remove all of the groups in your list? > > If you actually want to remove them, you are probably > better off using the chemical reaction functionality > instead of DeleteSubstructs(), which recalculates the > number of implicit Hs on atoms after each call. > > -greg > > > On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi > <cs...@co... <mailto:cs...@co...>> wrote: > > I am new to rdkit but I am already impressed by > its vibrant community. I have a question regarding > deleting substructure. In the RDKIT documentation, > this is a snippet of code describing how to delete > substructure: > > >>>m = Chem.MolFromSmiles("CC(=O)O") > >>>patt = Chem.MolFromSmarts("C(=O)[OH]") > >>>rm = AllChem.DeleteSubstructs(m, patt) > >>>Chem.MolToSmiles(rm) > 'C' > > This block of code first loads a molecule CH3COOH > using SMILES code, then defines a substructure > COOH using SMARTS code which is to be deleted. > After final line of code, the program outputs 'C', > in SMILES form. > > I had wanted to develop a method for detecting > number of groups in a molecule. In CH3COOH case, I > can search number of --CH3 and --COOH group by > using their respective SMARTS code with no > problem. However, when molecule becomes more > complicated, it is preferred to delete the > substructure that has been searched before moving > to next search using SMARTS code. Well, in current > case, after searching -COOH group and deleting it, > the leftover is 'C' which is essentially CH4 > instead of --CH3. I cannot proceed with searching > with SMARTS code for --CH3 ([CH3;A;X4!R]). > > Is there any way to work around this? > Thanks, > Chenyang > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the > world's most > engaging tech sites, SlashDot.org! > http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > <mailto:Rdk...@li...> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > <mailto:Rdk...@li...> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> > > > > > > ------------------------------------------------------------------------------ > Announcing the Oxford Dictionaries API! The API offers world-renowned > dictionary content that is easy and intuitive to access. Sign up for an > account today to start using our lexical data to power your apps and > projects. Get started today and enter our developer competition. > http://sdm.link/oxford > > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |