Re: [Rdkit-discuss] Cleaning SD files
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2010-09-17 03:20:25
|
Dear James, On Thu, Sep 16, 2010 at 8:01 PM, James Davidson <J.D...@ve...> wrote: > > I have attached the python-script that I have at the moment (a) in case it > is of some use to anybody else, (b) in the hope that I can improve my python > and rdkit abilities through any suggested alterations (I'm sure there are > many!), and (c) to form the basis of a couple of questions. At the moment, > the script is just running through each compound; checking if the molecule > is valid; and if so, noting how many components, and whether any of the > atoms are outside of the desired list. These two results are then written > out to a new SDF. I am then using this to make sure my data-set contains > only compounds that I would say are 'reasonable' to build a melting-point > model with. Now for the questions: Thanks for sending along the script. I haven't been through it yet but I will try and find some time later for that. > 1. In RDKit, has the 'cleaning / washing / salt-stripping' of molecules > already been formalised based on a set of rules, etc? Not that I'm aware of on the open-source side of things. All of the functionality required to do this is, I believe, present in the RDKit though. > 2. When identifying compounds that contain a non-allowed atom-type, why do > I find the SMARTS def [!H;!C;!N;!O;!F;!S;!Cl;!Br;!I] gives unexpected > results, but [!#1;!#6;!#7;!#8;!#9;!#16;!#17;!#35;!#53] works as I would > expect? This is a fairly common SMARTS "gotcha": in SMARTS the query "[C]" means "aliphatic C". This leads to the following behavior: [3]>>> Chem.MolFromSmiles('c1ccccc1').GetSubstructMatches(Chem.MolFromSmarts('[!C]')) Out[3] ((0,), (1,), (2,), (3,), (4,), (5,)) If you want to be sure that your SMARTS will capture aliphatic or aromatic atoms, you need to provide the atomic numbers, as in your second query: [4]>>> Chem.MolFromSmiles('c1ccccc1').GetSubstructMatches(Chem.MolFromSmarts('[!#6]')) Out[4] () Best Regards, -greg |