Thread: [Rdkit-discuss] RDKit Descriptors
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Robert D. <rkd...@gm...> - 2008-09-17 15:58:23
|
I've finally found time to start using RDKit and started with descriptor calculation. Following the examples on the wiki ( http://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit), I get a KeyError any time I attempt to obtain HeavyAtomCount, RingCount, PEOP_VSA, SMR_VSA, Slogp_VSA, EState_VSA, and VSA_Estate. (BTW, what is the difference between the two last VSA descriptors?) -Kirk DeLisle |
From: Greg L. <gre...@gm...> - 2008-09-17 22:36:08
|
Dear Kirk, On Thu, Sep 18, 2008 at 12:58 AM, Robert DeLisle <rkd...@gm...> wrote: > I've finally found time to start using RDKit and started with descriptor > calculation. Following the examples on the wiki > (http://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit), I get a > KeyError any time I attempt to obtain HeavyAtomCount, RingCount, HeavyAtomCount and RingCount were introduced after the May release, so they're in the subversion version of the code. They will be in the Q3 release (which will happen sometime in the next couple of weeks, hopefully). > PEOP_VSA, > SMR_VSA, Slogp_VSA, EState_VSA, and VSA_Estate. The various X_VSA descriptors are vector-valued and you access them by element, so you could ask for PEOE_VSA4 or Slogp_VSA10. > (BTW, what is the > difference between the two last VSA descriptors?) The "standard" VSA descriptors provide map summed VSA values into bins determined by the other descriptor. So, for example, SMR_VSA uses atomic contributions to the VSA and uses bins determined by atomic contributions to the SMR. EState_VSA is the same, it just uses atomic EState values. VSA_EState is reversed: atomic EState values are put into bins determined by the VSA contributions. Best Regards, -greg |
From: Robert D. <rkd...@gm...> - 2008-09-18 14:07:01
|
Greg, Thank you for the response. I was able to get PEOE_VSA1 through PEOE_VSA14, SMR_VSA1 through SMR_VSA10, and EState_VSA1 through EState_VSA11 working. Are these the correct limits on the vector components? I was unable, however, to get Slogp_VSA or VSA_EState working with any integer suffix between 1 and 10. I've also done a correlation analysis on all the descriptors that I've gotten working. After computing descriptors for some 24,000 compounds I removed those with less than 10% variance and limited correlations between variables to a maximum of 0.85 (using KNIME). I'm happy to send a list of the resulting descriptors or a correlation matrix if you or anyone else is interested. On Wed, Sep 17, 2008 at 11:36 PM, Greg Landrum <gre...@gm...>wrote: > Dear Kirk, > > On Thu, Sep 18, 2008 at 12:58 AM, Robert DeLisle <rkd...@gm...> > wrote: > > I've finally found time to start using RDKit and started with descriptor > > calculation. Following the examples on the wiki > > (http://code.google.com/p/rdkit/wiki/DescriptorsInTheRDKit), I get a > > KeyError any time I attempt to obtain HeavyAtomCount, RingCount, > > HeavyAtomCount and RingCount were introduced after the May release, so > they're in the subversion version of the code. They will be in the Q3 > release (which will happen sometime in the next couple of weeks, > hopefully). > > > PEOP_VSA, > > SMR_VSA, Slogp_VSA, EState_VSA, and VSA_Estate. > > The various X_VSA descriptors are vector-valued and you access them by > element, so you could ask for PEOE_VSA4 or Slogp_VSA10. > > > (BTW, what is the > > difference between the two last VSA descriptors?) > > The "standard" VSA descriptors provide map summed VSA values into bins > determined by the other descriptor. So, for example, SMR_VSA uses > atomic contributions to the VSA and uses bins determined by atomic > contributions to the SMR. EState_VSA is the same, it just uses atomic > EState values. VSA_EState is reversed: atomic EState values are put > into bins determined by the VSA contributions. > > Best Regards, > -greg > |
From: Greg L. <gre...@gm...> - 2008-09-18 21:20:54
|
On Thu, Sep 18, 2008 at 11:07 PM, Robert DeLisle <rkd...@gm...> wrote: > Greg, > > Thank you for the response. > > I was able to get PEOE_VSA1 through PEOE_VSA14, SMR_VSA1 through SMR_VSA10, > and EState_VSA1 through EState_VSA11 working. Are these the correct limits > on the vector components? Yes. Just in case you used a more painful approach, here's the simplest way to check (without looking at the source in $RDBASE/Python/Chem/MolSurf.py): [17] >>> [x for x in AvailDescriptors.descDict.keys() if x.find('PEOE_VSA')!=-1] Out[17]: ['PEOE_VSA14', 'PEOE_VSA13', 'PEOE_VSA12', 'PEOE_VSA11', 'PEOE_VSA10', 'PEOE_VSA8', 'PEOE_VSA7', 'PEOE_VSA6', 'PEOE_VSA5', 'PEOE_VSA4', 'PEOE_VSA3', 'PEOE_VSA2', 'PEOE_VSA1', 'PEOE_VSA9'] > I was unable, however, to get Slogp_VSA or VSA_EState working with any > integer suffix between 1 and 10. That's strange. What errors were you getting? > I've also done a correlation analysis on all the descriptors that I've > gotten working. After computing descriptors for some 24,000 compounds I > removed those with less than 10% variance and limited correlations between > variables to a maximum of 0.85 (using KNIME). I'm happy to send a list of > the resulting descriptors or a correlation matrix if you or anyone else is > interested. Sounds interesting. If you are willing, I would be happy to put this on the wiki, linked from the descriptors page. It would be best if you could also describe the source of the 24K compounds (or provide SMILES for them). -greg |