From: joerg.wegner <joe...@we...> - 2006-02-22 03:44:08
|
Hi again, Hi, @JOELib users: After a long time ... (still busy time) ... a bug fix release. >We would be grateful if the participants will submit their favorite set = of=20 >descriptors, and share it with all CoEPrA competitors. >In such cases we might include the descriptors in the CoEPrA task. That is marvellous and sure ... see further below ... and how-to = calculate JOELib2 features: around 5000 eigenvalue, RDF, autocorrelation and complexity features (already in different smoothing levels).=20 But with respect to the interpretation ability you should use the option = to calculate counting SMARTS, e.g. simply write a script which generates = linear or spherical patterns and forward them to OpenBabel or JOELib. OpenBabel = is much faster in that, because JOELib has some mining prototyping = features, which slows things down. E.g. you can read and assign any data type to features in SDF files. Statistics will then 'recognize' features automatically, which makes it quite easy to use any special atom, bond = or whatever properties. For people which would rather prefer just to use the chemical expert = system and the atom and bond properties, they have the choice of 53 atom = properties in different on different smoothing levels and also 9 different bond properties. > The most interesting characteristic of SAR/QSAR models is the > identification of the relevant descriptors, and in order to compare=20 > different types of descriptors, we are preparing several SAR/QSAR = datasets > that contain the chemical structures. Well ... I still disagree ... it is ONE interesting part ... but not the only one. The other is to create a good hypothesis language or = algorithm, and this can but MUST NOT be based on feature vectors. > However, I do not advocate the sole use of a consolidated set of=20 > descriptors... physical meaning and discrimation ability of atom types Again: No-Free-Lunch (for optimization) ... so I agree. You seem to prefer the top-down way, I prefer the bottom-up way ... even after years it is still difficult to say what is really important and especially what has the highest generalization ability. I am especially referring to the bias-variance-decomposition. And if I build 1001 single models, and merge them (kind of user-boosting algorithm) that will cause = an overfitting by users. This is not proven, but my impression is still = that we should avoid that kind of overfitting. Finally (AtomType relevance), any activity must not be a connected part = of a structure or explainable by a single atom type, so even with expert knowledge the underlying subgraph isomorphism problem can not be avoided (top-down or bottom-up ?;-). And yes, the AtomType coding is incredible good, but there is still room for improvements of the chemical expert systems, which are at the end responsible for assigning atom types. =20 <snip/> How to calculate JOELib2 features? Download the release from today: http://sourceforge.net/projects/joelib=20 Linux: you are fine Windows: Please install Cygwin ;-) Change to joelib2 base directory cd joelib2-2006-02-22 Csh: setenv JOELIB2 `pwd` Bash: export JOELIB2=3D`pwd` Create SMARTS features on a way you like ... educated guessing or combinatorial and store in file 'mySMARTS.txt', e.g. [CX4H3][#6] [CX4H2]([#6])[#6] [CX4H1]([#6])([#6])[#6] [CX4]([#6])([#6])([#6])[#6] [CX3;$([H2]),$([H1][#6]),$(C([#6])[#6])]=3D[CX3;$([H2]),$([H1][#6]),$(C([= #6])[ #6])]=20 [CX2]#[CX2] [CX3]=3D[CX2]=3D[CX3] [ClX1][CX4] [FX1][CX4] [BrX1][CX4] [IX1][CX4] Test installation: sh featureCalculation.sh +ap +countSMARTS +binarySMARTS +jcc +SSKey src/resources/multiple.mol multiple_features.sdf SMARTS_InteLigand_veryshort.txt > logging.txt sh featureStatistic.sh multiple_features.sdf > logging-stat.txt This will produce two files, a statistic and a binning file. Those files will be generated only once per SDF file, so they can be used in a = larger mining environment. If you have special features, so add them to Joelib2/src/joelib2/src/data/plain/knowResults.txt And add their data format to the appropriate section (regular = expressions are allowed). BTW, a molecule specific Weka mining class is already part of the JOELib release. Flatfile support is also available if you want create matrix files. And = the Matlab interface might be also interesting for some users, but I = recommend rather to go for flat files, because the Matlab interface was designed = for the old JOELib(1) version. If you need any help, write me an e-mail or contact the mailing lists joe...@li... joe...@li... Best regards, Joerg -----Urspr=FCngliche Nachricht----- Von: grbounce-nY5BowUAAAB926gH-oE5s-8BGcxAvbeI=3Djoerg.wegner=3Dweb.de@googleg= roups.c om [mailto:grbounce-nY5BowUAAAB926gH-oE5s-8BGcxAvbeI=3Djoerg.wegner=3Dweb.de= @google groups.com] Im Auftrag von co...@gm... Gesendet: Mittwoch, 22. Februar 2006 00:41 An: CoEPrA Betreff: CoEPrA and SAR/QSAR datasets Dear CoEPrA participants, I want to further elaborate on the composition of the SAR/QSAR datasets, because it seems that there is great interest in computing and comparing QSAR descriptors. Here I use "SAR/QSAR" with the meaning of "property prediction from chemical structures and structural descriptors", in order to include not only biological activities, but also physico-chemical properties. For some datasets we are constrained by the nature of the compounds to provide only calculated descriptors. This is due to their origin (databases, corporate compounds, copyright problems, or the fact that the compounds are taken from well-known databases). In the case of proprietary compounds or copyrighted information, the situation is clear: we cannot provide the chemical structures, so there is nothing that we can do here. We use such datasets in CoEPrA because of (a) their value in data modeling or (b) the importance of the problem (for example, drug/non-drug classification). If you know such datasets that might be suitable for the CoEPrA competition, please contact me at oii...@ut... or ie...@ya.... Regarding the datasets taken from well-known databases, we cannot provide the chemical structures because doing so will make impossible the "blind prediction" character of the CoEPrA competition. For example, compounds from the NCI database (or similar databases) will be given only as computed descriptors. The most interesting characteristic of SAR/QSAR models is the identification of the relevant descriptors, and in order to compare different types of descriptors, we are preparing several SAR/QSAR datasets that contain the chemical structures. For peptides datasets we will provide only the sequences, while the chemical structures will be provided in a molecular format that can be read with OpenBabel. This format can be SMILES (i.e., no coordinates) or a format that contains 3D coordinates. Based on the chemical structures provided in a CoEPrA task, the participants can generate 3D structures (if only SMILES codes are given, for example), generate conformers, use QM/MM/MD programs to optimize the geometry, align the molecules, or perform any computation on the molecular structure. Also, the participants can compute any set of structural descriptors, and use it to generate QSAR models. However, "black box" descriptors are not allowed, because we want to learn something from these "blind predictions", and not only to maximize the AUC or Q2. The scope of the "blind predictions" in CoEPrA is to test as many hypotheses as possible, and this is why we would be interested to see comparative studies for various classes of descriptors, such as: "counts of atom types" versus "path counts" versus "autocorrelation vectors" versus "3D pharmacophores" versus "subgraph counts" versus "quantum indices" versus "molecular field descriptors", and so on. Of course, all the above descriptors can be consolidated into a large set of descriptors, which also should be tested as a reference for QSAR models with individual classes of descriptors. However, I do not advocate the sole use of a consolidated set of descriptors, because: (a) some sets of descriptors have a physical meaning, such as LSER or TLSER descriptors (b) some classes of descriptors are a result of a QSAR theory (i.e., atom types can discriminate between drug/non-drug compounds) (c) the comparative evaluation of different classes of descriptors is lost. Best regards, Ovidiu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google = Groups "CoEPrA" group. To post to this group, send email to Co...@go... To unsubscribe from this group, send email to CoE...@go... For more options, visit this group at = http://groups.google.com/group/CoEPrA -~----------~----~----~----~------~----~------~--~--- |