From: Joerg K. W. <we...@in...> - 2003-11-27 09:08:34
|
Hi Wayne, > I grabbed JOELib-bin-20031117 and some data from the nci database, as I= =20 > wanted to compare property calculations from JOELib to other=20 > calculators. I queried for all structures that had an experimental log= =20 > P associated with them, 3576 structures came back. >=20 > I saw a number of discrepancies between what=92s calculated by JOELib a= nd=20 > what was recorded in the NCI dataset. For example : >=20 > - In 16% of the cases # of rotational bonds disagreed >=20 > - In ~29% of the cases, Number_of_HBA2 disagreed with the number of=20 > acceptors in the nci database >=20 > - In ~1% of the cases, Number_of_HBD1 disagreed with the number of=20 > donors in the nci database >=20 > - JOELib logP had a correlation of 0.64 with the experimental log P=20 > values; KOW & acd labs predictions in nci had correlations of 0.98 and = 0.92 >=20 > Below is a snippet of the code I=92m using=97before I look any further = at=20 > the reasons for the differences noted above, could you let me know if=20 > I=92m using the library correctly? Sounds reasonably !!! I've actually submitted two 'model-building-papers' which includes the=20 following sentences: 'For comparing models it should be guaranteed that the descriptors are=20 using all the same atom typer, aromaticity- and hybridization-model.=20 Because many programs use text definitions for the atom types=20 [JOELib,OpnBabel] we recommend to use the same definitions or the same=20 data processing workflow to avoid bad prediction results for new molecule= s.' As already mentioned several times, the descriptor calculation process=20 is the LAST step after processing four expert systems: http://www-ra.informatik.uni-tuebingen.de/software/joelib/tutorial/atomty= per.html In my opinion most of the programs have their own atom typer, which is=20 really critical !!! I thrust, taking my descriptor calculation=20 experience into account, mostly JOELib and OpenBabel, because both uses=20 the same atom typing definitions, which are open-source, open-content=20 and based on text files !!! Let's say these expert systems fails for some compounds, we can at least=20 be sure that they will also fail for analogue compounds, so we will have=20 a systematic error. Because these models have a long tradition they are=20 still really good, in my opinion. My cooperation partner told me that=20 the models are sometimes better than Sybyl. So far the results for the first definition of rotatable bonds, H-donors=20 and H-acceptors. The second definition is based on YOUR definition of=20 these descriptors. JOELib supports e.g. two different kind of donors and=20 acceptors, and will be never a gurantee for completeness !!! Most of the=20 authors in the literature gives their SMARTS pattern for this definition=20 or say, which is VERY BAD, we used program XYZ. A program is from the=20 computer scientist standpoint of view not transparent !!! Use always SMARTS or detailed descriptions of these descriptors !!! To LogP. I've already published a paper for LogP prediction. As you=20 surely know there are two main ways to predict values: 1. GroupContribution approach: The open-source model in JOELib is such=20 one. The model is really not that good (i checked this). See literature=20 reference in source code. 2. Descriptor/DataMining approach: Part of my published paper: J. K. Wegner, A. Zell, Prediction of Aqueous Solubility and Partition=20 Coefficient Optimized by a Genetic Algorithm Based Descriptor Selection=20 Method, Journal of Chemical Information and Computer Science (JCICS),=20 2003, 43(3), 1077-1084, DOI: 10.1021/ci034006u Concluding my mail i will say that my main work i'm paid for is Chemical=20 DataMining, so i know a lot of the problems in this area, please don't=20 hesitate to ask me, although these topics can be a little bit off-topic=20 for this mailing list. Do you know the JOELib interface to Weka ??? Regards, Joerg >=20 > Thanks, > Wayne >=20 > =20 >=20 > public class test { >=20 > =20 >=20 > public test() { >=20 > } >=20 > =20 >=20 > /** >=20 > * @param args the command line arguments >=20 > */ >=20 > public static void main(String[] args) throws Exception { >=20 > SimpleReader sdfile =3D new SimpleReader(args[0]); >=20 > JOEMol mol =3D new JOEMol(); >=20 > =20 >=20 > PrintStream out =3D new PrintStream(new FileOutputStream("out.d= at")); >=20 > =20 >=20 > DescResult LogP =3D null; >=20 > =20 >=20 > out.println("E_NSC\tjoe_logP\tkow_LogP\texp_logP\tacd_logP"); >=20 > while (sdfile.readNext(mol)) { >=20 > //System.out.println(mw.getDoubleValue(mol)); >=20 > LogP =3D DescriptorHelper.instance().descFromMol(mol, "LogP= "); >=20 > =20 >=20 > String kow_LogP =3D convert(mol.getData("E_LOGP")); >=20 > String exp_LogP =3D convert(mol.getData("E_LOGP/2")); >=20 > String acd_LogP =3D convert(mol.getData("E_LOGP/3")); >=20 > =20 >=20 > String nsc =3D mol.getData("E_NSC").toString(); >=20 > =20 >=20 > out.print(nsc + "\t" + LogP + "\t" + kow_LogP + "\t"); >=20 > out.println(exp_LogP + "\t" + acd_LogP); >=20 > } >=20 > } >=20 > =20 >=20 > static String convert(joelib.data.JOEGenericData in) { >=20 > String result =3D ""; >=20 > if (in !=3D null) { >=20 > result =3D in.toString().trim().substring(0,=20 > in.toString().indexOf(' ' >=20 > , 1)); >=20 > } >=20 > if (result =3D=3D null) { >=20 > return ""; >=20 > } else { >=20 > return result; >=20 > } >=20 > } >=20 > } >=20 > =20 >=20 > The =93convert=94 function is used to clean off spaces & a zero appende= d to=20 > the field containing the predicted log p values. I don=92t know why th= e=20 > sd file dump from the nci has that. >=20 > =20 >=20 > -----------------------------------------------------------------------= --------------------------=20 >=20 > This email may contain material that is confidential and privileged and= =20 > is for the sole use of the intended recipient. Any review, reliance or=20 > distribution by others or forwarding without express permission is=20 > strictly prohibited. If you are not the intended recipient, please=20 > contact the sender and delete all copies. >=20 --=20 Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. E. Hemingway |