From: Nina Jeliazkova <nina@ac...>  20061030 09:46:11

Rajarshi Guha <rguha@...> wrote: > On Sun, 20061029 at 08:41 +0200, Nina Jeliazkova wrote: > > Rajarshi Guha <rguha@...> wrote: > > > > > > > > My reasons for objecting are more philosophical than anything else: a > > > descriptor 'describes' some molecular or atomic feature. Now the > > > description may not be exact and instead might be an approximation (eg > > > GM charges, atom additive descriptors). On the other hand a predictive > > > model tries to 'explain' a feature (and a side effect is the > > > reproduction of the feature with varying degrees of accuracy). On this > > > level I think a predictive model should not be used as a descriptor > > > > > > > Just my two cents at the philosophical side of the discussion. > > I do not aggree that prediction is a "side effect", on the contrary, > > prediction is the PRIMARY objective of a predictive model. In fact, it is > > wrong to pretend that statistical models "explain" anything, statistics > could > > at its best "correlate" features, which is not the same as "explaining" (or > > finding causal relationship). > > You are correct in that there are some modeling techniques whose only > focus is predictive ability. These have been termed as 'algorithmic' > models (such as random forests, kNN regression/classification) by > Breiman. On the other hand there are 'data' models which assume that the > observed data is generated by a stochastic process  and the statistical > model tries to embody that process. An example would be OLS > > See Breiman, L., Statistical Science, 2001, 16(3), 199231 > > I was referring to the latter case, since the goal of those model types > is to generate an approximation to the stochastic process that generated > the data > Well, I was refering to the former case, since most widely used models in QSAR simply try to correlate values. > > What happens in practice is that *LogP can qualify as both as predicting > model > > and a descriptor, the latest is widely used to predict other properties. The > > same is true even for basic physicochemical properties, one can deal with > > measured values, predicted physchem properties and/or use them as > descriptors > > in another model. > > I have always though this a strange approach  I can understand using > atom additive logP (say Ghose & Crippen) for a descriptor, but to use a > logP value derived from a model (which contains some amount of random > error and also has a certain domain of applicability) simply implies > that one is purposely including extra degrees of error through the QSAR > modeling process Right, such descriptors do introduce further error and uncertainty. The problem here is that this uncertainty is not handled properly in the next step model. Otherwise, it is a perfectly valid approach to use derived properties to build a more complex model. I was not sure if XLogP is a predictive model (thanks Christian for clarification!), but I am aware ot at least CLogP and Syracuse Atom Fragment Contribution models (KOWWin software) that are indeed predictive models. In the latest model, fragment weights and correction factors are derived via regression. I wonder if these are obtained differently for XLogP algorithm. > > > IMHO, there should not be a distinction in the code on the basis of how the > > descriptor is obtained. There will be more confusion if e.g. XLogP is not > > available among the classes, implementing descriptor interface. > > I agree  but then again, XLogP is not based on a predictive model (?). > My understanding is that it is simply a group additive descriptor. > > > If there is > > really a need to distinguish, let's define an interface specific for > > "predicted descriptors" that extends "descriptors" and use them accordingly. > > I think suitable notes in the Javadoc (and ontology) should be > sufficient. Agree. Regards, Nina 