[Joelib-devel] AW: CoEPrA and SAR/QSAR datasets

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi again,

Hi,

@JOELib users: After a long time ... (still busy time) ... a bug fix
release.

>We would be grateful if the participants will submit their favorite set =
of=20
>descriptors, and share it with all CoEPrA competitors.
>In such cases we might include the descriptors in the CoEPrA task.
That is marvellous and sure ... see further below ... and how-to =
calculate
JOELib2 features: around 5000 eigenvalue, RDF, autocorrelation and
complexity features (already in different smoothing levels).=20
But with respect to the interpretation ability you should use the option =
to
calculate counting SMARTS, e.g. simply write a script which generates =
linear
or spherical patterns and forward them to OpenBabel or JOELib. OpenBabel =
is
much faster in that, because JOELib has some mining prototyping =
features,
which slows things down. E.g. you can read and assign any data type to
features in SDF files. Statistics will then 'recognize' features
automatically, which makes it quite easy to use any special atom, bond =
or
whatever properties.

For people which would rather prefer just to use the chemical expert =
system
and the atom and bond properties, they have the choice of 53 atom =
properties
in different on different smoothing levels and also 9 different bond
properties.

> The most interesting characteristic of SAR/QSAR models is the
> identification of the relevant descriptors, and in order to compare=20
> different types of descriptors, we are preparing several SAR/QSAR =
datasets

> that contain the chemical structures.
Well ... I still disagree ... it is ONE interesting part ... but not the
only one. The other is to create a good hypothesis language or =
algorithm,
and this can but MUST NOT be based on feature vectors.

> However, I do not advocate the sole use of a consolidated set of=20
> descriptors... physical meaning and discrimation ability of atom types
Again: No-Free-Lunch (for optimization) ... so I agree.
You seem to prefer the top-down way, I prefer the bottom-up way ... even
after years it is still difficult to say what is really important and
especially what has the highest generalization ability. I am especially
referring to the bias-variance-decomposition. And if I build 1001 single
models, and merge them (kind of user-boosting algorithm) that will cause =
an
overfitting by users. This is not proven, but my impression is still =
that we
should avoid that kind of overfitting.
Finally (AtomType relevance), any activity must not be a connected part =
of a
structure or explainable by a single atom type, so even with expert
knowledge the underlying subgraph isomorphism problem can not be avoided
(top-down or bottom-up ?;-). And yes, the AtomType coding is incredible
good, but there is still room for improvements of the chemical expert
systems, which are at the end responsible for assigning atom types. =20

<snip/>
How to calculate JOELib2 features?
Download the release from today:
http://sourceforge.net/projects/joelib=20

Linux: you are fine
Windows: Please install Cygwin ;-)

Change to joelib2 base directory
cd joelib2-2006-02-22

Csh: setenv JOELIB2 `pwd`
Bash: export JOELIB2=3D`pwd`

Create SMARTS features on a way you like ... educated guessing or
combinatorial and store in file 'mySMARTS.txt', e.g.
[CX4H3][#6]
[CX4H2]([#6])[#6]
[CX4H1]([#6])([#6])[#6]
[CX4]([#6])([#6])([#6])[#6]
[CX3;$([H2]),$([H1][#6]),$(C([#6])[#6])]=3D[CX3;$([H2]),$([H1][#6]),$(C([=
#6])[
#6])]=20
[CX2]#[CX2]
[CX3]=3D[CX2]=3D[CX3]
[ClX1][CX4]
[FX1][CX4]
[BrX1][CX4]
[IX1][CX4]

Test installation:
sh featureCalculation.sh +ap +countSMARTS +binarySMARTS +jcc +SSKey
src/resources/multiple.mol multiple_features.sdf
SMARTS_InteLigand_veryshort.txt > logging.txt

sh featureStatistic.sh multiple_features.sdf > logging-stat.txt

This will produce two files, a statistic and a binning file. Those files
will be generated only once per SDF file, so they can be used in a =
larger
mining environment.
If you have special features, so add them to
Joelib2/src/joelib2/src/data/plain/knowResults.txt
And add their data format to the appropriate section (regular =
expressions
are allowed).
BTW, a molecule specific Weka mining class is already part of the JOELib
release.
Flatfile support is also available if you want create matrix files. And =
the
Matlab interface might be also interesting for some users, but I =
recommend
rather to go for flat files, because the Matlab interface was designed =
for
the old JOELib(1) version.

If you need any help, write me an e-mail or contact the mailing lists
joe...@li...
joe...@li...

Best regards, Joerg

-----Urspr=FCngliche Nachricht-----
Von:
grbounce-nY5BowUAAAB926gH-oE5s-8BGcxAvbeI=3Djoerg.wegner=3Dweb.de@googleg=
roups.c
om
[mailto:grbounce-nY5BowUAAAB926gH-oE5s-8BGcxAvbeI=3Djoerg.wegner=3Dweb.de=
@google
groups.com] Im Auftrag von co...@gm...
Gesendet: Mittwoch, 22. Februar 2006 00:41
An: CoEPrA
Betreff: CoEPrA and SAR/QSAR datasets

Dear CoEPrA participants,

I want to further elaborate on the composition
of the SAR/QSAR datasets, because it seems that there
is great interest in computing and comparing QSAR descriptors.

Here I use "SAR/QSAR" with the meaning of "property
prediction from chemical structures and structural
descriptors", in order to include not only biological
activities, but also physico-chemical properties.

For some datasets we are constrained by the nature
of the compounds to provide only calculated descriptors.
This is due to their origin (databases, corporate compounds,
copyright problems, or the fact that the compounds are
taken from well-known databases).

In the case of proprietary compounds or copyrighted information,
the situation is clear: we cannot provide the chemical structures,
so there is nothing that we can do here.
We use such datasets in CoEPrA because of (a) their value
in data modeling or (b) the importance of the problem
(for example, drug/non-drug classification).

If you know such datasets that might be suitable
for the CoEPrA competition, please contact me at
oii...@ut... or ie...@ya....

Regarding the datasets taken from well-known databases,
we cannot provide the chemical structures because doing so
will make impossible the "blind prediction" character of
the CoEPrA competition.
For example, compounds from the NCI database (or similar
databases) will be given only as computed descriptors.

The most interesting characteristic of SAR/QSAR models
is the identification of the relevant descriptors,
and in order to compare different types of descriptors,
we are preparing several SAR/QSAR datasets that
contain the chemical structures.

For peptides datasets we will provide only the
sequences, while the chemical structures will
be provided in a molecular format that can be
read with OpenBabel.
This format can be SMILES (i.e., no coordinates)
or a format that contains 3D coordinates.

Based on the chemical structures provided in a CoEPrA
task, the participants can generate 3D structures (if
only SMILES codes are given, for example), generate
conformers, use QM/MM/MD programs to optimize the geometry,
align the molecules, or perform any computation on the
molecular structure.

Also, the participants can compute any set of structural
descriptors, and use it to generate QSAR models.
However, "black box" descriptors are not allowed, because
we want to learn something from these "blind predictions",
and not only to maximize the AUC or Q2.

The scope of the "blind predictions" in CoEPrA is to
test as many hypotheses as possible, and this is why we
would be interested to see comparative studies for
various classes of descriptors, such as:
"counts of atom types" versus "path counts" versus
"autocorrelation vectors" versus "3D pharmacophores"
versus "subgraph counts" versus "quantum indices"
versus "molecular field descriptors", and so on.

Of course, all the above descriptors can be consolidated
into a large set of descriptors, which also should be
tested as a reference for QSAR models with individual
classes of descriptors.

However, I do not advocate the sole use of a consolidated
set of descriptors, because:
(a) some sets of descriptors have a physical meaning,
such as LSER or TLSER descriptors
(b) some classes of descriptors are a result of a
QSAR theory (i.e., atom types can discriminate between
drug/non-drug compounds)
(c) the comparative evaluation of different classes of
descriptors is lost.

Best regards,
Ovidiu

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google =
Groups
"CoEPrA" group.
To post to this group, send email to Co...@go...
To unsubscribe from this group, send email to
CoE...@go...
For more options, visit this group at =
http://groups.google.com/group/CoEPrA
-~----------~----~----~----~------~----~------~--~---