Re: [Cdk-devel] QSAR project

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

At 08:48 16/04/2004 -0400, Rajarshi Guha wrote:
>On Fri, 2004-04-16 at 06:08, Nina Nikolova wrote:
>
> > > > You haven't said anything about dictionaries. IMO this is an 
> essential part
> > > > of the project. These dictionaries should certainly be available to 
> anyone
> > > > and I would wish to see them included in commercial tools as well. 
> The same
> > > > goes for test data sets.
> > >
> > > Ack.
> >
> > Apparently dictionary development should be (first) part of the project.
> > I am not quite sure how do we start, is there already something available ?
> >
>
>This makes sense (especially since you cant make models without
>descriptors :)
>
>If I understand properly, having various dictionaries allows us to
>basically group descriptors. In Dr. Murray-Rusts post he mentiones
>dictionaries for JOELib, Randic, Dragon etc.

This is one strategy. IIRC Dragon has ca 900 descriptors. They range from 
Molecular weight to various electrotopological indexes. The entries would 
be something like:

drag:mwt
drag:etop23

and so on. it doesn't matter what the IDs are as long as they are unique.

Now MWt is common to other programs. However the concept might vary. One 
program might sum all the average atomicMasses and another might take the 
largest peak in an HRMS (I don't think this is common in QSAR, but who 
knows). So it makes sense to create, say,

joe:molwt
joe:nrot

etc.
Note that the prefixes don't matter as long as they are mapped to a 
namespace URI which is constant, e.g.

<foo xmlns:drag="http://net.sf.qsar/dict/dragon">...
<foo xmlns:joe="http://net.sf.qsar/dict/joelib">...

You will need to make sure that you either obtain permission from the 
original authors to copy program manuals or that you extract the 
definitions from the open literature. Of course it would be great if the 
QSAR authors wanted to join but we shouldn't expect this.

>How are we going to decide on namespaces? Should it be by descriptor
>type (topological, geometrical, informational etc) or by program
>(JOELib, Dragon etc) or something else?

Anything that makes sense and maintenance easy. In general I would have one 
main curator per dictionary. When the same concept (e.g. Mwt) occurs in two 
dictionaries then we can create a communal dictionary which normalises the 
concept.

>One thing that occurs to me is that it is possible for overlap of
>descriptors - that is, two namespaces might list the same descriptor.
>Would'nt this be a problem?

We cannot assume that similar names in different programs are precisely the 
same concept. For example different programs may give different atom types 
within a molecule. e.g. which are the C.ar in pyrrole? So we should start 
off by assuming these are different

>For the case of references, does the CML schema (my terminology might be
>wrong) allow for a standard way to represent references (journal name,
>author, vol, year etc?)

No - we tend to use Dublin Core or other schemes. This is an area where 
there are already many approaches and CML does not add another

Peter

>-------------------------------------------------------------------
>Rajarshi Guha <rx...@ps...> <http://jijo.cjb.net>
>GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
>-------------------------------------------------------------------
>Chemistry professors never die, they just fail to react.
>
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: IBM Linux Tutorials
>Free Linux tutorial presented by Daniel Robbins, President and CEO of
>GenToo technologies. Learn everything from fundamentals to system
>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>_______________________________________________
>Cdk-devel mailing list
>Cdk...@li...
>https://lists.sourceforge.net/lists/listinfo/cdk-devel

Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069