[Joelib-help] Coding and similarity ?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Rich,

I've changed the subject to being more precisely.
I agree that things are getting complex, but primitive native 
numeric/nominal descriptors are only a really small subset of all 
possible codings for molecular structures (descriptor results).

descriptor (parameters, molecule): algorithm to get values
descriptor result:       storing object for the abstract molecule
                          numeric,nominal value, binary nominal value,
                          atom-pair, mcs, ...
query (parameters):  a search method getting a list of valid matchings
                      e.g. SMARTS, AP, shape, whatever, ...
metric (parameters, descRes1, descRes2): Getting similarity for two
                                          possibly codings

> But one thing that is not clear to me is how a generic Metric (or Comparator) does its job (without violating encapsulation) of comparing two Descriptor calculations given that the way in which each Descriptor represents itself is unique. For example, a Tanamoto comparison of two fingerprints will be done one way, but a Tanamoto comparison of two TPSA's will be done very differently. A Euclidian distance comparison of Topological Torsion is straightforward, but the same comparison of clogP - that's done very differently, I imagine.
Generic would not be the correct term.
The basic problem we always have is that 'similarity' can and definitely 
should not be separated from the metric, because a metric can only 
interpret the features given.
I've tried to find a structure for my private literature and i've now 
the opinion that coding and similarity are two sides of a coin.
So, we can have different images on one of the two sides, but we can not 
  split the coin.

So, eventually every descriptorResult should have something like:
List=descriptorResult.getPossibleMetrics();

And i've also the opinion that we should be really general here, because 
most model building algorithms (classification, regression, clustering) 
need most often only a kind of similarity and a meanValue for a set of 
molecules.
And the primitive euclidian distance of descriptor (sub)sets is only the 
  plain data mining approach with loosing all topologial information 
(inverse QSAR problem).

> And then there's the problem that a generic Metric will need a much wider Descriptor interface to do a comparison than a generic DescriptorResult or Descriptor will have.
Hmm, i think the result holds the: coding
and the metric addresses: similarity on coding

> How does JOELib handle these issues?
Not good and really diverse.
For general descriptor results i've recently introduced:
joelib.math.similarity.DistanceMetric

For basic values (numeric or nominal or binary nominal), furthermore 
there are some hot topics working directly on molecular structures. I 
will not discuss these things on the public mailing list, but i'm 
definitely willingly to cooperate here, if the plan is to write a paper 
using one of the new methods. For all methods we have the atom labelling 
(set) problem !

EUCLIDIAN, TANIMOTO:
joelib.util.ComparisonHelper
the euclidian or tanimoto metric is chosen from the kind of descriptor 
given to
setComparisonDescriptor(String)
setComparisonDescriptor(String[])

ATOM-PAIR (also unpublished work of Nikolas Fechner available, still in 
development)
joelib.desc.types.atompair.BasicAPDistanceMetric

MCS(not public, still in development, paper submitted, eventually i will 
publish after the paper was accepted, but i'm not sure if i'm willingly 
to share the implementation advantages so early)
Really weird, but i will prefer the abstractest object oriented way you 
can provide.
In fact two results (coding) and metric based on these results.
But there are tons of ways you can code (parameters for MCS generation) 
the MCS and you can apply the metric (parameters for metric)

> It almost seems like the "Descriptor" category itself is overly general and needs to be broken down further. Otherwise any Descriptor framework will have to know too much about particular Descriptor implementations with the result being a decidedly non-object-oriented framework that is difficult to extend and maintain. How can we address this?
In JOElib every descriptor knows it's result, so if you call
result=descriptor.calculate(molecule)
you will get the correct result. Because this is done by using 
Java-Reflection this is not the most efficient way, but if we use 
result=descriptor.calculate(molecule, result) this will be efficient.
Hence, standard users will have to pay a runtime-penalty, because object 
generation in Java is expensive (see also joelib.desc.ResultFactory).

I suggest that every result should know possible metrics.

I've also introduced a joelib.desc.DescriptorInfo object
Additionally there exists the DescDescription object which holds 
informations for each descriptor. If you will try:
joelib/ant> ant JOELibTestGUI

And you will switch to Info-->Descriptors Panel all informations are 
generated and loaded on the fly by using:
111. DescriptorFactory (get all descriptors JOELib can calculate, so we 
know the details for them, BTW unavailable documentation will cause 
annoying warnings, so developers are forced to provide from the 
beginning documentation files)

222. Get descriptor infos for each descriptor

333. Load single HTML documentation (generated also from DocBook-XML) 
for each descriptor

444. show informations.

Kind regards, Joerg
>  
> rich
> 
> "Joerg K. Wegner" <we...@in...> wrote:
> Hi again,
> 
> we should for performace issues not use (as in JOElib):
> molecule.calculate("XYZ")
> 
> we should use:
> keyXYZ=KeyFactory.getKey("XYZ");
> 
> // and use internal caching for this descriptor
> molecule.calculate(keyXYZ);
> 
> Kind regards, Joerg
> 
> 
>>Hi Rich,
>>
>>
>>>* Molecule implements AtomGraph. In the near future, BondingSystem 
>>>should also implement AtomGraph to enable traversal/query with the 
>>>same tools used for Molecules (any objections to this?)
>>
>>Good.
>>
>>
>>>* Traversers traverse the graph structure of any AtomGraph. Traversers 
>>>are low-level components that are helpful for building higher-level 
>>>functionality. Currently two types of Traverser are available: 
>>>DepthFirstTraverser and CycleTraverser. Both use a system of Handlers 
>>>and Controllers - Handlers for handling events generated at various 
>>>stages of a traversal algorithm and Controllers for exercising limited 
>>>control over the algorithm itself. This system borrows from SAX's 
>>>ContentHandler idea. HanserCycleTraverser is an implementation of 
>>>CycleTraverser that uses Hanser's algorithm for finding the set of all 
>>>cycles of an AtomGraph using collapsing Path-Graphs.
>>
>>CycleTraverser should use an interface, so that we can switch the 
>>traverser.
>>If nothing is said a default traverser should be used.
>>The traverser should also have an ID and version number analogue to 
>>descriptors.
>>
>>
>>
>>>* MoleculeComparator compares two AtomGraphs for isomorphism, but 
>>>without comparing atom/bonding properties. UllmanComparator implements 
>>>MoleculeComparator by using Ullman's subgraph isomorphism algorithm. 
>>>Like Traverser, MoleculeComparator uses a system of Handlers and 
>>>Controllers for fine-grained control. It should be possible to use 
>>>this sytem to create additional isomorphism algorithms implementing 
>>>MoleculeComparator.
>>
>>Isn't this only a formulation problem ?
>>Can't we use a boolean method compareNode(LabelSet) which uses a set of 
>>labels to check isomorphism ?
>>
>>
>>>* QueryBuilder enables clients to build a molecular query using the 
>>>same process that is used for building a Molecule with 
>>>MoleculeBuilder. In fact, QueryBuilder extends MoleculeBuilder and can 
>>>be used in many contexts calling for a MoleculeBuilder. QueryBuilder 
>>>is designed for building queries that are based on a template molecule 
>>>with constraints placed on individual Atoms with AtomQuery.
>>
>>Can 'pharmacophores' treated also with this approach. So are combined 
>>features, e.g. carbon acid group combined to a single feature and a 
>>distance to all other features allowed ?
>>
>>
>>
>>>* SmartsQueryFactory is in the early stages, but is intended to 
>>>simplify the process of using QueryBuilder by enabling clients to use 
>>>SMARTS Atomic Primitive strings as keys to obtain a fully functional 
>>>AtomQuery. Although this isn't exactly a SMARTS parser, it isn't that 
>>>far from being one given Octet's SmilesReader. Currenly only the 
>>>wildcard Atomic Primitive ("*") is supported, but other should be 
>>>appearing soon. The approach here has some elements in common with 
>>>that of CDK's growing SMARTS support, but there are also some 
>>>interesting differences.
>>
>>Same as above, so atom based (not feature based) compareNode(LabelSet) 
>>method, where the LabelSet is what i would call the chemical kernel atom 
>>labelling set.
>>
>>
>>>Looking a little further down the road for QSAR, what are people's 
>>>thoughts on a framework for molecular descriptors? Of course, there 
>>>hundreds of descriptors, and of course we all have our ideas on what a 
>>>particular descriptor means or doesn't mean. What I'm actually 
>>>wondering about is what a descriptor facility in QSAR would look and 
>>>feel like. I've been looking at JOELib's descriptor framework, which 
>>>has some reasonable concepts. From what I can tell, there are two 
>>>basic kinds of descriptor: a "holistic" descriptor that is a single 
>>>value (i.e. TPSA) and which is primitive-like, and everything else, 
>>>which tends to be higher-resolution in nature (i.e. Topological 
>>>Torsion) and more object-like. Are there any other ideas? 
>>
>>With respect to query i would prefer the object approach, so we can use:
>>result=molecule.calculate("XYZ")
>>or as in JOELib
>>result1=calculator.calculate(mol1,"XYZ", Properties)
>>result2=calculator.calculate(mol2,"XYZ", Properties)
>>
>>for matching or similarity we can then use
>>// inherited from Comparator in Java API
>>// applicable for euclidian, tanimoto, atom-pairs
>>similarity=metricThatILike(result1,result2, Properties);
>>
>>For simple single value descriptors it would be also interesting to have:
>>similarity=metricThatILike(ResultSet1,ResultSet2, Properties);
>>Also with pharmacophore outlook or multiple graph isomorphism and not 
>>only pair-wise matching.
>>
>>So a query is from my standpoint a kind of similarity-metric which can 
>>only return 0 and 1. Sometimes, as in SMARTS matching we are only 
>>interested in subgraph isomorphism.
>>result1=calculator.calculate(mol1,"XYZ", LabelSet)
>>result2=calculator.calculate(mol2,"XYZ", LabelSet)
>>// only applicable for this specific calculator
>>// can be used for maximum common substructure search (MCS)
>>matchings=matchingsThatILike(result1,result2, Properties);
>>
>>So, for SMARTS matching we need also:
>>matchings=matchingsThatILike(query1,result2, Properties);
>>
>>For pharmacophores 2D/3D/Shape we can also use this appraoch, because 
>>the representation for the similarity/matching is the relevant point.
>>matchings=matchingsThatILike(query1,result2, Properties);
>>or
>>similarity=metricThatILike(result1,result2, Properties);
>>
>>Kind regards, Joerg
>>
>>
> 
> 
> 

-- 
Dipl. Chem. Joerg K. Wegner
Center of Bioinformatics Tuebingen (ZBIT)
Department of Computer Architecture
Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany
Phone: (+49/0) 7071 29 78970
Fax: (+49/0) 7071 29 5091
E-Mail: mailto:we...@in...
WWW:    http://www-ra.informatik.uni-tuebingen.de
--
Never mistake motion for action.
                                     (E. Hemingway)

Never mistake action for meaningful action.
                                (Hugo Kubinyi,2004)