From: Joerg K. W. <we...@in...> - 2004-07-27 15:51:31
|
Hi Rich, I've changed the subject to being more precisely. I agree that things are getting complex, but primitive native numeric/nominal descriptors are only a really small subset of all possible codings for molecular structures (descriptor results). descriptor (parameters, molecule): algorithm to get values descriptor result: storing object for the abstract molecule numeric,nominal value, binary nominal value, atom-pair, mcs, ... query (parameters): a search method getting a list of valid matchings e.g. SMARTS, AP, shape, whatever, ... metric (parameters, descRes1, descRes2): Getting similarity for two possibly codings > But one thing that is not clear to me is how a generic Metric (or Comparator) does its job (without violating encapsulation) of comparing two Descriptor calculations given that the way in which each Descriptor represents itself is unique. For example, a Tanamoto comparison of two fingerprints will be done one way, but a Tanamoto comparison of two TPSA's will be done very differently. A Euclidian distance comparison of Topological Torsion is straightforward, but the same comparison of clogP - that's done very differently, I imagine. Generic would not be the correct term. The basic problem we always have is that 'similarity' can and definitely should not be separated from the metric, because a metric can only interpret the features given. I've tried to find a structure for my private literature and i've now the opinion that coding and similarity are two sides of a coin. So, we can have different images on one of the two sides, but we can not split the coin. So, eventually every descriptorResult should have something like: List=descriptorResult.getPossibleMetrics(); And i've also the opinion that we should be really general here, because most model building algorithms (classification, regression, clustering) need most often only a kind of similarity and a meanValue for a set of molecules. And the primitive euclidian distance of descriptor (sub)sets is only the plain data mining approach with loosing all topologial information (inverse QSAR problem). > And then there's the problem that a generic Metric will need a much wider Descriptor interface to do a comparison than a generic DescriptorResult or Descriptor will have. Hmm, i think the result holds the: coding and the metric addresses: similarity on coding > How does JOELib handle these issues? Not good and really diverse. For general descriptor results i've recently introduced: joelib.math.similarity.DistanceMetric For basic values (numeric or nominal or binary nominal), furthermore there are some hot topics working directly on molecular structures. I will not discuss these things on the public mailing list, but i'm definitely willingly to cooperate here, if the plan is to write a paper using one of the new methods. For all methods we have the atom labelling (set) problem ! EUCLIDIAN, TANIMOTO: joelib.util.ComparisonHelper the euclidian or tanimoto metric is chosen from the kind of descriptor given to setComparisonDescriptor(String) setComparisonDescriptor(String[]) ATOM-PAIR (also unpublished work of Nikolas Fechner available, still in development) joelib.desc.types.atompair.BasicAPDistanceMetric MCS(not public, still in development, paper submitted, eventually i will publish after the paper was accepted, but i'm not sure if i'm willingly to share the implementation advantages so early) Really weird, but i will prefer the abstractest object oriented way you can provide. In fact two results (coding) and metric based on these results. But there are tons of ways you can code (parameters for MCS generation) the MCS and you can apply the metric (parameters for metric) > It almost seems like the "Descriptor" category itself is overly general and needs to be broken down further. Otherwise any Descriptor framework will have to know too much about particular Descriptor implementations with the result being a decidedly non-object-oriented framework that is difficult to extend and maintain. How can we address this? In JOElib every descriptor knows it's result, so if you call result=descriptor.calculate(molecule) you will get the correct result. Because this is done by using Java-Reflection this is not the most efficient way, but if we use result=descriptor.calculate(molecule, result) this will be efficient. Hence, standard users will have to pay a runtime-penalty, because object generation in Java is expensive (see also joelib.desc.ResultFactory). I suggest that every result should know possible metrics. I've also introduced a joelib.desc.DescriptorInfo object Additionally there exists the DescDescription object which holds informations for each descriptor. If you will try: joelib/ant> ant JOELibTestGUI And you will switch to Info-->Descriptors Panel all informations are generated and loaded on the fly by using: 111. DescriptorFactory (get all descriptors JOELib can calculate, so we know the details for them, BTW unavailable documentation will cause annoying warnings, so developers are forced to provide from the beginning documentation files) 222. Get descriptor infos for each descriptor 333. Load single HTML documentation (generated also from DocBook-XML) for each descriptor 444. show informations. Kind regards, Joerg > > rich > > "Joerg K. Wegner" <we...@in...> wrote: > Hi again, > > we should for performace issues not use (as in JOElib): > molecule.calculate("XYZ") > > we should use: > keyXYZ=KeyFactory.getKey("XYZ"); > > // and use internal caching for this descriptor > molecule.calculate(keyXYZ); > > Kind regards, Joerg > > >>Hi Rich, >> >> >>>* Molecule implements AtomGraph. In the near future, BondingSystem >>>should also implement AtomGraph to enable traversal/query with the >>>same tools used for Molecules (any objections to this?) >> >>Good. >> >> >>>* Traversers traverse the graph structure of any AtomGraph. Traversers >>>are low-level components that are helpful for building higher-level >>>functionality. Currently two types of Traverser are available: >>>DepthFirstTraverser and CycleTraverser. Both use a system of Handlers >>>and Controllers - Handlers for handling events generated at various >>>stages of a traversal algorithm and Controllers for exercising limited >>>control over the algorithm itself. This system borrows from SAX's >>>ContentHandler idea. HanserCycleTraverser is an implementation of >>>CycleTraverser that uses Hanser's algorithm for finding the set of all >>>cycles of an AtomGraph using collapsing Path-Graphs. >> >>CycleTraverser should use an interface, so that we can switch the >>traverser. >>If nothing is said a default traverser should be used. >>The traverser should also have an ID and version number analogue to >>descriptors. >> >> >> >>>* MoleculeComparator compares two AtomGraphs for isomorphism, but >>>without comparing atom/bonding properties. UllmanComparator implements >>>MoleculeComparator by using Ullman's subgraph isomorphism algorithm. >>>Like Traverser, MoleculeComparator uses a system of Handlers and >>>Controllers for fine-grained control. It should be possible to use >>>this sytem to create additional isomorphism algorithms implementing >>>MoleculeComparator. >> >>Isn't this only a formulation problem ? >>Can't we use a boolean method compareNode(LabelSet) which uses a set of >>labels to check isomorphism ? >> >> >>>* QueryBuilder enables clients to build a molecular query using the >>>same process that is used for building a Molecule with >>>MoleculeBuilder. In fact, QueryBuilder extends MoleculeBuilder and can >>>be used in many contexts calling for a MoleculeBuilder. QueryBuilder >>>is designed for building queries that are based on a template molecule >>>with constraints placed on individual Atoms with AtomQuery. >> >>Can 'pharmacophores' treated also with this approach. So are combined >>features, e.g. carbon acid group combined to a single feature and a >>distance to all other features allowed ? >> >> >> >>>* SmartsQueryFactory is in the early stages, but is intended to >>>simplify the process of using QueryBuilder by enabling clients to use >>>SMARTS Atomic Primitive strings as keys to obtain a fully functional >>>AtomQuery. Although this isn't exactly a SMARTS parser, it isn't that >>>far from being one given Octet's SmilesReader. Currenly only the >>>wildcard Atomic Primitive ("*") is supported, but other should be >>>appearing soon. The approach here has some elements in common with >>>that of CDK's growing SMARTS support, but there are also some >>>interesting differences. >> >>Same as above, so atom based (not feature based) compareNode(LabelSet) >>method, where the LabelSet is what i would call the chemical kernel atom >>labelling set. >> >> >>>Looking a little further down the road for QSAR, what are people's >>>thoughts on a framework for molecular descriptors? Of course, there >>>hundreds of descriptors, and of course we all have our ideas on what a >>>particular descriptor means or doesn't mean. What I'm actually >>>wondering about is what a descriptor facility in QSAR would look and >>>feel like. I've been looking at JOELib's descriptor framework, which >>>has some reasonable concepts. From what I can tell, there are two >>>basic kinds of descriptor: a "holistic" descriptor that is a single >>>value (i.e. TPSA) and which is primitive-like, and everything else, >>>which tends to be higher-resolution in nature (i.e. Topological >>>Torsion) and more object-like. Are there any other ideas? >> >>With respect to query i would prefer the object approach, so we can use: >>result=molecule.calculate("XYZ") >>or as in JOELib >>result1=calculator.calculate(mol1,"XYZ", Properties) >>result2=calculator.calculate(mol2,"XYZ", Properties) >> >>for matching or similarity we can then use >>// inherited from Comparator in Java API >>// applicable for euclidian, tanimoto, atom-pairs >>similarity=metricThatILike(result1,result2, Properties); >> >>For simple single value descriptors it would be also interesting to have: >>similarity=metricThatILike(ResultSet1,ResultSet2, Properties); >>Also with pharmacophore outlook or multiple graph isomorphism and not >>only pair-wise matching. >> >>So a query is from my standpoint a kind of similarity-metric which can >>only return 0 and 1. Sometimes, as in SMARTS matching we are only >>interested in subgraph isomorphism. >>result1=calculator.calculate(mol1,"XYZ", LabelSet) >>result2=calculator.calculate(mol2,"XYZ", LabelSet) >>// only applicable for this specific calculator >>// can be used for maximum common substructure search (MCS) >>matchings=matchingsThatILike(result1,result2, Properties); >> >>So, for SMARTS matching we need also: >>matchings=matchingsThatILike(query1,result2, Properties); >> >>For pharmacophores 2D/3D/Shape we can also use this appraoch, because >>the representation for the similarity/matching is the relevant point. >>matchings=matchingsThatILike(query1,result2, Properties); >>or >>similarity=metricThatILike(result1,result2, Properties); >> >>Kind regards, Joerg >> >> > > > -- Dipl. Chem. Joerg K. Wegner Center of Bioinformatics Tuebingen (ZBIT) Department of Computer Architecture Univ. Tuebingen, Sand 1, D-72076 Tuebingen, Germany Phone: (+49/0) 7071 29 78970 Fax: (+49/0) 7071 29 5091 E-Mail: mailto:we...@in... WWW: http://www-ra.informatik.uni-tuebingen.de -- Never mistake motion for action. (E. Hemingway) Never mistake action for meaningful action. (Hugo Kubinyi,2004) |