From: <tho...@us...> - 2009-08-15 09:51:46
|
Revision: 14728 http://cdk.svn.sourceforge.net/cdk/?rev=14728&view=rev Author: thomaskuhn Date: 2009-08-15 09:51:35 +0000 (Sat, 15 Aug 2009) Log Message: ----------- Applied Achims comments Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-14 07:02:58 UTC (rev 14727) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-15 09:51:35 UTC (rev 14728) @@ -237,15 +237,7 @@ %%%%%%%%%%%%%%%% %% Background %% %% -\section*{Agenda} -\begin{itemize} - \item Provide anatomy of CDK-Taverna worker - \item Give two or three examples in for each group in worker table -\end{itemize} - - - \section*{Background} The recent release of large open chemistry databases into the public domain~\cite{PubChem,IrwinJ2005,ChEMBL,Williams2008} calls for flexible, open toolkits to process them. @@ -326,15 +318,15 @@ \subsection*{The anatomy of a CDK-Taverna worker} To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ -public Map<String, DataThing> execute(Map<String, DataThing> inputMap) throws TaskExecutionException;\\ +public Map String, DataThing execute(Map String, DataThing inputMap) throws TaskExecutionException;\\ \remark{TK: I am not able to draw the left and right angles for the Java Map object! Can somebody help me please here!} public String[] inputNames();\\ public String[] inputTypes();\\ public String[] outputNames();\\ public String[] outputTypes();\\ -The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods returns the name of the Java object type with its package declaration e.g. ``java/java.util.List'' for a List. -\remark{EW: Here the paper should explain how chemical structures are passed around, using CML, which is now described elsewhere.} +The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile}. + %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -516,7 +508,7 @@ Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. -Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. The CDK-Taverna contains worker which allows such a enumeration based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). +Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. CDK-Taverna contains workers which support an enumeration task based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). \begin{figure*} \centering @@ -531,10 +523,9 @@ \end{figure*} \subsubsection*{Classification Workflows} -During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. -In such a workflow the Get\_QSAR\_vector\_from\_database worker loads data vectors for a specific SQL query a list of data vectors from the database (compare section ??? above). -\remark{FIXME: finish reference TK: ???} -This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose min value equals its max value. +During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter is used. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. +Within a typical clustering workflow the Get\_QSAR\_vector\_from\_database worker loads molecule's data vectors (compare descriptor calculation workflow above) for a specific molecular SQL query from the database. +This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose minimum value equals its maximum value. After the loading and a first cleaning of the data vector the classification is performed using the ART2A\_Classificator worker. For the configuration of this worker different options are available: \begin{itemize} \item Linear scaling of the input vector to values between 0 and 1. @@ -544,7 +535,8 @@ \item The setting of the maximum classification time. \item A limit for the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis~\cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI~\cite{Degtyarenko2008}; see Figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. +The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. The clustering result is stored in form of a compressed XML document. This XML result document can be processed with different workers to create different visualizations depending on the aim of the clustering task. For chemical diversity analysis the ART 2-A worker was used for a successive top-down clustering of three chemical databases (two proprietary databases containing natural products and the ChEBI database~\cite{Degtyarenko2008} containing molecules of biological interest, see Figure ~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters show the similarity of the natural product databases in contrast to the ChEBI database which differs in chemical space occupation~\cite{KuhnDissertation}. This findings will be outlined in a subsequent publication. + \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} @@ -553,14 +545,10 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the composition of the different detected classes of the ART 2-A classification. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } + \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the occupancies of different detected clusters for three databases. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } \label{fig:ART2AClassificationResult} \end{figure*} -\subsubsection*{Further Example Workflow} -\cite{KuhnDissertation} describes different other example workflows and their results . This includes workflows for analysing the atom typing of the CDK and workflows for performing cluster analysis for the determination of the chemical diversity. - - %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} The open-source solution which is provided by the CDK-Taverna plug-in allows an This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |