From: <ste...@us...> - 2008-09-03 16:12:19
|
Revision: 12147 http://cdk.svn.sourceforge.net/cdk/?rev=12147&view=rev Author: steinbeck Date: 2008-09-03 16:12:08 +0000 (Wed, 03 Sep 2008) Log Message: ----------- Comments from meeting with Achim, Thomas and Chris Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-03 15:37:30 UTC (rev 12146) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-03 16:12:08 UTC (rev 12147) @@ -230,6 +230,23 @@ %% Background %% %% \section*{Background} +There are large Chemistry databases popping up in the public domain (cite pubchem, ChEMBL, ZINC) +and call for easily customisable tools to process them. + +Open Drug Discovery and Open Notebook Science + +Areas calling for chemoinformatics workflow support include +\begin{itemize} + \item Chemical data filtering, transformation and migration workflows + \subitem testtesttesttest + \item Chemical information retrieval related workflows (structures, reactions, object relational data etc.) + \item Data analysis workflows (statistics, clustering, soft computing/computational intelligence, QSAR/QSPR/pharmacophore oriented workflows) +\end{itemize} + + +Why Open and why existing tools are not open? + + The workflow paradigm allows scientists a flexible creation of generic workflows using different kind of data sources, filters and algorithms which fits the ever-changing demand of current research. Two open source tools were used to @@ -246,6 +263,8 @@ of different functionality. KNIME \cite{KNIMEWeb} is a open source modular data exploration platform which is licensed under the Aladdin free public license and is developed by the group of Michael Berthold at the University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. + +For all these different scenarios a number of different worker are implemented. %ToDo Cite Wendy \section*{Implementation} @@ -293,43 +312,36 @@ from the given repository location. After a restart of Taverna the new installed plug-in is usable with its features. +\subsection*{Iteration over large datasets} +XXX Move section from below XXXX + %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% \section*{Results} -\subsection*{Potential CDK-Taverna Workflows} -The chemoinformatics extension of Taverna leads to different potential workflow -scenarios like: -\begin{itemize} - \item Data filtering and transformation workflows - \item Data migration workflows - \item Chemical information retrieval related workflows (structures, reactions, object relational data etc.) - \item QSAR/QSPR/pharmacophore oriented workflows - \item Data analysis workflows (statistics, clustering, soft computing/computational intelligence etc.) -\end{itemize} -For all these different scenarios a number of different worker are implemented. +\subsection*{Current Status} -\subsection*{Current Worker allocation} +XXX Table of implemented Workers, group by subject XXX The CDK-Taverna plug-in provides approximately 140 different worker. These worker are allocated into file IO (15), database IO (7), QSAR descriptors (76, 42 molecular / 6 bond / 1 atom pair / 27 atomic), data analysis worker (17), SMILE tools (2) and miscellaneous (ca. 30) like substructure filter, aromaticity detection, atom typing and reaction enumeration. -\subsection*{Database Back-End} +We will exemplify some of the components as part of example workflows described below. + +\subsection*{Database I/O} The CDK-Taverna project decided to use the PostgresSQL \cite{PostgreSQLWeb} database with the open-source Pgchem::tigress \cite{PGChemWeb}extension. This combination allows the storing and querying of molecules on the database using an implementation of the GIST index of the PostgresSQL database. %TODO: Cite Postgres DB and PGChem:Tigres and add the version numbers -\subsection*{Example Workflows} -The first simple example workflow, a substructure search workflow, already shows -a lot of the potential of such workflow systems. -\subsubsection*{Substructure Workflow} +\subsection*{Substructure Workflow} +XXX split into CDK based and database based XXX The substructure workflow performs a topological substructure search on a list of given molecules and a given molecular substructure. (see figure~\ref{fig:substructureworkflow.ps}) The inputs of this workflow will be @@ -349,7 +361,7 @@ \label{fig:substructureworkflow.ps} \end{figure*} -\subsubsection*{QSAR Calculation Workflow} +\subsubsection*{Descriptor Calculation Workflow} This more complex example (see figure~\ref{fig:QSARWorkflow})loads its molecules from a PostgresSQL database. The perception of the atom types is the next step, after loading the molecules from the database. In the following steps of the workflow each molecule gets @@ -382,6 +394,9 @@ \subsubsection*{Iterative QSAR Workflow} + +XXX move up to technical part XXX + The iterative QSAR workflow is more or less a hack which allows the user to handle many thousands of molecules. The Taverna architecture do not support things like for or while loops. That's the reason why this detour is so important for This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <tho...@us...> - 2008-09-09 22:04:47
|
Revision: 12253 http://cdk.svn.sourceforge.net/cdk/?rev=12253&view=rev Author: thomaskuhn Date: 2008-09-09 22:04:42 +0000 (Tue, 09 Sep 2008) Log Message: ----------- Work on the article Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-09 22:04:23 UTC (rev 12252) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-09 22:04:42 UTC (rev 12253) @@ -230,41 +230,50 @@ %% Background %% %% \section*{Background} -There are large Chemistry databases popping up in the public domain (cite pubchem, ChEMBL, ZINC) -and call for easily customisable tools to process them. - -Open Drug Discovery and Open Notebook Science - -Areas calling for chemoinformatics workflow support include +There are large Chemistry databases popping up in the public domain +\cite{PubChem}\cite{IrwinJ2005}\cite{Degtyarenko2008} and call for easily +customisable tools to process them. Within the Open Drug Discovery and Open +Notebook Science \cite{DeLano2005} a scientists can participate in the research +and get unrestricted access to what has been learned. Essential for an +open-source science are the openess of tools and data \cite{GuhaR2006}. The +drug discovery like any other science has a large set of typical workflows. +These workflows can be grouped into different diciplines like bioinformatic, +chemoinformatic or statistical workflows. Areas calling for chemoinformatics +workflow support include \begin{itemize} \item Chemical data filtering, transformation and migration workflows - \subitem testtesttesttest \item Chemical information retrieval related workflows (structures, reactions, object relational data etc.) \item Data analysis workflows (statistics, clustering, soft computing/computational intelligence, QSAR/QSPR/pharmacophore oriented workflows) \end{itemize} +The openess of the tools used in science is so important, because only +open-source tools can be reviewed and analysed completely at its base, the +source code. The open-source tools are not black boxes which produce an output +for a given input, but should be usable like black boxes on the one side and be +as informative as an encyclopedia if the scientist wants to know something +special of the used tool or algorthim. -Why Open and why existing tools are not open? - - -The workflow paradigm allows scientists a flexible creation of generic +The workflow paradigm allows scientists a flexible creation of generic workflows using different kind of data sources, filters and algorithms which fits the ever-changing demand of current research. Two open source tools were used to -create an open workflow environment. Taverna \cite{TomOinn2004} -as workflow environment provides with its extendable architecture a well +create an open workflow environment for chemoinformatic problems. Taverna +\cite{TomOinn2004} as workflow environment provides with its extendable architecture a well established platform for a chemoinformatics extension. The CDK-Taverna project as extension for Taverna uses the Chemistry Development Kit (CDK) -\cite{Steinbeck2004a,Steinbeck2003a} as Chemoinformatics software library. The -workflow or pipeline paradigm is not a new one and there are a number of tools +\cite{Steinbeck2004a}\cite{Steinbeck2003a} as Chemoinformatics software library. +The workflow or pipeline paradigm is not a new one and there are a number of tools out there which provide such chemoinformatic functionality for the user like the Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of Accelrys or the InforSense platform from InforSense \cite{InforSenseWeb}. Both are commercial well established but closed source products with a large variety of different functionality. KNIME \cite{KNIMEWeb} is a open source modular data -exploration platform which is licensed under the Aladdin free public license and is developed by the group of Michael Berthold at the University of Konstanz, Germany. KNIME is based on the open-source Eclipse -platform. - -For all these different scenarios a number of different worker are implemented. +exploration platform which is licensed under the Aladdin free public license +and is developed by the group of Michael Berthold at the University of +Konstanz, Germany. KNIME is based on the open-source Eclipse platform. +The aim of this articel is not to evaluate the different workflow solutions but +to say the the only real open-source workflow environment for the given +chemoinformatic tasks is the combination of Taverna and the CDK-Taverna +plug-in. %ToDo Cite Wendy \section*{Implementation} @@ -314,8 +323,17 @@ \subsection*{Iteration over large datasets} -XXX Move section from below XXXX +For chemoinformatic applications the handling of large datasets is essential. +Programatically this is normaly done with a for or a while loop. Unluckly the +architecture of Taverna do not support such loops. Within the CDK-Taverna +there are worker available to build workflows which act like for or while loops +to process large datasets. This construction uses Taverna's iteration and retry +mechanism. +%see section iterative qsar workflow + + + %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -323,13 +341,30 @@ \subsection*{Current Status} -XXX Table of implemented Workers, group by subject XXX -The CDK-Taverna plug-in provides approximately 140 different worker. These -worker are allocated into file IO (15), database IO (7), QSAR descriptors (76, -42 molecular / 6 bond / 1 atom pair / 27 atomic), data analysis worker (17), SMILE tools (2) and -miscellaneous (ca. 30) like substructure filter, aromaticity detection, atom -typing and reaction enumeration. +The CDK-Taverna plug-in provides approximately 140 different worker. The +allocation of the can be found in the following table. +\begin{table}[htbp] +\caption{The worker allocation of CDK-Taverna} +\begin{flushleft} +\begin{tabular}{|l|c|} +\hline +\textbf{Grouped worker} & \multicolumn{1}{l|}{\textbf{Number of worker}} \\ \hline +File I/O & 15 \\ \hline +Database I/O & 7 \\ \hline +Molecular descriptors & 42 \\ \hline +Atom descriptors & 27 \\ \hline +Bond descriptors & 6 \\ \hline +Data analysis worker & 17 \\ \hline +SMILE tools & 2 \\ \hline +Miscellaneous & 30 \\ \hline +\end{tabular} +\end{flushleft} +\label{} +\end{table} + +The miscellaneous worker are for example, the substructure filter, the +aromaticity detection, the atom typing or the reaction enumeration We will exemplify some of the components as part of example workflows described below. \subsection*{Database I/O} @@ -337,20 +372,28 @@ database with the open-source Pgchem::tigress \cite{PGChemWeb}extension. This combination allows the storing and querying of molecules on the database using an implementation of the GIST index of the PostgresSQL database. -%TODO: Cite Postgres DB and PGChem:Tigres and add the version numbers \subsection*{Substructure Workflow} -XXX split into CDK based and database based XXX -The substructure workflow performs a topological substructure search on a list -of given molecules and a given molecular substructure. (see -figure~\ref{fig:substructureworkflow.ps}) The inputs of this workflow will be -the molecular substructure as SMILE \cite{WeiningerDavid1988} and a list of -structures stored in a file of the format MDL SD \cite{DalbyArthur1992}. The structures which match the substructure gets stored in MDL Mol -\cite{DalbyArthur1992} files while the not matching structures will be +A workflow to perform a substructure search can be done in different ways +depending on the type of input. I a first example the substructure workflow +performs a topological substructure search on a list of given molecules and a +given molecular substructure. (see figure~\ref{fig:substructureworkflow.ps}) +The workflow inputs are molecular substructure as SMILE +\cite{WeiningerDavid1988} and a list of structures stored in a file of the +format MDL SD \cite{DalbyArthur1992}. The structures which match the +substructure gets stored in MDL Mol \cite{DalbyArthur1992} files while the not matching structures will be converted into the Chemical Markup Language (CML) \cite{Murray-RustP1999, KuhnS2007}. This small workflow combines already 4 different molecular structure representations and the use of a topological substructure filter. +Another substructure search workflow will do the substructure search directly +on a database. Therefore it uses functionality provided by the Pgchem extension +of the PostgresSQL database. This allows the use of SQL commands to perform a +substructure search e.g. ``SELECT id, molecule FROM molecules +WHERE molecule = (SELECT molecule from molecules WHERE id = 1)`` The +following workflow ~\ref{fig:database_substructure_search} will perform the +substructure search directly on the database. The molecules containing the +substructure will afterwards printed within a table and stored in a pdf file. \begin{figure*} \centering @@ -361,6 +404,15 @@ \label{fig:substructureworkflow.ps} \end{figure*} +\begin{figure*} + \centering + \includegraphics[angle=90,clip=false,scale=.4]{pics/database_substructure_search.ps}\\ \caption{The workflow performs a substructure search on the database. + The substructure is loaded from a MDL SD file. The output will be a pdf + file containing a tabular view of the molecules containing the + substructure} + \label{fig:database_substructure_search} +\end{figure*} + \subsubsection*{Descriptor Calculation Workflow} This more complex example (see figure~\ref{fig:QSARWorkflow})loads its molecules from a PostgresSQL database. The perception of the atom types is the next step, after loading the molecules @@ -392,22 +444,17 @@ \label{fig:QSARWorkflow} \end{figure*} - \subsubsection*{Iterative QSAR Workflow} -XXX move up to technical part XXX - The iterative QSAR workflow is more or less a hack which allows the user to -handle many thousands of molecules. The Taverna architecture do not support -things like for or while loops. That's the reason why this detour is so important for -chemoinformatic workflows which often handle many thousand molecules. This -workflow (see figure~\ref{fig:IterativeQSARWorkflow}) processes each molecule -in the same manner as the QSAR Calculation Workflow but it uses different database workers. Instead of the one database -worker ``Get Molecules From Database`` uses this workflow three database -worker. The ``Iterative Molecule From Database Reader``, the ``Get Molecule -From Database`` and the ``Has Next Molecule From Database``. The first of the -three worker is used to configure the database connection and store it within -an internal object registry. The second worker gets as input the id of the +handle many thousands of molecules. This workflow (see +figure~\ref{fig:IterativeQSARWorkflow}) processes each molecule in the same +manner as the QSAR Calculation Workflow but it uses different database workers. +Instead of the one database worker ``Get Molecules From Database`` uses this +workflow three database worker. The ``Iterative Molecule From Database +Reader``, the ``Get Molecule From Database`` and the ``Has Next Molecule From +Database``. The first of the three worker is used to configure the database +connection and store it within an internal object registry. The second worker gets as input the id of the database connection and loads the molecule from the database for the workflow. But it loads only a subset of the original query using the SQL functions LIMIT and OFFSET. The last database worker checks whether the loaded molecules are @@ -429,6 +476,7 @@ \end{figure*} + %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} The open-source solution which is provided by the CDK-Taverna plug-in allows an This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <tho...@us...> - 2008-09-10 09:16:43
|
Revision: 12278 http://cdk.svn.sourceforge.net/cdk/?rev=12278&view=rev Author: thomaskuhn Date: 2008-09-10 09:16:39 +0000 (Wed, 10 Sep 2008) Log Message: ----------- Edit the article a bit Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-10 06:53:48 UTC (rev 12277) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-10 09:16:39 UTC (rev 12278) @@ -184,11 +184,20 @@ \begin{abstract} % Do not use inserted blank lines (ie \\) until main body of text. -\paragraph*{Background:} Workflows are great +\paragraph*{Background:} The availability of large chemistry database in +the public domain generates a demand for easily customisable tools to process +them. In terms of Open Drug Discovery and Open Notebook Science these tools +should be open-source and available for everyone. -\paragraph*{Results:} We made a workflow environment for chemoinformatics \ldots +\paragraph*{Results:} Taverna with the CDK-Taverna plug-in is an +open-source workflow solution for chemoinformatic problems which uses the +Chemistry Development Kit as chemoinformatic library. It provides about 160 +different workers to handle specific chemoinformtic tasks. -\paragraph*{Conclusions:} We are great \ldots +\paragraph*{Conclusions:} The combination of the two open-source projects +Taverna and the Chemistry Development Kit creates an workflow solution which +allows an easily customisable handling of chemoinformatic problems for users +without programming knowledge. \end{abstract} @@ -235,9 +244,9 @@ customisable tools to process them. Within the Open Drug Discovery and Open Notebook Science \cite{DeLano2005} a scientists can participate in the research and get unrestricted access to what has been learned. Essential for an -open-source science are the openess of tools and data \cite{GuhaR2006}. The +open-source science are the openness of tools and data \cite{GuhaR2006}. The drug discovery like any other science has a large set of typical workflows. -These workflows can be grouped into different diciplines like bioinformatic, +These workflows can be grouped into different disciplines like bioinformatic, chemoinformatic or statistical workflows. Areas calling for chemoinformatics workflow support include \begin{itemize} @@ -246,12 +255,12 @@ \item Data analysis workflows (statistics, clustering, soft computing/computational intelligence, QSAR/QSPR/pharmacophore oriented workflows) \end{itemize} -The openess of the tools used in science is so important, because only +The openness of the tools used in science is so important, because only open-source tools can be reviewed and analysed completely at its base, the source code. The open-source tools are not black boxes which produce an output -for a given input, but should be usable like black boxes on the one side and be -as informative as an encyclopedia if the scientist wants to know something -special of the used tool or algorthim. +for a given input. But the tools should be usable like black boxes on the one +side and be as informative as an encyclopedia on the other side if the scientist +wants to know something special of the used tool or algorithm. The workflow paradigm allows scientists a flexible creation of generic workflows using different kind of data sources, filters and algorithms which fits the @@ -261,17 +270,16 @@ established platform for a chemoinformatics extension. The CDK-Taverna project as extension for Taverna uses the Chemistry Development Kit (CDK) \cite{Steinbeck2004a}\cite{Steinbeck2003a} as Chemoinformatics software library. + The workflow or pipeline paradigm is not a new one and there are a number of tools out there which provide such chemoinformatic functionality for the user like the Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of Accelrys or the InforSense platform from InforSense \cite{InforSenseWeb}. Both are commercial well established but closed source products with a large variety -of different functionality. KNIME \cite{KNIMEWeb} is a open source modular data -exploration platform which is licensed under the Aladdin free public license -and is developed by the group of Michael Berthold at the University of -Konstanz, Germany. KNIME is based on the open-source Eclipse platform. -The aim of this articel is not to evaluate the different workflow solutions but -to say the the only real open-source workflow environment for the given +of different functionality. KNIME \cite{KNIMEWeb} is a modular data +exploration platform which uses a dual licensing model with the Aladdin free +public license. It is developed by the group of Michael Berthold at the +University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. The aim of this article is not to evaluate the different workflow solutions but to say that the only real open-source workflow environment for the given chemoinformatic tasks is the combination of Taverna and the CDK-Taverna plug-in. %ToDo Cite Wendy @@ -294,7 +302,7 @@ %TODO:At this position should be a table with a list of the SPI's The CDK-Taverna project implements some of these extensions which lead to an integration of the CDK functionality as so called Local Worker. These are -java classes which gets installed on each local Taverna installation. +Java classes which gets installed on each local Taverna installation. %TODO: Detailed description of the integration All worker provided from the CDK-Taverna plug-in implement the ``CDKLocalWorker`` interface. This interface is used for the detection of each @@ -313,20 +321,19 @@ for its installation. The plug-in manager needs for the description of the plug-in a XML file. This description XML file includes the name , the version number, the target Taverna version number, the repository location and the Maven -like java package description of the plug-in. After adding the plug-in -installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager gets -the user a list of available plug-in version. To +like Java package description of the plug-in. After adding the plug-in +installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all +available plug-in versions will pop up for the user. To install the plug-in the user has to select the plug-in version and -press the install button. Than Taverna downloads the java libraries on the fly +press the install button. Than Taverna downloads the Java libraries on the fly from the given repository location. After a restart of Taverna the new installed plug-in is usable with its features. \subsection*{Iteration over large datasets} For chemoinformatic applications the handling of large datasets is essential. -Programatically this is normaly done with a for or a while loop. Unluckly the -architecture of Taverna do not support such loops. Within the CDK-Taverna -there are worker available to build workflows which act like for or while loops +Programmatically this is normally done with a for or a while loop. Unfortunately the +architecture of Taverna do not support such loops. CDK-Taverna provides worker to build workflows which act like for or while loops to process large datasets. This construction uses Taverna's iteration and retry mechanism. %see section iterative qsar workflow @@ -341,8 +348,8 @@ \subsection*{Current Status} -The CDK-Taverna plug-in provides approximately 140 different worker. The -allocation of the can be found in the following table. +The CDK-Taverna plug-in provides approximately 160 different worker. The +allocation of the worker can be found in the following table. \begin{table}[htbp] \caption{The worker allocation of CDK-Taverna} @@ -355,31 +362,33 @@ Molecular descriptors & 42 \\ \hline Atom descriptors & 27 \\ \hline Bond descriptors & 6 \\ \hline -Data analysis worker & 17 \\ \hline +ART2A classificator and result analysis worker & 10 \\ \hline +SimpleKMean and EM clusterer (uses Weka) & 3 \\ \hline SMILE tools & 2 \\ \hline -Miscellaneous & 30 \\ \hline +Inchi Parser & 2 \\ \hline +Miscellaneous & 50 \\ \hline \end{tabular} \end{flushleft} \label{} \end{table} The miscellaneous worker are for example, the substructure filter, the -aromaticity detection, the atom typing or the reaction enumeration +aromaticity detection, the atom typing or the reaction enumeration. We will exemplify some of the components as part of example workflows described below. \subsection*{Database I/O} -The CDK-Taverna project decided to use the PostgresSQL \cite{PostgreSQLWeb} -database with the open-source Pgchem::tigress \cite{PGChemWeb}extension. This +The CDK-Taverna project decided to use the PostgresSQL \cite{PostgreSQLWeb} +database with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This combination allows the storing and querying of molecules on the database using an implementation of the GIST index of the PostgresSQL database. \subsection*{Substructure Workflow} A workflow to perform a substructure search can be done in different ways -depending on the type of input. I a first example the substructure workflow +depending on the type of input. In a first example the substructure workflow performs a topological substructure search on a list of given molecules and a given molecular substructure. (see figure~\ref{fig:substructureworkflow.ps}) -The workflow inputs are molecular substructure as SMILE +The workflow inputs are a molecular substructure as SMILE \cite{WeiningerDavid1988} and a list of structures stored in a file of the format MDL SD \cite{DalbyArthur1992}. The structures which match the substructure gets stored in MDL Mol \cite{DalbyArthur1992} files while the not matching structures will be @@ -400,7 +409,7 @@ \includegraphics[angle=0,clip=false,scale=.4]{pics/substructureworkflow.ps}\\ \caption{The workflow performs a topological substructure search on molecules from an MDL SD file. The input of this workflow will be a SMILE - which repesents the searchable substructure.} + which represents the searchable substructure.} \label{fig:substructureworkflow.ps} \end{figure*} @@ -414,9 +423,9 @@ \end{figure*} \subsubsection*{Descriptor Calculation Workflow} -This more complex example (see figure~\ref{fig:QSARWorkflow})loads its molecules from a PostgresSQL database. -The perception of the atom types is the next step, after loading the molecules -from the database. In the following steps of the workflow each molecule gets +This more complex example (see figure~\ref{fig:QSARWorkflow}) loads its +molecules from a PostgresSQL database. The perception of the atom types is the next step, after loading the molecules +from the database. In the following steps, the workflow gets for each molecule its implicit hydrogens and goes through the detection of the Hueckel armaticity. %Cite h\xFCckel aromaticity The tagging of each molecule is needed in the process of the extraction @@ -438,7 +447,7 @@ \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.3]{pics/QSAR_Workflow.ps}\\ - \caption{The workflow calculates differnt QSAR properties for the given + \caption{The workflow calculates different QSAR properties for the given molecules which gets loaded from a PostgresSQL database. The results of the calculation will be stored in a csv file.} \label{fig:QSARWorkflow} @@ -471,7 +480,7 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.3]{pics/Iterative_QSAR_Workflow.ps}\\ \caption{The workflow calculates differnt QSAR properties for the given molecules which gets loaded from a PostgresSQL database. The results of the calculation will be stored in a csv file.} + \includegraphics[angle=0,clip=flase,scale=.3]{pics/Iterative_QSAR_Workflow.ps}\\ \caption{The workflow calculates different QSAR properties for the given molecules which gets loaded from a PostgresSQL database. The results of the calculation will be stored in a csv file.} \label{fig:IterativeQSARWorkflow} \end{figure*} @@ -551,15 +560,8 @@ %% Do not use \listoffigures as most will included as separate files \section*{Figures} -\subsection*{Figure 1 - Sample figure title} -A short description of the figure content -should go here. -\subsection*{Figure 2 - Sample figure title} -Figure legend text. - - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% %% Tables %% @@ -592,24 +594,5 @@ org.embl.ebi.escience.scuflui.spi.UIComponentFactorySPI \\ \hline \end{tabular} } - - -\subsection*{Table 1aa - Sample table title} -Here is an example of a \emph{small} table in \LaTeX\ using -\verb|\tabular{...}|. This is where the description of the table -should go. \par \mbox{} -\par -\mbox{ -\begin{tabular}{|c|c|c|} -\hline \multicolumn{3}{|c|}{My Table}\\ \hline -A1 & B2 & C3 \\ \hline -A2 & ... & .. \\ \hline -A3 & .. & . \\ \hline -\end{tabular} -} -\subsection*{Table 2 - Sample table title} -Large tables are attached as separate files but should -still be described here. - \end{bmcformat} \end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2008-09-13 15:49:41
|
Revision: 12304 http://cdk.svn.sourceforge.net/cdk/?rev=12304&view=rev Author: steinbeck Date: 2008-09-13 15:49:37 +0000 (Sat, 13 Sep 2008) Log Message: ----------- Ueberarbeitete Version bis einschliesslich "Background" Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-12 22:34:27 UTC (rev 12303) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2008-09-13 15:49:37 UTC (rev 12304) @@ -160,7 +160,7 @@ \iid(1)Cologne University Bioinformatics Center (CUBIC),% Zuelpicher Str. 47, Koeln, Germany\\ \iid(2)University of Applied Sciences Gelsenkirchen, Germany\\ -\iid(3)European Bioinformatics Institute (EBI), Hinxton, UK +\iid(3)European Bioinformatics Institute (EBI), Cambridge, UK }% \maketitle @@ -184,20 +184,20 @@ \begin{abstract} % Do not use inserted blank lines (ie \\) until main body of text. -\paragraph*{Background:} The availability of large chemistry database in -the public domain generates a demand for easily customisable tools to process -them. In terms of Open Drug Discovery and Open Notebook Science these tools -should be open-source and available for everyone. +\paragraph*{Background:} The recent release of large open access chemistry database in +the public domain generates a demand for flexible tools to process +them and discover new knowledge. To support Open Drug Discovery and +Open Notebook Science on top of these data resource, is is desirable +for the processing tools to be open-source and available for everyone. -\paragraph*{Results:} Taverna with the CDK-Taverna plug-in is an -open-source workflow solution for chemoinformatic problems which uses the -Chemistry Development Kit as chemoinformatic library. It provides about 160 -different workers to handle specific chemoinformtic tasks. +\paragraph*{Results:} Here we describe a plugin for the open source workflow engine +Taverna based on the open source chemoinformatics library The Chemistry Development Kit +resulting in a open source workflow solution to attack chemoinformatics problems. +We have implemented more than 160 different workers to handle specific chemoinformtic tasks. \paragraph*{Conclusions:} The combination of the two open-source projects -Taverna and the Chemistry Development Kit creates an workflow solution which -allows an easily customisable handling of chemoinformatic problems for users -without programming knowledge. +Taverna and the Chemistry Development Kit creates an open chemoinformatics workflow +solution. \end{abstract} @@ -239,15 +239,19 @@ %% Background %% %% \section*{Background} -There are large Chemistry databases popping up in the public domain -\cite{PubChem}\cite{IrwinJ2005}\cite{Degtyarenko2008} and call for easily -customisable tools to process them. Within the Open Drug Discovery and Open -Notebook Science \cite{DeLano2005} a scientists can participate in the research -and get unrestricted access to what has been learned. Essential for an -open-source science are the openness of tools and data \cite{GuhaR2006}. The -drug discovery like any other science has a large set of typical workflows. +The recent release of large open Chemistry databases into the public domain +\cite{PubChem}\cite{IrwinJ2005}\cite{EBIDrugs2008} calls for flexible, open toolkits to process them. +These databases and tools will, for the first time, create opportunities for academia and third-world countries +to perform state-of-the-art open drug discovery and translational research - endeavors so far a domain of the +pharmaceutical industry. In order to enable this progress, the tools for processing these datasets must be free and open. +Within the Open Drug Discovery and Open +Notebook Science \cite{DeLano2005} scientists can participate in the research +and get unrestricted access to what has been learned \cite{GuhaR2006}. +This article describes the integration of various open source software packages to form an open +workflow engine for chemoinformatics and drug discovery. +These fields, like any other science, encompass sets of typical workflows. These workflows can be grouped into different disciplines like bioinformatic, -chemoinformatic or statistical workflows. Areas calling for chemoinformatics +chemoinformatic or statistical workflows. Areas calling for such workflow support include \begin{itemize} \item Chemical data filtering, transformation and migration workflows @@ -255,33 +259,28 @@ \item Data analysis workflows (statistics, clustering, soft computing/computational intelligence, QSAR/QSPR/pharmacophore oriented workflows) \end{itemize} -The openness of the tools used in science is so important, because only -open-source tools can be reviewed and analysed completely at its base, the -source code. The open-source tools are not black boxes which produce an output -for a given input. But the tools should be usable like black boxes on the one -side and be as informative as an encyclopedia on the other side if the scientist -wants to know something special of the used tool or algorithm. - -The workflow paradigm allows scientists a flexible creation of generic +The workflow paradigm allows scientists to flexibly create generic workflows using different kind of data sources, filters and algorithms which fits the -ever-changing demand of current research. Two open source tools were used to -create an open workflow environment for chemoinformatic problems. Taverna -\cite{TomOinn2004} as workflow environment provides with its extendable architecture a well -established platform for a chemoinformatics extension. The CDK-Taverna project as extension for -Taverna uses the Chemistry Development Kit (CDK) -\cite{Steinbeck2004a}\cite{Steinbeck2003a} as Chemoinformatics software library. +ever-changing demand of current research. +In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks +which can be manipulated with a mouse or any such pointing device in a graphics environment. +These building blocks can be connected by piplines to enable data flow between them, which is why +"pipelining" is often used interchangably for "workflow". +Here, two open source tools were used to +create an open workflow environment for chemoinformatic solutions: Taverna +\cite{TomOinn2004}, a workflow environment with an extensible architecture, and the +Chemistry Development Kit (CDK) +\cite{Steinbeck2004a}\cite{Steinbeck2003a}, a chemoinformatics software library. +The resulting CDK-Taverna package is the first completely free workflow solution for chemoinformatics. -The workflow or pipeline paradigm is not a new one and there are a number of tools -out there which provide such chemoinformatic functionality for the user like -the Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of +Existing proprietary or semi-proprietary implementations of the workflow or pipeline paradigm +in molecular informatics include Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of Accelrys or the InforSense platform from InforSense \cite{InforSenseWeb}. Both are commercial well established but closed source products with a large variety of different functionality. KNIME \cite{KNIMEWeb} is a modular data exploration platform which uses a dual licensing model with the Aladdin free public license. It is developed by the group of Michael Berthold at the -University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. The aim of this article is not to evaluate the different workflow solutions but to say that the only real open-source workflow environment for the given -chemoinformatic tasks is the combination of Taverna and the CDK-Taverna -plug-in. +University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. %ToDo Cite Wendy \section*{Implementation} @@ -291,50 +290,50 @@ The plug-in uses like Taverna Maven 2 \cite{MavenWeb} as build system. \subsection*{Taverna's extension points} -Taverna allows the execution of workflows which links together external +Taverna allows the execution of workflows linking together external remote or local, private or public, third-party or home-grown, heterogeneous open services, applications or databases. For the integration of these -different kind of resources Taverna provides different interfaces and -protocols for its extension. It allows for example an easy integration -of webservices which uses the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} +different kind of resources Taverna provides various interfaces and +protocols for its extension. It allows for an easy integration +of webservices which use the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} protocol. The CDK-Taverna plug-in uses a local extension of Taverna. For the local extension Taverna provides a list of different Service Provider Interfaces (SPI). %TODO:At this position should be a table with a list of the SPI's The CDK-Taverna project implements some of these extensions which lead to an -integration of the CDK functionality as so called Local Worker. These are -Java classes which gets installed on each local Taverna installation. +integration of the CDK functionality as so called Local Worker, running on the +same machine as the CDK taverna installation. %TODO: Detailed description of the integration All worker provided from the CDK-Taverna plug-in implement the ``CDKLocalWorker`` interface. This interface is used for the detection of each -worker. This detection is done within the implementation of the +worker, performed within the implementation of the ``CDKScavenger`` which itself implements the Taverna SPI ``org.embl.ebi.escience.scuflui.workbench.Scavenger``. -To add user interfaces for some of the worker an extending of the -``AbstractCDKProcessorAction`` is necessary. The ``AbstractCDKProcessorAction`` -itself implements the Taverna SPI +Adding user interfaces for some of the workers requires an extension of the +``AbstractCDKProcessorAction``,which itself implements the Taverna SPI ``org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI``. The use of this SPI -allows the adding of for example file chooser dialogs to workers like file +allows the addition of, for example, file chooser dialogs for workers like file reader or writer. \subsection*{Plug-in installation} The CDK-Taverna project uses the plug-in detection manager of Taverna -for its installation. The plug-in manager needs for the description of the -plug-in a XML file. This description XML file includes the name , +for its installation, which requires an XML file describing the plugin with information like the name , the version number, the target Taverna version number, the repository location and the Maven like Java package description of the plug-in. After adding the plug-in installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all -available plug-in versions will pop up for the user. To -install the plug-in the user has to select the plug-in version and -press the install button. Than Taverna downloads the Java libraries on the fly -from the given repository location. After a restart of Taverna the new -installed plug-in is usable with its features. +available plug-in versions will be graphically available to the user. +In order to install the CDK-Taverna plugin, the user selects the desired version and +all neccessary Java libraries are installed on the fly +from the given repository location. + + \subsection*{Iteration over large datasets} -For chemoinformatic applications the handling of large datasets is essential. -Programmatically this is normally done with a for or a while loop. Unfortunately the -architecture of Taverna do not support such loops. CDK-Taverna provides worker to build workflows which act like for or while loops -to process large datasets. This construction uses Taverna's iteration and retry +Chemoinformatics by definition deals with the discovery of chemical knowledge from large data collections. +Usually being to large to be loaded into memory as a whole, one needs to loop of these datasets to process them +piece by piece. Unfortunately the architecture of Taverna's current stable version do not support such loops. +CDK-Taverna therefore provides worker to build workflows which act like for or while loops +to process large datasets making use of Taverna's iteration and retry mechanism. %see section iterative qsar workflow This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-02 14:00:59
|
Revision: 14689 http://cdk.svn.sourceforge.net/cdk/?rev=14689&view=rev Author: steinbeck Date: 2009-08-02 14:00:52 +0000 (Sun, 02 Aug 2009) Log Message: ----------- changes to the text Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-01 15:34:51 UTC (rev 14688) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 14:00:52 UTC (rev 14689) @@ -120,7 +120,7 @@ %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\title{CDK-Taverna: An open workflow environment for chemoinformatics} +\title{CDK-Taverna: An open workflow environment for cheminformatics} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% @@ -159,7 +159,7 @@ \address{% \iid(1)Cologne University Bioinformatics Center (CUBIC),% Zuelpicher Str. 47, Koeln, Germany\\ -\iid(2)University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and Chemoinformatics, Recklinghausen, Germany\\ +\iid(2)University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and cheminformatics, Recklinghausen, Germany\\ \iid(3)European Bioinformatics Institute (EBI), Cambridge, UK }% @@ -184,29 +184,26 @@ \begin{abstract} % Do not use inserted blank lines (ie \\) until main body of text. -\paragraph*{Background:} The recent release of large open access chemistry database in -the public domain generates a demand for flexible tools to process -them and discover new knowledge. To support Open Drug Discovery and -Open Notebook Science on top of these data resource, is is desirable +\paragraph*{Background:} The recent release of large open access chemistry databases \cite{PubChem,IrwinJ2005,EBIDrugs2008} +generates a demand for flexible tools to process +them and discover new knowledge. To freely support open science based on these data resources, is is desirable for the processing tools to be open-source and available for everyone. -\paragraph*{Results:} Here we describe a plugin for the open source workflow engine -Taverna based on the open source chemoinformatics library Chemistry Development Kit (CDK) -resulting in a open source workflow solution to attack chemoinformatics problems. -We have implemented more than 160 different workers to handle specific chemoinformtics tasks. +\paragraph*{Results:} Here we describe an add-on for the open source workflow engine +Taverna \cite{TomOinn2004} based on the open source cheminformatics library Chemistry Development Kit (CDK) \cite{Steinbeck2004a,Steinbeck2003a} +resulting in a open source workflow solution to tackle cheminformatics problems. +We have implemented more than 160 different workers to handle specific cheminformtics tasks. We describe +the applications of CDK-Taverna in four different usage scenarios. \paragraph*{Conclusions:} The combination of the open-source workflow engine -Taverna and the Chemistry Development Kit provides the first open chemoinformatics workflow -solution for the biosciences. +Taverna and the Chemistry Development Kit provides the first free \cite{free-software-foundation} cheminformatics workflow +solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, +CDK-Taverna has the potential to become a free alternative to existing commercial workflow tools. \end{abstract} - - \ifthenelse{\boolean{publ}}{\begin{multicols}{2}}{} - - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% %% The Main Body begins here %% @@ -239,20 +236,14 @@ %% Background %% %% \section*{Background} -The recent release of large open Chemistry databases into the public domain -\cite{PubChem}\cite{IrwinJ2005}\cite{EBIDrugs2008} calls for flexible, open toolkits to process them. +The recent release of large open chemistry databases into the public domain +\cite{PubChem,IrwinJ2005,EBIDrugs2008} calls for flexible, open toolkits to process them. These databases and tools will, for the first time, create opportunities for academia and third-world countries -to perform state-of-the-art open drug discovery and translational research - endeavors so far a domain of the -pharmaceutical industry. In order to enable this progress, the tools for processing these datasets must be free and open. -Within the Open Drug Discovery and Open -Notebook Science \cite{DeLano2005} scientists can participate in the research -and get unrestricted access to what has been learned \cite{GuhaR2006}. +to perform state-of-the-art open drug discovery and translational research -- endeavors so far a domain of the +pharmaceutical industry. In order to enable this progress, t This article describes the integration of various open source software packages to form an open -workflow engine for chemoinformatics and drug discovery. -These fields, like any other science, encompass sets of typical workflows. -These workflows can be grouped into different disciplines like bioinformatics, -chemoinformatics or statistical and machine learning workflows. Areas calling for such -workflow support include +workflow engine for cheminformatics, where numerous recurring tasks can be automated, including +tasks for \begin{itemize} \item Chemical data filtering, transformation, curation and migration workflows \item Chemical documentation and information retrieval related workflows (structures, reactions, pharmacophores, object relational data etc.) @@ -260,33 +251,34 @@ \end{itemize} The workflow paradigm allows scientists to flexibly create generic -workflows using different kind of data sources, filters and algorithms which fits the -ever-changing demand of current research. +workflows using different kind of data sources, filters and algorithms, which can later be adapted to changing need. . In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks -which can be manipulated with a mouse or any such pointing device in a graphics environment. -These building blocks can be connected by piplines to enable data flow between them, which is why -"pipelining" is often used interchangably for "workflow". -Here, two open source tools were used to -create an open workflow environment for chemoinformatics solutions: Taverna -\cite{TomOinn2004}, a workflow environment with an extensible architecture, and the -Chemistry Development Kit (CDK) -\cite{Steinbeck2004a}\cite{Steinbeck2003a}, a chemoinformatics software library. -The resulting CDK-Taverna package is the first completely free workflow solution for chemoinformatics. +which can be manipulated with a mouse or any pointing device in a graphical environment, +relieving the scientist from the need to learn a programming language. +Building blocks are connected by data pipelines to enable data flow between them, which is why +"pipelining" is often used interchangeably for "workflow". Existing proprietary or semi-proprietary implementations of the workflow or pipeline paradigm in molecular informatics include Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of Accelrys or the InforSense platform from InforSense \cite{InforSenseWeb}. Both -are commercial well established but closed source products with a large variety +are commercially well established but closed source products with a large variety of different functionality. KNIME \cite{KNIMEWeb} is a modular data exploration platform which uses a dual licensing model with the Aladdin free public license. It is developed by the group of Michael Berthold at the -University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. +University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. +Here we present the combination of two open source tools to +create a free and open workflow environment for cheminformatics solutions. We have encapsulated : Taverna +\cite{TomOinn2004}, a workflow environment with an extensible architecture, and the +Chemistry Development Kit (CDK) +\cite{Steinbeck2004a,Steinbeck2003a}, a cheminformatics software library. +The resulting CDK-Taverna package is the first completely free \cite{free-software-foundation} workflow solution for cheminformatics. + \section*{Implementation} The CDK-Taverna plug-in written in Java is published under the GNU Lesser General Public License (LGPL). -The plug-in uses Maven 2 \cite{MavenWeb} as a build system like Taverna. +Like Taverna itself, the plug-in uses Maven 2 \cite{MavenWeb} as a build system. \subsection*{Taverna's extension points} Taverna allows the execution of workflows linking together heterogeneous @@ -324,7 +316,7 @@ \subsection*{Iteration over large datasets} -Chemoinformatics by definition deals with the discovery of chemical knowledge from large data collections. +cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them piece by piece. Unfortunately the architecture of Taverna's current stable version does not support such loops. CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows @@ -480,7 +472,7 @@ The open-source solution which is provided by the CDK-Taverna plug-in allows an efficient and powerful linking between different resources without any need of deeper software-development knowledge. It provides the possibility to process -hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows the perfoming of cluster analysis. The use of the PostgresSQL database with the Pgchem::tigres chemoinformatics cartridge allows the processing of chemical databases with up to a million molecules. +hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows the perfoming of cluster analysis. The use of the PostgresSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. \section*{Availability and requirements} \begin{itemize} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-02 14:22:26
|
Revision: 14690 http://cdk.svn.sourceforge.net/cdk/?rev=14690&view=rev Author: steinbeck Date: 2009-08-02 14:22:12 +0000 (Sun, 02 Aug 2009) Log Message: ----------- Addition of new environments for todo and remark Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 14:00:52 UTC (rev 14689) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 14:22:12 UTC (rev 14690) @@ -57,6 +57,7 @@ %\usepackage[applemac]{inputenc} %applemac support if unicode package fails %\usepackage[latin1]{inputenc} %UNIX support if unicode package fails \usepackage{graphicx} +\usepackage{color} \urlstyle{rm} @@ -99,8 +100,9 @@ %% returned to the Review style setting. %% %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\newcommand{\remark}[1]{\marginpar[\sf\tiny#1\hfil$\rightarrow$]{\sf\tiny$\leftarrow$#1}} +\newcommand{\todo}[1]{\textcolor{red}{\#\#\#}\remark{Todo #1}} - %Review style settings \newenvironment{bmcformat}{\begin{raggedright}\baselineskip20pt\sloppy\setboolean{publ}{false}}{\end{raggedright}\baselineskip20pt\sloppy} @@ -286,7 +288,7 @@ different kind of resources Taverna provides various interfaces and protocols for its extension. It allows for an easy integration of webservices which use the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} -protocol. The CDK-Taverna plug-in uses a local extension of Taverna. For the +protocol. \remark{This is a remark} The CDK-Taverna plug-in uses a local extension of Taverna. For the local extension Taverna provides a list of different Service Provider Interfaces (SPI). The CDK-Taverna project implements some of these extensions: This leads to an integration of the CDK functionality as so called Local Workers which run on the @@ -318,7 +320,7 @@ cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them -piece by piece. Unfortunately the architecture of Taverna's current stable version does not support such loops. +piece by piece. \todo{This is a todo} Unfortunately the architecture of Taverna's current stable version does not support such loops. CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows to process large datasets. %see section iterative qsar workflow This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-02 21:23:41
|
Revision: 14691 http://cdk.svn.sourceforge.net/cdk/?rev=14691&view=rev Author: steinbeck Date: 2009-08-02 21:23:31 +0000 (Sun, 02 Aug 2009) Log Message: ----------- more text, more corrections Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 14:22:12 UTC (rev 14690) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 21:23:31 UTC (rev 14691) @@ -277,18 +277,20 @@ The resulting CDK-Taverna package is the first completely free \cite{free-software-foundation} workflow solution for cheminformatics. -\section*{Implementation} +\section*{Methods} The CDK-Taverna plug-in written in Java is published under the GNU Lesser General Public License (LGPL). Like Taverna itself, the plug-in uses Maven 2 \cite{MavenWeb} as a build system. \subsection*{Taverna's extension points} Taverna allows the execution of workflows linking together heterogeneous -open services, applications or databases (remote or local, private or public, third-party or home-grown, ). For the integration of these +open services, applications or databases (remote or local, private or public, third-party or home-grown) \remark{this is a verbose copy of +some text from taverna people and needs a citation}. For the integration of these different kind of resources Taverna provides various interfaces and protocols for its extension. It allows for an easy integration of webservices which use the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} -protocol. \remark{This is a remark} The CDK-Taverna plug-in uses a local extension of Taverna. For the +protocol. \remark{Thomas, how much of the surrounding text is copied verbally from someone else?} +The CDK-Taverna plug-in uses a local extension of Taverna. For the local extension Taverna provides a list of different Service Provider Interfaces (SPI). The CDK-Taverna project implements some of these extensions: This leads to an integration of the CDK functionality as so called Local Workers which run on the @@ -305,8 +307,8 @@ reader or writer. \subsection*{Plug-in installation} -The CDK-Taverna project uses the plug-in detection manager of Taverna -for its installation. The detection manager requires a plugin description XML file that contains a plugin name, +The CDK-Taverna project employs the plug-in detection manager of Taverna +for its installation. The detection manager requires a plugin description XML file containing a plugin name, a version number, a target Taverna version number, a repository location and a Maven like Java package description. After adding the plug-in installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all @@ -316,18 +318,11 @@ all neccessary Java libraries are installed on the fly from the given repository location. -\subsection*{Iteration over large datasets} +\subsection{The anatomy of a CDK-Taverna worker} -cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. -These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them -piece by piece. \todo{This is a todo} Unfortunately the architecture of Taverna's current stable version does not support such loops. -CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows -to process large datasets. -%see section iterative qsar workflow +XXX the title says it all - - %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -359,7 +354,18 @@ The miscellaneous workers are e.g. a substructure filter, an aromaticity detection, an atom typing or a reaction enumeration. Some of the workers are outlined as part of example workflows described below. +In the following, we outline the application of CDK-Taverna in a few selected workflow scenarios. +\subsection*{Iteration over large datasets} + +cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. +These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them +piece by piece. \todo{This is a todo} Unfortunately the architecture of Taverna's current stable version does not support such loops. +CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows +to process large datasets. +%see section iterative qsar workflow + + \subsection*{Database I/O} For database support the CDK-Taverna project uses the PostgresSQL \cite{PostgreSQLWeb} database with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This @@ -492,14 +498,17 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Authors contributions} -Text for this section \ldots +CS and AZ conceived the project, lead the development. EW wrote an early version of CDK-Taverna and participated +in the all stages of the development. TK did the majority of CDK-Taverna development and +developed the projects to its current state. +All co-authors contributed to the manuscript. - %%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Acknowledgements} \ifthenelse{\boolean{publ}}{\small}{} -Text for this section \ldots +The authors express their gratitude to the Taverna team as well as the CDK +community for creating these great open tools. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-02 21:29:00
|
Revision: 14692 http://cdk.svn.sourceforge.net/cdk/?rev=14692&view=rev Author: steinbeck Date: 2009-08-02 21:28:53 +0000 (Sun, 02 Aug 2009) Log Message: ----------- more changes Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 21:23:31 UTC (rev 14691) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 21:28:53 UTC (rev 14692) @@ -433,8 +433,8 @@ \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.3]{pics/QSAR_Workflow.pdf}\\ - \caption{The workflow calculates different QSAR descriptors for molecules - from a PostgresSQL database. The results of the calculation is be stored in a csv file.} + \caption{The workflow calculates various QSAR descriptors for molecules + from a PostgresSQL database. The results of the calculation is be stored in a CSV file.} \label{fig:QSARWorkflow} \end{figure*} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-03 08:10:36
|
Revision: 14693 http://cdk.svn.sourceforge.net/cdk/?rev=14693&view=rev Author: egonw Date: 2009-08-03 08:10:23 +0000 (Mon, 03 Aug 2009) Log Message: ----------- Updated my address to UU, and removed CUBIC for all of us Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-02 21:28:53 UTC (rev 14692) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-03 08:10:23 UTC (rev 14693) @@ -138,16 +138,16 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\author{Thomas Kuhn$^{1,2}$% +\author{Thomas Kuhn$^{1}$% \email{Thomas Kuhn - tho...@fh...}% \and -Egon Willighangen$^1$% -\email{Egon Willighangen - ego...@gm...} +Egon L. Willighangen$^2$% +\email{Egon L. Willighangen - ego...@fa...} \and -Achim Zielesny\correspondingauthor$^2$% +Achim Zielesny\correspondingauthor$^1$% \email{Achim Zielesny\correspondingauthor - ach...@fh...} and -Christoph Steinbeck\correspondingauthor$^{1,3}$% +Christoph Steinbeck\correspondingauthor$^{3}$% \email{Christoph Steinbeck\correspondingauthor - ste...@eb...}% } @@ -159,10 +159,9 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \address{% -\iid(1)Cologne University Bioinformatics Center (CUBIC),% -Zuelpicher Str. 47, Koeln, Germany\\ -\iid(2)University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and cheminformatics, Recklinghausen, Germany\\ -\iid(3)European Bioinformatics Institute (EBI), Cambridge, UK +\iid(1) University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and cheminformatics, Recklinghausen, Germany\\ +\iid(2) Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden\\ +\iid(3) European Bioinformatics Institute (EBI), Cambridge, UK }% \maketitle This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-03 21:13:18
|
Revision: 14696 http://cdk.svn.sourceforge.net/cdk/?rev=14696&view=rev Author: egonw Date: 2009-08-03 21:13:10 +0000 (Mon, 03 Aug 2009) Log Message: ----------- Added remark with extra info on the claim of being the first open source cheminfo workflow environment Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-03 21:11:46 UTC (rev 14695) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-03 21:13:10 UTC (rev 14696) @@ -197,7 +197,8 @@ the applications of CDK-Taverna in four different usage scenarios. \paragraph*{Conclusions:} The combination of the open-source workflow engine -Taverna and the Chemistry Development Kit provides the first free \cite{free-software-foundation} cheminformatics workflow +Taverna and the Chemistry Development Kit provides the first open source \cite{free-software-foundation} cheminformatics workflow +\remark{First release dates back to at least October 2005...: \url{http://chem-bla-ics.blogspot.com/2005/10/cdk-taverna-fully-recognized.html}, \url{http://sourceforge.net/projects/cdk/files/CDK-Taverna/20051018/cdk-taverna-20051018.jar/download}} solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, CDK-Taverna has the potential to become a free alternative to existing commercial workflow tools. \end{abstract} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-07 13:34:29
|
Revision: 14699 http://cdk.svn.sourceforge.net/cdk/?rev=14699&view=rev Author: steinbeck Date: 2009-08-07 13:34:19 +0000 (Fri, 07 Aug 2009) Log Message: ----------- Scaled figures to look better Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 13:15:29 UTC (rev 14698) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 13:34:19 UTC (rev 14699) @@ -159,7 +159,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \address{% -\iid(1) University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and Cheminformatics, Recklinghausen, Germany\\ +\iid(1) University of Applied Sciences of Gelsenkirchen, Institute for Bioinformatics and Cheminformatics, Recklinghausen, Germany\\ \iid(2) Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden\\ \iid(3) European Bioinformatics Institute (EBI), Cambridge, UK }% @@ -198,7 +198,7 @@ \paragraph*{Conclusions:} The combination of the open-source workflow engine Taverna and the Chemistry Development Kit provides the first open source \cite{free-software-foundation} cheminformatics workflow -\remark{First release dates back to at least October 2005...: \url{http://chem-bla-ics.blogspot.com/2005/10/cdk-taverna-fully-recognized.html}, \url{http://sourceforge.net/projects/cdk/files/CDK-Taverna/20051018/cdk-taverna-20051018.jar/download}} +\remark{First release dates back to at least October 2005...: \url{http://chem-bla-ics.blogspot.com/2005/10/cdk-taverna-fully-recognized.html}, \url{http://sourceforge.net/projects/cdk/files/CDK-Taverna/20051018/cdk-taverna-20051018.jar/download} CS: Ok, what do we do with this information?} solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, CDK-Taverna has the potential to become a free alternative to existing commercial workflow tools. \end{abstract} @@ -237,14 +237,19 @@ %%%%%%%%%%%%%%%% %% Background %% %% +\section{Agenda} +XXX remove before submission + + + \section*{Background} The recent release of large open chemistry databases into the public domain \cite{PubChem,IrwinJ2005,EBIDrugs2008} calls for flexible, open toolkits to process them. These databases and tools will, for the first time, create opportunities for academia and third-world countries to perform state-of-the-art open drug discovery and translational research -- endeavors so far a domain of the pharmaceutical industry. In order to enable this progress, t -\remark{corrupt sentence?} - +\remark{corrupt sentence?} + This article describes the integration of various open source software packages to form an open workflow engine for cheminformatics, where numerous recurring tasks can be automated, including tasks for @@ -255,7 +260,7 @@ \end{itemize} The workflow paradigm allows scientists to flexibly create generic -workflows using different kinds of data sources, filters and algorithms, which can later be adapted to changing needs. +workflows using different kinds of data sources, filters and algorithms, which can later be adapted to changing needs. In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks which can be manipulated with a mouse or any pointing device in a graphical environment, relieving the scientist from the need to learn a programming language. @@ -291,7 +296,7 @@ different kind of resources Taverna provides various interfaces and protocols for its extension. It allows for an easy integration of webservices which use the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} -protocol. \remark{Thomas, how much of the surrounding text is copied verbally from someone else? TK: As far as I remember only the frist part is a citation. the other part is my composition.} +protocol. \remark{CS: Thomas, how much of the surrounding text is copied verbally from someone else? TK: As far as I remember only the frist part is a citation. the other part is my composition.} The CDK-Taverna plug-in uses a local extension of Taverna. For the local extension Taverna provides a list of different Service Provider Interfaces (SPI). The CDK-Taverna project implements some of these extensions: This leads to an @@ -317,11 +322,12 @@ available plug-in versions will be graphically available to the user. In order to install the CDK-Taverna plugin the user selects the desired version. Then -all neccessary Java libraries are installed on the fly from the given repository location. +all necessary Java libraries are installed on the fly from the given repository location. \subsection{The anatomy of a CDK-Taverna worker} -\remark{Has somebody added this? And what is the goal here?} +\remark{Has somebody added this? And what is the goal here? CS: I added this and the idea is to show the programmatic skeleton of a +CDK-Taverna worker, if possible} XXX the title says it all @@ -330,8 +336,8 @@ %% \section*{Results} -The CDK-Taverna plug-in provides approximately 164 different cheminformatics workers. The -fields of application of these workers are described in the Table~\ref{tab:cheminfoWorkers}. +The CDK-Taverna plug-in provides approximately 164 different cheminformatics workers. The +fields of application of these workers are described in the Table~\ref{tab:cheminfoWorkers}. \begin{table}[htbp] \caption{The worker allocation of CDK-Taverna} @@ -340,8 +346,8 @@ \hline \textbf{Grouped worker} & \multicolumn{1}{l|}{\textbf{Number of worker}} \\ \hline File I/O & 15 \\ \hline -SMILES tools & 2 \\ \hline -InChI parser & 2 \\ \hline +SMILES tools & 2 \\ \hline +InChI parser & 2 \\ \hline Database I/O & 7 \\ \hline Molecular descriptors & 42 \\ \hline Atom descriptors & 27 \\ \hline @@ -350,7 +356,7 @@ Miscellaneous & 50 \\ \hline \end{tabular} \end{flushleft} -\label{tab:cheminfoWorkers} +\label{tab:cheminfoWorkers} \end{table} The miscellaneous workers are e.g. a substructure filter, an @@ -360,7 +366,7 @@ \subsection*{Iteration over large datasets} -Cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. +Cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them piece by piece. \todo{This is a todo} Unfortunately the architecture of Taverna's current stable version does not support such loops. CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows @@ -379,7 +385,7 @@ depending on the type of input. In a first example the substructure workflow performs a topological substructure search on a list of given molecules and a given molecular substructure (see figure~\ref{fig:substructureworkflow.ps}). -The workflow inputs are a molecular substructure as a SMILES string +The workflow inputs are a molecular substructure as a SMILES string \cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file \cite{DalbyArthur1992}. The structures which match the substructure are stored as MDL Mol files \cite{DalbyArthur1992}. The non-matching structures are converted into the Chemical Markup Language (CML) \cite{Murray-RustP1999, KuhnS2007}. @@ -396,9 +402,9 @@ \centering \includegraphics[angle=0,clip=false,scale=.4]{pics/substructureworkflow.pdf}\\ \caption{The workflow performs a topological substructure search on - molecules from a MDL SD file. The input of this workflow is a SMILES string + molecules from a MDL SD file. The input of this workflow is a SMILES string which represents the substructure.} - \label{fig:substructureworkflow.ps} + \label{} \end{figure*} \begin{figure*} @@ -434,13 +440,13 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.3]{pics/QSAR_Workflow.pdf}\\ + \includegraphics[angle=0,clip=false,scale=.5]{pics/QSAR_Workflow.pdf}\\ \caption{The workflow calculates various QSAR descriptors for molecules from a PostgresSQL database. The results of the calculation is be stored in a CSV file.} \label{fig:QSARWorkflow} \end{figure*} -%<TODO THOMAS: Wir haben doch Angaben ber die Performance der Berechnungen: Die sollten hier rein> +%<TODO THOMAS: Wir haben doch Angaben ber die Performance der Berechnungen: Die sollten hier rein> \subsubsection*{Iterative QSAR Workflow} @@ -466,7 +472,7 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.2]{pics/Iterative_QSAR_Workflow.pdf}\\ \caption{The workflow calculates different QSAR properties for the given molecules which gets loaded from a PostgresSQL database. The results of the calculation will be stored in a csv file.} \label{fig:IterativeQSARWorkflow} + \includegraphics[angle=0,clip=false,scale=.5]{pics/Iterative_QSAR_Workflow.pdf}\\ \caption{The workflow calculates different QSAR properties for the given molecules which gets loaded from a PostgresSQL database. The results of the calculation will be stored in a csv file.} \label{fig:IterativeQSARWorkflow} \end{figure*} \subsubsection*{Validation of CDK Atom Types} @@ -475,54 +481,55 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.2]{pics/AtomTypingWF.pdf}\\ \caption{This workflow shows the iterative loading of molecules from a database and searches for molecules with (for the CDK) unknown atom types} + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{This workflow shows the iterative loading of molecules from a database and searches for molecules with (for the CDK) unknown atom types} \label{fig:AtomTypingWF} \end{figure*} \subsubsection*{Reaction enumeration} -Markush structures are generic chemical structures, which contains variable patterns such as "Heterocyclic", "Alkyl" or "R = Methyl, Isopropyl, Pentyl". -\remark{This section does not contain a word about what CDK-Taverna can do in this field. This has to be added in combination with an example.} -Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are useable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. +Markush structures are generic chemical structures, which contains variable patterns such as "Heterocyclic", "Alkyl" or "R = Methyl, Isopropyl, Pentyl". +\remark{This section does not contain a word about what CDK-Taverna can do in this field. This has to be added in combination with an example.} +Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are useable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionenumeration.pdf}\\ \caption{This reaction enumeration example contains two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} + \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{This reaction enumeration example contains two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} \label{fig:ReactionEnumerationSchema} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL mol files. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results. \remark{Bioclipse should be mentioned and cited in the text.} } + \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL mol files. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.} + \label{fig:ReactionEnumerationSchema} \end{figure*} \subsubsection*{Classification Workflows} \remark{I have just copied this from the phd thesis, Here also the question should the results of the classification be discusses?} -During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm \cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the "kth-nearest-neighbour"-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. -In such a workflow the Get\_QSAR\_vector\_from\_database worker loads data vectors for a specific SQL query a list of data vectors from the database (compare section ??? above). -\remark{FIXME: finish reference} -This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose min value equals its max value. -After the loading and a first cleaning of the data vector the classification is performed using the ART2A\_Classificator worker. For the configuration of this worker different options are available: +During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm \cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the "kth-nearest-neighbour"-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. +In such a workflow the Get\_QSAR\_vector\_from\_database worker loads data vectors for a specific SQL query a list of data vectors from the database (compare section ??? above). +\remark{FIXME: finish reference} +This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose min value equals its max value. +After the loading and a first cleaning of the data vector the classification is performed using the ART2A\_Classificator worker. For the configuration of this worker different options are available: \begin{itemize} - \item Linear scaling of the input vector to values between 0 and 1. - \item A switch between \textit{deterministic random} and \textit{random random} for the selection of the vectors to process. - \item The definition of the convergence criteria of the classification. - \item To set the required similarity for the convergence criteria. \remark{These three should be reworded too to sound as an option, not a process.} + \item Linear scaling of the input vector to values between 0 and 1. + \item A switch between \textit{deterministic random} and \textit{random random} for the selection of the vectors to process. + \item The definition of the convergence criteria of the classification. + \item To set the required similarity for the convergence criteria. \remark{These three should be reworded too to sound as an option, not a process.} \item To set the maximum classification time. \item To set the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. \remark{An example is mandatory at this point which can be combined with the next section where an example is also mandatory.} +The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. \remark{An example is mandatory at this point which can be combined with the next section where an example is also mandatory.} \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} + \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} \label{fig:ART2AClassification} \end{figure*} \subsubsection*{Further Example Workflow} -\cite{KuhnDissertation} descripes different other example workflows and their results . This includs workflows for analysing the atom typing of the CDK and workflows for performing cluster analysis for the determination of the chemical diversity. +\cite{KuhnDissertation} describes different other example workflows and their results . This includes workflows for analysing the atom typing of the CDK and workflows for performing cluster analysis for the determination of the chemical diversity. %%%%%%%%%%%%%%%%%%%%%% @@ -530,10 +537,11 @@ The open-source solution which is provided by the CDK-Taverna plug-in allows an efficient and powerful linking between different resources without any need of deeper software-development knowledge. It provides the possibility to process -hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows -cluster analysis of molecular descriptor vectors. -The use of the PostgresSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. +hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows +cluster analysis of molecular descriptor vectors. +The use of the PostgresSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. + \section*{Availability and requirements} \begin{itemize} \item \textbf{Project name:} CDK-Taverna This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-07 15:24:18
|
Revision: 14701 http://cdk.svn.sourceforge.net/cdk/?rev=14701&view=rev Author: steinbeck Date: 2009-08-07 15:24:11 +0000 (Fri, 07 Aug 2009) Log Message: ----------- Addition of a proper trademark symbol and cleanup of the Taverna extension points section. Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 14:42:56 UTC (rev 14700) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:24:11 UTC (rev 14701) @@ -252,14 +252,14 @@ workflow engine for cheminformatics, where numerous recurring tasks can be automated, including tasks for \begin{itemize} - \item Chemical data filtering, transformation, curation and migration workflows - \item Chemical documentation and information retrieval related workflows (structures, reactions, pharmacophores, object relational data etc.) - \item Data analysis workflows (statistics and clustering/machine learning for QSAR, diversity analysis etc.) + \item chemical data filtering, transformation, curation and migration workflows + \item chemical documentation and information retrieval related workflows (structures, reactions, pharmacophores, object relational data etc.) + \item data analysis workflows (statistics and clustering/machine learning for QSAR, diversity analysis etc.) \end{itemize} The workflow paradigm allows scientists to flexibly create generic workflows using different kinds of data sources, filters and algorithms, which can later be adapted to changing needs. -In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks +In order to achieve this, library methods are encapsulated in Lego\texttrademark-like building blocks which can be manipulated with a mouse or any pointing device in a graphical environment, relieving the scientist from the need to learn a programming language. Building blocks are connected by data pipelines to enable data flow between them, which is why @@ -285,29 +285,28 @@ \section*{Methods} The CDK-Taverna plug-in written in Java is published under the GNU Lesser General Public License (LGPL). -Like Taverna itself, the plug-in uses Maven 2 \cite{MavenWeb} as a build system. +Like Taverna itself the plug-in uses Maven 2 \cite{MavenWeb} as a build system. \subsection*{Taverna's extension points} Taverna allows the execution of workflows linking together heterogeneous -open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007} \remark{this is a verbose copy of -some text from taverna people and needs a citation TK: Done}. For the integration of these -different kind of resources Taverna provides various interfaces and -protocols for its extension. It allows for an easy integration -of webservices which use the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} -protocol. \remark{CS: Thomas, how much of the surrounding text is copied verbally from someone else? TK: As far as I remember only the frist part is a citation. the other part is my composition.} -The CDK-Taverna plug-in uses a local extension of Taverna. For the -local extension Taverna provides a list of different Service Provider Interfaces (SPI). -The CDK-Taverna project implements some of these extensions: This leads to an -integration of the CDK functionality as so called Local Workers which run on the -same machine as the CDK taverna installation. -All workers provided from the CDK-Taverna plug-in implement the -``CDKLocalWorker`` interface. This interface is used for the detection of each -worker, performed within the implementation of the -``CDKScavenger`` which itself implements the Taverna SPI -``org.embl.ebi.escience.scuflui.workbench.Scavenger``. +open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007}. +For the integration of these +different resource types Taverna provides various interfaces and +protocols for its extension. For example, it allows for easy use +of webservices through the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} +protocol. +The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the +local extension Taverna provides a list of different Service Provider Interfaces (SPI), so +of which we implement in CDK-Taverna, which leads to an +integration of the CDK functionality as so called Local Workers running on the +same machine as the CDK-Taverna installation. +All workers in CDK-Taverna implement the +\texttt{CDKLocalWorker} interface. It is used for the detection of +workers by the \texttt{CDKScavenger} class which itself implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.workbench.Scavenger} interface. Adding user interfaces for some of the workers requires an extension of the -``AbstractCDKProcessorAction`` which itself implements the Taverna SPI -``org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI``. The use of this SPI +\texttt{AbstractCDKProcessorAction} which again implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI}. The use of this SPI allows the addition of, for example, file chooser dialogs for workers like file reader or writer. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-07 15:33:51
|
Revision: 14702 http://cdk.svn.sourceforge.net/cdk/?rev=14702&view=rev Author: steinbeck Date: 2009-08-07 15:33:43 +0000 (Fri, 07 Aug 2009) Log Message: ----------- Added one comment regarding workers Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:24:11 UTC (rev 14701) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:33:43 UTC (rev 14702) @@ -237,11 +237,15 @@ %%%%%%%%%%%%%%%% %% Background %% %% -\section{Agenda} -XXX remove before submission +\section*{Agenda} +\begin{itemize} + \item Provide anatomy of CDK-Taverna worker + \item Give two or three examples in for each group in worker table +\end{itemize} + \section*{Background} The recent release of large open chemistry databases into the public domain \cite{PubChem,IrwinJ2005,ChEMBL} calls for flexible, open toolkits to process them. @@ -313,16 +317,15 @@ \subsection*{Plug-in installation} The CDK-Taverna project employs the plug-in detection manager of Taverna for its installation. The detection manager requires a plugin description XML file containing a plugin name, -a version number, a target Taverna version number, a repository location and a Maven -like Java package description. After adding the plug-in +a version number, a target Taverna version number, a repository location and a Maven-like Java package description. After adding the plug-in installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all -available plug-in versions will be graphically available to the user. +available plug-in versions will be graphically accessible to the user. -In order to install the CDK-Taverna plugin the user selects the desired version. Then -all necessary Java libraries are installed on the fly from the given repository location. +In order to install the CDK-Taverna plugin the user selects the desired version after which +all necessary Java libraries are installed on-the-fly from the given repository location. -\subsection{The anatomy of a CDK-Taverna worker} +\subsection*{The anatomy of a CDK-Taverna worker} \remark{Has somebody added this? And what is the goal here? CS: I added this and the idea is to show the programmatic skeleton of a CDK-Taverna worker, if possible} XXX the title says it all @@ -339,26 +342,28 @@ \begin{table}[htbp] \caption{The worker allocation of CDK-Taverna} \begin{flushleft} -\begin{tabular}{|l|c|} +\begin{tabular}{|l|c|l|} \hline -\textbf{Grouped worker} & \multicolumn{1}{l|}{\textbf{Number of worker}} \\ \hline -File I/O & 15 \\ \hline -SMILES tools & 2 \\ \hline -InChI parser & 2 \\ \hline -Database I/O & 7 \\ \hline -Molecular descriptors & 42 \\ \hline -Atom descriptors & 27 \\ \hline -Bond descriptors & 6 \\ \hline -Clustering & 13 \\ \hline -Miscellaneous & 50 \\ \hline +\textbf{Workers by function} & \multicolumn{1}{l|}{\textbf{Number of workers}} & \textbf{Examples} \\ \hline +File I/O & 15 & SDFParser, XXX more examples\\ \hline +SMILES tools & 2 & \\ \hline +InChI parser & 2 & \\ \hline +Database I/O & 7 & \\ \hline +Molecular descriptors & 42 & \\ \hline +Atom descriptors & 27 & \\ \hline +Bond descriptors & 6 & \\ \hline +Clustering & 13 & \\ \hline +Miscellaneous & 50 & \\ \hline \end{tabular} \end{flushleft} \label{tab:cheminfoWorkers} \end{table} The miscellaneous workers are e.g. a substructure filter, an -aromaticity detection, an atom typing or a reaction enumeration. +aromaticity detector, an atomtyper or a reaction enumerator. Some of the workers are outlined as part of example workflows described below. +\remark{CS: Is there a complete list of available CDK-Taverna workers somewhere? Can we reference it or should +it go into the supplemental material?} In the following, we outline the application of CDK-Taverna in a few selected workflow scenarios. \subsection*{Iteration over large datasets} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-07 15:39:07
|
Revision: 14703 http://cdk.svn.sourceforge.net/cdk/?rev=14703&view=rev Author: steinbeck Date: 2009-08-07 15:39:00 +0000 (Fri, 07 Aug 2009) Log Message: ----------- Minor fixes Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:33:43 UTC (rev 14702) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:39:00 UTC (rev 14703) @@ -370,15 +370,14 @@ Cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them -piece by piece. \todo{This is a todo} Unfortunately the architecture of Taverna's current stable version does not support such loops. -CDK-Taverna therefore provides workers which act like FOR or WHILE loops that make use of Taverna's iteration and retry mechanism to allow workflows +piece by piece. Unfortunately the architecture of Taverna's current stable version does not support such loops. +CDK-Taverna therefore provides workers which act like FOR or WHILE loops,making use of Taverna's iteration and retry mechanism to allow workflows to process large datasets. %see section iterative qsar workflow - \subsection*{Database I/O} For database support the CDK-Taverna project uses the PostgresSQL \cite{PostgreSQLWeb} -database with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This +relational database management system (RDBMS) with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This combination allows storage and fast retrieval of up to a million molecules with a reasonable performance. The Pgchem::tigress extension uses an implementation of the GIST index of the PostgresSQL database. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-07 16:36:01
|
Revision: 14704 http://cdk.svn.sourceforge.net/cdk/?rev=14704&view=rev Author: steinbeck Date: 2009-08-07 16:35:54 +0000 (Fri, 07 Aug 2009) Log Message: ----------- Some language corrections in the figure legends Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 15:39:00 UTC (rev 14703) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-07 16:35:54 UTC (rev 14704) @@ -378,11 +378,12 @@ \subsection*{Database I/O} For database support the CDK-Taverna project uses the PostgresSQL \cite{PostgreSQLWeb} relational database management system (RDBMS) with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This -combination allows storage and fast retrieval of up to a million molecules with a reasonable performance. The Pgchem::tigress extension uses an implementation of the GIST index -of the PostgresSQL database. +combination allows storage and fast retrieval of up to a million molecules with a reasonable performance \remark{CS: Is this a +fixed limit? Where does this number come from?}. +The Pgchem::tigress extension uses an implementation of the GIST \remark{CS: What is this? Can we explain it or cite something?} of the PostgresSQL database. \subsection*{Substructure Workflow} -A workflow to perform a substructure search can be done in different ways +We may want to design workflows for performing substructure searches in different ways depending on the type of input. In a first example the substructure workflow performs a topological substructure search on a list of given molecules and a given molecular substructure (see figure~\ref{fig:substructureworkflow.ps}). @@ -390,27 +391,28 @@ \cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file \cite{DalbyArthur1992}. The structures which match the substructure are stored as MDL Mol files \cite{DalbyArthur1992}. The non-matching structures are converted into the Chemical Markup Language (CML) \cite{Murray-RustP1999, KuhnS2007}. -This small workflow already combines 4 different molecular structure +This small example workflow already combines 4 different molecular structure representations and the use of a topological substructure filter. + A related workflow performs a substructure search directly on a database: It uses functionality provided by the Pgchem extension of the PostgresSQL database. This extension allows the use of SQL commands to perform a -substructure search e.g. ``SELECT id, molecule FROM molecules -WHERE molecule = (SELECT molecule from molecules WHERE id = 1)``. A demonstration is given with workflow ~\ref{fig:database_substructure_search}. -The molecules containing the substructure are printed in table form within a pdf file. +substructure search e.g. \texttt{SELECT id, molecule FROM molecules +WHERE molecule = (SELECT molecule from molecules WHERE id = 1)}. A demonstration is given with workflow ~\ref{fig:database_substructure_search}. +The molecules containing the substructure are reported in tabular form in a PDF file. \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.4]{pics/substructureworkflow.pdf}\\ - \caption{The workflow performs a topological substructure search on - molecules from a MDL SD file. The input of this workflow is a SMILES string + \caption{Workflow performing a topological substructure search on + molecules from a \textbf{MDL SD file}. The input of this workflow is a SMILES string which represents the substructure.} \label{fig:substructureworkflow.ps} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=90,clip=false,scale=.4]{pics/database_substructure_search.pdf}\\ \caption{The workflow performs a substructure search on the database. + \includegraphics[angle=90,clip=false,scale=.4]{pics/database_substructure_search.pdf}\\ \caption{Workflow performing a substructure search in a \textbf{database}. The substructure is loaded from a MDL SD file. The output is a PDF file containing a tabular view of the molecules containing the substructure.} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-11 08:47:04
|
Revision: 14711 http://cdk.svn.sourceforge.net/cdk/?rev=14711&view=rev Author: steinbeck Date: 2009-08-11 08:46:57 +0000 (Tue, 11 Aug 2009) Log Message: ----------- fixed a lot of language issues in the db validation section Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-11 08:30:11 UTC (rev 14710) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-11 08:46:57 UTC (rev 14711) @@ -432,8 +432,7 @@ \begin{figure*} \centering \includegraphics[angle=-90,clip=true,scale=.5]{pics/QSAR_Worker_UI_small.pdf}\\ - \caption{The user interface allows a selection of the available QSAR descriptors. - The selected descriptors are calculated for each molecule during the execution of the workflow.} + \caption{User interface to select QSAR descriptors calculated for each molecule during the execution of the workflow.} \label{fig:QSARWorkerUI} \end{figure*} @@ -445,7 +444,7 @@ \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/QSAR_Workflow.pdf}\\ \caption{Workflow calculating various QSAR descriptors for molecules - from a PostgresSQL database. The results of the calculation is be stored in a CSV file.} + from a PostgresSQL database. The results of the calculation is stored in a CSV file.} \label{fig:QSARWorkflow} \end{figure*} @@ -457,21 +456,22 @@ hundreds of thousands of molecules. This workflow (see figure~\ref{fig:IterativeQSARWorkflow}) processes each molecule in the same manner as the QSAR Calculation Workflow but it uses different database workers. -Instead of the single database worker \texttt{Get Molecules From Database} three database -workers are applied: \texttt{Iterative Molecule From Database Reader}, -\texttt{Get Molecule From Database} and \texttt{Has Next Molecule From Database}. +Instead of the single database worker \texttt{Get\_Molecules\_From\_Database} three database +workers are applied: \texttt{Iterative\_Molecule\_From\_Database\_Reader}, +\texttt{Get\_Molecule\_From\_Database} and \texttt{Has\_Next\_Molecule\_From\_Database}. The first of the three workers is used to configure the database connection and store it within an internal object registry. The second worker gets the ID of the database connection as an input and loads molecules from the database. Only a subset of the original query is loaded using the SQL functions LIMIT -and OFFSET. The last database worker checks whether the loaded molecules are +and OFFSET. The last database worker checks whether the set of loaded molecules the the last of this query or if further molecules must be loaded. If the latter applies the output of this last worker would be the text value \texttt{true}. A last but -essential worker is \texttt{Fail if true}. This worker throws an exception if it gets +essential worker is \texttt{Fail\_if\_true}. This worker throws an exception if it gets the value \texttt{true} as input. This worker is crucial for the nested workflow: If it fails the whole nested workflow fails. Taverna then provides a retry mechanism for a failing worker or nested workflow. This mechanism is used to re-run the -nested workflow as often as necessary. +nested workflow as often as necessary. This dirty workaround might become obsolete in +later versions of Taverna. \begin{figure*} \centering @@ -480,7 +480,10 @@ \subsubsection*{Validation of CDK Atom Types} -The calculation of different physiochemical properties is only feasible if the atom types are detected correctly for each molecule. The CDK contains methods for the detection of defined atom types. These methods allow also the identification of unknown atom types. With the help of these methods a validation of the CDK and of the processed data is possible. The detection of an unknown atom type by the CDK indicates whether the CDK is missing this specific atom type or the molecule contains chemically nonsensical atom types such as five bonded nitrogen's. Such a workflow (see figure~\ref{fig:AtomTypingWF})searches molecules which contain unknown atom types. Therefore it tries to perceive the atom types of each molecule using the Perceive\_atom\_types worker. Than the Extract\_the\_databaseID\_from\_the\_molecule worker extracts the database identifiers from the molecules with unknown atom types. As results of this workflow two text files are created, the first contains the identifier of the molecules and is created by the Iterative\_File\_Writer. The second file contains information about which atom of which molecule is unknown to the CDK. \remark{Should the results of this workflow be discussed here too?} +The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the +detection of atom types each molecule, including atom types being flagged as \emph{unknown}. Based on this, we devised +an example workflow (see figure~\ref{fig:AtomTypingWF}) for the validation of validation of the CDK atom typing procedures vs processed data. The detection of an unknown atom type by the CDK indicates that either the CDK lacks this specific atom type or the molecule contains chemically nonsensical atom types. In figure~\ref{fig:AtomTypingWF} the Perceive\_atom\_types worker performs an atom type detection, followed by the retrieval of the database ID for those molecules with +unknown atom type by the Extract\_the\_databaseID\_from\_the\_molecule worker. The workflow creates two text files, one containing the identifier of all molecules with unknown atom types, created by the Iterative\_File\_Writer, and a second one containing information about which atom of which molecule is unknown to the CDK. \remark{Should the results of this workflow be discussed here too? CS: Yes, pleeeaaaassseee :-)} \begin{figure*} \centering This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-13 22:43:37
|
Revision: 14719 http://cdk.svn.sourceforge.net/cdk/?rev=14719&view=rev Author: egonw Date: 2009-08-13 22:43:28 +0000 (Thu, 13 Aug 2009) Log Message: ----------- Applied aspell to fix a number of spelling errors Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 15:08:27 UTC (rev 14718) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 22:43:28 UTC (rev 14719) @@ -142,8 +142,8 @@ \author{Thomas Kuhn$^{1}$% \email{Thomas Kuhn - tho...@fh...}% \and -Egon L. Willighangen$^2$% -\email{Egon L. Willighangen - ego...@fa...} +Egon L. Willighagen$^2$% +\email{Egon L. Willighagen - ego...@fa...} \and Achim Zielesny\correspondingauthor$^1$% \email{Achim Zielesny\correspondingauthor - ach...@fh...} @@ -361,27 +361,27 @@ \end{table} The miscellaneous workers are e.g. a substructure filter, an -aromaticity detector, an atomtyper or a reaction enumerator. +aromaticity detector, an atom typer or a reaction enumerator. Some of the workers are outlined as part of example workflows described below. \remark{CS: Is there a complete list of available CDK-Taverna workers somewhere? Can we reference it or should it go into the supplemental material? TK: My Diss contains a complete list in the appendix. So we can reference it or put it here in the supplemental material. } In the following, we outline the application of CDK-Taverna in a few selected workflow scenarios. -\subsection*{Iteration over large datasets} +\subsection*{Iteration over large data sets} Cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. -These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the datasets to process them +These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the data sets to process them piece by piece. Unfortunately the architecture of Taverna's current stable version does not support such loops. CDK-Taverna therefore provides workers which act like FOR or WHILE loops,making use of Taverna's iteration and retry mechanism to allow workflows -to process large datasets. +to process large data sets. %see section iterative qsar workflow \subsection*{Database I/O} -For database support the CDK-Taverna project uses the PostgresSQL \cite{PostgreSQLWeb} +For database support the CDK-Taverna project uses the PostgreSQL \cite{PostgreSQLWeb} relational database management system (RDBMS) with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This combination allows storage and fast retrieval of up to a million molecules with a reasonable performance \remark{CS: Is this a fixed limit? Where does this number come from? TK: No this is not a fixed limit. This was a number Ergo Schmid has told me during my thesis.}. -The Pgchem::tigress extension uses an implementation of the Generalized Search Tree (GiST) \cite{GISTWeb} of the PostgresSQL database. CDK-Taverna can use a local installation of the PostgresSQL Database with the Pgchem::tigress extension or can connect to a remote instance. +The Pgchem::tigress extension uses an implementation of the Generalized Search Tree (GiST) \cite{GISTWeb} of the PostgreSQL database. CDK-Taverna can use a local installation of the PostgreSQL Database with the Pgchem::tigress extension or can connect to a remote instance. \subsection*{Substructure Workflow} We may want to design workflows for performing substructure searches in different ways @@ -397,7 +397,7 @@ A related workflow performs a substructure search directly on a database: It uses functionality provided by the Pgchem extension -of the PostgresSQL database. This extension allows the use of SQL commands to perform a +of the PostgreSQL database. This extension allows the use of SQL commands to perform a substructure search e.g. \texttt{SELECT id, molecule FROM molecules WHERE molecule = (SELECT molecule from molecules WHERE id = 1)}. A demonstration is given with workflow ~\ref{fig:database_substructure_search}. The molecules containing the substructure are reported in tabular form in a PDF file. @@ -422,7 +422,7 @@ \subsubsection*{Descriptor Calculation Workflow} This workflow (see figure~\ref{fig:QSARWorkflow}) starts with loading its -molecules from a PostgresSQL database. The recognition of the atom types is the next step, followed by the addition +molecules from a PostgreSQL database. The recognition of the atom types is the next step, followed by the addition of implicit hydrogens for each molecule as well as the detection of H"uckel aromaticity. Each molecule is tagged for the descriptor calculation process. The tagging of the molecules is used to add an universal identifier to each molecule. This allows the identification of the corresponding QSAR values within the QSAR values table. The QSAR worker @@ -437,15 +437,15 @@ \label{fig:QSARWorkerUI} \end{figure*} -The result of the workflow is comma separated value (csv) text file +The result of the workflow is comma separated value (CSV) text file which contains the ID of the molecule and the calculate property values. The -property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgresSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see figure ~\ref{fig:TimeCalculateDescriptors}). +property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see figure ~\ref{fig:TimeCalculateDescriptors}). \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/QSAR_Workflow.pdf}\\ \caption{Workflow calculating various QSAR descriptors for molecules - from a PostgresSQL database. The results of the calculation is stored in a CSV file.} + from a PostgreSQL database. The results of the calculation is stored in a CSV file.} \label{fig:QSARWorkflow} \end{figure*} @@ -481,7 +481,7 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/Iterative_QSAR_Workflow.pdf}\\ \caption{Workflow calculating different QSAR properties for molecules loaded from a PostgresSQL database. The results will be stored in a CSV file.} \label{fig:IterativeQSARWorkflow} + \includegraphics[angle=0,clip=false,scale=.5]{pics/Iterative_QSAR_Workflow.pdf}\\ \caption{Workflow calculating different QSAR properties for molecules loaded from a PostgreSQL database. The results will be stored in a CSV file.} \label{fig:IterativeQSARWorkflow} \end{figure*} \subsubsection*{Validation of CDK Atom Types} @@ -499,7 +499,7 @@ \subsubsection*{Reaction enumeration} Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. -Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are useable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. +Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. The CDK-Taverna contains worker which allows such a enumeration based on a generic reaction (see figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). @@ -511,7 +511,7 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL mol files. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.} + \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.} \label{fig:ReactionEnumerationWorkflow} \end{figure*} @@ -553,7 +553,7 @@ deeper software-development knowledge. It provides the possibility to process hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows cluster analysis of molecular descriptor vectors. -The use of the PostgresSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. +The use of the PostgreSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. \section*{Availability and requirements} @@ -562,7 +562,7 @@ \item \textbf{Project home page:} http://www.cdk-taverna.de \item \textbf{Operating system(s):} Platform independent \item \textbf{Programming language:} Java - \item \textbf{Other requirements:} e.g. Java 1.6.0 or higher + \item \textbf{Other requirements:} Java 1.6.0 or higher \item \textbf{License:} GNU Library or Lesser General Public License (LGPL) \item \textbf{Any restrictions to use by non-academics:} non \end{itemize} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-13 22:59:58
|
Revision: 14721 http://cdk.svn.sourceforge.net/cdk/?rev=14721&view=rev Author: egonw Date: 2009-08-13 22:59:41 +0000 (Thu, 13 Aug 2009) Log Message: ----------- Capitalized Figure in 'see figure X' Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 22:46:40 UTC (rev 14720) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 22:59:41 UTC (rev 14721) @@ -387,7 +387,7 @@ We may want to design workflows for performing substructure searches in different ways depending on the type of input. In a first example the substructure workflow performs a topological substructure search on a list of given molecules and a -given molecular substructure (see figure~\ref{fig:substructureworkflow.ps}). +given molecular substructure (see Figure~\ref{fig:substructureworkflow.ps}). The workflow inputs are a molecular substructure as a SMILES string \cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file \cite{DalbyArthur1992}. The structures which match the substructure are stored as MDL Mol files \cite{DalbyArthur1992}. The non-matching structures are @@ -421,13 +421,13 @@ \end{figure*} \subsubsection*{Descriptor Calculation Workflow} -This workflow (see figure~\ref{fig:QSARWorkflow}) starts with loading its +This workflow (see Figure~\ref{fig:QSARWorkflow}) starts with loading its molecules from a PostgreSQL database. The recognition of the atom types is the next step, followed by the addition of implicit hydrogens for each molecule as well as the detection of H"uckel aromaticity. Each molecule is tagged for the descriptor calculation process. The tagging of the molecules is used to add an universal identifier to each molecule. This allows the identification of the corresponding QSAR values within the QSAR values table. The QSAR worker provides a user interface based selection of multiple QSAR descriptors (see -figure~\ref{fig:QSARWorkerUI}) for the calculation of a molecule's property vector +Figure~\ref{fig:QSARWorkerUI}) for the calculation of a molecule's property vector based on the CDK. \begin{figure*} @@ -439,7 +439,7 @@ The result of the workflow is comma separated value (CSV) text file which contains the ID of the molecule and the calculate property values. The -property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see figure ~\ref{fig:TimeCalculateDescriptors}). +property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see Figure~\ref{fig:TimeCalculateDescriptors}). \begin{figure*} \centering @@ -460,7 +460,7 @@ The iterative QSAR workflow is a \emph{work-around} which allows the treatment of hundreds of thousands of molecules. This workflow (see -figure~\ref{fig:IterativeQSARWorkflow}) processes each molecule in the same +Figure~\ref{fig:IterativeQSARWorkflow}) processes each molecule in the same manner as the QSAR Calculation Workflow but it uses different database workers. Instead of the single database worker \texttt{Get\_Molecules\_From\_Database} three database workers are applied: \texttt{Iterative\_Molecule\_From\_Database\_Reader}, @@ -488,7 +488,7 @@ The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the detection of atom types each molecule, including atom types being flagged as \emph{unknown}. Based on this, we devised -an example workflow (see figure~\ref{fig:AtomTypingWF}) for the validation of validation of the CDK atom typing procedures vs processed data. The detection of an unknown atom type by the CDK indicates that either the CDK lacks this specific atom type or the molecule contains chemically nonsensical atom types. In figure~\ref{fig:AtomTypingWF} the Perceive\_atom\_types worker performs an atom type detection, followed by the retrieval of the database ID for those molecules with +an example workflow (see Figure~\ref{fig:AtomTypingWF}) for the validation of validation of the CDK atom typing procedures vs processed data. The detection of an unknown atom type by the CDK indicates that either the CDK lacks this specific atom type or the molecule contains chemically nonsensical atom types. In Figure~\ref{fig:AtomTypingWF} the Perceive\_atom\_types worker performs an atom type detection, followed by the retrieval of the database ID for those molecules with unknown atom type by the Extract\_the\_databaseID\_from\_the\_molecule worker. The workflow creates two text files, one containing the identifier of all molecules with unknown atom types, created by the Iterative\_File\_Writer, and a second one containing information about which atom of which molecule is unknown to the CDK. An analysis \cite{KuhnDissertation} of the atom type detection of three different databases with more than 600000 molecules showed that the CDK algorithms matches the atom types quite well but lacks on atom types of heavier atoms. This leads to problem with coordination compounds containing metals. \begin{figure*} @@ -501,7 +501,7 @@ Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. -Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. The CDK-Taverna contains worker which allows such a enumeration based on a generic reaction (see figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). +Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. The CDK-Taverna contains worker which allows such a enumeration based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). \begin{figure*} \centering @@ -529,7 +529,7 @@ \item The setting of the maximum classification time. \item A limit for the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis \cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI \cite{Degtyarenko2008}) see figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. +The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis \cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI \cite{Degtyarenko2008}; see Figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-13 23:02:37
|
Revision: 14722 http://cdk.svn.sourceforge.net/cdk/?rev=14722&view=rev Author: egonw Date: 2009-08-13 23:02:28 +0000 (Thu, 13 Aug 2009) Log Message: ----------- Using ~ before \cite{} to ensure no newline in the output Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 22:59:41 UTC (rev 14721) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 23:02:28 UTC (rev 14722) @@ -186,19 +186,19 @@ \begin{abstract} % Do not use inserted blank lines (ie \\) until main body of text. -\paragraph*{Background:} The recent release of large open access chemistry databases \cite{PubChem,IrwinJ2005,ChEMBL} +\paragraph*{Background:} The recent release of large open access chemistry databases~\cite{PubChem,IrwinJ2005,ChEMBL} generates a demand for flexible tools to process them and discover new knowledge. To freely support open science based on these data resources, it is desirable for the processing tools to be open-source and available for everyone. \paragraph*{Results:} Here we describe a novel combination of the workflow engine -Taverna \cite{TomOinn2004} and the cheminformatics library Chemistry Development Kit (CDK) \cite{Steinbeck2004a,Steinbeck2003a} +Taverna~\cite{TomOinn2004} and the cheminformatics library Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} resulting in a open source workflow solution for cheminformatics. We have implemented more than 160 different workers to handle specific cheminformatics tasks. We describe the applications of CDK-Taverna in various usage scenarios. \paragraph*{Conclusions:} The combination of the workflow engine -Taverna and the Chemistry Development Kit provides the first open source \cite{free-software-foundation} cheminformatics workflow +Taverna and the Chemistry Development Kit provides the first open source~\cite{free-software-foundation} cheminformatics workflow \remark{First release dates back to at least October 2005...: \url{http://chem-bla-ics.blogspot.com/2005/10/cdk-taverna-fully-recognized.html}, \url{http://sourceforge.net/projects/cdk/files/CDK-Taverna/20051018/cdk-taverna-20051018.jar/download} CS: Ok, what do we do with this information? TK: No idea Egon?} solution for the biosciences. With the Taverna-community working towards a more powerful workflow engine and a more user-friendly user interface, CDK-Taverna has the potential to become a free alternative to existing commercial workflow tools. @@ -248,8 +248,7 @@ \section*{Background} -The recent release of large open chemistry databases into the public domain -\cite{PubChem,IrwinJ2005,ChEMBL} calls for flexible, open toolkits to process them. +The recent release of large open chemistry databases into the public domain~\cite{PubChem,IrwinJ2005,ChEMBL} calls for flexible, open toolkits to process them. These databases and tools will, for the first time, create opportunities for academia and third-world countries to perform state-of-the-art open drug discovery and translational research -- endeavors so far a domain of the pharmaceutical industry. @@ -271,23 +270,22 @@ \emph{pipelining} is often used interchangeably for \emph{workflow}. Existing proprietary or semi-proprietary implementations of the workflow or pipeline paradigm -in molecular informatics include Pipeline Pilot \cite{PipelinePilotWeb} from SciTegic, a subsidiary of -Accelrys or the InforSense platform from InforSense \cite{InforSenseWeb}. Both +in molecular informatics include Pipeline Pilot~\cite{PipelinePilotWeb} from SciTegic, a subsidiary of +Accelrys or the InforSense platform from InforSense~\cite{InforSenseWeb}. Both are commercially well established but closed source products with a large variety -of different functionality. KNIME \cite{KNIMEWeb} is a modular data +of different functionality. KNIME~\cite{KNIMEWeb} is a modular data exploration platform which uses a dual licensing model with the Aladdin free public license. It is developed by the group of Michael Berthold at the University of Konstanz, Germany. KNIME is based on the open-source Eclipse platform. Here we present the combination of various open source tools to create a free and open workflow environment for cheminformatics: We have integrated our open source cheminformatics -library, the Chemistry Development Kit (CDK) \cite{Steinbeck2004a,Steinbeck2003a} with Taverna -\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free \cite{free-software-foundation} workflow solution for cheminformatics. CDK-Taverna makes use of additional open source components such as Bioclipse \cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. +library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. \section*{Methods} The CDK-Taverna plug-in written in Java is published under the GNU Lesser General Public License (LGPL). -Like Taverna itself the plug-in uses Maven 2 \cite{MavenWeb} as a build system. +Like Taverna itself the plug-in uses Maven 2~\cite{MavenWeb} as a build system. \subsection*{Taverna's extension points} Taverna allows the execution of workflows linking together heterogeneous @@ -295,7 +293,7 @@ For the integration of these different resource types Taverna provides various interfaces and protocols for its extension. For example, it allows for easy use -of webservices through the WSDL \cite{WSDLWeb} or SOAP \cite{SOAPWeb} +of webservices through the WSDL~\cite{WSDLWeb} or SOAP~\cite{SOAPWeb} protocol. The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the local extension Taverna provides a list of different Service Provider Interfaces (SPI), so @@ -377,11 +375,11 @@ %see section iterative qsar workflow \subsection*{Database I/O} -For database support the CDK-Taverna project uses the PostgreSQL \cite{PostgreSQLWeb} -relational database management system (RDBMS) with the open-source Pgchem::tigress \cite{PGChemWeb} extension. This +For database support the CDK-Taverna project uses the PostgreSQL~\cite{PostgreSQLWeb} +relational database management system (RDBMS) with the open-source Pgchem::tigress~\cite{PGChemWeb} extension. This combination allows storage and fast retrieval of up to a million molecules with a reasonable performance \remark{CS: Is this a fixed limit? Where does this number come from? TK: No this is not a fixed limit. This was a number Ergo Schmid has told me during my thesis.}. -The Pgchem::tigress extension uses an implementation of the Generalized Search Tree (GiST) \cite{GISTWeb} of the PostgreSQL database. CDK-Taverna can use a local installation of the PostgreSQL Database with the Pgchem::tigress extension or can connect to a remote instance. +The Pgchem::tigress extension uses an implementation of the Generalized Search Tree (GiST)~\cite{GISTWeb} of the PostgreSQL database. CDK-Taverna can use a local installation of the PostgreSQL Database with the Pgchem::tigress extension or can connect to a remote instance. \subsection*{Substructure Workflow} We may want to design workflows for performing substructure searches in different ways @@ -389,9 +387,9 @@ performs a topological substructure search on a list of given molecules and a given molecular substructure (see Figure~\ref{fig:substructureworkflow.ps}). The workflow inputs are a molecular substructure as a SMILES string -\cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file \cite{DalbyArthur1992}. The structures which match the -substructure are stored as MDL Mol files \cite{DalbyArthur1992}. The non-matching structures are -converted into the Chemical Markup Language (CML) \cite{Murray-RustP1999, KuhnS2007}. +\cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file~\cite{DalbyArthur1992}. The structures which match the +substructure are stored as MDL Mol files~\cite{DalbyArthur1992}. The non-matching structures are +converted into the Chemical Markup Language (CML)~\cite{Murray-RustP1999, KuhnS2007}. This small example workflow already combines 4 different molecular structure representations and the use of a topological substructure filter. @@ -452,7 +450,7 @@ \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.3]{pics/timeneededtocalculatemoleculardescriptors.pdf}\\ - \caption{Overview of the time needed to calculate different molecular descriptors for 1000 molecules using CDK-Taverna. \cite{CDKTavernaBlogWeb}} + \caption{Overview of the time needed to calculate different molecular descriptors for 1000 molecules using CDK-Taverna.~\cite{CDKTavernaBlogWeb}} \label{fig:TimeCalculateDescriptors} \end{figure*} @@ -489,7 +487,7 @@ The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the detection of atom types each molecule, including atom types being flagged as \emph{unknown}. Based on this, we devised an example workflow (see Figure~\ref{fig:AtomTypingWF}) for the validation of validation of the CDK atom typing procedures vs processed data. The detection of an unknown atom type by the CDK indicates that either the CDK lacks this specific atom type or the molecule contains chemically nonsensical atom types. In Figure~\ref{fig:AtomTypingWF} the Perceive\_atom\_types worker performs an atom type detection, followed by the retrieval of the database ID for those molecules with -unknown atom type by the Extract\_the\_databaseID\_from\_the\_molecule worker. The workflow creates two text files, one containing the identifier of all molecules with unknown atom types, created by the Iterative\_File\_Writer, and a second one containing information about which atom of which molecule is unknown to the CDK. An analysis \cite{KuhnDissertation} of the atom type detection of three different databases with more than 600000 molecules showed that the CDK algorithms matches the atom types quite well but lacks on atom types of heavier atoms. This leads to problem with coordination compounds containing metals. +unknown atom type by the Extract\_the\_databaseID\_from\_the\_molecule worker. The workflow creates two text files, one containing the identifier of all molecules with unknown atom types, created by the Iterative\_File\_Writer, and a second one containing information about which atom of which molecule is unknown to the CDK. An analysis~\cite{KuhnDissertation} of the atom type detection of three different databases with more than 600000 molecules showed that the CDK algorithms matches the atom types quite well but lacks on atom types of heavier atoms. This leads to problem with coordination compounds containing metals. \begin{figure*} \centering @@ -516,7 +514,7 @@ \end{figure*} \subsubsection*{Classification Workflows} -During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm \cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. +During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. In such a workflow the Get\_QSAR\_vector\_from\_database worker loads data vectors for a specific SQL query a list of data vectors from the database (compare section ??? above). \remark{FIXME: finish reference TK: ???} This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose min value equals its max value. @@ -529,7 +527,7 @@ \item The setting of the maximum classification time. \item A limit for the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis \cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI \cite{Degtyarenko2008}; see Figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. +The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis~\cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI~\cite{Degtyarenko2008}; see Figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-13 23:12:24
|
Revision: 14724 http://cdk.svn.sourceforge.net/cdk/?rev=14724&view=rev Author: egonw Date: 2009-08-13 23:12:14 +0000 (Thu, 13 Aug 2009) Log Message: ----------- Rephrased to (IMHO) better reflect the history of the project Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 23:09:40 UTC (rev 14723) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 23:12:14 UTC (rev 14724) @@ -569,10 +569,9 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Authors contributions} -CS and AZ conceived the project, lead the development. EW wrote an early version of CDK-Taverna and participated -in the all stages of the development. TK did the majority of CDK-Taverna development and -developed the projects to its current state. -All co-authors contributed to the manuscript. +EW initiated the integration of Taverna and the CDK. CS and AZ conceived the project, +and lead the further development. TK did the majority of CDK-Taverna development and +developed the projects to its current state. All co-authors contributed to the manuscript. %%%%%%%%%%%%%%%%%%%%%%%%%%% This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-14 06:12:29
|
Revision: 14725 http://cdk.svn.sourceforge.net/cdk/?rev=14725&view=rev Author: egonw Date: 2009-08-14 06:12:21 +0000 (Fri, 14 Aug 2009) Log Message: ----------- The paper is English, not German Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-13 23:12:14 UTC (rev 14724) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-14 06:12:21 UTC (rev 14725) @@ -58,7 +58,7 @@ %\usepackage[latin1]{inputenc} %UNIX support if unicode package fails \usepackage{graphicx} \usepackage{color} -\usepackage[german]{babel} +\usepackage[english]{babel} \urlstyle{rm} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <tho...@us...> - 2009-08-15 09:51:46
|
Revision: 14728 http://cdk.svn.sourceforge.net/cdk/?rev=14728&view=rev Author: thomaskuhn Date: 2009-08-15 09:51:35 +0000 (Sat, 15 Aug 2009) Log Message: ----------- Applied Achims comments Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-14 07:02:58 UTC (rev 14727) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-15 09:51:35 UTC (rev 14728) @@ -237,15 +237,7 @@ %%%%%%%%%%%%%%%% %% Background %% %% -\section*{Agenda} -\begin{itemize} - \item Provide anatomy of CDK-Taverna worker - \item Give two or three examples in for each group in worker table -\end{itemize} - - - \section*{Background} The recent release of large open chemistry databases into the public domain~\cite{PubChem,IrwinJ2005,ChEMBL,Williams2008} calls for flexible, open toolkits to process them. @@ -326,15 +318,15 @@ \subsection*{The anatomy of a CDK-Taverna worker} To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ -public Map<String, DataThing> execute(Map<String, DataThing> inputMap) throws TaskExecutionException;\\ +public Map String, DataThing execute(Map String, DataThing inputMap) throws TaskExecutionException;\\ \remark{TK: I am not able to draw the left and right angles for the Java Map object! Can somebody help me please here!} public String[] inputNames();\\ public String[] inputTypes();\\ public String[] outputNames();\\ public String[] outputTypes();\\ -The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods returns the name of the Java object type with its package declaration e.g. ``java/java.util.List'' for a List. -\remark{EW: Here the paper should explain how chemical structures are passed around, using CML, which is now described elsewhere.} +The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile}. + %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -516,7 +508,7 @@ Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. -Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. The CDK-Taverna contains worker which allows such a enumeration based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). +Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. CDK-Taverna contains workers which support an enumeration task based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). \begin{figure*} \centering @@ -531,10 +523,9 @@ \end{figure*} \subsubsection*{Classification Workflows} -During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter performs this task. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. -In such a workflow the Get\_QSAR\_vector\_from\_database worker loads data vectors for a specific SQL query a list of data vectors from the database (compare section ??? above). -\remark{FIXME: finish reference TK: ???} -This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose min value equals its max value. +During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter is used. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. +Within a typical clustering workflow the Get\_QSAR\_vector\_from\_database worker loads molecule's data vectors (compare descriptor calculation workflow above) for a specific molecular SQL query from the database. +This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose minimum value equals its maximum value. After the loading and a first cleaning of the data vector the classification is performed using the ART2A\_Classificator worker. For the configuration of this worker different options are available: \begin{itemize} \item Linear scaling of the input vector to values between 0 and 1. @@ -544,7 +535,8 @@ \item The setting of the maximum classification time. \item A limit for the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. As output the worker stored the result within a compressed XML document. For the visualisation of the results this document is processed again. For this visualisation different workers are available depending on the aim of the task. Within a Chemical Diversity Analysis~\cite{KuhnDissertation} the ART 2-A algorithm was used for a hierarchical clustering approach of three databases (two containing Natural Products and one containing molecules of biological interests (ChEBI~\cite{Degtyarenko2008}; see Figure~\ref{fig:ART2AClassificationResult}). The population of different classes indicates a similar coverage of the chemical space from the two Natural Product databases whereas the ChEBI database covers a distinctly different chemical space. +The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. The clustering result is stored in form of a compressed XML document. This XML result document can be processed with different workers to create different visualizations depending on the aim of the clustering task. For chemical diversity analysis the ART 2-A worker was used for a successive top-down clustering of three chemical databases (two proprietary databases containing natural products and the ChEBI database~\cite{Degtyarenko2008} containing molecules of biological interest, see Figure ~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters show the similarity of the natural product databases in contrast to the ChEBI database which differs in chemical space occupation~\cite{KuhnDissertation}. This findings will be outlined in a subsequent publication. + \begin{figure*} \centering \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} @@ -553,14 +545,10 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the composition of the different detected classes of the ART 2-A classification. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } + \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the occupancies of different detected clusters for three databases. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } \label{fig:ART2AClassificationResult} \end{figure*} -\subsubsection*{Further Example Workflow} -\cite{KuhnDissertation} describes different other example workflows and their results . This includes workflows for analysing the atom typing of the CDK and workflows for performing cluster analysis for the determination of the chemical diversity. - - %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} The open-source solution which is provided by the CDK-Taverna plug-in allows an This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <ste...@us...> - 2009-08-28 10:22:22
|
Revision: 14780 http://cdk.svn.sourceforge.net/cdk/?rev=14780&view=rev Author: steinbeck Date: 2009-08-28 09:10:39 +0000 (Fri, 28 Aug 2009) Log Message: ----------- Rearranged sections, fixed figure legends and code display Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-27 12:36:21 UTC (rev 14779) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-28 09:10:39 UTC (rev 14780) @@ -244,8 +244,7 @@ These databases and tools will, for the first time, create opportunities for academia and third-world countries to perform state-of-the-art open drug discovery and translational research -- endeavors so far a domain of the pharmaceutical industry. -This article describes the integration of various open source software packages started in 2005~\cite{Chemblaics2005,CDKTaverna2005} -to form an open workflow engine for cheminformatics, where numerous recurring tasks can be automated, including +Commonly used in this context are workflow engines for cheminformatics, where numerous recurring tasks can be automated, including tasks for \begin{itemize} \item chemical data filtering, transformation, curation and migration workflows @@ -275,60 +274,15 @@ An overview of workflow systems in life sciences was recently given by Tiwari \emph{et al.}~\cite{Tiwari2007}. -Here we present the combination of various open source tools to -create a free and open workflow environment for cheminformatics: We have integrated our open source cheminformatics -library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. +In 2005~\cite{Chemblaics2005,CDKTaverna2005} we started to +integrate our open source cheminformatics +library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. Here we present +the results of this endeavour. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. -\section*{Methods} -The CDK-Taverna plug-in written in Java is published under the GNU Lesser -General Public License (LGPL). -Like Taverna itself the plug-in uses Maven 2~\cite{MavenWeb} as a build system. -\subsection*{Taverna's extension points} -Taverna allows the execution of workflows linking together heterogeneous -open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007}. -For the integration of these -different resource types Taverna provides various interfaces and -protocols for its extension. For example, it allows for easy use -of webservices through the WSDL~\cite{WSDLWeb} or SOAP~\cite{SOAPWeb} -protocol. -The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the -local extension Taverna provides a list of different Service Provider Interfaces (SPI), so -of which we implement in CDK-Taverna, which leads to an -integration of the CDK functionality as so called Local Workers running on the -same machine as the CDK-Taverna installation. The CDK-Taverna plug-in Version 0.5.1 uses the CDK revision 12084. -All workers in CDK-Taverna implement the -\texttt{CDKLocalWorker} interface. It is used for the detection of -workers by the \texttt{CDKScavenger} class which itself implements the Taverna SPI -\texttt{org.embl.ebi.escience.scuflui.workbench.Scavenger} interface. -Adding user interfaces for some of the workers requires an extension of the -\texttt{AbstractCDKProcessorAction} which again implements the Taverna SPI -\texttt{org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI}. The use of this SPI -allows the addition of, for example, file chooser dialogs for workers like file -reader or writer. -\subsection*{Plug-in installation} -The CDK-Taverna project employs the plug-in detection manager of Taverna -for its installation. The detection manager requires a plugin description XML file containing a plugin name, -a version number, a target Taverna version number, a repository location and a Maven-like Java package description. After adding the plug-in -installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all -available plug-in versions will be graphically accessible to the user. -In order to install the CDK-Taverna plugin the user selects the desired version after which -all necessary Java libraries are installed on-the-fly from the given repository location. - -\subsection*{The anatomy of a CDK-Taverna worker} -To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ -public Map String, DataThing execute(Map String, DataThing inputMap) throws TaskExecutionException;\\ \remark{TK: I am not able to draw the left and right angles for the Java Map object! Can somebody help me please here!} -public String[] inputNames();\\ -public String[] inputTypes();\\ -public String[] outputNames();\\ -public String[] outputTypes();\\ -The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile}. - - - %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -497,13 +451,13 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{This workflow shows the iterative loading of molecules from a database and searches for molecules with (for the CDK) unknown atom types} + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit.} \label{fig:AtomTypingWF} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{The diagram shows the allocation of the unknown atom types detected during the analysis of the ChEBI database.} + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{Allocation of the unknown atom types detected during the analysis of the ChEBI database.} \label{fig:AtomTypingResults} \end{figure*} @@ -515,13 +469,13 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{This reaction enumeration example contains two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} + \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{Reaction enumeration example with two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} \label{fig:ReactionEnumerationSchema} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.~\cite{MyExperiementWFReactionEnum}} + \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{Reaction enumeration loading a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.~\cite{MyExperiementWFReactionEnum}} \label{fig:ReactionEnumerationWorkflow} \end{figure*} @@ -542,26 +496,83 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} + \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{Workflow loading a data vector from a database and performs an ART2A classification} \label{fig:ART2AClassification} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the occupancies of different detected clusters for three databases. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } + \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{Occupancies of different detected clusters for three databases with the +goal to compare two proprietary Natural Product databases with the ChEBI database. } \label{fig:ART2AClassificationResult} \end{figure*} + %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} -The open-source solution which is provided by the CDK-Taverna plug-in allows an -efficient and powerful linking between different resources without any need of -deeper software-development knowledge. It provides the possibility to process -hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows +With CDK-Taverna we have presented the first free and open cheminformatics workflow solution for the biosciences. +It allows to link and process data from various sources in visually accessible workflow diagrams without any deeper +programming experience. Processing of hundreds of thousands of molecules has been demonstrated and the upper boundary +is only limited by the amount of available memory. The currently implemented workers allow the processing of chemical data in +various formats, provides the possibility to calculate chemical properties and allows cluster analysis of molecular descriptor vectors. -The use of the PostgreSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. +The use of the PostgreSQL database with the Pgchem::tigres cheminformatics cartridge provides +access to chemical databases with up to a million molecules. +\section*{Methods} +The CDK-Taverna plug-in written in Java is published under the GNU Lesser +General Public License (LGPL). +Like Taverna itself the plug-in uses Maven 2~\cite{MavenWeb} as a build system. + +\subsection*{Taverna's extension points} +Taverna allows the execution of workflows linking together heterogeneous +open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007}. +For the integration of these +different resource types Taverna provides various interfaces and +protocols for its extension. For example, it allows for easy use +of webservices through the WSDL~\cite{WSDLWeb} or SOAP~\cite{SOAPWeb} +protocol. +The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the +local extension Taverna provides a list of different Service Provider Interfaces (SPI), so +of which we implement in CDK-Taverna, which leads to an +integration of the CDK functionality as so called Local Workers running on the +same machine as the CDK-Taverna installation. The CDK-Taverna plug-in Version 0.5.1 uses the CDK revision 12084. +All workers in CDK-Taverna implement the +\texttt{CDKLocalWorker} interface. It is used for the detection of +workers by the \texttt{CDKScavenger} class which itself implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.workbench.Scavenger} interface. +Adding user interfaces for some of the workers requires an extension of the +\texttt{AbstractCDKProcessorAction} which again implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI}. The use of this SPI +allows the addition of, for example, file chooser dialogs for workers like file +reader or writer. + +\subsection*{Plug-in installation} +The CDK-Taverna project employs the plug-in detection manager of Taverna +for its installation. The detection manager requires a plugin description XML file containing a plugin name, +a version number, a target Taverna version number, a repository location and a Maven-like Java package description. After adding the plug-in +installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all +available plug-in versions will be graphically accessible to the user. + +In order to install the CDK-Taverna plugin the user selects the desired version after which +all necessary Java libraries are installed on-the-fly from the given repository location. + + +\subsection*{The anatomy of a CDK-Taverna worker} +To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ + +\begin{verbatim} +public Map <String, DataThing> execute(Map String, DataThing inputMap) throws TaskExecutionException; +public String[] inputNames(); +public String[] inputTypes(); +public String[] outputNames(); +public String[] outputTypes(); +\end{verbatim} + +The method inputNames and outputNames return the names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile} + + \section*{Availability and requirements} \begin{itemize} \item \textbf{Project name:} CDK-Taverna This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-30 15:41:07
|
Revision: 14786 http://cdk.svn.sourceforge.net/cdk/?rev=14786&view=rev Author: egonw Date: 2009-08-30 15:40:57 +0000 (Sun, 30 Aug 2009) Log Message: ----------- Moved all figures to the end, as BMC wants them outside the PDF anyway. Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-30 15:30:19 UTC (rev 14785) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-30 15:40:57 UTC (rev 14786) @@ -354,26 +354,6 @@ A demonstration is given with workflow~\ref{fig:database_substructure_search}. The molecules containing the substructure are reported in tabular form in a PDF file. -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.4]{pics/substructureworkflow.pdf}\\ - \caption{Workflow performing a topological substructure search on - molecules from a MDL SD file~\cite{MyExperiementWFSubstructure}. - The input of this workflow is a SMILES string - which represents the substructure.} - \label{fig:substructureworkflow.ps} -\end{figure*} - -\begin{figure*} - \centering - \includegraphics[angle=90,clip=false,scale=.4]{pics/database_substructure_search.pdf}\\ - \caption{Workflow performing a substructure search in a database~\cite{MyExperiementWFSubstructureDB}. - The substructure is loaded from a MDL SD file. The output is a PDF - file containing a tabular view of the molecules containing the - substructure.} - \label{fig:database_substructure_search} -\end{figure*} - \subsubsection*{Descriptor Calculation Workflow} The Descriptor Calculation Workflow workflow, depicted in Figure~\ref{fig:QSARWorkflow}, starts with loading its molecules from a PostgreSQL database. The recognition of the atom types is the next step, followed by the addition @@ -384,34 +364,10 @@ Figure~\ref{fig:QSARWorkerUI}) for the calculation of a molecule's property vector based on the CDK. -\begin{figure*} - \centering - \includegraphics[angle=-90,clip=true,scale=.5]{pics/QSAR_Worker_UI_small.pdf}\\ - \caption{User interface to select QSAR descriptors calculated for each molecule during - the execution of the workflow.} - \label{fig:QSARWorkerUI} -\end{figure*} - The result of the workflow is comma separated value (CSV) text file which contains the ID of the molecule and the calculate property values. The property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see Figure~\ref{fig:TimeCalculateDescriptors}). -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/QSAR_Workflow.pdf}\\ - \caption{Workflow calculating various QSAR descriptors for molecules - from a PostgreSQL database. The results of the calculation is stored in a CSV file.} - \label{fig:QSARWorkflow} -\end{figure*} - -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.3]{pics/timeneededtocalculatemoleculardescriptors.pdf}\\ - \caption{Overview of the time needed to calculate different molecular descriptors for 1000 - molecules~\cite{CDKTavernaBlogWeb}.} - \label{fig:TimeCalculateDescriptors} -\end{figure*} - \subsubsection*{Iterative QSAR Workflow} The iterative QSAR workflow is a \emph{work-around} which allows the treatment of @@ -436,14 +392,6 @@ nested workflow as often as necessary. This dirty workaround might become obsolete in later versions of Taverna. -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/Iterative_QSAR_Workflow.pdf} - \caption{Workflow calculating different QSAR properties for molecules loaded from a PostgreSQL - database~\cite{MyExperiementWFQSAR}. The results will be stored in a CSV file.} - \label{fig:IterativeQSARWorkflow} -\end{figure*} - \subsubsection*{Validation of CDK Atom Types} The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the @@ -461,38 +409,14 @@ algorithms on, for example, coordination compounds containing metals, making atom type matching an import filter in cheminformatics workflows. -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit.} - \label{fig:AtomTypingWF} -\end{figure*} -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{Allocation of the unknown atom types detected during the analysis of the ChEBI database.} - \label{fig:AtomTypingResults} -\end{figure*} - \subsubsection*{Reaction enumeration} Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. CDK-Taverna contains workers which support an enumeration task based on a generic reaction (see Figure~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{Reaction enumeration example with two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} - \label{fig:ReactionEnumerationSchema} -\end{figure*} -\begin{figure*} - \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{Reaction enumeration loading a generic reaction from a MDL RXN file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to allow visualization and - analysis of the - results~\cite{MyExperiementWFReactionEnum}.} - \label{fig:ReactionEnumerationWorkflow} -\end{figure*} - \subsubsection*{Classification Workflows} During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter is used. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. Within a typical clustering workflow the Get\_QSAR\_vector\_from\_database worker loads molecule's data vectors (compare descriptor calculation workflow above) for a specific molecular SQL query from the database. @@ -508,20 +432,6 @@ \end{itemize} The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. The clustering result is stored in form of a compressed XML document. This XML result document can be processed with different workers to create different visualizations depending on the aim of the clustering task. For chemical diversity analysis the ART 2-A worker was used for a successive top-down clustering of three chemical databases (two proprietary databases containing natural products and the ChEBI database~\cite{Degtyarenko2008} containing molecules of biological interest, see Figure ~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters show the similarity of the natural product databases in contrast to the ChEBI database which differs in chemical space occupation~\cite{KuhnDissertation}. This findings will be outlined in a subsequent publication. -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{Workflow loading a data vector from a database and performs an ART2A classification.} - \label{fig:ART2AClassification} -\end{figure*} - -\begin{figure*} - \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{Occupancies of different detected clusters for three databases with the -goal to compare two proprietary Natural Product databases with the ChEBI database. } - \label{fig:ART2AClassificationResult} -\end{figure*} - - %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} With CDK-Taverna we have presented the first free and open cheminformatics workflow solution for the biosciences. @@ -656,9 +566,122 @@ %% %% Do not use \listoffigures as most will included as separate files +\clearpage \section*{Figures} +\begin{figure*}[bh!] + \centering + \includegraphics[angle=0,clip=false,scale=.4]{pics/substructureworkflow.pdf}\\ + \caption{Workflow performing a topological substructure search on + molecules from a MDL SD file~\cite{MyExperiementWFSubstructure}. + The input of this workflow is a SMILES string + which represents the substructure.} + \label{fig:substructureworkflow.ps} +\end{figure*} +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=90,clip=false,scale=.4]{pics/database_substructure_search.pdf}\\ + \caption{Workflow performing a substructure search in a database~\cite{MyExperiementWFSubstructureDB}. + The substructure is loaded from a MDL SD file. The output is a PDF + file containing a tabular view of the molecules containing the + substructure.} + \label{fig:database_substructure_search} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=-90,clip=true,scale=.5]{pics/QSAR_Worker_UI_small.pdf}\\ + \caption{User interface to select QSAR descriptors calculated for each molecule during + the execution of the workflow.} + \label{fig:QSARWorkerUI} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/QSAR_Workflow.pdf}\\ + \caption{Workflow calculating various QSAR descriptors for molecules + from a PostgreSQL database. The results of the calculation is stored in a CSV file.} + \label{fig:QSARWorkflow} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.3]{pics/timeneededtocalculatemoleculardescriptors.pdf}\\ + \caption{Overview of the time needed to calculate different molecular descriptors for 1000 + molecules~\cite{CDKTavernaBlogWeb}.} + \label{fig:TimeCalculateDescriptors} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/Iterative_QSAR_Workflow.pdf} + \caption{Workflow calculating different QSAR properties for molecules loaded from a PostgreSQL + database~\cite{MyExperiementWFQSAR}. The results will be stored in a CSV file.} + \label{fig:IterativeQSARWorkflow} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit.} + \label{fig:AtomTypingWF} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{Allocation of the unknown atom types detected during the analysis of the ChEBI database.} + \label{fig:AtomTypingResults} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{Reaction enumeration example with two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} + \label{fig:ReactionEnumerationSchema} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{Reaction enumeration loading a generic reaction from a MDL RXN file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to allow visualization and + analysis of the + results~\cite{MyExperiementWFReactionEnum}.} + \label{fig:ReactionEnumerationWorkflow} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{Workflow loading a data vector from a database and performs an ART2A classification.} + \label{fig:ART2AClassification} +\end{figure*} + +\clearpage + +\begin{figure*} + \centering + \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{Occupancies of different detected clusters for three databases with the +goal to compare two proprietary Natural Product databases with the ChEBI database. } + \label{fig:ART2AClassificationResult} +\end{figure*} + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% %% Tables %% @@ -667,7 +690,9 @@ %% Use of \listoftables is discouraged. %% +\clearpage \section*{Tables} + \subsection*{Table 1 - List of available Service Provider Interfaces} A packaged subset of extended classes or interface of this table can be used to create a plug-in for Taverna which provides @@ -692,4 +717,5 @@ \end{tabular} } \end{bmcformat} + \end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <eg...@us...> - 2009-08-30 15:46:44
|
Revision: 14787 http://cdk.svn.sourceforge.net/cdk/?rev=14787&view=rev Author: egonw Date: 2009-08-30 15:46:37 +0000 (Sun, 30 Aug 2009) Log Message: ----------- Applied a suggestion from Achim made some while ago, but that never made it into a change the text Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-30 15:40:57 UTC (rev 14786) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-30 15:46:37 UTC (rev 14787) @@ -423,12 +423,12 @@ This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose minimum value equals its maximum value. After the loading and a first cleaning of the data vector the classification is performed using the ART2A\_Classificator worker. For the configuration of this worker different options are available: \begin{itemize} - \item Linear scaling of the input vector to values between 0 and 1. - \item A switch between \textit{deterministic random} and \textit{random random} for the selection of the vectors to process. - \item The definition of the convergence criteria of the classification. - \item The setting of the required similarity for the convergence criteria. - \item The setting of the maximum classification time. - \item A limit for the number of classifications and a range for the vigilance parameter. + \item linear scaling of the input vector to values between 0 and 1, + \item a switch between \textit{deterministic random} and \textit{random random} for the selection of the vectors to process, + \item the definition of the convergence criteria of the classification, + \item the required similarity for the convergence criteria, + \item the maximum classification time, and + \item a limit for the number of classifications and a range for the vigilance parameter. \end{itemize} The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. The clustering result is stored in form of a compressed XML document. This XML result document can be processed with different workers to create different visualizations depending on the aim of the clustering task. For chemical diversity analysis the ART 2-A worker was used for a successive top-down clustering of three chemical databases (two proprietary databases containing natural products and the ChEBI database~\cite{Degtyarenko2008} containing molecules of biological interest, see Figure ~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters show the similarity of the natural product databases in contrast to the ChEBI database which differs in chemical space occupation~\cite{KuhnDissertation}. This findings will be outlined in a subsequent publication. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |