From: <ste...@us...> - 2009-08-28 10:22:22
|
Revision: 14780 http://cdk.svn.sourceforge.net/cdk/?rev=14780&view=rev Author: steinbeck Date: 2009-08-28 09:10:39 +0000 (Fri, 28 Aug 2009) Log Message: ----------- Rearranged sections, fixed figure legends and code display Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-27 12:36:21 UTC (rev 14779) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-08-28 09:10:39 UTC (rev 14780) @@ -244,8 +244,7 @@ These databases and tools will, for the first time, create opportunities for academia and third-world countries to perform state-of-the-art open drug discovery and translational research -- endeavors so far a domain of the pharmaceutical industry. -This article describes the integration of various open source software packages started in 2005~\cite{Chemblaics2005,CDKTaverna2005} -to form an open workflow engine for cheminformatics, where numerous recurring tasks can be automated, including +Commonly used in this context are workflow engines for cheminformatics, where numerous recurring tasks can be automated, including tasks for \begin{itemize} \item chemical data filtering, transformation, curation and migration workflows @@ -275,60 +274,15 @@ An overview of workflow systems in life sciences was recently given by Tiwari \emph{et al.}~\cite{Tiwari2007}. -Here we present the combination of various open source tools to -create a free and open workflow environment for cheminformatics: We have integrated our open source cheminformatics -library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. +In 2005~\cite{Chemblaics2005,CDKTaverna2005} we started to +integrate our open source cheminformatics +library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. Here we present +the results of this endeavour. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. -\section*{Methods} -The CDK-Taverna plug-in written in Java is published under the GNU Lesser -General Public License (LGPL). -Like Taverna itself the plug-in uses Maven 2~\cite{MavenWeb} as a build system. -\subsection*{Taverna's extension points} -Taverna allows the execution of workflows linking together heterogeneous -open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007}. -For the integration of these -different resource types Taverna provides various interfaces and -protocols for its extension. For example, it allows for easy use -of webservices through the WSDL~\cite{WSDLWeb} or SOAP~\cite{SOAPWeb} -protocol. -The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the -local extension Taverna provides a list of different Service Provider Interfaces (SPI), so -of which we implement in CDK-Taverna, which leads to an -integration of the CDK functionality as so called Local Workers running on the -same machine as the CDK-Taverna installation. The CDK-Taverna plug-in Version 0.5.1 uses the CDK revision 12084. -All workers in CDK-Taverna implement the -\texttt{CDKLocalWorker} interface. It is used for the detection of -workers by the \texttt{CDKScavenger} class which itself implements the Taverna SPI -\texttt{org.embl.ebi.escience.scuflui.workbench.Scavenger} interface. -Adding user interfaces for some of the workers requires an extension of the -\texttt{AbstractCDKProcessorAction} which again implements the Taverna SPI -\texttt{org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI}. The use of this SPI -allows the addition of, for example, file chooser dialogs for workers like file -reader or writer. -\subsection*{Plug-in installation} -The CDK-Taverna project employs the plug-in detection manager of Taverna -for its installation. The detection manager requires a plugin description XML file containing a plugin name, -a version number, a target Taverna version number, a repository location and a Maven-like Java package description. After adding the plug-in -installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all -available plug-in versions will be graphically accessible to the user. -In order to install the CDK-Taverna plugin the user selects the desired version after which -all necessary Java libraries are installed on-the-fly from the given repository location. - -\subsection*{The anatomy of a CDK-Taverna worker} -To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ -public Map String, DataThing execute(Map String, DataThing inputMap) throws TaskExecutionException;\\ \remark{TK: I am not able to draw the left and right angles for the Java Map object! Can somebody help me please here!} -public String[] inputNames();\\ -public String[] inputTypes();\\ -public String[] outputNames();\\ -public String[] outputTypes();\\ -The method inputNames and outputNames return the Names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile}. - - - %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -497,13 +451,13 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{This workflow shows the iterative loading of molecules from a database and searches for molecules with (for the CDK) unknown atom types} + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingWF.pdf}\\ \caption{Workflow for iterative loading of molecules from a database and searches for molecules with atom types unknown to the Chemistry Development Kit.} \label{fig:AtomTypingWF} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{The diagram shows the allocation of the unknown atom types detected during the analysis of the ChEBI database.} + \includegraphics[angle=0,clip=false,scale=.5]{pics/AtomTypingResults.pdf}\\ \caption{Allocation of the unknown atom types detected during the analysis of the ChEBI database.} \label{fig:AtomTypingResults} \end{figure*} @@ -515,13 +469,13 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{This reaction enumeration example contains two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} + \includegraphics[angle=0,clip=false,scale=.5]{pics/reactionenumeration.pdf}\\ \caption{Reaction enumeration example with two building blocks. For each building block, a list of three reactants is defined. This enumeration results in nine different products.} \label{fig:ReactionEnumerationSchema} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{This workflow performs a reaction enumeration. Consequently it loads a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.~\cite{MyExperiementWFReactionEnum}} + \includegraphics[angle=0,clip=flase,scale=.4]{pics/reactionEnumerationWF_2.pdf}\\ \caption{Reaction enumeration loading a generic reaction from a MDL rxn file and two reactant lists from MDL SD files. The products from the enumeration are stored as MDL molfiles. Besides these files a PDF document showing the 2D structure of the products is created. At the end Bioclipse will start up to inspect the results.~\cite{MyExperiementWFReactionEnum}} \label{fig:ReactionEnumerationWorkflow} \end{figure*} @@ -542,26 +496,83 @@ \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{This workflow loads a data vector from a database and performs an ART2A classification} + \includegraphics[angle=0,clip=false,scale=.5]{pics/ATR2AClassificationWF.pdf}\\ \caption{Workflow loading a data vector from a database and performs an ART2A classification} \label{fig:ART2AClassification} \end{figure*} \begin{figure*} \centering - \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{This diagram shows the occupancies of different detected clusters for three databases. Target was the comparison of two proprietary Natural Product databases in contrast to the ChEBI database. } + \includegraphics[angle=0,clip=false,scale=.5]{pics/ART2ACluster.pdf}\\ \caption{Occupancies of different detected clusters for three databases with the +goal to compare two proprietary Natural Product databases with the ChEBI database. } \label{fig:ART2AClassificationResult} \end{figure*} + %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} -The open-source solution which is provided by the CDK-Taverna plug-in allows an -efficient and powerful linking between different resources without any need of -deeper software-development knowledge. It provides the possibility to process -hundreds of thousands of molecules and is only limited by the available memory. The currently implemented workers allow the processing of chemical data in various formats, provides the possibility to calculate chemical properties and allows +With CDK-Taverna we have presented the first free and open cheminformatics workflow solution for the biosciences. +It allows to link and process data from various sources in visually accessible workflow diagrams without any deeper +programming experience. Processing of hundreds of thousands of molecules has been demonstrated and the upper boundary +is only limited by the amount of available memory. The currently implemented workers allow the processing of chemical data in +various formats, provides the possibility to calculate chemical properties and allows cluster analysis of molecular descriptor vectors. -The use of the PostgreSQL database with the Pgchem::tigres cheminformatics cartridge allows the processing of chemical databases with up to a million molecules. +The use of the PostgreSQL database with the Pgchem::tigres cheminformatics cartridge provides +access to chemical databases with up to a million molecules. +\section*{Methods} +The CDK-Taverna plug-in written in Java is published under the GNU Lesser +General Public License (LGPL). +Like Taverna itself the plug-in uses Maven 2~\cite{MavenWeb} as a build system. + +\subsection*{Taverna's extension points} +Taverna allows the execution of workflows linking together heterogeneous +open services, applications or databases (remote or local, private or public, third-party or home-grown)\cite{Taylor2007}. +For the integration of these +different resource types Taverna provides various interfaces and +protocols for its extension. For example, it allows for easy use +of webservices through the WSDL~\cite{WSDLWeb} or SOAP~\cite{SOAPWeb} +protocol. +The CDK-Taverna plug-in, on the other side, uses a local extension of Taverna. For the +local extension Taverna provides a list of different Service Provider Interfaces (SPI), so +of which we implement in CDK-Taverna, which leads to an +integration of the CDK functionality as so called Local Workers running on the +same machine as the CDK-Taverna installation. The CDK-Taverna plug-in Version 0.5.1 uses the CDK revision 12084. +All workers in CDK-Taverna implement the +\texttt{CDKLocalWorker} interface. It is used for the detection of +workers by the \texttt{CDKScavenger} class which itself implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.workbench.Scavenger} interface. +Adding user interfaces for some of the workers requires an extension of the +\texttt{AbstractCDKProcessorAction} which again implements the Taverna SPI +\texttt{org.embl.ebi.escience.scuflui.spi.ProcessorActionSPI}. The use of this SPI +allows the addition of, for example, file chooser dialogs for workers like file +reader or writer. + +\subsection*{Plug-in installation} +The CDK-Taverna project employs the plug-in detection manager of Taverna +for its installation. The detection manager requires a plugin description XML file containing a plugin name, +a version number, a target Taverna version number, a repository location and a Maven-like Java package description. After adding the plug-in +installation URL (http://www.cdk-taverna.de/plugin/) to the plug-in manager all +available plug-in versions will be graphically accessible to the user. + +In order to install the CDK-Taverna plugin the user selects the desired version after which +all necessary Java libraries are installed on-the-fly from the given repository location. + + +\subsection*{The anatomy of a CDK-Taverna worker} +To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ + +\begin{verbatim} +public Map <String, DataThing> execute(Map String, DataThing inputMap) throws TaskExecutionException; +public String[] inputNames(); +public String[] inputTypes(); +public String[] outputNames(); +public String[] outputTypes(); +\end{verbatim} + +The method inputNames and outputNames return the names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile} + + \section*{Availability and requirements} \begin{itemize} \item \textbf{Project name:} CDK-Taverna This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |