From: <eg...@us...> - 2009-11-07 20:53:18
|
Revision: 15048 http://cdk.svn.sourceforge.net/cdk/?rev=15048&view=rev Author: egonw Date: 2009-11-07 20:53:10 +0000 (Sat, 07 Nov 2009) Log Message: ----------- Reformating only; no textual changes Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-11-07 20:44:57 UTC (rev 15047) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-11-07 20:53:10 UTC (rev 15048) @@ -278,8 +278,14 @@ In 2005~\cite{Chemblaics2005,CDKTaverna2005} we started to integrate our open source cheminformatics -library, the Chemistry Development Kit (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for cheminformatics. Here we present -the results of this endeavour. CDK-Taverna makes use of additional open source components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress \cite{PGChemWeb}. +library, the Chemistry Development Kit +(CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a +workflow environment with an extensible architecture, to produce CDK-Taverna, +the first completely free~\cite{free-software-foundation} workflow solution for +cheminformatics. Here we present +the results of this endeavour. CDK-Taverna makes use of additional open source +components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress +\cite{PGChemWeb}. @@ -290,14 +296,16 @@ %% \section*{Results} -The CDK-Taverna plug-in currently provides 164 different cheminformatics workers. The -fields of application of these workers are described in the Table~\ref{tab:cheminfoWorkers}. -This includes workers for input and output (I/O) of various chemical files -and line notations formats, databases, and descriptors for atoms, bonds and -molecules. The miscellaneous workers are e.g. a substructure filter, an -aromaticity detector, an atom typer or a reaction enumerator. -Some of the workers are outlined as part of example workflows described below (for a complete list see \cite{KuhnDissertation}). -In the following, we outline the application of CDK-Taverna in a few selected workflow scenarios. +The CDK-Taverna plug-in currently provides 164 different cheminformatics +workers. The fields of application of these workers are described in the +Table~\ref{tab:cheminfoWorkers}. This includes workers for input and output +(I/O) of various chemical files and line notations formats, databases, and +descriptors for atoms, bonds and molecules. The miscellaneous workers are e.g. a +substructure filter, an aromaticity detector, an atom typer or a reaction +enumerator. Some of the workers are outlined as part of example workflows +described below (for a complete list see~\cite{KuhnDissertation}). In the +following, we outline the application of CDK-Taverna in a few selected workflow +scenarios. \begin{table}[htbp] \caption{The worker allocation of CDK-Taverna by function.} @@ -330,22 +338,28 @@ %see section iterative qsar workflow \subsection*{Database I/O} -For database support the CDK-Taverna project uses the PostgreSQL~\cite{PostgreSQLWeb} -relational database management system (RDBMS) with the open-source Pgchem::tigress~\cite{PGChemWeb} extension. This -combination allows storage and fast retrieval of up to a million molecules with a reasonable performance. -The Pgchem::tigress extension uses an implementation of the Generalized Search Tree (GiST)~\cite{GISTWeb} of the PostgreSQL database. CDK-Taverna can use a local installation of the PostgreSQL Database with the Pgchem::tigress extension or can connect to a remote instance. +For database support the CDK-Taverna project uses the +PostgreSQL~\cite{PostgreSQLWeb} relational database management system (RDBMS) +with the open-source Pgchem::tigress~\cite{PGChemWeb} extension. This +combination allows storage and fast retrieval of up to a million molecules with +a reasonable performance. The Pgchem::tigress extension uses an implementation +of the Generalized Search Tree (GiST)~\cite{GISTWeb} of the PostgreSQL database. +CDK-Taverna can use a local installation of the PostgreSQL Database with the +Pgchem::tigress extension or can connect to a remote instance. \subsection*{Substructure Workflow} -We may want to design workflows for performing substructure searches in different ways -depending on the type of input. In a first example the substructure workflow -performs a topological substructure search on a list of given molecules and a -given molecular substructure (see Figure~\ref{fig:substructureworkflow.ps}). -The workflow inputs are a molecular substructure as a SMILES string -\cite{WeiningerDavid1988} and a list of structures stored in a MDL SD file~\cite{DalbyArthur1992}. The structures which match the -substructure are stored as MDL Molfiles~\cite{DalbyArthur1992}. The non-matching structures are -converted into the Chemical Markup Language (CML)~\cite{Murray-RustP1999,Willighagen2001,KuhnS2007}. -This small example workflow already combines four different molecular structure -representations and the use of a topological substructure filter. +We may want to design workflows for performing substructure searches in +different ways depending on the type of input. In a first example the +substructure workflow performs a topological substructure search on a list of +given molecules and a given molecular substructure (see +Figure~\ref{fig:substructureworkflow.ps}). The workflow inputs are a molecular +substructure as a SMILES string \cite{WeiningerDavid1988} and a list of +structures stored in a MDL SD file~\cite{DalbyArthur1992}. The structures which +match the substructure are stored as MDL Molfiles~\cite{DalbyArthur1992}. The +non-matching structures are converted into the Chemical Markup Language +(CML)~\cite{Murray-RustP1999,Willighagen2001,KuhnS2007}. This small example +workflow already combines four different molecular structure representations and +the use of a topological substructure filter. A related workflow performs a substructure search directly on a database: It uses functionality provided by the Pgchem extension @@ -359,18 +373,24 @@ The molecules containing the substructure are reported in tabular form in a PDF file. \subsubsection*{Descriptor Calculation Workflow} -The Descriptor Calculation Workflow workflow, depicted in Figure~\ref{fig:QSARWorkflow}, starts with loading its -molecules from a PostgreSQL database. The recognition of the atom types is the next step, followed by the addition -of implicit hydrogens for each molecule -as well as the detection of H\"uckel aromaticity. -Each molecule is tagged for the descriptor calculation process. The tagging of the molecules is used to add an universal identifier to each molecule. This allows the identification of the corresponding QSAR values within the QSAR values table. The QSAR worker -provides a user interface based selection of multiple QSAR descriptors (see -Figure~\ref{fig:QSARWorkerUI}) for the calculation of a molecule's property vector -based on the CDK. +The Descriptor Calculation Workflow workflow, depicted in +Figure~\ref{fig:QSARWorkflow}, starts with loading its molecules from a +PostgreSQL database. The recognition of the atom types is the next step, +followed by the addition of implicit hydrogens for each molecule as well as the +detection of H\"uckel aromaticity. Each molecule is tagged for the descriptor +calculation process. The tagging of the molecules is used to add an universal +identifier to each molecule. This allows the identification of the corresponding +QSAR values within the QSAR values table. The QSAR worker provides a user +interface based selection of multiple QSAR descriptors (see +Figure~\ref{fig:QSARWorkerUI}) for the calculation of a molecule's property +vector based on the CDK. -The result of the workflow is comma separated value (CSV) text file -which contains the ID of the molecule and the calculate property values. The -property vectors may then be used for statistical analysis, clustering or machine learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to calculate a large number of descriptors for many thousand of molecules in a reasonable time (see Figure~\ref{fig:TimeCalculateDescriptors}). +The result of the workflow is comma separated value (CSV) text file which +contains the ID of the molecule and the calculate property values. The property +vectors may then be used for statistical analysis, clustering or machine +learning purposes. With the PostgreSQL Database backend CDK-Taverna is able to +calculate a large number of descriptors for many thousand of molecules in a +reasonable time (see Figure~\ref{fig:TimeCalculateDescriptors}). \subsubsection*{Iterative QSAR Workflow} @@ -398,33 +418,77 @@ \subsubsection*{Validation of CDK Atom Types} -The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the -detection of atom types for all atoms in each molecule. The atom types describe basic atomic -properties needed by the various cheminformatics algorithms implemented in the CDK. If no -atom type is recognized for an atom, the atom is flagged as \emph{unknown}. Based on -the CDK's atom type perception functionality, we devised -an example workflow (see Figure~\ref{fig:AtomTypingWF}) for the validation of the CDK atom typing procedures vs processed data. The detection of an unknown atom type by the CDK indicates that either the CDK lacks this specific atom type or the molecule contains chemically nonsensical atom types. In Figure~\ref{fig:AtomTypingWF} the \texttt{Perceive\_atom\_types} worker performs an atom type detection, followed by the retrieval of the database ID for those molecules with -unknown atom type by the \texttt{Extract\_the\_databaseID\_from\_the\_molecule} worker. The workflow creates two text files, one containing the identifier of all molecules with unknown atom types, created by the \texttt{Iterative\_File\_Writer}, and a second one containing information about which atom of which molecule is unknown to the CDK. An analysis~\cite{KuhnDissertation} of the atom type detection was performed on three different databases, two proprietary natural products databases and the open access database of Chemical Entities of Biological Interest (ChEBI) \cite{ChEBIWeb, Degtyarenko2008}, maintained -at the European Bioinformatics Institute (EBI). -The workflow was run with more than 600 thousand molecules and showed that the CDK algorithms matches the atom types -quite well, but that the atom type list is not complete for metals and other heavy atoms. Incomplete -atom type lists is not unique to the CDK and leads problems with the application of cheminformatics -algorithms on, for example, coordination compounds containing metals, making atom type matching an import -filter in cheminformatics workflows. +The calculation of physiochemical properties in the Chemistry Development Kit +(CDK) relies on the detection of atom types for all atoms in each molecule. The +atom types describe basic atomic properties needed by the various +cheminformatics algorithms implemented in the CDK. If no atom type is recognized +for an atom, the atom is flagged as \emph{unknown}. Based on the CDK's atom type +perception functionality, we devised an example workflow (see +Figure~\ref{fig:AtomTypingWF}) for the validation of the CDK atom typing +procedures vs processed data. The detection of an unknown atom type by the CDK +indicates that either the CDK lacks this specific atom type or the molecule +contains chemically nonsensical atom types. In Figure~\ref{fig:AtomTypingWF} the +\texttt{Perceive\_atom\_types} worker performs an atom type detection, followed +by the retrieval of the database ID for those molecules with unknown atom type +by the \texttt{Extract\_the\_databaseID\_from\_the\_molecule} worker. The +workflow creates two text files, one containing the identifier of all molecules +with unknown atom types, created by the \texttt{Iterative\_File\_Writer}, and a +second one containing information about which atom of which molecule is unknown +to the CDK. An analysis~\cite{KuhnDissertation} of the atom type detection was +performed on three different databases, two proprietary natural products +databases and the open access database of Chemical Entities of Biological +Interest (ChEBI) \cite{ChEBIWeb, Degtyarenko2008}, maintained at the European +Bioinformatics Institute (EBI). The workflow was run with more than 600 thousand +molecules and showed that the CDK algorithms matches the atom types quite well, +but that the atom type list is not complete for metals and other heavy atoms. +Incomplete atom type lists is not unique to the CDK and leads problems with the +application of cheminformatics algorithms on, for example, coordination +compounds containing metals, making atom type matching an import filter in +cheminformatics workflows. - \subsubsection*{Reaction enumeration} -Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. -Such structures are commonly used within patents for describing whole compound classes and are named after Eugene A. Markush who described these kind of structures firstly in his US patent in the 1920s. In the process of reaction enumeration, Markush structures are used to design generic reactions. These reactions are usable for the enumeration of large chemical spaces, which includes the generation of chemical target libraries. The result of the enumeration is an important base for patents and for the High Throughput Screening (HTS) laboratories in drug development. Within HTS experiments, a couple of years ago the enumeration of large libraries was driven to about O(10.000-100.000) molecules in one run. These large and often unspecific HTS libraries did not prove to be successful in general. Today the enumeration is based on more targeted libraries with a reduced size of O(1.000) molecules. -For a reaction enumeration, a given reaction contains different building blocks, which are needed for the enumeration. Each reactant of the reaction represents a building block. A scientist then selects a number of molecules for each reactant and the reaction enumeration creates a list of all possible products. The list of products then passes a virtual screening before at last a scientist decides which products will be synthesized. -Results can be visualized and inspected at the end in Bioclipse~\cite{Spjuth2007}. CDK-Taverna contains workers which support an enumeration task based on a generic reaction (see Figures~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). +Markush structures are generic chemical structures, which contains variable +patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, +Isopropyl, Pentyl}. Such structures are commonly used within patents for +describing whole compound classes and are named after Eugene A. Markush who +described these kind of structures firstly in his US patent in the 1920s. In the +process of reaction enumeration, Markush structures are used to design generic +reactions. These reactions are usable for the enumeration of large chemical +spaces, which includes the generation of chemical target libraries. The result +of the enumeration is an important base for patents and for the High Throughput +Screening (HTS) laboratories in drug development. Within HTS experiments, a +couple of years ago the enumeration of large libraries was driven to about +O(10.000-100.000) molecules in one run. These large and often unspecific HTS +libraries did not prove to be successful in general. Today the enumeration is +based on more targeted libraries with a reduced size of O(1.000) molecules. For +a reaction enumeration, a given reaction contains different building blocks, +which are needed for the enumeration. Each reactant of the reaction represents a +building block. A scientist then selects a number of molecules for each reactant +and the reaction enumeration creates a list of all possible products. The list +of products then passes a virtual screening before at last a scientist decides +which products will be synthesized. Results can be visualized and inspected at +the end in Bioclipse~\cite{Spjuth2007}. CDK-Taverna contains workers which +support an enumeration task based on a generic reaction (see +Figures~\ref{fig:ReactionEnumerationWorkflow} and +~\ref{fig:ReactionEnumerationSchema}). - \subsubsection*{Classification Workflows} -During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by Stephen Grossberg and Gail Carpenter is used. This algorithm was chosen because of its capability to automatically classify open-categorical problems. Compared to traditional clustering methods like the k-nearest-neighbour-clustering, the ART 2-A algorithm is computationally less demanding and therefore applicable especially to large data sets. -Within a typical clustering workflow the \texttt{Get\_QSAR\_vector\_from\_database} worker loads molecule's data vectors (compare descriptor calculation workflow above) for a specific molecular SQL query from the database. -This worker provides options to inspect the result vector which includes checks for values such as Not a Number (NaN) or Infinity. The options allow choosing the threshold for removing whole components of the vectors or single vectors and the removing of components whose minimum value equals its maximum value. -After the loading and a first cleaning of the data vector the classification is performed using the \texttt{ART2A\_Classificator} +During the work on this project classification workflows are mainly used to +perform an unsupervised clustering. For the majority of classifications an +implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by +Stephen Grossberg and Gail Carpenter is used. This algorithm was chosen because +of its capability to automatically classify open-categorical problems. Compared +to traditional clustering methods like the k-nearest-neighbour-clustering, the +ART 2-A algorithm is computationally less demanding and therefore applicable +especially to large data sets. Within a typical clustering workflow the +\texttt{Get\_QSAR\_vector\_from\_database} worker loads molecule's data vectors +(compare descriptor calculation workflow above) for a specific molecular SQL +query from the database. This worker provides options to inspect the result +vector which includes checks for values such as Not a Number (NaN) or Infinity. +The options allow choosing the threshold for removing whole components of the +vectors or single vectors and the removing of components whose minimum value +equals its maximum value. After the loading and a first cleaning of the data +vector the classification is performed using the \texttt{ART2A\_Classificator} worker, as depicted in Figure~\ref{fig:ART2AClassification}. For the configuration of this worker different options are available: \begin{itemize} @@ -435,8 +499,22 @@ \item the maximum classification time, and \item a limit for the number of classifications and a range for the vigilance parameter. \end{itemize} -The implemented ART 2-A algorithm contains two possible convergence criteria. It converges if the classification does not change after one epoch or if the scalar product of the classes between two epochs is less than the required similarity. The clustering result is stored in form of a compressed XML document. This XML result document can be processed with different workers to create different visualizations depending on the aim of the clustering task. For chemical diversity analysis the ART 2-A worker was used for a successive top-down clustering of three chemical databases (two proprietary databases containing natural products and the ChEBI database~\cite{Degtyarenko2008} containing molecules of biological interest, see Figure ~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters show the similarity of the natural product databases in contrast to the ChEBI database which differs in chemical space occupation~\cite{KuhnDissertation}. This findings will be outlined in a subsequent publication. +The implemented ART 2-A algorithm contains two possible convergence criteria. It +converges if the classification does not change after one epoch or if the scalar +product of the classes between two epochs is less than the required similarity. +The clustering result is stored in form of a compressed XML document. This XML +result document can be processed with different workers to create different +visualizations depending on the aim of the clustering task. For chemical +diversity analysis the ART 2-A worker was used for a successive top-down +clustering of three chemical databases (two proprietary databases containing +natural products and the ChEBI database~\cite{Degtyarenko2008} containing +molecules of biological interest, see +Figure~\ref{fig:ART2AClassificationResult}). The occupancies of the different clusters +show the similarity of the natural product databases in contrast to the ChEBI +database which differs in chemical space occupation~\cite{KuhnDissertation}. +This findings will be outlined in a subsequent publication. + %%%%%%%%%%%%%%%%%%%%%% \section*{Conclusions} With CDK-Taverna we have presented the first free and open cheminformatics workflow solution for the biosciences. @@ -489,7 +567,9 @@ \subsection*{The anatomy of a CDK-Taverna worker} -To create a CDK-Taverna worker the Java class of this worker has to implement the CDKLocalWorker Interface. This interface defines that every worker has to define the following methods:\\ +To create a CDK-Taverna worker the Java class of this worker has to implement +the CDKLocalWorker Interface. This interface defines that every worker has to +define the following methods:\\ \begin{verbatim} public Map <String, DataThing> execute(Map String, DataThing inputMap) @@ -500,7 +580,11 @@ public String[] outputTypes(); \end{verbatim} -The method inputNames and outputNames return the names of the ports of each worker whereas the inputTypes and outputTypes methods return the names for the Java object types with its package declaration e.g. \texttt{java/java.util.List} for a List. Within CDK-Taverna chemical structures are passed around using the Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile} +The method inputNames and outputNames return the names of the ports of each +worker whereas the inputTypes and outputTypes methods return the names for the +Java object types with its package declaration e.g. \texttt{java/java.util.List} +for a List. Within CDK-Taverna chemical structures are passed around using the +Java object \texttt{java/org.openscience.cdk.applications.taverna.CMLChemfile} \section*{Availability and requirements} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |