From: <eg...@us...> - 2009-11-07 21:07:20
|
Revision: 15049 http://cdk.svn.sourceforge.net/cdk/?rev=15049&view=rev Author: egonw Date: 2009-11-07 21:07:11 +0000 (Sat, 07 Nov 2009) Log Message: ----------- Made the structure of the paper more clear by adding 'Scenario X:' headers to the six separate workflow scenarios Modified Paths: -------------- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex Modified: cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex =================================================================== --- cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-11-07 20:53:10 UTC (rev 15048) +++ cdk-taverna-paper/trunk/cdk-taverna/bmc_article.tex 2009-11-07 21:07:11 UTC (rev 15049) @@ -282,10 +282,9 @@ (CDK)~\cite{Steinbeck2004a,Steinbeck2003a} with Taverna~\cite{TomOinn2004}, a workflow environment with an extensible architecture, to produce CDK-Taverna, the first completely free~\cite{free-software-foundation} workflow solution for -cheminformatics. Here we present -the results of this endeavour. CDK-Taverna makes use of additional open source -components such as Bioclipse~\cite{Spjuth2007} and PGChem:Tigress -\cite{PGChemWeb}. +cheminformatics, which we present here. It makes additional use of other +open source components such as Bioclipse~\cite{Spjuth2007} and +PGChem:Tigress~\cite{PGChemWeb}. In the next section @@ -304,7 +303,7 @@ substructure filter, an aromaticity detector, an atom typer or a reaction enumerator. Some of the workers are outlined as part of example workflows described below (for a complete list see~\cite{KuhnDissertation}). In the -following, we outline the application of CDK-Taverna in a few selected workflow +following, we outline the application of CDK-Taverna in a six selected workflow scenarios. \begin{table}[htbp] @@ -328,7 +327,7 @@ \label{tab:cheminfoWorkers} \end{table} -\subsection*{Iteration over large data sets} +\subsubsection*{Iteration over large data sets} Cheminformatics by definition deals with the discovery of chemical knowledge from large data collections. These data sources are usually too large to be loaded into memory as a whole so one needs to loop over the data sets to process them @@ -337,7 +336,7 @@ to process large data sets. %see section iterative qsar workflow -\subsection*{Database I/O} +\subsubsection*{Database I/O} For database support the CDK-Taverna project uses the PostgreSQL~\cite{PostgreSQLWeb} relational database management system (RDBMS) with the open-source Pgchem::tigress~\cite{PGChemWeb} extension. This @@ -347,7 +346,7 @@ CDK-Taverna can use a local installation of the PostgreSQL Database with the Pgchem::tigress extension or can connect to a remote instance. -\subsection*{Substructure Workflow} +\subsection*{Scenario 1: Substructure Search} We may want to design workflows for performing substructure searches in different ways depending on the type of input. In a first example the substructure workflow performs a topological substructure search on a list of @@ -372,7 +371,7 @@ A demonstration is given with the workflow in Figure~\ref{fig:database_substructure_search}. The molecules containing the substructure are reported in tabular form in a PDF file. -\subsubsection*{Descriptor Calculation Workflow} +\subsection*{Scenario 2: Descriptor Calculation} The Descriptor Calculation Workflow workflow, depicted in Figure~\ref{fig:QSARWorkflow}, starts with loading its molecules from a PostgreSQL database. The recognition of the atom types is the next step, @@ -392,7 +391,7 @@ calculate a large number of descriptors for many thousand of molecules in a reasonable time (see Figure~\ref{fig:TimeCalculateDescriptors}). -\subsubsection*{Iterative QSAR Workflow} +\subsection*{Scenario 3: Iterative QSAR} The iterative QSAR workflow is a \emph{work-around} which allows the treatment of hundreds of thousands of molecules. @@ -416,7 +415,7 @@ nested workflow as often as necessary. This dirty workaround might become obsolete in later versions of Taverna. -\subsubsection*{Validation of CDK Atom Types} +\subsection*{Scenario 4: Validation of CDK Atom Types} The calculation of physiochemical properties in the Chemistry Development Kit (CDK) relies on the detection of atom types for all atoms in each molecule. The @@ -446,7 +445,7 @@ compounds containing metals, making atom type matching an import filter in cheminformatics workflows. -\subsubsection*{Reaction enumeration} +\subsection*{Scenario 5: Reaction Enumeration} Markush structures are generic chemical structures, which contains variable patterns such as \emph{Heterocyclic}, \emph{Alkyl} or \emph{R = Methyl, Isopropyl, Pentyl}. Such structures are commonly used within patents for @@ -472,7 +471,7 @@ Figures~\ref{fig:ReactionEnumerationWorkflow} and ~\ref{fig:ReactionEnumerationSchema}). -\subsubsection*{Classification Workflows} +\subsection*{Scenario 6: Classification Workflows} During the work on this project classification workflows are mainly used to perform an unsupervised clustering. For the majority of classifications an implementation of the ART 2-A algorithm~\cite{Carpenter1991} developed by This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |