[Cdk-commits] SF.net SVN: cdk:[15594] cdk-taverna-2-paper

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Revision: 15594
          http://cdk.svn.sourceforge.net/cdk/?rev=15594&view=rev
Author:   andreas1981
Date:     2011-04-13 10:56:27 +0000 (Wed, 13 Apr 2011)

Log Message:
-----------
- Continued working on the paper

Modified Paths:
--------------
    cdk-taverna-2-paper/bmc_article.tex
    cdk-taverna-2-paper/pics/looattrevaluation.png

Modified: cdk-taverna-2-paper/bmc_article.tex
===================================================================

--- cdk-taverna-2-paper/bmc_article.tex	2011-04-13 07:27:29 UTC (rev 15593)
+++ cdk-taverna-2-paper/bmc_article.tex	2011-04-13 10:56:27 UTC (rev 15594)
@@ -308,6 +308,7 @@
 % TODO Andreas: Bitte ausarbeiten, welche Loop-Strukturen Taverna jetzt unterst\xFCtzt. Auch neue GUI-Funktionalit\xE4ten erw\xE4hnen.
 Problems with iterations over large data sets\cite{Kuhn} are completely resolved in version 2.0 due to new implementations in Taverna. The new version defines control contructs such as while-loop an if-then-else. To simulate control structures like a while-loop in Taverna 1.x it was nescessary to define a sub-workflow which fails in execution until a certain state is reached. The failing of the sub workflow thus leads to execute the whole workflow again for a predefined number of intervalls. Taverna version 2.x defines new loop processors which allow to set up loops more easily. The termination criteria can now be evaluated  by listening to a state port.\cite{Missier} 
 
+Not only improvements to the engine have been made. The user interface of the Taverna 2.0 workbench was also improved in contrast to its predecessor. The manipulation and designing of workflows directly in a graphical workflow editor improves the usability significantly. Features like copy/paste and undo/redo simplify appreciably the workflow creation and maintenance.
 % TODO Andreas: Hier bitte mit Bildern "austoben".
 Version 1.0 provided only a basic functionality for combinatorial chemistry related reaction enumeration. CDK-Taverna 2.0 offers major enhancements which 
 considerably widen the industrial applicability like multi-match detection, any number of reactants, products and generic groups as well as variable R-groups, ring sizes 
@@ -322,12 +323,26 @@
 \item \texttt{Simple KMeans}
 \item \texttt{XMeans}
 \end{itemize}
-Furthermore the regression machine learning workers are implemented. This workers enable the analysis of attributes, the segmentation of datasets into training und testset, model generation, prediction and result visualisation. 
+Furthermore the regression machine learning workers are implemented. This workers enable the analysis of attributes, the segmentation of datasets into training und testset, model generation, prediction and result visualisation. For regression purposes are currently four algorithms implemented:
+\begin{itemize}
+\item \texttt{Multiple Linear Regression}
+\item \texttt{Three-Layer Perceptron-Type Neuronal Networks}
+\item \texttt{Support Vector Machines}
+\item \texttt{M5P regression trees}
+\end{itemize}
 
-For attribute analysis are two workers available \texttt{GA Attribute Selection} \texttt{Leave-one-out Attribute Selection}. The former worker uses an genetic algorithm implementation to find an optimal set of attribute curations. Therefor each individual has a different subset of attributes enabled. The evaluation is performed either with the complete dataset nor by using a n-fold cross-validation. The used fitness function is $s(RMSE)=\left(\frac{1}{RMSE}\right)^{3}$ where RMSE is the root mean squared error of the evaluated subset. A mutation is made up of switching attributes on or off and cross-over is performed by interchanging the states of regions of attributes. The second worker is something like a `Leaf-one-out' method. Therefor attributes are consecutively leaved out. The worst performing attribute is discarded. This step is repeated until only a single attribute survives. The result should give an impression on how important the different attributes are. Figure \ref{fig:LeaveOneOutResults} shows the results from a `Leaf-one-out' analysis.   
+For attribute analysis are two workers available \texttt{GA Attribute Selection} \texttt{Leave-one-out Attribute Selection}. The former worker uses an genetic algorithm implementation to find an optimal set of attribute curations. Therefor each individual has a different subset of attributes enabled. The evaluation is performed either with the complete dataset nor by using a n-fold cross-validation. The used fitness function is $s(RMSE)=\left(\frac{1}{RMSE}\right)^{2}$ where RMSE is the root mean squared error of the evaluated subset. A mutation is made up of switching attributes on or off and cross-over is performed by interchanging the states of regions of attributes. The second worker uses a `Leaf-one-out' strategy for evaluating the peerformance of each attribute. Therefor attributes are consecutively leaved out. The worst performing attribute is discarded. This step is repeated until only a single attribute survives. The result should give an impression on how important the different attributes are. Figure \ref{fig:LeaveOneOutResults} shows the results from a `Leaf-one-out' analysis and figure \ref{fig:LeaveOneOutWorkflow} the related workflow.   
 
+The plugin provides the \texttt{Split Dataset Into Train-/Testset} worker for splitting the data into a training set and a test set. There are three different algorithms available:
+\begin{itemize}
+\item \texttt{Random}
+\item \texttt{Cluster Representatives}
+\item \texttt{Single Global Max}
+\end{itemize}
+As the name implies the first algorithm splits the data randomly. The second algorithm first clusters the data whereas the cluster number is set to the number of training set datapoints. Afterwards one datapoint from each cluster is randomly inserted in the training set. The rest is transferred into the test set. The last algorithm starts like the former one. Subsequently the worst predictiv datapoint in the test set is evaluated and exchanged with the corresponding datapoint in the trainings set. This step is repeated for a predefined number of iterations.
+
 % TODO Andreas: Bitte ausf\xFCllen.
-Last but not least parallel computing with multicore processors is supported by multiple-threading workers where possible. Many operations scale nearly linearly with the number of cores.
+Last but not least parallel computing with multicore processors is supported by multiple-threading workers where possible. Many operations scale nearly linearly with the number of cores. Exspecially the machine learning components and the QSAR calculation workers benefit from the parrallel computation. Table \ref{tab:ThreadedWorker} gives an overview of all workers which are capable of using multiple threads for their calculations.
 
 
 %%%%%%%%%%%%%%%%%%%%%%
@@ -410,14 +425,14 @@
 \begin{figure*}[hbt]
   \centering
 	\includegraphics[angle=0,clip=false,scale=1.0]{pics/looattrevaluation.png}
-  \caption{In the graph is plotted the root mean sqare error against the number of leaved out attributes. The root mean sqare error rises when leaving out attributes one after another. Important to see is that the first fifty attributes hava almost no influence on the result. but the last thirty attributes are essential for the result.}
+  \caption{Tthe graph shows the root mean sqare error against the number of leaved out attributes. The root mean sqare error rises when leaving out attributes one after another. Important to see is that the first fifty attributes hava almost no influence on the result. But the last thirty attributes are essential for the result.}
   \label{fig:LeaveOneOutResults}
 \end{figure*}
 
 \begin{figure*}[hbt]
   \centering
 	\includegraphics[angle=0,clip=false,scale=1.0]{pics/looattrworkflow.png}
-  \caption{}
+  \caption{The drafted workflow performs a `Leave-One-Out' attribute analysis. Therefor a regression dataset is generated from qsar data CSV file. Afterwards the composed datasets are written out as XRFF files and a CSV file is generated showing the order of the leaved out attributes. Additionally the results are visualised in a PDF file. This workflo is also useable with the \texttt{GA Attribute Selection} worker.}
   \label{fig:LeaveOneOutWorkflow}
 \end{figure*}
 
@@ -434,6 +449,7 @@
   \subsection*{Table 1 - CDK-Taverna 2.0 workers}
     Overview on CDK-Taverna 2.0 workers categorized by their function. \par \mbox{}
     \par
+    \begin{table}[hbt]
     \mbox{
       \begin{tabular}{|c|c|c|}\hline 
         \textbf{Function} &  \textbf{\# workers}  & \textbf{Examples} \\ \hline
@@ -442,8 +458,26 @@
         Machine learning & ?  & kMeans, Perceptron, SVM  \\ \hline
       \end{tabular}
       }
+      \end{table}
+      
+    \subsection*{Table 2 - Overview on multi-threading CDK-Taverna 2.0 workers}
+    Overview on CDK-Taverna 2.0 workers which are capable of using multiple threads for their calcualtions. \par \mbox{}
+    \par
+    \begin{table}[hbt]
+    \mbox{
+      \begin{tabular}{|c|c|}\hline 
+        \textbf{Worker} &  \textbf{Funktion}  \\ \hline
+        QSAR Descriptor Threaded & Calculation of QSAR descriptors  \\ \hline
+        Weka Clustering & Clustering data  \\ \hline
+        GA Attribute Selection & Attribute evaluation using a genetic algorithm  \\ \hline
+        Leave-One-Out Attribute Selection & Attribute evaluation using a `Leave-One-Out' strategy \\ \hline
+        Split Dataset Into Train-/Testset & Separating datasets into training sets an tests sets \\ \hline
+        Weka Regression & Building regression models \\ \hline
+      \end{tabular}
+      \label{tab:ThreadedWorker}
+      }
+			\end{table}
 
-
 \end{bmcformat}
 \end{document}
 

Modified: cdk-taverna-2-paper/pics/looattrevaluation.png
===================================================================
(Binary files differ)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.