From: <and...@us...> - 2011-10-26 06:58:22
|
Revision: 15625 http://cdk.svn.sourceforge.net/cdk/?rev=15625&view=rev Author: andreas1981 Date: 2011-10-26 06:58:15 +0000 (Wed, 26 Oct 2011) Log Message: ----------- Modified Paths: -------------- cdk-taverna-2-paper/bmc_article.bib cdk-taverna-2-paper/bmc_article.tex Modified: cdk-taverna-2-paper/bmc_article.bib =================================================================== --- cdk-taverna-2-paper/bmc_article.bib 2011-10-11 14:54:40 UTC (rev 15624) +++ cdk-taverna-2-paper/bmc_article.bib 2011-10-26 06:58:15 UTC (rev 15625) @@ -346,7 +346,7 @@ address = {Berlin}, author = {Zielesny, Achim}, pages = {548}, - publisher = {Springer: Intelligent Systems Reference Library, in print}, + publisher = {Springer: Intelligent Systems Reference Library, Volume 18}, title = {From Curve Fitting to Machine Learning: An illustrative Guide to scientific Data Analysis and Computational Intelligence}, year = {2011} } Modified: cdk-taverna-2-paper/bmc_article.tex =================================================================== --- cdk-taverna-2-paper/bmc_article.tex 2011-10-11 14:54:40 UTC (rev 15624) +++ cdk-taverna-2-paper/bmc_article.tex 2011-10-26 06:58:15 UTC (rev 15625) @@ -196,9 +196,7 @@ \begin{abstract} % Do not use inserted blank lines (ie \\) until main body of text. \paragraph*{Background:} The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego(TM)-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public. - \paragraph*{Results:} The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios. - \paragraph*{Conclusions:} CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios. \end{abstract} @@ -302,9 +300,10 @@ The configuration panel has to be registered in the \texttt{CDKConfigurationPanelFactory} class of the \texttt{org.openscience.cdk.applications.taverna.ui.view} package. More details on how to write workers and their configuration panels are provided at the project's wiki page \url{http://cdk-taverna-2.ts-concepts.de/wiki/index.php?title=Main_Page}. \subsection*{Requirements} -CDK-Taverna 2.0 supports 64-bit computing by use with a Java 64-bit virtual machine. The CDK-Taverna 2.0 plug-in is written in Java and requires Java 6 or higher. The latest java version is available at \url{http://www.java.com/de/download/}. The CDK-Taverna 2.0 plug-in is developed and tested on Microsoft Windows 7 (32 and 64 bit) as well as Mac OS X (32 and 64 bit). Problems on Linux distributions are not expected. +CDK-Taverna 2.0 supports 64-bit computing by use with a Java 64-bit virtual machine. The CDK-Taverna 2.0 plug-in is written in Java and requires Java 6 or higher. The latest java version is available at \url{http://www.java.com/de/download/}. The CDK-Taverna 2.0 plug-in is developed and tested on Microsoft Windows 7 as well as Linux and Mac OS X (32 and 64 bit). + %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% @@ -326,7 +325,7 @@ The curation and signature scoring workers may not only be applied in evaluating the NP-likeness of compound libraries but also in evaluating the metabolite-likeness of theoretical metabolites for predicting whole metabolomes. The latter application was the original purpose for the worker development and corresponding results will be presented in a subsequent publication. \subsection*{Clustering and machine learning applications} -Unsupervised clustering tries to partition input data into a number of groups smaller than the number of data whereas supervised machine learning tries to model the association of input and output data. If the output codes continuous quantities a regression task is defined. Alternatively the output may code classes so that a classification task is addressed. Molecular data sets for clustering consist of input vectors where each vector represents a molecular entity and consists of a set of molecular descriptors itself. Molecular data sets for machine learning add to each input vector a corresponding output vector with features to be learned - thus they consist of I/O pairs of input and output vectors. +Unsupervised clustering tries to partition input data into a number of groups smaller than the number of data whereas supervised machine learning tries to construct model functions that map the input data onto their corresponding output data. If the output codes continuous quantities a regression task is defined. Alternatively the output may code classes so that a classification task is addressed. Molecular data sets for clustering consist of input vectors where each vector represents a molecular entity and consists of a set of molecular descriptors itself. Molecular data sets for machine learning add to each input vector a corresponding output vector with features to be learned - thus they consist of I/O pairs of input and output vectors. The clustering and machine learning workers of CDK-Taverna 2.0 allow the use of distinct WEKA functionality. As far as clustering is concerned the ART-2a worker of version 1.0 is supplemented with five additional WEKA-based workers which offer @@ -349,7 +348,7 @@ \end{itemize} -Machine learning workers support the analysis of the significance of single input vector components, the partitioning of machine learning data into training and test sets, associative model generation and model based predictions as well as result visualization. There is a total of six WEKA-based machine learning methods available: Two workers allow regression as well as classification procedures ... +Machine learning workers support the significance analysis of single components (i.e. features) of an input vector to obtain smaller inputs with a reduced set of components/features, the partitioning of machine learning data into training and test sets, the construction of input/output mapping model functions and model based predictions as well as result visualization. There is a total of six WEKA-based machine learning methods available: Two workers allow regression as well as classification procedures ... \begin{itemize} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |