feed4weka - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size
README.txt	2013-07-01	6.9 kB
Feed4Weka.jar	2013-07-01	9.2 MB
feed4wekaSRC.jar	2013-07-01	628.8 kB
Totals: 3 Items		9.8 MB

Feed4Weka is an open source library which integrates Weka with some new algorithms for Machine learning, and it extends some core features already exisiting in the main Weka framework.
In short, Weka is an open source java library for machine learning and data mining, developed by the University of Waikato and released under the GNU General Public License. The library contains modules for:
==> Data processing. A vast collection of algorithms for tranforming and manipulating tables, and for computing some relevant statistics.
==> Classification, regression and clustering. The library embodies several well-known algorithms from the literature.
==> Association rules. Weka contains some limited algorithms for frequent pattern discovery and association rules.
==> Visualization. Histograms and scatter plots are the main visualization tools in Weka.
Extending Weka is relatively straightforward, as explained in http://weka.wikispaces.com/ Feed4weka extends Weka by adding new algorithms as described in the wiki website. In addition, Feed4Weka changes the general structure of the main Weka application, by adding new functionalities for outlier detection and for co-clustering. Adding such features has required a main modification in the core of Weka and in the interface, by adding the data structures which support the above additions.
To summarize, Feed4Weka:
* Enriches the list of algorithms included in Weka;
* Adds outlier detection and co-clustering to the interface. The latter are not included in the original Weka distribution.

Feed4Weka in details

The algorithms developed within Feed4Weka can be categorized in the following tasks:
Classifiers
Based on their features, the classifiers included in Feed4Weka are:
Probabilistic classifiers
* Maximum Entropy Model. This classifier implements the MaxEnt Model described in.
o Jaynes, E. T., 1986 (new version online 1996), 'Monkeys, kangaroos and N', in Maximum-Entropy and Bayesian Methods in Applied Statistics, J. H. Justice (ed.), Cambridge University Press, Cambridge, p. 26.
Rule-Based Classifiers
* PNRule. Positive and Negative rules for mining with rarity, described in
o Ramesh Agarwal and Mahesh V. Joshi, PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection), First SIAM Conference on Data Mining 2000
* Slipper (Simple Learner with Iterative Pruning to Produce Error Reduction). A specialization of a boosting technique, described in
o W.W. Cohen, Y. Singer. A Simple, Fast, and Effective Rule Learner. Proceedings of the Sixteenth National Conference on Artificial Intelligence. Orlando Florida (United States of America, 1999) 335-342.
* CBA-GB. Apriori-based associative classification:
o B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. 4th International Conference on Knowledge Discovery and Data Mining (KDD98). New York (USA, 1998) 80-86.
* MRNB. A mixed hierarchical-rule based approach, described in
o Gianni Costa, Massimo Guarascio, Giuseppe Manco, Riccardo Ortale, Ettore Ritacco: Rule Learning with Probabilistic Smoothing. DaWaK 2009: 428-440
Decision-Tree classifiers
* AUC-CITree. The splitting criterion is defined by the AUC of a Naive Bayes classifier. The description can be found in
o Jiang SuÊ Zhang, H.ÊLearning conditional independence tree for ranking. Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
* QUEST (Quick Unbiased Efficient Statistical Tree ). A statistical splitting criterion guides this algorithm, as described in
o Loh, W.-Y. and Shih, Y.-S. (1997), Split selection methods for classification trees, Statistica Sinica, vol. 7, 815-840
Clusterers
Again, the clustering algorithms can be categorized in
Model-based clusteirng
* CEM. This is a variation of the EM algorithm, as described in
o Figueiredo, Jain, Unsupervised Learning of Finite Mixture Models, IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3), 2002.
Categorical data clusterers
* Rock (RObust Clustering using linKs). An agglomerative hierarchical clustering described in
o Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, "ROCK: A Robust Clustering Algorithm for Categorical Attributes," icde, pp.512, 15th International Conference on Data Engineering (ICDE'99), 1999
* TrKMeans. A variation of the K-Means algorithm, described in
o Fosca Giannotti, Cristian Gozzi, Giuseppe Manco: Clustering Transactional Data. PKDD 2002: 175-187
* Limbo. This is an entropy-based clustering algorithm, described in
o Periklis Andritsos, Panayiotis Tsaparas, Rene J. Miller, Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. Advances in Database Technology - EDBT 2004 (2004), pp. 531-532.
* CURE. A centroid-based clustering, described in
o Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: an efficient clustering algorithm for large databases. Information Systems, Volume 26, Number 1, March 2001.
Outlier Detection
Feed4Weka integrates in Weka a new structural component. Such major modification allows Weka to incorporate outlier detection algorithms and visualizers. Besides the structural modification, Feed4Weka includes
* Feature Bagging. The algorithm is based on a bagging procedure, with a model-based clustering baseline for the detection of outliers. A description can be found in
o Lazarevic, A. and Kumar, V. 2005. Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD international Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21 - 24, 2005). KDD '05. ACM, New York, NY, 157-166
CoClustering
Feed4Weka includes two main co-clustering algoirhtms:
* Information Theoretic CoClustering. An entropy-based co-clustering algorithm, described in
o Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh and Srujana Merugu and Dharmendra S. Modha. A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation, In KDD 2004, pages 509-514
* Fully Automatic Cross-Associations (FACA). Again, an entropy-based divisive algorithm, described in
o Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha and Christos Faloutsos. Fully automatic cross-associations. In KDD 2004, pages 79-88, ACM Press

Overview of Feed4Weka

Adding Feed4Weka to a specific Weka implementation is relatively simple. The following directions assume that a users has Weka 3.6.3. In order to include Feed4Weka, a user must modify the filesGenericObjectEditor.props and GenericObjectEditor.props, by adding the proper lines relative to the above described algorithms. I've included an example runnable jar that contains the main Weka release, as well as the Feed4Weka algorithms.

The sources are contained in the package

Feed4WekaSRC.jar

the example package is

Feed4Weka.jar

In order to run the example package, download and run it by using the java commond

java -Xmx1024M -jar feed4weka.jar

Source: README.txt, updated 2013-07-01