Home
Name Modified Size InfoDownloads / Week
README.txt 2015-12-01 3.4 kB
RedRF.java 2015-05-05 24.5 kB
Screenshot.png 2015-05-05 17.6 kB
cancer_preprocessed.csv 2015-05-05 14.8 kB
R-Commands.txt 2015-05-05 1.0 kB
Totals: 5 Items   61.3 kB 0
Indiana University, Bloomington, USA
School of Informatics and Computing
Red-RF: Reduced Random Forest for big data using priority voting & dynamic data reduction
README file
May 2015
=============================================================================================

CITATION:
=========

Please cite the following paper(s) when using Red-RF:

H. Mohsen, H. Kurban, K. Zimmer, M. Jenne and M. Dalkilic. Red-RF: Reduced Random Forests using priority voting & dynamic data reduction. Proceedings of the 4th IEEE International Congress on Big Data (IEEE BigData Congress'2015), 118-125, New York, NY, June-July 2015.

H. Mohsen, H. Kurban, M. Jenne and M. Dalkilic (2014). A New Set of Random Forests with Varying Dynamic Data Reduction and Voting Techniques. Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2014), 309-405, Shanghai, China, October-November 2014.

CODE EXECUTION:
===============

The given code runs directly on the given cancer data (input file should be in running directory).​

To run the code against new dataset, the user needs to adjust the following global variables in the code as desirable:

// Names of the attributes excluding the label. It is CRUCIAL you have n items in this array if you have n attributes. You may not worry much about attribute names so you may call them "att1", "att2", etc.
attributeNames={"CT","UCSi","UCSh","MA","SICZ","BN","BC","NN","M"}; 

// Change to the prefix in the new input file.
public static String dataPrefix="cancer";  

// The type of impurity measure used. 2 for Gini index, 1 for entropy and 0 for Error rate. Default value is 2 for Gini.
public static int typeOfImpurity=2;

// The maximum branching factor in forest trees.
public static int numberOfBranches=5;
 
// N', the size of the sample used to build each tree in the random forest.
public static int NPrime = 15; 

// m, the number of attributes randomly chosen at each split while building forests trees.
public static int m = (int) Math.ceil(Math.sqrt(attributeNames.length));

// Number of trees in the original whole forest
public static int forestSize =150;

// Size of the dataset (number of rows/records)
public static int dataSize=683;

The code does 10-fold cross validation. When execution is over, the average accuracy and execution times are printed on console. An example is attached (screenshot.png).

INPUT FILE:
===========

In the input CSV file:
- All attribute values must be numerical. For categorical values, preprocess them to be numerical (2 categorical values could be concerted to 1 and 4 for example).
- Labels must be 0 and 1 (not "0" and "1" - no quotations) and they must in the last column in your input CSV file.

HEAP & ROC Files:
=================

Heap and ROC files are generated in running directory.

To generate heap distribution or ROC plots for the new data set:

For the heap distribution:
- When execution is over, the generated file will be called prefix_heap.txt. Run the Heap commands in attached R-commands.txt. Commands are based on pROC R library.

For the ROC plot:
- When execution is over, there will be a generated file called prefix_ROC.txt. Run the ROC commands in attached R-commands.txt (histogram generation).

CONTACT:
========

For inquiries, please contact us at hmohsen@imail.iu.edu (or @indiana.edu).
Source: README.txt, updated 2015-12-01