Download Latest Version osdq-spark_0.0.1.zip (141.7 MB)
Email in envelope

Get an email when there's a new version of apache spark data pipeline osDQ

Home / spark-classifier
Name Modified Size InfoDownloads / Week
Parent folder
spark-classifier_1.3-SNAPSHOT.zip 2018-03-19 127.0 MB
README.txt 2018-03-19 3.3 kB
spark-classifier_SourceCode1.3.zip 2018-03-19 17.6 kB
Totals: 3 Items   127.1 MB 0
Overview:

This program can be used for both training and classifying purpose. You can train the model and use RESTFul web service to query the model.

This program also exposes a RESTFul web service to (jetty and javaspark based) expose classification/prediction as a service.
Please refer below details based on how you are planning to use this program

Install:

1. Download the package the package Òspark-classifier_1.3-SNAPSHOT.zipÓ

2. Unzip the pre-built distribution and follow the below details

3. Understand the folder structure of release upon unzipping

* spark-classifier_\<version>
    * /lib: contains all dependent jars
    * /conf: contains classifier.properties, please review this file before running the program 
    * /model: the default model path where both model would saved (after training) and read (during classification service). You should have write access to this folder
    * /spark-classifier-\<version>.jar: the main driver jar


Configuration:

Currently it supports Random Forest and Multilayer Perceptron classifiers. Please set the same under Òconf/classifier.propertiesÓ

# Currently supported algorithm RANDOM_FOREST or MULTILEVEL_PERCEPTRON
classifier.algorithm=MULTILEVEL_PERCEPTRON
#classifier.algorithm=RANDOM_FORES

It takes Comma(,) separated list of columns for Feature and Label. * in label means it will take all columns to predict. It will skip feature columns if they in in predict or label column too.

classifier.featurecols=Number,Follow up

####list of labels to be predicted
#### '*' will process all the columns
classifier.labelcols=Root Cause
#classifier.labelcols=L1, L2, L3...

Train the model:

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.classifier.ClassifierTrainer
    
    The input file name and output model location can be defined inside `conf/classifier.properties`
    
    By default, above command would assume that `conf/classifier.properties` file is correctly setup.
    

Use the model to predict or classify

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.service.ClassifierService

It will start default jetty server which will accept post requests. After this you may post the RESTFul API http://localhost:4567/classify/<algorithm_name>/<label_name> -d jsonfile

Where \<algorithm_name> can be "randon_forest" or "multilevel_perceptron" and \<label_name> would be the label column name (column for which model was trained) in your training dataset and json file will have feature column and values which are input for prediction or classification
    
cmd > curl -XPOST http://localhost:4567/classify/random_forest/LABEL1 -d '[{
        "FeatureField1":"FeatureField1VALUE",
        "FeatureField2":" FeatureField2VALUE",
        "FeatureField3":" FeatureField3VALUE"}]'

    > Response JSON
       [{
        "classifiedLabel": "PredictedValue",
        "probability": "0.951814884316891"
    }]


Things to Remember
1.) Presently it takes only txt file with field separator
2.) Null is replaced by NULLVALUE as null cannot be used in model
3.) multilevel_perceptron does not give probability of predicted value. This feature is available in latest apache spark version.
4.) Currently label_name shouldn't have hyphen '-' character
5.) If there is space in label column name use Ô%20Õ for space.
Source: README.txt, updated 2018-03-19