Questions related to prediction

Help
Anonymous
2013-01-10
2013-01-21

  • Anonymous
    2013-01-10

    Hi,

    I actually have a couple of questions related to the predict() method.

    The first:

    In the documentation (specifically for GClasses::GSupervisedLearner::predict(const double pIn, double pOut) http://waffles.sourceforge.net/apidoc/html/class_g_classes_1_1_g_supervised_learner.html#a11c114a06eb49a07da465e9e6782ef2c) it is said that

    Evaluate pIn to compute a prediction for pOut. The model must be trained (by calling train) before the first time that this method is called. pIn and pOut should point to arrays of doubles of the same size as the number of columns in the training matrices that were passed to the train method.

    I understand that pIn should point to an array of the same size as the number of columns, but what about pOut? If my prediction problem is to classify an n-ary feature vector to a binary label, wouldn't the size of pOut be just one?

    The second:

    I am actually looking for a way to threshold the output of a classifier (e.g. NaiveBayes) in order compute, for instance, the Receiver-Operator-Characteristic. Is there a way (function or method) of obtaining the output probabilities of NaiveBayes rather than just the label?

    The third:

    Is it true that, in order to predict a pIn that is sparse, pIn is transformed to dense representation inside the Classifier code?

    Regards,
    Jorn

     
  • Mike Gashler
    Mike Gashler
    2013-01-10

    1- The train method requires a feature (input) matrix and a label (output) matrix. The predict method requires an input vector (pIn) and an output vector (pOut). pIn should be the same size as the number of columns in the feature matrix. pOut should be the same size as the number of columns in the label matrix. Yes, for classification tasks, the size of pOut should be 1. The only times pOut would be larger than 1 is for multiple classes, or regression to a vector.

    2- call "predictDistribution" instead of just "predict". It is more cumbersome to use, but it provides the additional information you seek.

    3- The answer to this question is algorithm-specific. For example, GKNN efficiently evaluates pIn in sparse format. GNaiveBayes, however, converts it to dense format and then uses a common evaluation method. (It is currently implemented this way so that it can learn from the zeros as well as the ones.)

     

  • Anonymous
    2013-01-10

    Hi,

    Thanks, again, for the answers!

    Regards,
    Jorn

     

  • Anonymous
    2013-01-17

    Hi,

    I have yet another question related to the predict method:
    The problem I am trying to solve is a binary classification task.
    The general design of my code is that I load a pretrained model (in this case a Naive Bayes model GClasses::GNaiveBayes) and use it to predict the label of new instances. I noticed that in my setup the predictDistribution method is the bottleneck in terms of speed. I timed the calls of each line of code and it seems that the call to predictDistribution (as seen in line 2 in the code snippet below) takes about 10ms. The rest of the calls that are done take less than 1 ms. What is the bottleneck in the prediction method? I traced the call and I couldn't really find any complexity issues. Are there ways of optimizing the prediction calls in terms of speed?

    int ClassificationModel::Classify(double *input, double *output)
    {
    ........
    1. GClasses::GPrediction pDistr[1];
    
    2. this->myModel->predictDistribution(input, pDistr);
    
    3. double prob = pDistr[0].asCategorical()->likelihood(1);
    .........
    
    }
    

    Regards,
    Jorn

     
  • Mike Gashler
    Mike Gashler
    2013-01-18

    I did some testing with a profiler. It looks like the predict and predictDistribution methods both have approximately the same cost. Most of the computational cost is in GNaiveBayesOutputValue::eval (GNaiveBayes.cpp:136-153). More than half of computational cost of this method is due to computing logarithms. It looks like this method could be optimized by pre-computing these values and storing them, instead of computing them each time a prediction is needed.

     

  • Anonymous
    2013-01-18

    Hi,

    Thanks for your time!
    I have been able to speed classification by a factor of 10 using your suggestions.

    Regards,
    Jorn

     
  • Mike Gashler
    Mike Gashler
    2013-01-18

    Wow! I didn't realize there was such a bottleneck here. Thanks for pointing it out. (If you would like commit access to our git repository, just send me a SourceForge user ID by e-mail, and I can add you to the project.)

     
  • Jorn Bakker
    Jorn Bakker
    2013-01-21

    Hi,

    I was pleasantly surprised as well. A few remarks:
    this increase in speed is for prediction only and I haven't thoroughly tested the code yet.

    Regards,
    Jorn

     


Anonymous


Cancel   Add attachments