Menu

#16 Unknown Values in Output Vectors Become 0 When Evaluating

v1.0_(example)
closed
nobody
None
5
2015-12-22
2014-12-17
MLUser
No

If an output vector contains unknown values, then during evaluation they get converted to 0's. An example of a dataset which has unknown values in its outputs is bridges, as can be found at https://sourceforge.net/projects/meka/files/Datasets/.

During evaluation however, outputs are converted to integers at line 96 of meka.core.Result.java in the function addResult(double pred[], Instance real). I know that this is happening because I set a conditional breakpoint at line 242 of meka.core.MLUtils.java (in the function toIntArray(Instance x, int L)), with the condition Double.isNaN(x.value(j)), and it was hit. The issue is that Meka stores unknown values as Double.NaN, which are then cast to integers, resulting in them being converted to 0's. During evaluation, this means that for a model to get an output vector correct, it must predict 0 where there are unknown values in the target output vector. This issue is not present while training, so model's are unlikely to learn to predict 0 for unknown values. In the end, this means that results on the bridges dataset are artificially low.

I'm not sure how to fix this. Output vectors with unknown values could be skipped when finding exact match scores, or perhaps any value could be counted as correct for the outputs that are unknown. Or perhaps unknown outputs should not be considered valid, so an exception could be thrown when they are detected.

I am using revision 282 of the code.

Discussion

  • Jesse Read

    Jesse Read - 2014-12-18
    • status: open --> pending
     
  • Jesse Read

    Jesse Read - 2014-12-18

    This is a kind of partially-supervised scenario. Meka does support semi-supervised learning (the classifiers that implement SemiSupervised) but that case assumes that a batch of instances are unlabelled -- and obviously do not need to be evaluated. This case is different in that distinct labels are unknown. Setting all unknown labels to zero is probably not actually that bad -- most datasets will be 'noisy' in this sense with unknown or missed labels hidden as 0s. As I am not sure how common it is to know when the labels are unknown, I probably won't bother with a fix yet, but will keep it in mind.

    Another possible work-around is to consider a multi-target case, and encode the 'unknown class' as a 3 or something.

    Thanks for pointing this out.

     
  • MLUser

    MLUser - 2015-01-05

    Thanks for your response. I think that for the multi-label case it does make sense, as you said, to set unknown labels to zero. The bridges dataset, however, is mutli-target, so zero doesn't mean the a label is not present, but rather zero refers to a specific nominal value.

    Bridges is the only dataset I have seen with this problem, so this is not a pressing concern.

     
  • Jesse Read

    Jesse Read - 2015-12-22
    • status: pending --> closed
     

Log in to post a comment.