[Classifier4j-devel] Serious Classifier4J bug
Status: Beta
Brought to you by:
nicklothian
|
From: Thomas J. <tje...@ya...> - 2008-06-04 12:37:20
|
Hi,
I am are just investigating about the possibilities of an automated bug classification system based on log-files and found Classifier4j to be very useful for this. Thank you for that!
However, during some more extensive tests with large log-files, I found a serious bug in the BayesianClassifier class which I thought you should know about.
This is the problematic code:
double z = 0d;
double xy = 0d;
for (int i = 0; i < wps.length; i++)
{
if (z == 0)
{
z = (1 - wps[i].getProbability());
}
else
{
z = z * (1 - wps[i].getProbability());
}
if (xy == 0)
{
xy = wps[i].getProbability();
}
else
{
xy = xy * wps[i].getProbability();
}
}
The bug is occuring when you have a large number of words. Then z and/or xy tend towards 0 and eventually often even will reach 0. This alone would be a problem, because if both reach 0, we would get an exception.
But this never happens, because of a 2nd problem. If z or xy reach zero, the code assumes that they need to be initialized. This can happend multiple times and causes completely random results.
The 2nd problem could be easily overcome by simply initializing z and xy with 1d outside the loop and remove the two if parts.
The 1st problem was a bit more tricky, but I found a simple and elegant solution. Logarithms! The key formulas here are:
log(a * b) = log(a) + log(b)
log(a / b) = log(a) - log(b)
Also I merged z and xy into one single variable to avoid one logarithm/loop. Finally the code looks like this:
double xyz = 0d;
for (int i = 0; i < wps.length; i++)
{
xyz = xyz + Math.log((1 - wps[i].getProbability()) / wps[i].getProbability());
}
double numerator = 1;
double denominator = 1 + Math.exp(xyz);
// don't worry about too large or too small xyz values here, Java handles this
return numerator / denominator;
Now classifier code works like expected even with a huge number of words. Feel free to use my code to apply a fix to your project.
Have fun!
Thomas Jentzsch | *** Every bit is sacred ! ***
tjentzsch at yahoo dot de |
__________________________________________________________
Gesendet von Yahoo! Mail.
Dem pfiffigeren Posteingang.
http://de.overview.mail.yahoo.com
|