Kolmogorov-Smirnov test poor performance
Java library of statistical distribution
Brought to you by:
robbyjo
I'm using the Kolmogov-Smirnov implementation calculating the p-value:
ks_pvalue = DistributionTest.kolmogorov_smirnov_test(part1, part2)[1];
part1 and part2 are each double[50000].
I timed the command run time and it was 14,500 ms (i.e. 14.5 seconds).
This is very long, as in R it takes under 1 second, and the JDistLib Bartlett implementation for the exact same data takes ~200 ms.
Do you know of any reason for this performance issue?
Thanks in advance,
Gilad
Anonymous
Thank you for your bug report. The reason is R always switches to inexact p-value computation method whenever there are huge data sets like that, whereas JDistlib always uses exact p-value method. So, it is not JDistlib's bug per se. I recognize that computing exact p-values may not be desirable in some situations, especially for time critical applications. So, I have added an option in version 0.3.7 to allow inexact p-value computation to save time. The default will always be exact method, however. Also, you need to be aware of the exact vs. inexact p-value difference, especially when it comes to multiple testing (since false discovery methods are notoriously sensitive to minute changes in p-values).
When I did profiling (to identify the code sections with performance problems), I also encountered an integer overflow bug for bigger data sets. I have fixed this bug in v0.3.7 as well.
Thank you for the bug report.