Menu

#19 Kolmogorov-Smirnov test poor performance

1.0
closed
2014-12-11
2014-12-09
Anonymous
No

I'm using the Kolmogov-Smirnov implementation calculating the p-value:
ks_pvalue = DistributionTest.kolmogorov_smirnov_test(part1, part2)[1];
part1 and part2 are each double[50000].
I timed the command run time and it was 14,500 ms (i.e. 14.5 seconds).
This is very long, as in R it takes under 1 second, and the JDistLib Bartlett implementation for the exact same data takes ~200 ms.
Do you know of any reason for this performance issue?
Thanks in advance,
Gilad

Discussion

  • Roby Joehanes

    Roby Joehanes - 2014-12-11

    Thank you for your bug report. The reason is R always switches to inexact p-value computation method whenever there are huge data sets like that, whereas JDistlib always uses exact p-value method. So, it is not JDistlib's bug per se. I recognize that computing exact p-values may not be desirable in some situations, especially for time critical applications. So, I have added an option in version 0.3.7 to allow inexact p-value computation to save time. The default will always be exact method, however. Also, you need to be aware of the exact vs. inexact p-value difference, especially when it comes to multiple testing (since false discovery methods are notoriously sensitive to minute changes in p-values).

    When I did profiling (to identify the code sections with performance problems), I also encountered an integer overflow bug for bigger data sets. I have fixed this bug in v0.3.7 as well.

    Thank you for the bug report.

     
  • Roby Joehanes

    Roby Joehanes - 2014-12-11
    • status: open --> closed
     

Anonymous
Anonymous

Add attachments
Cancel