Menu

#19 Kolmogorov-Smirnov test poor performance

1.0
closed
2014-12-11
2014-12-09
Anonymous
No

I'm using the Kolmogov-Smirnov implementation calculating the p-value:
ks_pvalue = DistributionTest.kolmogorov_smirnov_test(part1, part2)[1];
part1 and part2 are each double[50000].
I timed the command run time and it was 14,500 ms (i.e. 14.5 seconds).
This is very long, as in R it takes under 1 second, and the JDistLib Bartlett implementation for the exact same data takes ~200 ms.
Do you know of any reason for this performance issue?
Thanks in advance,
Gilad

Discussion

  • Roby Joehanes

    Roby Joehanes - 2014-12-11

    Thank you for your bug report. The reason is R always switches to inexact p-value computation method whenever there are huge data sets like that, whereas JDistlib always uses exact p-value method. So, it is not JDistlib's bug per se. I recognize that computing exact p-values may not be desirable in some situations, especially for time critical applications. So, I have added an option in version 0.3.7 to allow inexact p-value computation to save time. The default will always be exact method, however. Also, you need to be aware of the exact vs. inexact p-value difference, especially when it comes to multiple testing (since false discovery methods are notoriously sensitive to minute changes in p-values).

    When I did profiling (to identify the code sections with performance problems), I also encountered an integer overflow bug for bigger data sets. I have fixed this bug in v0.3.7 as well.

    Thank you for the bug report.

     
  • Roby Joehanes

    Roby Joehanes - 2014-12-11
    • status: open --> closed
     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.