Menu

#360 subtle & rare bug in vsearch

critical
closed-accepted
None
1
2014-12-20
2014-08-19
Diab Jerius
No

There's a rare bug in binary search algorithms having to do with integer overflow in the calculation of the midpoint of the search range. See http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

The offending code in PDL's vsearch is:

m = (jh+jl) >> 1;

where jh & jl are declared as having type PDL_Indx. At least on my 64bit linux box, PDL_Indx is declared as (signed) long. The fix (as per the above post) is to coerce them into unsigned longs, and then perform the shift, e.g.

m = ((unsigned PDL_Indx) jh + (unsigned PDL_Indx) jl) >> 1;

(which of course won't work).

It doesn't look like there's an unsigned version of PDL_Indx defined. To prevent future catastrophes, it'd be nice if that were done.

Discussion

  • Derek Lamb

    Derek Lamb - 2014-08-19

    That blog post mentions that any divide-and-conquer function could possibly have the same bug. We should also look at qsort & related subs: they all have a line that looks like "median = (i+j)/2". There may be others.

     
    • Chris Marshall

      Chris Marshall - 2014-08-23

      I would categorize this as a "bug waiting to happen". :-)

      My preference for a fix is the better alternative from the
      blog post. It is clearer since it is explicit that the code
      cannot overflow for non-negative values of low. The
      version with (unsigned PDL_Indx) is less clear since the
      behavior comes from mandated C language performance
      in overflow.

      mid = low + ((high - low) / 2);

      We definitely could use code scrubbing to ensure this
      isn't going to bite us somewhere else.

      --Chris

       
  • Diab Jerius

    Diab Jerius - 2014-08-25

    Replacing the bit shift by an addition and a division operation results in a substantial performance hit. I've attached some test code. Here are the results of running it as

    % time ./time 100 && time ./time 100 1
    divide

    real 0m15.859s
    user 0m15.817s
    sys 0m0.012s
    shift

    real 0m11.461s
    user 0m11.433s
    sys 0m0.004s

    To be honest, without any comments, either version

    mid = ((unsigned int)low + (unsigned int)high)) >> 1;

    mid = low + ((high - low) / 2);

    is unsatisfactory, and since both have to be documented, I'd choose the faster version over the slower one. Proper documentation about the effects of C unsigned integer overflow should alleviate any confusion about the code.

     
  • Chris Marshall

    Chris Marshall - 2014-08-25

    Fixed in git.

    Thanks for the timing code. With -O3 I get essentially no difference in performance and there is no requirement to introduce an unsigned PDL_Indx datatype to "fix" the problem.

    That said, improved types support for PDL3 should revisit this decision for development going forward.

     
    • Diab Jerius

      Diab Jerius - 2014-08-25

      Good catch on the optimization. Thanks.

       
  • Chris Marshall

    Chris Marshall - 2014-08-25
    • status: open --> open-fixed
    • assigned_to: Chris Marshall
    • Priority: 5 --> 1
     
  • Chris Marshall

    Chris Marshall - 2014-12-20
    • status: open-fixed --> closed-accepted
     
  • Chris Marshall

    Chris Marshall - 2014-12-20

    Its in git for the future. Thanks for observing and reporting the problem.

     

Log in to post a comment.