Menu

#404 rounding errors in linear model from "stats"

5.0
open
nobody
stats (1)
5
2014-08-18
2014-07-23
No

This small, completely linear dataset gives NaN for the errors of slope and intercept in "stats" linear model:

stats "-"
1 2
2 3
4 5
e

I took a look at the code in "src/stats.c" (around line 320) and the one thing that sprang to mind was that line 318

double ss = (ssyy - res.slope * ssxy) / (n - 2);

could become negative by some rounding errors, so that the calculation of the errors (following lines) would include the root of a negative number.

Funnily, my self-compiled gnuplot5 (mingw/win7) gives a small (6e-10) positive error for the same dataset. Above is with Tatsuro´s gp5 binary from 2014-07-14. It´s the other way round with this one

stats "-"
1.1 2.2
2.2 3.3
4.4 5.5
e

Most other linear data triplets i tried just return errors zero.

I´m not sure how to tackle this, however, or if it should be fixed at all. In most cases, i guess, people know if their data is perfectly linear. ymmv

(errors in the range of 1e-7 can already trigger the described problem, e.g

stats "-"
1 2
2 3
4 5.0000001

. Who does measurements that are so good? ;-) )

Discussion

  • Ethan Merritt

    Ethan Merritt - 2014-07-23

    What exactly is the "bug" here?

    Are you saying that on some platform the rounding error can cause the program to fail due to floating underflow or sqrt of a negative number?

    Have you tried adding a test something like the one below?

    if (fabs(ss) < DBL_EPSILON)
    ss = 0.0;

     
  • Karl Ratzsch

    Karl Ratzsch - 2014-07-24

    Doesn´t make much difference, because ss contains ssyy - res.slope * ssxy, and ssyy and ssxy already are defined as such error-prone differences. So the uncertainity is somewhat bigger.

    One could do

    if (ss < 0) ss = 0.0;    /* negative values can occur due to fp errors */
    

    but that seems a bit half-hearted to me.

    I think it´d be best to just throw a warning "Variance in linear model is below FP precision, errors might be invalid." if abs(ss) is below a certain value. But for this someone (with better knowledge of fp math than me) would have to derive that value.

    Don´t know if that´d be worthwhile. ;-)

     
    • Hans-Bernhard Broeker

      if abs(ss) is below a certain value.

      That criterion would be worse than useless. The result of abs(ss) changes in proportion if you multiply all input y values by a common factor. No fixed "certain value" could possibly be correct.

       
  • Hans-Bernhard Broeker

    This small, completely linear dataset gives NaN for the errors of slope and intercept in "stats" linear model:

    And IMHO, that's a perfectly sensible outcome. Displaying anything else instead of the actual NaN that resulted would, to no small extent, be doing a disservice to the user.

     

    Last edit: Hans-Bernhard Broeker 2014-07-24
  • Karl Ratzsch

    Karl Ratzsch - 2014-07-24

    Acknowledged, but with other similar data you get error zero or a small positive number. And if the data is only very nearly linear (last example above), the reported error can be grossly wrong. Some measurements are precise to the tenth digit, it could happen with "real" data, i say.

    It´d be good to warn the user if the reported error is unreliable. One only needs to come up with a safe criterion for detection. I´ll try to work this out sometime.

    Can we move this to "Feature Requests" in the meantime? I agree it´s by no means a real "bug".

     
  • Ethan Merritt

    Ethan Merritt - 2014-08-18

    Ticket moved from /p/gnuplot/bugs/1447/

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.