#1238 Weird behaviour when combining smooth and NaN

5.0
closed-out-of-date
nobody
None
5
2014-08-22
2013-05-04
Ulrik Falklöf
No

Hi,

With a file containg a mix of numbers and NaN values, the smooth functions works as I expect for this command:

  plot "foo.dat" using 1:2 smooth unique



But the smooth functions gives a very different result for the equivalent (?) command:

  plot "foo.dat" using ($1):($2) smooth unique



For the first command, all points with NaN are silently ignored and one smoothed curve is drawn based on the other points. But for the second command, the curve is separated into multiple parts separated by the NaN values. Each part is smoothed independently of the others, and drawn on top of each other.

There are also similar problems with filter expressions resulting in NaN values, such as:

  plot "foo.dat" using 1:($3 > 1 ? $2 : NaN) smooth unique



I've been using GnuPlot 4.6.3 compiled from source on Linux.

Question 1: The smooth algorithm functions in iterpol.c use UNDEFINED points to split a curve into separate parts, via the next_curve function. Is that by design? If so, why?

Question 2: The function datafile.c:df_readascii handles NaN in data files differently depending on if NaN is retrieved via an action table or not. If there is an action table it will return DF_UNDEFINED, but otherwise it will set line_okay = 0 and silently ignore the line. Is that by design? If so, why?

Question 3A: When the filter expression (in the last command above) is evaluated to NaN, the action table function will return NaN, but that is not detected until in axis.h:STORE_WITH_LOG_AND_UPDATE_RANGE, which keeps the point and marks it as UNDEFINED. Is the NaN supposed to be detected by the comparison between temp and VERYLARGE already in eval.c:evaluate_at? If so, that comparison does not work for me (VERYLARGE=DBL_MAX/2-1, GCC=4.6.3, x86), however the similar negated check in axis.h:STORE_WITH_LOG_AND_UPDATE_RANGE works.

Question 3B: Are filter expressions which evaluates to NaN, supposed to result in ignored lines or created points marked as UNDEFINED? Since I'm working with large data sets, which are filtered by expressions such as ($3 > 1 ? $2 : NaN), I would prefere if the filtered data could be ignored instead of being stored as UNDEFINED points and wasting memory and CPU.

Would you accept a patch that would change the behaviour mentioned above?

Cheers,
Ulrik Falklöf

Discussion

  • Ethan Merritt
    Ethan Merritt
    2013-05-04

    Thank you for the bug report. Please consider this as only a preliminary set of answers to your questions. There are real issues here that will take more than a quick look to resolve completely.

    Question 1: The smooth algorithm functions in iterpol.c use UNDEFINED points to split a curve into separate parts, via the next_curve function. Is that by design? If so, why?

    I believe that the intent was to split the curve based on the presence of blank lines. A problem arises because the definition of "blank line" was not adequately established. Note for example the comments in datafile.c

       /* EAM - Oct 2002 Distinguish between
        * DF_MISSING and DF_BAD.  Previous versions
        * would never notify caller of either case.
        * Now missing data will be noted. Bad data
        * should arguably be noted also, but that
        * would change existing default behavior.  */
    

    There are already some hooks in the lower level routines (datafile.c:df_readline) but the information is not returned to the callers (plot2d.c plot3d.c). This would be a good time to explore what improvements are possible.

    Question 2: The function datafile.c:df_readascii handles NaN in data files differently depending on if NaN is retrieved via an action table or not. If there is an action table it will return DF_UNDEFINED, but otherwise it will set line_okay = 0 and silently ignore the line. Is that by design? If so, why?

    I do not know what the original intent was, but I agree with you that it is undesirable. Changing this was judged to break backward compatibility and hence postponed until the next major release of gnuplot (version 5). It has already been changed in the CVS source for the development version, and is documented under "New features" and "missing" (p. 105 in the pdf version)
    http://gnuplot.sourceforge.net/gnuplot_cvs.pdf
    It may be that additional changes to the code are needed to catch all problem cases, so your testing is welcome.

    Question 3A: When the filter expression (in the last command above) is evaluated to NaN, the action table function will return NaN, but that is not detected until in axis.h:STORE_WITH_LOG_AND_UPDATE_RANGE, which keeps the point and marks it as UNDEFINED. Is the NaN supposed to be detected by the comparison between temp and VERYLARGE already in eval.c:evaluate_at?

    The line of code in evaluate_at() is clearly an error. Amazing that it wasn't fixed at the same time as the code in axis.h. In fact that whole routine looks like it needs an overhaul, since the comments indicate hackery aimed at ancient versions of various platforms. I'll have a serious look at it.

    Question 3B: Are filter expressions which evaluates to NaN, supposed to result in ignored lines or created points marked as UNDEFINED? Since I'm working with large data sets, which are filtered by expressions such as ($3 > 1 ? $2 : NaN), I would prefer if the filtered data could be ignored instead of being stored as UNDEFINED points and wasting memory and CPU.

    This is complicated, and again the details have already been changed in the development version in preparation for version 5. For one thing some plot types (image plots, heat maps, gridded data) come out mangled if NaN points are silently omitted rather than being passed through as UNDEFINED. Another issue is that deleting/omitting points at the time they are read in prevents re-using them in place if the filter criteria changes. E.g. it would be nice to do

    filter(x) = something
    plot LARGEDATASTREAM using 1 : 2 : (filter($1) ? foo : baz)
    filter(x) = something_else
    refresh
    

    Note that the "refresh" command uses the previously stored data rather than rereading the input stream. This doesn't currently work in the form shown, but I'd like to aim for that.

    Would you accept a patch that would change the behaviour mentioned above?

    Yes, but it would only make sense to patch relative to the development version (4.7) since some of this behaviour has already been changed. Equally important would be a set of test scripts and data demonstrating the desired before/after behaviour, so that alternative patches can be evaluated.

     
  • Ethan Merritt
    Ethan Merritt
    2014-03-14

    • status: open --> closed-out-of-date
     
  • Ethan Merritt
    Ethan Merritt
    2014-03-14

    I believe that these issues have been addressed in 4.7 and will be present in the 5.0 release.

    If I'm wrong about that, please re-open this with a specific test case that shows the remaining problem.