When processing input columns in a plot command using the using spec, expressions inside parentheses are evaluated. If a $N (or column(N)) call within such an expression returns NaN, the entire expression is currently treated as undefined, even if the final evaluated result is a valid number.
I don't know if I should call it a bug or if it is a side effect of the spec.
This behavior seems to be introduced after Bug #1896 and related commit.
Here's an example:
$data <<EOD
1 5
2 4
3 3
4 NaN
5 NaN
6 3
7 5
EOD
Suppose we want to plot columns 1 and 2 with 'with boxes', replacing NaN values with 0 using an expression in the using spec.
This command works correctly:
plot $data using 1:(valid(2)?$2:0) with boxes
It replaces NaN at rows 4 and 5 with 0 as expected.
However, the following command fails, even though the final result of the expression should be the same:
isnan(x) = (x==x)?0:1
plot $data using 1:(isnan($2)?0:$2) with boxes
Here is meaningless example to describe the situation,
plot $data using 1:($2,$1) with boxes
Although the initial $2 reference in the comma expression has no effect on the final result, it seems to trigger an undefined flag due to the NaN, which causes the entire expression to be treated as invalid.
It is unclear whether this behavior is intentional or a bug, but ideally, the validity of an expression should depend on its final result, not on intermediate evaluations. Use cases are few, but it is difficult to avoid failure without knowing this specification in an attempt.
Full sample script:
$data <<EOD
1 5
2 4
3 3
4 NaN
5 NaN
6 3
7 5
EOD
set xrange [0:8]
set yrange [0:10]
set title "using 1:(valid(2)?$2:0)"
plot $data using 1:(valid(2)?$2:0) with boxes
pause -1 "return press key"
set title "using 1:(isnan($2)?0:$2)"
isnan(x) = (x==x)?0:1
plot $data using 1:(isnan($2)?0:$2) with boxes
pause -1 "return press key"
set title "using 1:(isnan(value[$0+1])?0:$2)"
array value[7] = [5,4,3,NaN,NaN,3,5]
plot $data using 1:(isnan(value[$0+1])?0:$2) with boxes
pause -1 "return press key"
set title "using 1:($2,$1)"
plot $data using 1:($2,$1) with boxes
pause -1 "return press key"
Another sample script:
$data <<EOD
1 3 4
2 4 NaN
3 8 NaN
4 9 NaN
5 10 NaN
6 12 NaN
7 16 NaN
8 19 NaN
9 22 NaN
10 24 10
EOD
a = 0
stats $data using (a=a+$3, $1) nooutput
print a
print STATS_invalid ### 8 (expected 0)
print STATS_sum ### 11.0 (expected 55.0)
There may be several related issues here, but the heart of the matter is your very pertinent observation that the comma operator for serial evaluation is not acting as a proper sequence point. In particular, error conditions are not cleared in between successive expression in
(<exp1>,<exp2>,<exp3>,...).This is fixable, and the fix is now in version 6.1 via commit 2cf853b028
With that in place, it is possible to modify the definition of
isnanin your sample script as follows:With this change all the plots in your first sample script work as I think you expected them to, and the second script using
statsalso runs as expected.It may well be, however, that other unexpected (or poorly documented) cases exist. In general gnuplot 6 tries to figure out whether or not an input data value that is missing /NaN/undefined is required in order to plot that point. If it is required, the point marked as invalid. If it is not required, the point is considered valid and the result (usually NaN) is stored in the corresponding slot of the data structure for that point. The problem is, what exactly does "required" mean? Both x and y are obviously required in order to plot a point a [x,y]. But if the plot command references a z value but the z value is missing, what then? For example
The point can be drawn, but the missing or NaN z value affects the sort order. Does that make the data point valid or invalid? As it happens, in this case the points are sorted (incorrectly) using some artefactual value that ends up in z. I think this is neither obvious nor documented.
If you have further thoughts on the matter, please attach them here or raise the issue for discussion on the gnuplot-beta mailing list.
Last edit: Ethan Merritt 2025-04-16
Thank you for looking into this so quickly.
Here is how I think missing and invalid data should be handled in the using specification.
Handling of Missing and Invalid Data in Parentheses Evaluation in the
usingSpecificationMissing Data
When missing data is referenced using
$Norcolumn(N), the evaluation of the parentheses is determined as missing, even during intermediate steps of sequential evaluation. If we consider missing data as inherently inaccessible, then references to missing data via other column access functions—such asvalid(N),strcol(N), ortimecolumn(N, "timefmt")—should result in the same outcome.Invalid Data
When invalid data is referenced using
$Norcolumn(N), the evaluation of the parentheses is not immediately determined as invalid at that point. Intermediate results during sequential evaluation do not affect the outcome; only the final result determines the evaluation of the parentheses. Invalid data is represented asNaN, which can be used in comparisons and arithmetic operations with other numeric values.There may be differences of opinion. Especially, the part regarding 'valid', 'strcol', and 'timecolumn' may have significant impact due to potential compatibility issues.
Aside from that, I believe the behavior can be implemented by applying the following modifications in addition to commit "3e1f93bd".
While this change might unintentionally interfere with the intended behavior of the original fix for Bug #1896, the rationale behind the current behaviour of '$N' is still unclear to me.
This is a script for verifying the differences between missing and invalid data.
In case 1 and 3, NaN values in the data file are treated as missing, so we expect bars to be drawn only at x = 1, 3, and 7. In case 2 and 4, NaN values in the data file are treated as invalid, so we expect bars to be drawn for all x values from 1 to 7. The behavior differs before and after applying the patch. As mentioned above, 'valid(N)' does not trigger the missing flag, so after applying the patch, case 3 will produce the same plot as case 2 and 4.
I am confused.
I wonder if you are seeing the same result from running that script that I do?
You say "after applying the patch case 3 will produce the same plot as case 2 and 4", but for me cases 2 and 4 yield different plots. This is true either for gnuplot 6.0 or 6.1 (output attached)
Using the current git tip commit 2cf853b0 I can make the output for 2 be the same as 4 by correcting the definition of isnan() to use serial evaluation:
isnan(X) = (TEMP=(X==X)?0:1),TEMP). But now cases 2 3 and 4 all produce the same output even without your additional patch.Are you seeing something different?
The script compares the values in the second and third columns, and plots the valid and larger one.
Attached are the outputs from gnuplot 6.0.2 and gnuplot 6.1.0 with patch [commit "3e1f93bd" & my patch]. These two figures match the expected changes resulting from the patch.
In "case1" and "case2", all input data is referenced using "column(N)". This is the usage pattern addressed by this ticket. Case1 involves reading missing data and is therefore unaffected by the patch. In contrast, case2 involves reading invalid data and shows changes after the patch.
In "case3" and "case4", "valid(N)" is used to ensure that "column(N)" is only accessed when the value is valid. When a value is invalid, the script avoids referencing "column(N)". So, case3, and case4 render correctly even in version 6.0.2, so no changes are observed with or without the patch.
Among the four figures, only case2 should be affected by the patch.
Figures are attached here.
It is the definition
isnan(X) = (X==X)?0:1that is incorrect. This does not work in a using specifier exactly because evaluation ofX==Xsets an error flag. This is exactly why the separate functionvalid(x)was introduced.The recent change to serial evaluation makes it possible to define
isnan(X)to work the way you want, which is good, and "fixes" case 2 of the multiplot example.I am not following the rationale for the proposed patch. I can see how case (2) shows that it might be useful to provide a builtin function isnan(), but the patch would affect other cases than evaluation to NaN. It would suppress detection of other "invalid" cases as well. I think. I would have to hunt up or regenerate test scripts for the original problems that led to changes in that code section in 2017 (Bug #1896) 2018 (commit b8304eafc) and 2022 (commit c8d468de9).
I believe this function definition is actually working as intended.
To me, it seems that the reason "isnan($1)" doesn't behave as expected within a "using" clause is because referencing
$1when it points to an invalid value causes the undefined flag to be set. Am I missing something here?I'd like to clarify that I didn't open this issue in order to be able to define "isnan". My concern is more about the side effect of referencing
$N, which appears to be the actual issue.As you mentioned, it's not entirely clear what parts of the code might be affected by my patch, so I agree that we need to proceed cautiously. However, the fact that merely referencing a value sets the undefined flag seems like a side effect to me.
I need to think about it some more and go back to look at the earlier modification.