From: Ethan A M. <me...@uw...> - 2022-03-04 08:18:01
|
I think, or hope anyway, that discussion will lead to at least a partial solution going forward. More history follows. Skip down to the TL;DNR line if you like. Going back at least as far as gnuplot version 3.7.1 there is a mechanism using "set datafile missing <foo>" to flag data points that should be ignored during input. As originally implemented this mechanism worked as intended for commands of the form plot FOO using 1:2 But in those days the program rejected any input that evaluated to NaN (not-a-number). Some time in version 5.2 gnuplot gained the flexibility to return all requested columns to the user even if one of them contained NaN. For instance NaN in the color field of palette-color or image data can be handled as "don't draw this pixel", but dropping the pixel from the input stream altogether would mess up the image raster. Unfortunately this led to an new class of errors. Basically if the using spec contains an expression that can't handle NaN then *not* skipping that input line is bad also. See Bug #2042. In commit b8304eaf I tried to handle this case also (so feel free to blame me for ensuing complications). Here is the commit message: > skip evaluation of using spec expression that depends on a missing value > > The presence of a "missing" flag in an input data field is easy to detect > if it is encountered in place of a bare number ('using N'), but if it is > encountered during evaluation of an expression ('using (func($N))') then the > function evaluation may exit via int_error() before the missing value can be > flagged for the calling code. Now we pre-screen expressions in a using spec > for the presence of "$N" and note the dependence on column N in a new field > use_spec.depends_on_column. During data input, if this field is non-zero > the content can be checked for a missing value flag before trying to > evaluate the expression is was found in. > Bug #2042 This first appeared in version 5.4.0 No one at the time raised the issue of the breaking the valid(N) function. That was clearly something we missed. TL;DNR I dislike the idea of going back to either the original behaviour (always skip input line if a column is NaN or missing) or the version 5.2 behaviour (see Bug #2042). You have now pointed out the very real bug that one cannot use valid(N) in an expression containing $N because the entire expression will be skipped. So I dislike the way it is now, also. Possible options: - Revert commit b8304eaf. That would re-introduce Bug #2042 but would allow use of valid() to avoid it. - Better documentation. Trying to explain all this is probably much too difficult, but we could at least warn that if you use valid(N) then you must also use column(N) rather than $N. - At the cost of some hackery, I could add a check for the valid() function itself immediately before the check for $N added by the commit shown above. The logic would be "if we see the user is doing their own validity checks, we will not use the known-imperfect check that looks for $ signs". I could cook up a patch for that if you want to test it. - Your notion of setting a default value for missing data is interesting. I don't know quite where that would go. A uniform default could be done by something like set datafile missing "?" default FOO But that is probably too global to be useful. You might want different defaults for different columns and for different intput files. still thinking Ethan On Thursday, 3 March 2022 14:41:50 PST Juhász Péter wrote: > On Wed, 2022-03-02 at 14:36 -0800, Ethan A Merritt wrote: > > On Wednesday, 2 March 2022 06:37:26 PST Peter Juhasz wrote: > > > On Wed, Mar 2, 2022 at 5:22 AM Ethan A Merritt <me...@uw...> > > > wrote: > > > > > > > > Observations: > > > > > - it's as if the mere presence of a $X in the specification > > > > > causes the > > > > > datum to be marked as invalid, and dropped entirely, if column > > > > > X > > > > > doesn't contain data, no matter what the rest of the > > > > > specification is. > > > > This is intentional. > > > [... detailed explanation snipped ...] > > This all sounds eminently horrible. I admit I never had to think into > the various edge cases like `column($2)` and so on, but even > acknowledging these, it seems to me that peeking ahead in the > expression and just throwing away the datum if a column happens to be > empty/invalid is an overly zealous, and in the end, suboptimal, > approach, because it throws away that datum even if the invalid value > ends up not influencing the result. > > If I understand you correctly, my use case (plotting a dataset with a > potentially null column with a default value, so that a point is > plotted for every row) is supposed to be impossible, and the method > I've eventually stumbled on (`valid(N) ? column(N) : 0`) only works > because of an implementation detail (that peeking ahead for `column()` > was deemed too hard). > > And even accepting all this, the fact remains that there is an > undocumented discrepancy in the user interface (`column(N)` vs `$N` > behaves differently, even though the documentation states that the > latter is just a shortcut to the former), and that all of this is not > easily discoverable and confusing to the user. > > > > > Any suggestions for improvement are welcome. > > To try to be constructive: > > - the state of affairs should be documented somewhere (though I'm not > sure under which keyword does it belong) > > - perhaps there could be a separate, explicit function for plotting a > column with a default value. My subconcious is trying to suggest a > defined-or operator like perl's `//`, but I realise that that would not > help much with the problems that already exist on the level of > expression parsing. But perhaps a two-argument alternative to > `column(X)` could be added, one that allows a default value to be > specified, e.g. `defaultcolumn(1, 3.14)`, or just make `column` accept > an optional second argument, like `timecolumn` does. > > I'm not sure how much value this second proposal would add, though, > since the `valid ? column : default` syntax does work, but perhaps a > separate explicit function would be useful if `column` was ever "fixed" > to behave like the dollar operator. > > > > > > cheers, > > Ethan > > > > > > best regards, > Peter Juhasz > > > -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg MS 357742, University of Washington, Seattle 98195-7742 |