From: Ethan A M. <me...@uw...> - 2022-03-11 04:43:05
|
On Friday, 4 March 2022 00:17:15 PST Ethan A Merritt wrote: > I dislike the idea of going back to either the original behaviour > (always skip input line if a column is NaN or missing) or the version 5.2 > behaviour (see Bug #2042). > > You have now pointed out the very real bug that one cannot use > valid(N) in an expression containing $N because the entire expression > will be skipped. > So I dislike the way it is now, also. > > Possible options: > > - Revert commit b8304eaf. That would re-introduce Bug #2042 but would > allow use of valid() to avoid it. > > - Better documentation. Trying to explain all this is probably much too > difficult, but we could at least warn that if you use valid(N) then > you must also use column(N) rather than $N. > > - At the cost of some hackery, I could add a check for the valid() function > itself immediately before the check for $N added by the commit shown above. > The logic would be "if we see the user is doing their own validity checks, > we will not use the known-imperfect check that looks for $ signs". > I could cook up a patch for that if you want to test it. Eventually I realized that the valid() function goes all the way back to gnuplot version 3.something, and it always had the property that it couldn't catch "missing" values because they were discarded at an earlier stage of the input. I have come around to thinking that the only major issue here is the one that Peter originally pointed out: that $N and column(N) are documented as being identical but they were not. So commit c8d468de adds the same initial check to column(N) that already exists for $N. In both cases if column N contains a missing value flag then the data point is skipped prior to evaluation of the expression in the "using" specifier. So you cannot in practice catch such points by using valid(N) ? $N : <foo> There is a secondary issue as pointed out earlier in this thread that if the "missing" column is referenced only indirectly then the pre-evaluation check doesn't catch it. I.e. N = 2 filter(i) = (valid(i) ? column(i) : 0) plot FOO using 1:(filter(N)) This is way too complex for the parser to recognize that column 2 is special prior to evaluation (filter(N)), so in this case the valid(i) test would do something. But given that it would not do anything on the less convoluted cases it would probably be a bad idea to use it in a script. Better to disable any "missing" flag and then use "set datafile missing NaN" instead. The same commit c8d468de replaces the internal handling of this case to make it (I hope) more robust. Ethan > > - Your notion of setting a default value for missing data is interesting. > I don't know quite where that would go. A uniform default could be > done by something like > set datafile missing "?" default FOO > But that is probably too global to be useful. You might want different > defaults for different columns and for different intput files. > > still thinking > > Ethan > > On Thursday, 3 March 2022 14:41:50 PST Juhász Péter wrote: > > On Wed, 2022-03-02 at 14:36 -0800, Ethan A Merritt wrote: > > > On Wednesday, 2 March 2022 06:37:26 PST Peter Juhasz wrote: > > > > On Wed, Mar 2, 2022 at 5:22 AM Ethan A Merritt <me...@uw...> > > > > wrote: > > > > > > > > > > Observations: > > > > > > - it's as if the mere presence of a $X in the specification > > > > > > causes the > > > > > > datum to be marked as invalid, and dropped entirely, if column > > > > > > X > > > > > > doesn't contain data, no matter what the rest of the > > > > > > specification is. > > > > > > This is intentional. > > > > > [... detailed explanation snipped ...] > > > > This all sounds eminently horrible. I admit I never had to think into > > the various edge cases like `column($2)` and so on, but even > > acknowledging these, it seems to me that peeking ahead in the > > expression and just throwing away the datum if a column happens to be > > empty/invalid is an overly zealous, and in the end, suboptimal, > > approach, because it throws away that datum even if the invalid value > > ends up not influencing the result. > > > > If I understand you correctly, my use case (plotting a dataset with a > > potentially null column with a default value, so that a point is > > plotted for every row) is supposed to be impossible, and the method > > I've eventually stumbled on (`valid(N) ? column(N) : 0`) only works > > because of an implementation detail (that peeking ahead for `column()` > > was deemed too hard). > > > > And even accepting all this, the fact remains that there is an > > undocumented discrepancy in the user interface (`column(N)` vs `$N` > > behaves differently, even though the documentation states that the > > latter is just a shortcut to the former), and that all of this is not > > easily discoverable and confusing to the user. > > > > > > > > Any suggestions for improvement are welcome. > > > > To try to be constructive: > > > > - the state of affairs should be documented somewhere (though I'm not > > sure under which keyword does it belong) > > > > - perhaps there could be a separate, explicit function for plotting a > > column with a default value. My subconcious is trying to suggest a > > defined-or operator like perl's `//`, but I realise that that would not > > help much with the problems that already exist on the level of > > expression parsing. But perhaps a two-argument alternative to > > `column(X)` could be added, one that allows a default value to be > > specified, e.g. `defaultcolumn(1, 3.14)`, or just make `column` accept > > an optional second argument, like `timecolumn` does. > > > > I'm not sure how much value this second proposal would add, though, > > since the `valid ? column : default` syntax does work, but perhaps a > > separate explicit function would be useful if `column` was ever "fixed" > > to behave like the dollar operator. > > > > > > > > > > cheers, > > > Ethan > > > > > > > > > > best regards, > > Peter Juhasz > > > > > > > > > -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg MS 357742, University of Washington, Seattle 98195-7742 |