Menu

#2140 Selecting columns using strcol adds extra spaces

None
open
nobody
None
2019-02-25
2019-02-22
No

Plotting columns of tab-separated data with table using strcol inserts additional spaces that prevents use of common comparative operations. That is, given a column of the letter 'a', plotting using (strcol(1) eq 'a' ? true_val : false_val) on a datablock will always be false because the value extracted from the original data is padded with a space character eg '<space>a'

Steps to reproduce:

[13:28:00]:tmp$ printf "" > in.tsv; for i in {1..10}; do printf "a\tb\tc\n" >> in.tsv; done
[13:28:07]:tmp$cat -t in.tsv  # basically a<tab>b<tab>c<newline>
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
a^Ib^Ic
[14:28:18]:tmp$ gnuplot -d -e "set datafile separator tab; set table 'out.tsv' separator tab; plot 'in.tsv' using (strcol(1)):(strcol(2)) with table; unset table;"
[14:29:42]:tmp$ cat -t out.tsv # basically <space>a<tab><space>b<newline>
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b
 a^I b

Comparison failure can be reproduced via:

 [15:02:26]:tmp$ gnuplot -d -e "fn(val,pred)=(val eq pred ? 'true' : 'false'); \
> set datafile separator tab; \
> set table \$BLOCK separator tab; \
> plot 'in.tsv' using (strcol(1)):(strcol(2)) with table; \
> unset table; \
> set table 'out2.tsv' separator tab; \
> plot \$BLOCK using (fn(strcol(1),'a')):(fn(strcol(2),'b')) with table; \
> unset table;"; \
> cat out2.tsv 
 false   true
 false   true
 false   true
 false   true
 false   true
 false   true
 false   true
 false   true
 false   true
 false   true

Behavior confirmed on 5.2.4-5.2.6 and branch-5-2-stable

$ git log -n1
commit d83adc16f5a572f5d004963ead8326591498dd41 (HEAD -> branch-5-2-stable, origin/branch-5-2-stable)
Author: Ethan A Merritt <merritt@u.washington.edu>
Date:   Tue Feb 19 23:10:08 2019 -0800

    Clear STATS_* variables before performing 'stats' analysis
...
$ uname -a
Darwin pinion.local 18.2.0 Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 x86_64

Discussion

  • Ethan Merritt

    Ethan Merritt - 2019-02-23

    For numerical values the extra whitespace is ignored.
    For string content, If you know that the string itself does not contain internal blanks then it would be sufficient to say
    set datafile separator whitespace

    Beyond that, maybe we should add a built-in string operator trim(str) as in perl6 or perl5 String::Util That could be quite useful.

     
    • Ethan Merritt

      Ethan Merritt - 2019-02-23

      Now in 5.3:

      gnuplot> help trim
      
       `trim("  padded string ")` returns the original string stripped of leading
       and trailing whitespace.  This is useful for string comparisons of input
       data fields that may contain extra whitespace. For example
            plot FOO using 1:( trim(strcol(3)) eq "A" ? $2 : NaN )
      
       
      • Mike Tegtmeyer

        Mike Tegtmeyer - 2019-02-25

        Hi Ethan,

        I think the trim command will be useful in general. I was achieving a similar workaround using the word call.

        I think that there still is an underlying bug however. It seems that the with table command prepends characters to your data (specifically the space char) when it shouldn't. It is reasonable to expect gnuplot to change how your data is delimited (since it was set via a separator) but I think that it is unreasonable for gnuplot (or any application for that matter) to modify the data in some irreparable way during basic I/O. If someone is using the tab character as a column separator then it is reasonable to expect that the data might be padded with whitespace characters and should be preserved. For example with the current proposed fix, it appears to be impossible to load in string data prepended with X spaces and write them back out (either as a file or as a datablock) and preserve the X spaces. Either they will be increased by 1 (which the user didn't ask for) or they will be stripped (via a trim-like operation).

        To me, round-tripping data by reading in a tab-separated file of potentially space-padded string data using set datafile separator tab and then writing it back out again using set table ... separator tab should succeed. Ie the input data and the output data should be the same.

         
        • Ethan Merritt

          Ethan Merritt - 2019-02-25

          My view is that any algorithm, format, or language that is sensitive to the amount of whitespace is broken or badly designed or however you want to state it. Even Fortran eventually outgrew that craziness. I have more sympathy for the complaint that programs (excel, soffice, etc) are inconsistant about whether string values are placed in quotes. This is explicitly allowed by the RFC 4180 standard but that doesn't make it any less annoying. So far as I know the standard says nothing about leading or trailing blank space.

          However, note that the C documentation says that string input via %s "stops at white space or at the maximum field width, whichever occurs first", which implies that input is insensitive to the amount of trailing whitespace. So it might be better to favor "%s<blank>" over "<blank>%s" on output.</blank></blank>

           
          • Mike Tegtmeyer

            Mike Tegtmeyer - 2019-02-25

            I hear what you are saying but I would counter then that 'separator tab' directives don't actually do what they claim to do. To me, delimited data means that everything that is between the delimiters is considered data. I'm somewhat familiar with RFC4180 and in my reading, I believe RFC 4180 agrees with this. That is, in CSV

            foo,bar is not the same as foo, bar (note the space)

            And from the RFC 4180 BNF:

            TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

            Here hex 0x20 is the space character

            The command set datafile separator tells gnuplot that data fields in
            subsequent input files are separated by a specific character rather than by
            whitespace. ...

            If what gnuplot is actually going to do is insert a tab (as directed by 'separator tab') and then add some additional whitespace but the user should just consider the tab and any additional whitespace a column separator, then 'separator tab' doesn't actually add any real functionality. Meaning 'separator tab' is basically equivelant to 'separator whitespace'

            In my opinion, what we are talking about here is not a language (which I agree should be insensitive to whitespace), we are talking about datafile where the data may or may not contain whitespace as valid data. gnuplot can assume that it doesn't (separator whitespace). However, if the data does contain some and the user is explicitly stating how their data is/should be laid out and what or what is not considered data by using the 'separator' directive, then there is a reason for it and gnuplot should respect that.

            In the end, I'm suggesting that gnuplot's CSV handling should be in accordance with RFC 4180 and tab-separated data should be the same with the exception of exchanging the comma for the tab. I think that this follows the principle of least surprise. That is, if I give gnuplot:

            data[tab]data, gnuplot shouldn't rewrite it as data[tab][space]data and tell me is the same thing.

            I'm struggling to come up with another application that adds additional whitespace to field separators after you've set it to be a tab. Certainly not excel, Matlab, Octave, or Maple.

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.