Plotting columns of tab-separated data with table using strcol inserts additional spaces that prevents use of common comparative operations. That is, given a column of the letter 'a', plotting using (strcol(1) eq 'a' ? true_val : false_val)
on a datablock will always be false because the value extracted from the original data is padded with a space character eg '<space>a
'
Steps to reproduce:
[13:28:00]:tmp$ printf "" > in.tsv; for i in {1..10}; do printf "a\tb\tc\n" >> in.tsv; done [13:28:07]:tmp$cat -t in.tsv # basically a<tab>b<tab>c<newline> a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic a^Ib^Ic [14:28:18]:tmp$ gnuplot -d -e "set datafile separator tab; set table 'out.tsv' separator tab; plot 'in.tsv' using (strcol(1)):(strcol(2)) with table; unset table;" [14:29:42]:tmp$ cat -t out.tsv # basically <space>a<tab><space>b<newline> a^I b a^I b a^I b a^I b a^I b a^I b a^I b a^I b a^I b a^I b
Comparison failure can be reproduced via:
[15:02:26]:tmp$ gnuplot -d -e "fn(val,pred)=(val eq pred ? 'true' : 'false'); \ > set datafile separator tab; \ > set table \$BLOCK separator tab; \ > plot 'in.tsv' using (strcol(1)):(strcol(2)) with table; \ > unset table; \ > set table 'out2.tsv' separator tab; \ > plot \$BLOCK using (fn(strcol(1),'a')):(fn(strcol(2),'b')) with table; \ > unset table;"; \ > cat out2.tsv false true false true false true false true false true false true false true false true false true false true
Behavior confirmed on 5.2.4-5.2.6 and branch-5-2-stable
$ git log -n1 commit d83adc16f5a572f5d004963ead8326591498dd41 (HEAD -> branch-5-2-stable, origin/branch-5-2-stable) Author: Ethan A Merritt <merritt@u.washington.edu> Date: Tue Feb 19 23:10:08 2019 -0800 Clear STATS_* variables before performing 'stats' analysis ... $ uname -a Darwin pinion.local 18.2.0 Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 x86_64
For numerical values the extra whitespace is ignored.
For string content, If you know that the string itself does not contain internal blanks then it would be sufficient to say
set datafile separator whitespace
Beyond that, maybe we should add a built-in string operator
trim(str)
as in perl6 or perl5 String::Util That could be quite useful.Now in 5.3:
Hi Ethan,
I think the trim command will be useful in general. I was achieving a similar workaround using the word call.
I think that there still is an underlying bug however. It seems that the
with table
command prepends characters to your data (specifically the space char) when it shouldn't. It is reasonable to expect gnuplot to change how your data is delimited (since it was set via a separator) but I think that it is unreasonable for gnuplot (or any application for that matter) to modify the data in some irreparable way during basic I/O. If someone is using the tab character as a column separator then it is reasonable to expect that the data might be padded with whitespace characters and should be preserved. For example with the current proposed fix, it appears to be impossible to load in string data prepended with X spaces and write them back out (either as a file or as a datablock) and preserve the X spaces. Either they will be increased by 1 (which the user didn't ask for) or they will be stripped (via a trim-like operation).To me, round-tripping data by reading in a tab-separated file of potentially space-padded string data using
set datafile separator tab
and then writing it back out again usingset table ... separator tab
should succeed. Ie the input data and the output data should be the same.My view is that any algorithm, format, or language that is sensitive to the amount of whitespace is broken or badly designed or however you want to state it. Even Fortran eventually outgrew that craziness. I have more sympathy for the complaint that programs (excel, soffice, etc) are inconsistant about whether string values are placed in quotes. This is explicitly allowed by the RFC 4180 standard but that doesn't make it any less annoying. So far as I know the standard says nothing about leading or trailing blank space.
However, note that the C documentation says that string input via %s "stops at white space or at the maximum field width, whichever occurs first", which implies that input is insensitive to the amount of trailing whitespace. So it might be better to favor "%s<blank>" over "<blank>%s" on output.</blank></blank>
I hear what you are saying but I would counter then that 'separator tab' directives don't actually do what they claim to do. To me, delimited data means that everything that is between the delimiters is considered data. I'm somewhat familiar with RFC4180 and in my reading, I believe RFC 4180 agrees with this. That is, in CSV
foo,bar
is not the same asfoo, bar
(note the space)And from the RFC 4180 BNF:
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Here hex 0x20 is the space character
If what gnuplot is actually going to do is insert a tab (as directed by 'separator tab') and then add some additional whitespace but the user should just consider the tab and any additional whitespace a column separator, then 'separator tab' doesn't actually add any real functionality. Meaning 'separator tab' is basically equivelant to 'separator whitespace'
In my opinion, what we are talking about here is not a language (which I agree should be insensitive to whitespace), we are talking about datafile where the data may or may not contain whitespace as valid data. gnuplot can assume that it doesn't (separator whitespace). However, if the data does contain some and the user is explicitly stating how their data is/should be laid out and what or what is not considered data by using the 'separator' directive, then there is a reason for it and gnuplot should respect that.
In the end, I'm suggesting that gnuplot's CSV handling should be in accordance with RFC 4180 and tab-separated data should be the same with the exception of exchanging the comma for the tab. I think that this follows the principle of least surprise. That is, if I give gnuplot:
data[tab]data
, gnuplot shouldn't rewrite it asdata[tab][space]data
and tell me is the same thing.I'm struggling to come up with another application that adds additional whitespace to field separators after you've set it to be a tab. Certainly not excel, Matlab, Octave, or Maple.