Re: RFC: a proposed alternative solution to the "plotting columns from multiple files" problem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Sunday, 01 September, 2013 20:54:01 Juhász Péter wrote:
> Dear gnuplot developers,
> 
> Ethan has a patch on Sourceforge that aims to solve the old FAQ "how can
> I combine values from columns in multiple input files?"
> 
> http://sourceforge.net/p/gnuplot/patches/615/
> 
> First some observations to the patch and its description:
> 
> I don't like the name "merge" for this operation, because simply,
> merging is not what it does. If we were to go with it, I'd propose
> "store" or "stash" (the latter is inspired by "git stash")...
> 
> ...but I don't really like it as it is. I found the new command and its
> usage pattern quite hard to understand and the whole thing comes across
> as a hack.
> 
> So, I thought, if we wanted to solve the original FAQ problem, why not
> attack it directly, by extending the plot command to allow plotting from
> multiple files -- but I couldn't find an acceptable solution, with the
> plot command being quite complicated as it is, both in its user
> interface and its implementation.
> 
> Then a new idea struck: introduce a new concept called "datasource", in
> effect a layer that comes between low-level datafile reading and
> plotting. In this new mechanism, a datasource would be a kind of a
> "virtual file" that would define how the contents from one or more real
> data files are to be combined, transformed and filtered, and their
> output fed to the plot command. 
> 
> For example: 
> 
> set datasource $DATA1 "foo.txt" paste "bar.txt"
> plot $DATA1
> 
> This command would take the two files foo.txt and bar.txt and
> concatenate them line by line, like the Unix "paste" command, and let
> the plot command see the result of this combined file. 
> 
> Other combinations could be defined, for example "cat" which would just
> concatenate the files, one after the other (like the similarly named
> Unix command), or "transpose", which would act on just one file,
> transposing its contents. The usual "using", "every" etc. modifiers
> could be applied to the file names.
> 
> It is important that the "set datasource" command itself would not
> perform these operations, it would just prepare them. The data files
> would be read (and the specified transformations performed on them) only
> when the plot command is executed.
> 
> (The alternative is that the operations are performed by the set command
> itself and the results saved into a datablock. This would be simpler to
> implement, but the resulting datablocks could take up a lot of space, or
> the operation may not be possible at all if the input files are very
> large -- this is a problem with the original merge proposal as well).

True.  But if the data set is too large to hold in memory then the operations
I want to perform on it will be impractical in any case. 

> Note that I don't have any code to show yet, this RFC is just to poll
> the public opinion to see if the concept makes sense at all.

On the one hand I can see the advantage of agreeing on a desirable
user interface first and only then working on the implementation.

But on the other hand I'm not convinced that an implementation of this
particular interface is practical, or even possible.  Certainly it would
require a lot of new code.

> Also note that I personally have some reservations about the whole
> thing: it would potentially require rewriting / mucking up sensitive
> "here be dragons" parts of the code, with unclear benefit. It would also
> go against the Unix principle: we can run external commands and we have
> adequate text processing utilities outside gnuplot, so there is little
> need to make gnuplot into a text-processing-kitchen-sink-included
> utility. 

That too. If all it would accomplish is to internalize  cat, paste, grep,
and friends then I don't think it is worth starting down that path at all.

> Let me know what you think about all this.

I'm not sure that your proposal would address the actual use case
that motivated my "merge" patch.  If I knew how to handle this conveniently 
with some combination of paste/cat/grep/awk,  I probably would just do that
instead of working on a separate patch.  Possibly it would be better to
use R or Octave, but I understand gnuplot a whole lot better than either
of these, so...

Here is my application.  
There are a couple of dedicated programs that were written to deal exactly
with this class of experimental data, but  they offer limited hooks for
visualization.   As a die-hard gnuplot hacker, it seemed easier to me to
extend gnuplot's data-processing mechanism than to add general visualization
tools to existing dedicated processing programs, some of which are not
open source.

Experimental data is stored in many (tens to hundreds) of 
individual files.  Each consists of at least 3 columns of data:
  sample coordinate,   sample value,  error estimate

The task is to group and scale subsets of the files to agree with each
other, where it is unknown in advance 
1) which files will in fact agree with each other after scaling
2) what range of sample coordinates this agreement holds for
3) what scaling function is optimal

After identifying the optimal files, range and scaling, the corresponding
data is to be merged into a single output file for subsequent analysis. 

Using "paste" only works if the sample ranges and points
in the data files are identical. This is often true but is not guaranteed.
That's more or less what I was doing before the "merge" patch, but
it is rather cumbersome and requires separate checking in advance
that the sample points line up correctly.

If the scaling were already known, then plotting the data would be
easy even though it requires reading in multiple files.   But to optimize
the range and scaling you want to interactively select ranges of data,
the files for inclusion in the scaling, and the scaling function whose
parameters are to be optimized.     Alternating "fit" and "plot" commands
in gnuplot can do this, but only if the data is accessible all at once,
i.e. as if it were present in separate columns of a single input file.

The thing is, I don't think your proposal would actually work in practice
for the "fit" part of this.  If every cycle of L-M fitting requres re-reading
the original files through an intermediate layer of data presentation,
my guess is that the throughput would become so awful as to be
unusable.    

I suppose it's fair to answer that we won'e know the throughput limitations
until an implementation exists.   I also suppose that I should spend more
time exploring whether this task can be done in R or Octave.  Anyhow,
thanks very much for the feedback.  I'll continue to ponder alternatives.

	Ethan

> 
> Peter Juhasz
> 
> 

Re: RFC: a proposed alternative solution to the "plotting columns from multiple files" problem

A portable, multi-platform, command-line driven graphing utility

Re: RFC: a proposed alternative solution to the "plotting columns from multiple files" problem