Re: Lengthy discussion about datafile.c...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ethan A Merritt wrote:

>On Saturday 14 August 2004 08:54 pm, Daniel Sebald wrote:
>  
>
>>>Your docs say
>>>	+ Gnuplot will retrieve a number of binary
>>>	+ variables equal to the largest column specified in the `<using list>`.
>>>	+ For example, `using 1:3` would cause three columns to be read, of which
>>>	+ the second will be ignored.
>>>So how do you handle the case of 10 logical columns of data in the file,
>>>of which you only want to read the 2nd and 4th?  How do you skip "columns"
>>>5 to 10 of each "line"?
>>>      
>>>
>>"format" is supposed to be analogous to the "using" format string, so
>>something like the following should work
>>
>>plot "datafile.dat" binary format ="%10float" using 2:4
>>    
>>
>
>But according to the documentation I quoted above, that would only
>read in 4 logical columns, leaving 6 more unread values in the file
>before you get to the next set of input values.  How do you tell it
>to skip the next 6 columns?
>

Well, the documentation is a bit misleading.  If all one specifies is 
"using", without a format string, then the assumption is that the 
highest number is the number of columns.  But when the format string is 
there, it tells that a line is 10 floats in this case.  I guess the idea 
was that one could leave out the format string and just type "using 
1:2:3" for example.  I'd not be against the "format" and "using" 
required to appear together.  A lot of code goes to these special 
assumptions and whatnot.

>>>What constitutes the logical equivalent of a "blank line" in your binary
>>>files? Or is there no equivalent to the auto-determination of scan lines?
>>>      
>>>
>>A blank line occurs when the scan line reaches its end.  For example,
>>here is the scatter2 example from the image demo
>>
>>splot 'scatter2.bin' binary record=30,30,29,26 endian=little using 1:2:3
>>
>>which means blank lines occur at the 30th line, 60th line, etc.
>>    
>>
>
>But there you have told it on the command line what the structure is.
>The thing about blank lines in an ascii input file is that they define
>a structure on the fly; you don't need to specify it on the command line.
>I would much rather require a file format that indicates what each
>logical line contains.  A blank line is then indicated *in the file* by some
>designated code (probably some number of 0s, but whatever).
>

My geezerness only goes back to the days of PDP 11-70 and the 8 inch 
floppy platter.  But I can't recall binary files ever having special 
characters to serve as the end of a record.  (If so, what are they that 
they wouldn't clash with a valid IEEE float?  Is there a NAN in IEEE 
float that could serve as an end of record?)

>>Well, I'd not thought of strings in the binary file, but perhaps
>>something like "%s" in the format string?  I would probably make the
>>restriction that the strings within the file need to be NULL terminated.
>> That's not an unrealistic expectation, is it?  Or wait, maybe "%s"
>>could be general length but NULL terminated; "%[#]s" could be a fixed
>>length of # characters.
>>    
>>
>
>Fixed length strings are not interesting.  You could use NULL-termination,
>but only if you specify everything on the command line because otherwise
>the input routine doesn't know whether it's reading a string at all.
>

Right.  I guess all this sort of information would be in some type of 
header.

>>I'm not sure what you mean by a matrix of strings.
>>    
>>
>
>I mean like an input file consisting of 10 lines of 5 strings each.
>Only in this case it would be a binary file containing 50 NULL-terminated
>strings that you have somehow flagged as being in a 10x5 matrix.
>
>  
>
>>What is the matrix variant?  
>>    
>>
>
>Like your example above. (At least I *think* that's what your example
>was doing).   A regular array of values all of the same sort.  E.g.
>a  100x200x300 grid with x varying faster than y faster than z.
>But since it's regular and all the entries are the same length you
>know exactly where to find every element without any funky format
>stuff.
>

Yes, I'd certainly be content with something simple to get data across 
from an application to gnuplot.  I mean, the expectation is not that one 
has a thousand columns of binary data of which you want to pick out only 
two of them.  In most cases one could say something like

format="%int" using 1:2:3

or

format="%float" using 1:2:3

I think originally this started as a slight variation on the using 
command, but it got too confusing from that.  So "binary" developed it's 
own using string.

>>The problem is that an image of say 500 x 500 pixels gets very big in ASCII.
>>    
>>
>
>I think that is a non-issue.  You don't have to store this anywhere; you're 
>just piping it in.    But this is the very straightforward case that I called 
>a matrix.  You know in advance it's a 500x500 array, and you know how big
>each element is.  No need for format statements, using specs, or any of that.
>
>[EAM puts on geezer hat again]  Back in the old days of limited disk space
>it was a big win to store numeric data in binary files.  This caused 
>man-centuries of time to be wasted in dealing with cross-platform conversions
>and uncertainty about the exact format of the binary files.  When disks got
>cheap we all heaved a huge sigh of relief and for the most part stopped 
>using binary output files.  It's just not worth it.  So what if the ascii 
>equivalent is big?  Just compress it and it goes back to being about the
>same size as the original binary (OK, that depends a bit on what sort of
>data it is).  
>

[Battle of the geezers coming]  Point taken, when we're talking a few 
hundred data points.  But when we're talking images, it could be a 
megabyte file, and converting that to ASCII yields a 10 megabyte file. 
 (Each data point gets expanded to a floating point ASCII number.  In 
addition the (x,y) locations have to be added as columns.)  Then it has 
to be read in using the formatted I/O.  (I assume scanf is slightly 
slower than raw data.)  There gets to be this delay between hitting the 
return key and an image popping up in what looks like Octave, but is 
really Gnuplot.

Another similar situation might be a speech waveform.  I understand that 
the person using the software should really appropriately down-sample 
the data so that one isn't sending all kinds of data and extraneous high 
resolution to gnuplot, but people don't do that unfortunately.  They've 
got fast computers and 120 Gbyte hard drives.

>Bottom line is I really don't like this general binary input format.
>If you know enough about your binary format to write a cryptic 
>description like
>    plot "datafile.dat" binary format="%*int16%float32%*float32%" \
>    record=30,30,29,26 endian=little 
>then by gum, you know enough to write a jiffy filter routine and
>pipe normal ascii input into gnuplot.   Many users may not be
>up to this, but those same users won't be able to figure out the
>endian business anyhow.  Where exactly is the big gain?
>Simplicity is worth *a lot*.  Far more than saving a little bandwidth
>in the input pipe.
>
>Input of binary files containing regular arrays may be worth it
>for convenience.   But more complicated input requiring flags for
>bit order, word size, floating point format, and pre-announcement
>of the file structure? ---- All that strikes me as being more trouble
>than it is worth.  Will your code work on an Amiga?  On a 64-bit
>VMS machine?   Who is going to explain to users how to set
>all the right flags to make it work?
>

Hey, I'm not going to fight you on that one.  I'm all for simplicity. 
 No one ever offered up a simple solution.  It started as a slight 
variation on the current implementation of user/etc.  I probably figured 
at the time why not treat binary just like ascii so that all the 
functionality that ascii input has is also present for binary, e.g., 
passing through a function, etc.?  It grew from there.

I would add that I myself am deterred from implementing general binary 
if the df_readline() is going to continue to grow with functionality 
from within.  Unless, say, the use_spec processing is converted to a 
function that can be called from multiple places, trying to maintain two 
"analogous", or "parallel", routines is too much for anyone, whether he 
or she is the original author or not.

>I believe that Petr had some specific applications in mind, so
>maybe he can step in and clarify exactly what pieces of this 
>code he wanted, and why.  I myself plot many sorts of data in
>gnuplot, but I've  never felt a need for direct binary input.
>  
>

... but I would say that binary input has to exist if one is going to 
display images.  The faster the response between hitting the return key 
and the image popping up on the screen, the better.  If it gets too slow 
for relatively small images, the user's response won't be favorable.

Offer up a simple way of doing it...

Would a syntax where there is _no_ format string and _no_ using string 
simplify matters?  That is, everything must be binary floats and there 
cannot be any discarded columns.  It would remove a lot of bits and 
pieces of code that would add up to pretty much I guess.  How about the 
sample intervals?  Are those useful?  Again, the main structure of a 
binary data file from an application would just a solid string of raw 
data, but it isn't unreasonable to require everything about (x,y) 
positions be explicit, rather than implicit.

Is there a subset of the syntax we've offered up which will work 
sufficiently and provide some flexibility so that both large image files 
and large linear files, like speech or other forms of lengthy time 
records can be transferred efficiently?

If one wants to rule out lengthy binary linear records, and still allow 
large binary image files, how about an extension to the gnuplot binary 
format?  That is, would you allow a variable to follow "binary" as 
binary presently exists in the CVS version?  This purpose of the 
variable would be to indicate how many "channels" or "entries" or 
whatever are associated with a location in the grid.  This would allow 
the use of grayscale and RGB images.  For example, "binary 3" would be

N                   x1                   x2              x3   ...    xN
y1       <r11 g11 b11>   <r21 g21 b21>    <rN1 gN1 bN1>
y2       <r12 g12 b12>   <r22 g22 b22>    <rN2 gN2 bN2>
etc.

The x and y wouldn't necessarily have to be Cartesian.  They could be 
radial, if ever one gets adventurous enough to attempt circular images 
like sonograms, CT...  which probably won't happen.  I do acknowledge 
that gnuplot binary is limiting though.  But it works for me.  Petr may 
have feelings otherwise.  (But Petr, it may be possible to take a lot of 
the binary code and write a little app that converts ESRF to gpbin or 
gpbin3, then have a little awk script so that gnuplot behaves almost 
exactly like "plot 'image.edf' with rgbimage".)

Seriously, some consensus on a simple, acceptable approach and I can 
toss it together in a matter of hours and be done.  From my perspective, 
so long as I can use Octave to get images, in this case spectrograms, 
into a PostScript or PS/Latex form, that has axes and tics and labels, 
and can be imported to a LaTeX document, I'm happy.  What I have now 
works for me, but I won't be motivated to do a simpler design without 
some consensus, as opposed to offering up some other alternative for 
evaluation.  I'm happy for the feedback and willing to change things if 
it will go somewhere.

Dan

Re: Lengthy discussion about datafile.c...

A portable, multi-platform, command-line driven graphing utility

Re: Lengthy discussion about datafile.c...