Binary syntax reduction [Was: Lengthy discussion...]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ethan Merritt wrote:

>(wandering a bit off topic)
>
>On Sunday 15 August 2004 01:03 am, Daniel J Sebald wrote:
>  
>
>>My geezerness only goes back to the days of PDP 11-70 and the 8 inch
>>floppy platter.  But I can't recall binary files ever having special
>>characters to serve as the end of a record.
>>    
>>
>
>PDP 11/xx used the FILES-11 filesystem, in which meta-information
>about record type, disk allocation, ACLs, etc were stored in a 
>separate meta-file, not as in-line info.
>
>These filesystems supported very complicated record structures
>for database work (still in use today), but also had 3 main "simple"
>file structures:
>	Fixed-length records:
>		What it sounds like.  The record length was specified in
>		meta-data.  A read operation returned 1 whole record.
>	Variable-length records:
>		Each record began with an integer specifying how long
>		the record was.
>	CR/LF:
>		Unix-like stream-of-bytes, with end of record signalled by
>		either a CR or a LF.
>
>On top of that, Fortran used carriage-control characters at the beginning
>of a record.
>

OK, you win the geezer challenge...  Anyway, gnuplot binary then is 
similar to variable-length record.

>But it's not a file. It never hits the disk, so I/O speed is not an issue.
>And at current memory bandwidths, transferring 10 MB of data should
>take only about 0.01 sec (if I haven't dropped a decimal point somewhere).
>That will be totally dominated by the I/O time to read the original binary
>data from a disk file.  So it may be unaesthetic to have an intermediate
>ascii stream, but I doubt it will be noticeable in terms of interactive
>response.  
>

Here is a test.  Lets say a 500 x 500 image is processed in Octave and 
is to be plotted.  I don't think 500 x 500 is unreasonable, x-ray 
angiography images, telescopic space images, they're usually pretty big. 
 If you have octave, try the following to simulate the amount of data 
that would be transferred through the pipe.  (Granted, we have no idea 
what kind of bottle necks might exist in how Octave is programmed for 
the pipe--perhaps it could be improved--but we'll use this as a rough test.)

t = [1:500*500]/100;
s = sin(t);
plot(t,s);

On my machine, a three year old Dell with a Pentium 4, 900-1000 MHz 
system bus, that plot takes 8 seconds.  After the 4th second the octave 
command line returns, and 4 seconds after that the gnuplot plot appears. 
 To me, that time is unacceptable.  (Imagine the derision...  No, the 
solution is not to buy a faster computer.)  There are probably a couple 
things going on.  First, the pipe may not transfer data at the rate you 
suggest, due to time sharing perhaps.  Who knows?  Second there is also 
the issue of this being formatted I/O, meaning that every value has to 
go through the scanf function.  Does that slow things down?

Now an example in Octave using the m-file designed to use the image and 
binary features added to gnuplot.

A = 1./hilb(500);
imagegp(A);

This takes 3/4 to 1 second.  Tolerable.  There is a difference here 
though.  The binary data goes through a file.  So maybe the file is 
faster than the pipe.

Let's try one last test.  Sending the image data to a file in ascii 
form.  I'll put an "if 1" around the instructions to ensure they are all 
executed as fast as possible one after the other.

X = ones(size(A,2),1)* [1:size(A,1)];
Y = [1:size(A,2)]'*ones(1,size(A,1));
N = size(A,1)*size(A,2);
B = [reshape(X,N,1) reshape(Y,N,1) reshape(A,N,1)]';

if 1
    fid = fopen("junk.dat","w");
    fprintf(fid, "%f %f %f\n", B);
    fclose(fid);
    graw("plot \'junk.dat\' using 1:2:3 w image\n");
end

This takes 6 or 7 seconds.  So files and a pipe are roughly the same in 
this crude test.  Perhaps the file is even faster because more data is 
being transferred in that case.  However, there are other things within 
gnuplot, i.e., reading from a file and reading from '-' may be different.

Anyway, rough test.  But, the conclusion is that it is probably the 
"fprintf' and 'scanf', i.e., formatted I/O, that slows things down, and 
binary data is a nice feature to have with images.

>>I would add that I myself am deterred from implementing general binary
>>if the df_readline() is going to continue to grow with functionality
>>from within.  Unless, say, the use_spec processing is converted to a
>>function that can be called from multiple places, trying to maintain two
>>"analogous", or "parallel", routines is too much for anyone, whether he
>>or she is the original author or not.
>>    
>>
>
>You mean changing use_spec[] from an array into a function?
>If that turns out to be useful then I suppose it would be reasonable.
>

Yeah, but I'm not advocating that.  You are persuading me that perhaps 
"binary" should be simpler.  The question is, how many people will use 
Gnuplot, from the command line for processing images.  Not many; so I 
would say that passing data through a function isn't that necessary, as 
in this example

plot 'blutux.rgb' binary array=128x128 flipy format='%uchar' using 
(1.5*$1):2:3 with rgbimage

The primary use I have in mind for this "large data set plotting" is 
something done by an application in an ephemeral way.  Just send some 
data over, plot it, and discard the data.  So, perhaps the ability to 
skip data within a binary file isn't necessary.  That is, no '%*uchar%' 
kind of stuff, or skipping a number of bytes at the head of the file.

How about tossing out the multiple records per file feature.  If there 
is more than one big data set to plot, just create multiple files.

How about tossing the implicit sampling interval?  That would mean that 
all data must appear in the file, for example the (x,y) coordinates for 
each pixel of an image must be along with the pixel value.  That means a 
sample image for the 'image.dem' program would increase in size by a 
factor of 5/3.  No problem.

Translations, toss that in the case where coordinates are in the file.

All of this stuff would reduce a lot of the code, much of which is for 
interpreting the keywords.  With no "using" there can be no functions. 
 Also, let's say with binary, no strings, no time data, etc.  Again, 
this kind of stuff will be small in quantity if ever it is plotted, in 
which case ASCII can be used.  What I mean is there is no need to plot 
500 strings.

I'd hesitate to toss '%uchar', etc.  Although I could give on that one. 
 But let's rule out multiple data types per file.  Maybe just one 
%float, etc. inside the format string.  The code that does the 
transformation inside the df_readbinary() routine is fairly 
straightforward.  There is a set of tables to compute datasizes upon 
compilation.  Looks nasty but once it is compiled, it probably isn't too 
big.

I'd hesitate to toss the endian information too.  That code inside 
df_readbinary() also isn't too bad.  The thing is, octave has a 
qualifier associated with its fopen() routine "ieee-le" and "ieee-be". 
 They pay attention to endianess, so maybe gnuplot binary should too.

So, in order to get functionality, here is a possible reduced syntax.

binary {3 | xy | xyz | xyzc} {format="string"} {endian=little}

Now if we want to toss the format, and require "all floats, all the 
time", fine.

But the first part of that syntax is to allow entry for both images and 
long linear records such as speech waveforms or whatever.

binary  :  The current gpbin file

binary 3  :  Very similar to current gpbin, what I call gpbin3.  That 
is, it is the matrix format, but each element of the matrix has 3 
components.  (Could make that an arbitrary number, 1 up to max columns.)

Now that covers images, i.e, a matrix format.  But what about sampling 
in one dimension?  Perhaps that could be done with gpbin if one sets N 
(the number of columns and first number in the file) to one.  But that 
is tricky from the user's perspective.  Hence the following:

binary xy  :  Two "columns" of data.  Would be useful for 2D plots.

binary xyz  :  Three "columns" of data.  Would be useful for 3D plots.

binary xyzc  :  Four "columns" of data.  Would be useful for 3D plot 
doing color.

This wouldn't have to be the exact syntax.  For example, it would be 
nice if one could just specify the number of columns with 2, 3, 4, 5, 
..., max_cols, but that would conflict with trying to introduce multiple 
components per element of matrix binary.

Dan

Binary syntax reduction [Was: Lengthy discussion...]

A portable, multi-platform, command-line driven graphing utility

Binary syntax reduction [Was: Lengthy discussion...]