Re: Lengthy discussion about datafile.c...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ethan Merritt wrote:

>On Saturday 14 August 2004 11:56 am, Daniel J Sebald wrote:
>  
>
>>Hope this doesn't sound like a lecture, but I want to discuss how to
>>keep datafile.c clean and prevent the evolution of convoluted code that
>>is starting to occur with that file.
>>    
>>
>
>[snip lengthy rant, some of which is on target, some not]
>

:-)

>Didn't we already have this discussion a few months ago?
>I proposed that the whole notion of tracking input data by how
>many columns were read in has outlived its usefulness.
>I think we should get rid of max_cols and all the various
>tests that depend on it, and instead pass explicit information
>about the requested input data.  Hans-Bernard disagreed.
>

Well, yeah.  Lot's of disagreement; but not much agreement and what the 
paradigm should be and I'm suggesting that there be some agreement to 
avoid too much divergence.  The discussion sort of faded...

>>df_open(int max_using, int plot_mode)
>>    
>>
>
>Like that, yes, except that 
>(1) I think max_using is not necessary or desirable, and 
>(2) I proposed passing a pointer to the whole plot
>structure rather than passing only the plot style.
>

Another paradigm is fine, passing in a pointer to the whole plot is 
fine.  Just not a combination of multiple views.  Some of the image 
stuff may fit to one or the other paradigms, so it might be good to 
adhere to only one in the near future.

I know that for ASCII files the number of columns can be determined by 
the file itself and gnuplot readjusts accordingly.  That code is 
currently in plot2d.c.  That will remain there?  Or will that be moved 
to inside datafile.c as part of df_readline?  df_open?

>>While on this topic of df_readline, I wonder if introducing too much
>>"plot dependent" stuff into df_readline is a good idea.
>>    
>>
>
>And that was Hans-Bernard's counterargument.
>
>  
>
>>For example, 
>>with the histogram tics, this kind of line seems like it shouldn't be in
>>a file-reading routine:
>>
>>             add_tic_user(axis,temp_string,xpos,0);
>>
>>Is there some way to move this functionality outside of df_readline()
>>back into plot2d.c?
>>    
>>
>
>Why?  It is not specific to 2D plots.  But anyway, the answer is no.
>The information being processed, the tic labels, are not specific to 
>the current plot; they are a property of the axis.  The code belongs
>in axis.c, which is where it is currently.  But still you have to call it
>from somewhere, and I maintain the logical place (maybe the only
>possible place) is the point at which you obtain the information.
>That is set.c in the case of axis tic info coming from a "set [xyz]tics"
>command, and datafile.c in the case of tic info read in from a file.
>

OK, let me back up here.  I think I see now the more important issue 
here is that the data to be plotted, the imigration.dat file for 
example, won't work because it has more columns than allowed by max_cols 
passed into the df_readline routine.  That is, the normal gnuplot ascii 
file looks like

"string" **
<data>

"streing"
<data>

"string"
<data>

But the 'imigration.dat' file is

    **
"string"   "string"   ....   "string"
<data>     <data>    ....   <data>

where the data which is to serve as the tic labels (read as a string 
rather than a number) is contained in the ** element.

I don't have all the answers, but I'll make some comments.  In the 
latter case, those bunch of strings at the start of the file could all 
be read at once.  In fact, it is almost similar in strategy to the 
"gnuplot binary" type of file where along the top is the x values and 
the first column afterward is the y_values.  Also, the df_readline() 
routine might be easily arranged to remove the max_cols restriction and 
make the value of j returned dynamic, from 2, 3, 4, etc. all the way up 
to 500 if one wants.  It may just mean dynamic alocation of memory (that 
doesn't need to be reallocated if the number of read values doesn't 
change, thus saving efficiency).

>>I pose this question because I've been trying to make the case that
>>df_readascii() and df_readbinary(), or whatever, should be transparent
>>to the calling routine.  If functionality like above keeps being added
>>to df_readascii (df_readline) then soon the situation arises where
>>certain types of plots can't be done simply because the data comes from
>>a binary data file.
>>    
>>
>
>If that is indeed true then I have reservations about introducing binary
>input at all.  Are you saying that it will not be possible to read strings
>in from a binary file, so that the new "plot with labels" and 
>"using ...:xticlabels(<col>)" will not work?  If so, then the functionality
>has already diverged.  And if the two modes have different capabilities
>then all the more reason to keep them separate in the code as well.
>

No, certainly I could add reading strings from binary data files.  But I 
would propose making it a generic thing.  Say for example, a command 
line syntax (or it doesn't have to be command line, it could be an 
internal variable) whereby one of the columns can be designated as a 
string tic label, e.g., "ticlabel <col>", or whatever.  But it's meaning 
is generic, it is just a string passed back and treated accordingly.  In 
the case of histograms it is a tic label.  Perhaps something different 
for something else.

However, my feeling about df_readline(), df_readascii(), df_readbinary() 
are that these should be core little routines (in scope anyway), a 
kernel if you will, that takes in data and shuffles it off to somewhere 
else to be processed further.

If one mixes dedicated code like plot->histogram, etc. into 
df_readascii(),  then they also need to remember to make that change in 
df_readbinary().  If it is tweak in one location, then it has to be 
touched in another spot, which might go forgotten.

So, maybe a version of df_readline as follows:

int df_readline(double vector[], char **string)

where now the vector can be of any length, and the string is a location 
where df_readline is to put a pointer to a character string that it 
dynamically allocates.  (It can be one, at most, of the columns treated 
as a string rather than a number.)

Does this get around some problems?  Am I understanding the big issue 
now, that there are more columns now than max_cols?  I guess I'm asking 
that if the max_cols restriction were dropped, would the current set up 
allow you to move data into the plot structure as desired?  Is there a 
paradigm shift here for the way data can be arranged in the file for 
histograms?

>>Ethan, what is the minimal amount of information that you would need
>>coming back from df_readline() to implement headers from files?  If
>>df_readline() were equipped with a char pointer for which df_readline
>>could realloc() memory and assign a string, would that do it?
>>    
>>
>
>That's what it does now. Because plot->title is not visible from
>inside df_readline (which actually I would prefer), the title is allocated
>and a pointer to it is stored in a static variable. A helper routine 
>df_set_key_title() is later called from get_data(), which is indeed in 
>plot2d.c.  No global variables are involved.
>

Yeah, that is fine.  I assume that df_set_key_title() is not within 
df_readlin().  My major point in all this is to keep df_readline() clean 
and generic, and in the long run it will promote happiness.

>>That is, I might propose
>>             add_tic_user(axis,temp_string,xpos,0);
>>could be moved to plot2d.c. 
>>    
>>
>
>You are confusing plot titles and axis tic labels.  The two things are
>quite different.  One is a specific property of the current plot,
>the other is not.
>
>I know, you are going to point to a single place where the histogram 
>code stuffs a plot title into an axis tic label.  I'm not terribly happy
>about that either, but let's split that off into a totally separate 
>discussion that only applies to stacked histograms.
>

No biggie.  Got to start somewhere.

>>PS:  I've concluded that moving df_readbinary() to another file would
>>require the sharing of too many "local" variables.
>>    
>>
>
>I don't agree.  Most of those local variables are indeed local.
>They should not *need* to be shared.
>

Well, here is the thing.  There is a certain element of this that can't 
be disentangled (if that's a word).  A lot of the parameters for reading 
from a file are set up by df_open() because it is there that the 
keywords from the command line are processed.  So, at the point of 
df_open() it isn't known yet whethere the file is ascii or binary.  That 
could be fixed by first, at the start of df_open, checking all the 
keywords to see if one is "binary", but that's not graceful.  So, yes 
even a df_open_binary() could be generated where all the keywords are 
again interpretted.  But why repeat all these in a different file if 
they are going to be pretty much the same?  "every" works the same, 
"thru" works the same, etc.

Let me make this revision, and maybe that will help things fall in place.

Dan

Re: Lengthy discussion about datafile.c...

A portable, multi-platform, command-line driven graphing utility

Re: Lengthy discussion about datafile.c...