From: Harald H. <h.h...@tu...> - 2004-08-13 09:34:37
|
In this mailing list some discussion has been about the 'nolevel2' option in postscript terminal which now has been changed to 'ai' in CVS. I do not like the new name and prefer the suggestion 'level1' that also has been made here (or even better the suggestion described below). Some reasons: - Newer Versions of Adobe Illustrator (at least AI CS 11) do understand the level 2 code. Thus, using 'ai' is misleading here. - There are other programmes that fail using the new code, for example gv (while gs understands it). - The option switches off Level 2 code and switches to Level 1, independently where you use the output. Thus naming what it does is better. What about an option 'pslevel' that takes an argument: - 'pslevel 1' restricts to Postscript level 1 - 'pslevel 2' allows Postscript level 2 (some day, PS level 3 features may be added) With this method, future problems with level questions will be avoided. What do you think? Yours Harald --=20 Harald Harders Langer Kamp 8 Technische Universit=E4t Braunschweig D-38106 Braunschweig Institut f=FCr Werkstoffe Germany E-Mail: h.h...@tu... Tel: +49 (5 31) 3 91-3062 WWW : http://www.harald-harders.de Fax: +49 (5 31) 3 91-3058 |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-13 15:41:47
|
On Friday 13 August 2004 02:33 am, Harald Harders wrote: > - Newer Versions of Adobe Illustrator (at least AI CS 11) do understand > the level 2 code. Thus, using 'ai' is misleading here. So we are still looking for a good name. That's OK with me; I'm just trying to make everyone happy. > - There are other programmes that fail using the new code, for example gv > (while gs understands it). ??? gv works for me. The only issue is the one of anti-aliasing. > > What about an option 'pslevel' that takes an argument: > - 'pslevel 1' restricts to Postscript level 1 > - 'pslevel 2' allows Postscript level 2 (some day, PS level 3 features may > be added) > With this method, future problems with level questions will be avoided. But the code produced by the driver is *not* Level 1 PostScript, even without the Pattern definitions. Calling it Level 1 is just not correct. Maybe instead of trying to find a generic label, we must settle for being absolutely specific: set term post {patternlevel2} This describes precisely what the code option does, at the cost of a long option name. OK, we could allow almost_equals("pat$ternlevel2") > > What do you think? > > Yours > Harald > -- Ethan A Merritt merritt@u.washington.edu Biomolecular Structure Center Mailstop 357742 University of Washington, Seattle, WA 98195 |
From: Harald H. <h.h...@tu...> - 2004-08-13 15:55:59
|
On Fri, 13 Aug 2004, Ethan Merritt wrote: > On Friday 13 August 2004 02:33 am, Harald Harders wrote: > > - Newer Versions of Adobe Illustrator (at least AI CS 11) do understand > > the level 2 code. Thus, using 'ai' is misleading here. > > So we are still looking for a good name. That's OK with me; > I'm just trying to make everyone happy. > > > - There are other programmes that fail using the new code, for example = gv > > (while gs understands it). > > ??? gv works for me. The only issue is the one of anti-aliasing. I just have noticed this today, too. I have prepared a bug report since also gstate, setgsate and currentgstate do not work with the x11alpha device. > > What about an option 'pslevel' that takes an argument: > > - 'pslevel 1' restricts to Postscript level 1 > > - 'pslevel 2' allows Postscript level 2 (some day, PS level 3 features = may > > be added) > > With this method, future problems with level questions will be avoided. > > But the code produced by the driver is *not* Level 1 PostScript, > even without the Pattern definitions. Calling it Level 1 is just not > correct. Mmmh, which are the Postscript level 2 commands used in Gnuplot? Wasn't it a good idea to make a pslevel option which really switches back to postscript level 1 syntax? Or does this involve too strong limitations? > Maybe instead of trying to find a generic label, we must settle for > being absolutely specific: > =09set term post {patternlevel2} If it is not possible to restrict to level 1: What about set term post solidfill versus set term post patternfill , maybe with shortcuts sf and pf? --=20 Harald Harders Langer Kamp 8 Technische Universit=E4t Braunschweig D-38106 Braunschweig Institut f=FCr Werkstoffe Germany E-Mail: h.h...@tu... Tel: +49 (5 31) 3 91-3062 WWW : http://www.harald-harders.de Fax: +49 (5 31) 3 91-3058 |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-13 19:55:35
|
On Friday 13 August 2004 08:55 am, Harald Harders wrote: > > But the code produced by the driver is *not* Level 1 PostScript, > > even without the Pattern definitions. Calling it Level 1 is just not > > correct. > > Mmmh, which are the Postscript level 2 commands used in Gnuplot? I apologize for what may have been an unwarranted assumption. You are right, and I was unduly accepting of the first line of the output file, which has been %!PS-Adobe-2.0 since at least gnuplot version 3.7 In a quick check of the source for Level 2 extensions listed in the PostScript reference manual I don't see any obvious ones. I am not sure about whether our code assumes Level 2 behaviour with regard to font encodings. > Wasn't it > a good idea to make a pslevel option which really switches back to > postscript level 1 syntax? Or does this involve too strong limitations? It's a good idea. I thought it would involve actual work, but I guess I was too pessimistic. However, Daniel's upcoming binary + image patch depends on the Level2 operator "filter", so we will have to keep an eye on that. So I am convinced. I'll change the syntax again, and make the option "level1". -- Ethan A Merritt merritt@u.washington.edu Biomolecular Structure Center Mailstop 357742 University of Washington, Seattle, WA 98195 |
From: Harald H. <h.h...@tu...> - 2004-08-14 17:12:56
|
> You are right, and I was unduly accepting of the first line of the > output file, which has been > %!PS-Adobe-2.0 > since at least gnuplot version 3.7 This is not the Postscript level but the Version number of the DSC comment structure. Version 3.0 of the "PostScript Language Document Structuring Conventions Specification" is dated 25 September 1992, long before Postscript Level 3 has been started. The Postscript header does not specify which Postscript level is used. > In a quick check of the source for Level 2 extensions listed in the > PostScript reference manual I don't see any obvious ones. I am not > sure about whether our code assumes Level 2 behaviour with > regard to font encodings. I will have a look to the Postscript Language Reference to find out if we really are free of level2 code. > However, Daniel's upcoming binary + image patch depends on the > Level2 operator "filter", so we will have to keep an eye on that. We should speak to him to ensure that also the filter operator is not used when using level1. --=20 Harald Harders Langer Kamp 8 Technische Universit=E4t Braunschweig D-38106 Braunschweig Institut f=FCr Werkstoffe Germany E-Mail: h.h...@tu... Tel: +49 (5 31) 3 91-3062 WWW : http://www.harald-harders.de Fax: +49 (5 31) 3 91-3058 |
From: Harald H. <h.h...@tu...> - 2004-08-14 17:56:02
|
On Sat, 14 Aug 2004, Harald Harders wrote: > > In a quick check of the source for Level 2 extensions listed in the > > PostScript reference manual I don't see any obvious ones. I am not > > sure about whether our code assumes Level 2 behaviour with > > regard to font encodings. > > I will have a look to the Postscript Language Reference to find out if we > really are free of level2 code. I think the reencoding of fonts also is level 1. According to the Language Reference, /ISOLatin1Encoding is only predefined in Postscript Level 2. But the gnuplot postscript files (re)define /ISOLatin1Encoding anyway that a Postscript Level 2 interpreter is not necessary. As a test I have tried to view a postscript output file without the definition of /ISOLatin1Encoding. And it works both with gs and with a Level-2 printer. Yours Harald --=20 Harald Harders Langer Kamp 8 Technische Universit=E4t Braunschweig D-38106 Braunschweig Institut f=FCr Werkstoffe Germany E-Mail: h.h...@tu... Tel: +49 (5 31) 3 91-3062 WWW : http://www.harald-harders.de Fax: +49 (5 31) 3 91-3058 |
From: Daniel J S. <dan...@ie...> - 2004-08-14 18:44:51
|
Harald Harders wrote: >>However, Daniel's upcoming binary + image patch depends on the >>Level2 operator "filter", so we will have to keep an eye on that. >> >> > >We should speak to him to ensure that also the filter operator is not used >when using level1. > Here are the commands that the image routine uses: %%%%BeginImage gsave translate scale [ M 0 0 N 0 0 ] currentfile /ASCII85Decode filter false %%%%BeginPalette [ /Indexed\n /DeviceRGB <bunch of characters for palette> ] setcolorspace %%%%EndPalette << /ImageType 1 /Width M /Height N /BitsPerComponent /ImageMatrix [ M 0 0 N 0 0 ] /Decode [ 0 #] /DataSource currentfile /ASCII85Decode filter /MultipleDataSources false /Interpolate false >> %%%%BeginData colorimage image <bunch of characters for image> %%%%EndData grestore %%%%EndImage The only non level-1 suspect commands might be /DataSource currentfile /Interpolate false The filter Ethan refers to is the ASCII85Decode filter, which I'm guessing would have been in PostScript from the beginning. The reason for the /Interpolate false is that for a scientific plotting program there shouldn't be any additional image processing done beyond what the user may have applied or intended. Dan |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-14 19:29:44
|
On Saturday 14 August 2004 12:10 pm, Daniel J Sebald wrote: > The filter Ethan refers to is the ASCII85Decode filter, which I'm > guessing would have been in PostScript from the beginning. No. I am referring to the PostScript command "filter", which is a Level 2 extension. |
From: Daniel J S. <dan...@ie...> - 2004-08-14 18:30:53
|
Hope this doesn't sound like a lecture, but I want to discuss how to keep datafile.c clean and prevent the evolution of convoluted code that is starting to occur with that file. In having done quite of bit of modular and object oriented code for a few years, I'd like to discourage the use of global variables across two conceptually different pieces of code. In particular, in this instance is the global variable "df_datum". Let's begin there. I see that before the histogram strings code was added, df_datum was used outside of datafile.c in one--and only one--case, plot2d.c. I would discourage making df_datum available to the outside world in datafile.h. Here is its use: case 0: /* not blank line, but df_readline couldn't parse it */ { df_close(); int_error(current_plot->token, "Bad data on line %d", df_line_number); } case 1: { /* only one number */ /* x is index, assign number to y */ v[1] = v[0]; v[0] = df_datum; /* nobreak */ } case 2: /* x, y */ /* ylow and yhigh are same as y */ OK. The idea is simple; if only one variable is read from the file then the index into the file should be used as the x value and the read variable should be used as y. But I'd like to point out a few things. First, consider what is being done here, the use of df_readline() starting the loop, i.e., while ((j = df_readline(v, max_cols)) != DF_EOF) { retrieves from df_readline the variables (v[]) and the number of variables read (j) in a modular fashion. However, plot2d.c retrieves a short time later something also generated by df_readline (df_datum) as a global variable. That is not good. If you were evaluating that practise for, say, a software exam, ask yourself what grade you would give that. Alright then, so how to avoid using df_datum like that? Well, I would say that that case 1 situation above really could be easily moved into df_readline itself (** referenced later). Notice that if at the end of df_readline() we knew that this was for 2D data (in the current case, that is a given fact because reading 3D data is its own special routine) would could have this exact same code. For example, at the end of df_readline() if (df_plot_mode == MODE_PLOT && output == 1) { v[1] = v[0]; v[0] = df_datum; output++; } The df_plot_mode variable is something I invented for the image stuff that is recorded upon openning the file. (It isn't the main focus of this point.) You might argue, well, then df_readlin() needs to know something about how the data is going to be used. Yes, but all it needs to know is that it is intended for 2D data or 3D data. In other words, we are making available to df_readline some knowledge of what the *minimum* allowable number of variables is. In that scenario then, our case statement would conclude that *both* 0 variable returned and 1 variable returned is invalid, i.e., case 0: /* not blank line, but df_readline couldn't parse it */ case 1: { df_close(); int_error(current_plot->token, "Bad data on line %d", df_line_number); } case 2: /* x, y */ /* ylow and yhigh are same as y */ In some sense the test for case 0 and case 1 is merely a sanity check. It is when the number of returned variables is 2 or greater that the meaning of the data is open for interpretation based upon the plot style, etc. Alright so let me summarize three alternatives with the various concepts: 1. Allow the use of df_datum as a global variable leaving the code as is. This encourages people to use df_datum at will, and that is starting to happen. It isn't exactly modular and clean practise. 2. Pass into df_readline() a variable indicating the minimum number of columns that is valid. E.g., df_readline(v[], mincols, maxcols), where in plot2d.c mincols would be 2 and in plot3d.c mincols would be 3. I think you understand the point. This is probably the most appropriate "modular" syntax, but I would say that passing in that mincols has little advantage if in fact we know it will always be 2 or 3 depending upon the plot mode. 3. What I described above. It is conceptually the same as 2) but instead the information is passed to datafile.c via df_open(int max_using, int plot_mode) where in plot2d.c plot_mode would be MODE_PLOT and in plot3d.c plot_mode would be MODE_SPLOT. Like case 2), df_datum no longer needs to be globally available. I would opine that alternative #3 is a good balance. Now, having said all that, I'd like to point out some code that Ethan added to df_readline() somewhere: /* FIXME EAM - Trap special case of only a single 'using' column. */ /* But really we need to handle general case of implicit column 0 */ if (output == 1) xpos = (axcol == 0) ? df_datum : v[axcol-1]; else xpos = v[axcol]; I think that is exactly the concept I've discribed in 3) above. (**) So you can see how things are starting to get convoluted. While on this topic of df_readline, I wonder if introducing too much "plot dependent" stuff into df_readline is a good idea. For example, with the histogram tics, this kind of line seems like it shouldn't be in a file-reading routine: add_tic_user(axis,temp_string,xpos,0); Is there some way to move this functionality outside of df_readline() back into plot2d.c? I pose this question because I've been trying to make the case that df_readascii() and df_readbinary(), or whatever, should be transparent to the calling routine. If functionality like above keeps being added to df_readascii (df_readline) then soon the situation arises where certain types of plots can't be done simply because the data comes from a binary data file. Ethan, what is the minimal amount of information that you would need coming back from df_readline() to implement headers from files? If df_readline() were equipped with a char pointer for which df_readline could realloc() memory and assign a string, would that do it? That is, I might propose df_readline(v[], maxcols, string) where string is a character pointer. Then add_tic_user(axis,temp_string,xpos,0); could be moved to plot2d.c. I see absolutely nothing wrong with that addition. I think that is much cleaner than working with so many global variables. In any case, if my comments have motivated anyone to change something, please hold off until after the image patch... or if you want me to make an attempt at removing df_datum from outside datafile.c as part of the image patch, I can do that. Dan PS: I've concluded that moving df_readbinary() to another file would require the sharing of too many "local" variables. So I'm thinking that just putting it to the bottom of datafile.c is best... that's actually where it probably should be organized anyway. |
From: Daniel J S. <dan...@ie...> - 2004-08-14 19:46:41
|
Daniel J Sebald wrote: > In having done quite of bit of modular and object oriented code for a > few years, I'd like to discourage the use of global variables across > two conceptually different pieces of code. In particular, in this > instance is the global variable "df_datum". I see now there are several variables made global outside datafile.c: * public variables declared in this file. * int df_no_use_specs - number of columns specified with 'using' * int df_no_tic_specs - count of additional ticlabel columns * int df_line_number - for error reporting * int df_datum - increases with each data point * TBOOLEAN df_binary - it's a binary file * [ might change this to return value from df_open() ] * int df_eof - end of file * int df_timecol[] - client controls which cols read as time Well, then its starting to get to be a lot of work to adhere to a modular strategy. Anyway, things like df_line_number could be done differently. From what I see, that variables is used outside of datafile.c only as an argument to error messages when things fail, e.g., int_error(this_plot->token, "2 columns only possible with explicit pm3d style (line %d)", df_line_number); Another approach might be to create a funciton df_int_error() which acts just like int_error() but in addition tags on information about the file line number it is currently at then calls int_error(). That way, error messages for the line number can be made consistent and easily alter if for some reason more information is to be added. Dan |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-14 20:22:08
|
On Saturday 14 August 2004 11:56 am, Daniel J Sebald wrote: > Hope this doesn't sound like a lecture, but I want to discuss how to > keep datafile.c clean and prevent the evolution of convoluted code that > is starting to occur with that file. [snip lengthy rant, some of which is on target, some not] Didn't we already have this discussion a few months ago? I proposed that the whole notion of tracking input data by how many columns were read in has outlived its usefulness. I think we should get rid of max_cols and all the various tests that depend on it, and instead pass explicit information about the requested input data. Hans-Bernard disagreed. > df_open(int max_using, int plot_mode) Like that, yes, except that (1) I think max_using is not necessary or desirable, and (2) I proposed passing a pointer to the whole plot structure rather than passing only the plot style. > While on this topic of df_readline, I wonder if introducing too much > "plot dependent" stuff into df_readline is a good idea. And that was Hans-Bernard's counterargument. > For example, > with the histogram tics, this kind of line seems like it shouldn't be in > a file-reading routine: > > add_tic_user(axis,temp_string,xpos,0); > > Is there some way to move this functionality outside of df_readline() > back into plot2d.c? Why? It is not specific to 2D plots. But anyway, the answer is no. The information being processed, the tic labels, are not specific to the current plot; they are a property of the axis. The code belongs in axis.c, which is where it is currently. But still you have to call it from somewhere, and I maintain the logical place (maybe the only possible place) is the point at which you obtain the information. That is set.c in the case of axis tic info coming from a "set [xyz]tics" command, and datafile.c in the case of tic info read in from a file. > I pose this question because I've been trying to make the case that > df_readascii() and df_readbinary(), or whatever, should be transparent > to the calling routine. If functionality like above keeps being added > to df_readascii (df_readline) then soon the situation arises where > certain types of plots can't be done simply because the data comes from > a binary data file. If that is indeed true then I have reservations about introducing binary input at all. Are you saying that it will not be possible to read strings in from a binary file, so that the new "plot with labels" and "using ...:xticlabels(<col>)" will not work? If so, then the functionality has already diverged. And if the two modes have different capabilities then all the more reason to keep them separate in the code as well. > Ethan, what is the minimal amount of information that you would need > coming back from df_readline() to implement headers from files? If > df_readline() were equipped with a char pointer for which df_readline > could realloc() memory and assign a string, would that do it? That's what it does now. Because plot->title is not visible from inside df_readline (which actually I would prefer), the title is allocated and a pointer to it is stored in a static variable. A helper routine df_set_key_title() is later called from get_data(), which is indeed in plot2d.c. No global variables are involved. > That is, I might propose > add_tic_user(axis,temp_string,xpos,0); > could be moved to plot2d.c. You are confusing plot titles and axis tic labels. The two things are quite different. One is a specific property of the current plot, the other is not. I know, you are going to point to a single place where the histogram code stuffs a plot title into an axis tic label. I'm not terribly happy about that either, but let's split that off into a totally separate discussion that only applies to stacked histograms. > PS: I've concluded that moving df_readbinary() to another file would > require the sharing of too many "local" variables. I don't agree. Most of those local variables are indeed local. They should not *need* to be shared. > So I'm thinking that > just putting it to the bottom of datafile.c is best... that's actually > where it probably should be organized anyway. All right. Do that as a first step. At least it may disentangle the code paths enough that I can do an xxdiff between the new and old versions to see what you are actually changing. Right now it's such a tangle that I don't have an overall sense of what is shared and what isn't. |
From: Daniel J S. <dan...@ie...> - 2004-08-14 22:25:31
|
Ethan Merritt wrote: >On Saturday 14 August 2004 11:56 am, Daniel J Sebald wrote: > > >>Hope this doesn't sound like a lecture, but I want to discuss how to >>keep datafile.c clean and prevent the evolution of convoluted code that >>is starting to occur with that file. >> >> > >[snip lengthy rant, some of which is on target, some not] > :-) >Didn't we already have this discussion a few months ago? >I proposed that the whole notion of tracking input data by how >many columns were read in has outlived its usefulness. >I think we should get rid of max_cols and all the various >tests that depend on it, and instead pass explicit information >about the requested input data. Hans-Bernard disagreed. > Well, yeah. Lot's of disagreement; but not much agreement and what the paradigm should be and I'm suggesting that there be some agreement to avoid too much divergence. The discussion sort of faded... >>df_open(int max_using, int plot_mode) >> >> > >Like that, yes, except that >(1) I think max_using is not necessary or desirable, and >(2) I proposed passing a pointer to the whole plot >structure rather than passing only the plot style. > Another paradigm is fine, passing in a pointer to the whole plot is fine. Just not a combination of multiple views. Some of the image stuff may fit to one or the other paradigms, so it might be good to adhere to only one in the near future. I know that for ASCII files the number of columns can be determined by the file itself and gnuplot readjusts accordingly. That code is currently in plot2d.c. That will remain there? Or will that be moved to inside datafile.c as part of df_readline? df_open? >>While on this topic of df_readline, I wonder if introducing too much >>"plot dependent" stuff into df_readline is a good idea. >> >> > >And that was Hans-Bernard's counterargument. > > > >>For example, >>with the histogram tics, this kind of line seems like it shouldn't be in >>a file-reading routine: >> >> add_tic_user(axis,temp_string,xpos,0); >> >>Is there some way to move this functionality outside of df_readline() >>back into plot2d.c? >> >> > >Why? It is not specific to 2D plots. But anyway, the answer is no. >The information being processed, the tic labels, are not specific to >the current plot; they are a property of the axis. The code belongs >in axis.c, which is where it is currently. But still you have to call it >from somewhere, and I maintain the logical place (maybe the only >possible place) is the point at which you obtain the information. >That is set.c in the case of axis tic info coming from a "set [xyz]tics" >command, and datafile.c in the case of tic info read in from a file. > OK, let me back up here. I think I see now the more important issue here is that the data to be plotted, the imigration.dat file for example, won't work because it has more columns than allowed by max_cols passed into the df_readline routine. That is, the normal gnuplot ascii file looks like "string" ** <data> "streing" <data> "string" <data> But the 'imigration.dat' file is ** "string" "string" .... "string" <data> <data> .... <data> where the data which is to serve as the tic labels (read as a string rather than a number) is contained in the ** element. I don't have all the answers, but I'll make some comments. In the latter case, those bunch of strings at the start of the file could all be read at once. In fact, it is almost similar in strategy to the "gnuplot binary" type of file where along the top is the x values and the first column afterward is the y_values. Also, the df_readline() routine might be easily arranged to remove the max_cols restriction and make the value of j returned dynamic, from 2, 3, 4, etc. all the way up to 500 if one wants. It may just mean dynamic alocation of memory (that doesn't need to be reallocated if the number of read values doesn't change, thus saving efficiency). >>I pose this question because I've been trying to make the case that >>df_readascii() and df_readbinary(), or whatever, should be transparent >>to the calling routine. If functionality like above keeps being added >>to df_readascii (df_readline) then soon the situation arises where >>certain types of plots can't be done simply because the data comes from >>a binary data file. >> >> > >If that is indeed true then I have reservations about introducing binary >input at all. Are you saying that it will not be possible to read strings >in from a binary file, so that the new "plot with labels" and >"using ...:xticlabels(<col>)" will not work? If so, then the functionality >has already diverged. And if the two modes have different capabilities >then all the more reason to keep them separate in the code as well. > No, certainly I could add reading strings from binary data files. But I would propose making it a generic thing. Say for example, a command line syntax (or it doesn't have to be command line, it could be an internal variable) whereby one of the columns can be designated as a string tic label, e.g., "ticlabel <col>", or whatever. But it's meaning is generic, it is just a string passed back and treated accordingly. In the case of histograms it is a tic label. Perhaps something different for something else. However, my feeling about df_readline(), df_readascii(), df_readbinary() are that these should be core little routines (in scope anyway), a kernel if you will, that takes in data and shuffles it off to somewhere else to be processed further. If one mixes dedicated code like plot->histogram, etc. into df_readascii(), then they also need to remember to make that change in df_readbinary(). If it is tweak in one location, then it has to be touched in another spot, which might go forgotten. So, maybe a version of df_readline as follows: int df_readline(double vector[], char **string) where now the vector can be of any length, and the string is a location where df_readline is to put a pointer to a character string that it dynamically allocates. (It can be one, at most, of the columns treated as a string rather than a number.) Does this get around some problems? Am I understanding the big issue now, that there are more columns now than max_cols? I guess I'm asking that if the max_cols restriction were dropped, would the current set up allow you to move data into the plot structure as desired? Is there a paradigm shift here for the way data can be arranged in the file for histograms? >>Ethan, what is the minimal amount of information that you would need >>coming back from df_readline() to implement headers from files? If >>df_readline() were equipped with a char pointer for which df_readline >>could realloc() memory and assign a string, would that do it? >> >> > >That's what it does now. Because plot->title is not visible from >inside df_readline (which actually I would prefer), the title is allocated >and a pointer to it is stored in a static variable. A helper routine >df_set_key_title() is later called from get_data(), which is indeed in >plot2d.c. No global variables are involved. > Yeah, that is fine. I assume that df_set_key_title() is not within df_readlin(). My major point in all this is to keep df_readline() clean and generic, and in the long run it will promote happiness. >>That is, I might propose >> add_tic_user(axis,temp_string,xpos,0); >>could be moved to plot2d.c. >> >> > >You are confusing plot titles and axis tic labels. The two things are >quite different. One is a specific property of the current plot, >the other is not. > >I know, you are going to point to a single place where the histogram >code stuffs a plot title into an axis tic label. I'm not terribly happy >about that either, but let's split that off into a totally separate >discussion that only applies to stacked histograms. > No biggie. Got to start somewhere. >>PS: I've concluded that moving df_readbinary() to another file would >>require the sharing of too many "local" variables. >> >> > >I don't agree. Most of those local variables are indeed local. >They should not *need* to be shared. > Well, here is the thing. There is a certain element of this that can't be disentangled (if that's a word). A lot of the parameters for reading from a file are set up by df_open() because it is there that the keywords from the command line are processed. So, at the point of df_open() it isn't known yet whethere the file is ascii or binary. That could be fixed by first, at the start of df_open, checking all the keywords to see if one is "binary", but that's not graceful. So, yes even a df_open_binary() could be generated where all the keywords are again interpretted. But why repeat all these in a different file if they are going to be pretty much the same? "every" works the same, "thru" works the same, etc. Let me make this revision, and maybe that will help things fall in place. Dan |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-14 23:00:29
|
On Saturday 14 August 2004 03:50 pm, you wrote: > > OK, let me back up here. I think I see now the more important issue > here is that the data to be plotted, the imigration.dat file for > example, won't work because it has more columns than allowed by max_cols > passed into the df_readline routine. No, that's completely wrong. The data is being plotted 1 column at a time. Sure there are lots of columns in the data file, but there's nothing special about that. > ** > "string" "string" .... "string" > <data> <data> .... <data> > where the data which is to serve as the tic labels (read as a string > rather than a number) is contained in the ** element. No. The scheme you describe does not correspond to any of the histogramming modes I implemented. Just forget about histograms for the purpose of this discussion. You are going way off on a tangent because you have misunderstood how the current code works. I don't ever need to read in more than 2 columns of data for histogramming. And the histogram code doesn't use strings anyhow, except insofar as you normally want to specify tic labels to go with your histograms. But the tic label business and the histograms are separate bits of code. [snip] > No, certainly I could add reading strings from binary data files. But I > would propose making it a generic thing. Say for example, a command > line syntax (or it doesn't have to be command line, it could be an > internal variable) whereby one of the columns can be designated as a > string tic label, e.g., "ticlabel <col>", or whatever. But it's meaning > is generic, it is just a string passed back and treated accordingly. In > the case of histograms it is a tic label. Perhaps something different > for something else. Ugh. Daniel. Please try to understand what the current code is doing. You're just so far off base I don't know how to reply to your comments except to say they are irrelevant. Hint: In the current code every requested column is returned twice, once as a number and once as a string. The caller can choose whether it wants the string value or the numeric value. I don't know how this fits in with your binary data files, but I assure you it is fully generic. The histogram code doesn't use this anyhow; histogramming is not about strings, it's about columns of numbers. You are, I am guessing, thinking about my *other* new plotting mode - 'with labels'. > Does this get around some problems? Am I understanding the big issue > now, that there are more columns now than max_cols? No. Nothing at all like that. I want to get rid of max_cols not because I want lots of columns, but just because it is not used for anything that really has to do with columns. It is only used to try to deduce back to what the plot type is - which I think is nuts. If you need to know the plot type then just pass the plot type. > Well, here is the thing. There is a certain element of this that can't > be disentangled (if that's a word). A lot of the parameters for reading > from a file are set up by df_open() because it is there that the > keywords from the command line are processed. So, at the point of > df_open() it isn't known yet whethere the file is ascii or binary. That > could be fixed by first, at the start of df_open, checking all the > keywords to see if one is "binary", but that's not graceful. Hunh? Now I'm the one who is confused. I really have not been looking at that part of your patch because I have no use for binary input. But I assumed you told the program *somehow* that this was a binary data file. How does this work at all if there isn't a keyword on the command line? > Let me make this revision, and maybe that will help things fall in place. OK. I will wait and have a look at it. But I seriously hope that you can basically leave all the existing code in df_readline() untouched. |
From: Daniel J S. <dan...@ie...> - 2004-08-15 01:55:38
|
Ethan Merritt wrote: >On Saturday 14 August 2004 03:50 pm, you wrote: > > >>OK, let me back up here. I think I see now the more important issue >>here is that the data to be plotted, the imigration.dat file for >>example, won't work because it has more columns than allowed by max_cols >>passed into the df_readline routine. >> >> > >No, that's completely wrong. The data is being plotted 1 column at >a time. Sure there are lots of columns in the data file, but there's >nothing special about that. > OK, got it. >Hint: In the current code every requested column is returned >twice, once as a number and once as a string. The caller can >choose whether it wants the string value or the numeric value. >I don't know how this fits in with your binary data files, but I >assure you it is fully generic. The histogram code doesn't use >this anyhow; histogramming is not about strings, it's about >columns of numbers. You are, I am guessing, thinking about >my *other* new plotting mode - 'with labels'. > Inside df_readline() is the if statement: if (use_spec[output].expected_type >= CT_XTICLABEL) { and inside of that case is an instruction: if (df_current_plot) xpos += df_current_plot->histogram->start; and inside of graphics.h is the definition of histogram for curve_points: struct histogram_style *histogram; /* Only used if plot_style == HISTOGRAM */ If I follow, only the histogram plot style, then, can make use of that particular block of code inside df_readline(). That, or one needs to realize that the histogram->start can be used in a generic fashion. I guess it isn't that big of a deal, but it just seems like the use_spec[] portion of df_readline() is growing very large and doing lot's of specific stuff. >>Does this get around some problems? Am I understanding the big issue >>now, that there are more columns now than max_cols? >> >> > >No. Nothing at all like that. I want to get rid of max_cols not because >I want lots of columns, but just because it is not used for anything that >really has to do with columns. It is only used to try to deduce back to >what the plot type is - which I think is nuts. If you need to know the plot >type then just pass the plot type. > Oh, well yeah guessing the structure from the max cols is kind of silly. >>Well, here is the thing. There is a certain element of this that can't >>be disentangled (if that's a word). A lot of the parameters for reading >>from a file are set up by df_open() because it is there that the >>keywords from the command line are processed. So, at the point of >>df_open() it isn't known yet whethere the file is ascii or binary. That >>could be fixed by first, at the start of df_open, checking all the >>keywords to see if one is "binary", but that's not graceful. >> >> > >Hunh? Now I'm the one who is confused. I really have not been >looking at that part of your patch because I have no use for binary >input. But I assumed you told the program *somehow* that this >was a binary data file. How does this work at all if there isn't a >keyword on the command line? > What I'm saying is if all the parameters that are controlled by the command line keywords are to have there own "local instance" in a different file, say "binfile.c" that will mean there has to be a variant of df_open() just for binary files, which resides inside "binfile.c". It will look extremely similar to the current df_open(). I'd like to avoid that sort of thing, i.e., code repetition. Dan |
From: Ethan M. <merritt@u.washington.edu> - 2004-08-15 00:08:27
|
On Saturday 14 August 2004 03:50 pm, Daniel J Sebald wrote: > Ethan Merritt wrote: > >Didn't we already have this discussion a few months ago? > >I proposed that the whole notion of tracking input data by how > >many columns were read in has outlived its usefulness. > >I think we should get rid of max_cols and all the various > >tests that depend on it, and instead pass explicit information > >about the requested input data. Hans-Bernard disagreed Oh, and apologies for the typo in Hans-Bernhard's name. > I know that for ASCII files the number of columns can be determined by > the file itself and gnuplot readjusts accordingly. Which brings up another issue. The description of your binary read "format" commands looks *really* fragile. I mean the stuff being parsed in plot_option_binary_format(). I am seriously worried that it won't transfer well across 32/64 bit machines, that it won't handle string data, and worst of all that it requires too much user-knowledge of file and data types. Basically I don't like it. [EAM puts on geezer hat] In the old days of Fortran programming and VMS file systems, binary files had actual "records". In those days there was an obvious parallel between "columns" in an ascii file and "records" in a binary file. But that approach has been drowned by the unix notion that "everything is a stream of bytes". It's *really hard* to figure out what data is in a binary stream, and I am dubious that it is worth spending thousands of lines of code in gnuplot trying to do so. The unix way in such a case would be to run the input binary data through a tailored filter on its way into gnuplot. That way gnuplot only has to know about ascii input, and you can debug a suitable filter for your application without having to recode gnuplot. Your docs say + Gnuplot will retrieve a number of binary + variables equal to the largest column specified in the `<using list>`. + For example, `using 1:3` would cause three columns to be read, of which + the second will be ignored. So how do you handle the case of 10 logical columns of data in the file, of which you only want to read the 2nd and 4th? How do you skip "columns" 5 to 10 of each "line"? What constitutes the logical equivalent of a "blank line" in your binary files? Or is there no equivalent to the auto-determination of scan lines? Do you plan to handle strings? How? Would you require a full "binary format" description in this case? Is there such a thing as a matrix of strings? The matrix variant is far more straight-forward. I would think this will be by far the most common use anyhow, and it would cover the pixel images that you obviously have fondness for. Could we maybe have a first cut version of this patch that only deals with matrix format binary data? |
From: Daniel J S. <dan...@ie...> - 2004-08-15 03:28:58
|
Ethan Merritt wrote: >On Saturday 14 August 2004 03:50 pm, Daniel J Sebald wrote: > > >>I know that for ASCII files the number of columns can be determined by >>the file itself and gnuplot readjusts accordingly. >> >> > >Which brings up another issue. The description of your binary read >"format" commands looks *really* fragile. I mean the stuff being parsed >in plot_option_binary_format(). I am seriously worried that >it won't transfer well across 32/64 bit machines, that it won't handle >string data, and worst of all that it requires too much user-knowledge >of file and data types. Basically I don't like it. > Don't have a 64 bit machine to try this on. But the question as to how it will transfer is a matter of how data is stored in the file. Is there a 64-bit IEEE floating point format? There probably is. 32-bit floats in files are still certainly readable. 64-bit should work so long as the native file byte order matches the CPU/compiler order byte order. >[EAM puts on geezer hat] In the old days of Fortran programming and >VMS file systems, binary files had actual "records". In those days >there was an obvious parallel between "columns" in an ascii file >and "records" in a binary file. But that approach has been drowned >by the unix notion that "everything is a stream of bytes". > I know, that's the crux. >It's *really hard* to figure out what data is in a binary stream, and I >am dubious that it is worth spending thousands of lines of code in >gnuplot trying to do so. The unix way in such a case would be to >run the input binary data through a tailored filter on its way into >gnuplot. That way gnuplot only has to know about ascii input, and >you can debug a suitable filter for your application without having >to recode gnuplot. > The problem is that an image of say 500 x 500 pixels gets very big in ASCII. >Your docs say > + Gnuplot will retrieve a number of binary > + variables equal to the largest column specified in the `<using list>`. > + For example, `using 1:3` would cause three columns to be read, of which > + the second will be ignored. >So how do you handle the case of 10 logical columns of data in the file, >of which you only want to read the 2nd and 4th? How do you skip "columns" >5 to 10 of each "line"? > "format" is supposed to be analogous to the "using" format string, so something like the following should work plot "datafile.dat" binary format ="%10float" using 2:4 (But actually, I see there is a bug because 10 is greater than MAX_COLS, which is a silly restriction in the code... I'll fix that.) Or, if there were a mix of variable types plot "datafile.dat" binary format = "%*int16%float32%*float32%int16%3*int%3*float" I agree that unless one uses this a lot it is a bit arcane. But recall, one of the primary uses is for automation. For example, passing an image from Octave to Gnuplot in binary is an example. Once the image() script in Octave is written with the proper format string, there is no need to deal with that again in Octave. >What constitutes the logical equivalent of a "blank line" in your binary >files? Or is there no equivalent to the auto-determination of scan lines? > A blank line occurs when the scan line reaches its end. For example, here is the scatter2 example from the image demo splot 'scatter2.bin' binary record=30,30,29,26 endian=little using 1:2:3 which means blank lines occur at the 30th line, 60th line, etc. Whatever application that is sending the data to gnuplot must know the quantity being sent. If that information is stored within the datafile and must be interpretted, then that requires additional routines, an example of which Petr has supplied. Such routines are easy to link in, but I'm not enthusiastic about writing all sorts of binary file routines for the bazillion different formats in the computer world. My original goal with all this was to quickly pump raw image data across to Octave. >Do you plan to handle strings? How? Would you require a full "binary format" >description in this case? Is there such a thing as a matrix of strings? > Well, I'd not thought of strings in the binary file, but perhaps something like "%s" in the format string? I would probably make the restriction that the strings within the file need to be NULL terminated. That's not an unrealistic expectation, is it? Or wait, maybe "%s" could be general length but NULL terminated; "%[#]s" could be a fixed length of # characters. I'm not sure what you mean by a matrix of strings. >The matrix variant is far more straight-forward. I would think this will be >by far the most common use anyhow, and it would cover the pixel images >that you obviously have fondness for. Could we maybe have a first cut >version of this patch that only deals with matrix format binary data? > What is the matrix variant? Gnuplot binary? That was available all along. However, gnuplot binary doesn't work for color images. (Need three channels for that.) The switch BINARY_DATA_FILE can be undefined to remove binary datafiles from the code. Gnuplot binary would still work with the swicth off. Dan |
From: Ethan A M. <merritt@u.washington.edu> - 2004-08-15 04:30:52
|
On Saturday 14 August 2004 08:54 pm, Daniel Sebald wrote: > > >Your docs say > > + Gnuplot will retrieve a number of binary > > + variables equal to the largest column specified in the `<using list>`. > > + For example, `using 1:3` would cause three columns to be read, of which > > + the second will be ignored. > >So how do you handle the case of 10 logical columns of data in the file, > >of which you only want to read the 2nd and 4th? How do you skip "columns" > >5 to 10 of each "line"? > > "format" is supposed to be analogous to the "using" format string, so > something like the following should work > > plot "datafile.dat" binary format ="%10float" using 2:4 But according to the documentation I quoted above, that would only read in 4 logical columns, leaving 6 more unread values in the file before you get to the next set of input values. How do you tell it to skip the next 6 columns? > >What constitutes the logical equivalent of a "blank line" in your binary > >files? Or is there no equivalent to the auto-determination of scan lines? > > A blank line occurs when the scan line reaches its end. For example, > here is the scatter2 example from the image demo > > splot 'scatter2.bin' binary record=30,30,29,26 endian=little using 1:2:3 > > which means blank lines occur at the 30th line, 60th line, etc. But there you have told it on the command line what the structure is. The thing about blank lines in an ascii input file is that they define a structure on the fly; you don't need to specify it on the command line. I would much rather require a file format that indicates what each logical line contains. A blank line is then indicated *in the file* by some designated code (probably some number of 0s, but whatever). > Well, I'd not thought of strings in the binary file, but perhaps > something like "%s" in the format string? I would probably make the > restriction that the strings within the file need to be NULL terminated. > That's not an unrealistic expectation, is it? Or wait, maybe "%s" > could be general length but NULL terminated; "%[#]s" could be a fixed > length of # characters. Fixed length strings are not interesting. You could use NULL-termination, but only if you specify everything on the command line because otherwise the input routine doesn't know whether it's reading a string at all. > I'm not sure what you mean by a matrix of strings. I mean like an input file consisting of 10 lines of 5 strings each. Only in this case it would be a binary file containing 50 NULL-terminated strings that you have somehow flagged as being in a 10x5 matrix. > What is the matrix variant? Like your example above. (At least I *think* that's what your example was doing). A regular array of values all of the same sort. E.g. a 100x200x300 grid with x varying faster than y faster than z. But since it's regular and all the entries are the same length you know exactly where to find every element without any funky format stuff. > The problem is that an image of say 500 x 500 pixels gets very big in ASCII. I think that is a non-issue. You don't have to store this anywhere; you're just piping it in. But this is the very straightforward case that I called a matrix. You know in advance it's a 500x500 array, and you know how big each element is. No need for format statements, using specs, or any of that. [EAM puts on geezer hat again] Back in the old days of limited disk space it was a big win to store numeric data in binary files. This caused man-centuries of time to be wasted in dealing with cross-platform conversions and uncertainty about the exact format of the binary files. When disks got cheap we all heaved a huge sigh of relief and for the most part stopped using binary output files. It's just not worth it. So what if the ascii equivalent is big? Just compress it and it goes back to being about the same size as the original binary (OK, that depends a bit on what sort of data it is). Bottom line is I really don't like this general binary input format. If you know enough about your binary format to write a cryptic description like plot "datafile.dat" binary format="%*int16%float32%*float32%" \ record=30,30,29,26 endian=little then by gum, you know enough to write a jiffy filter routine and pipe normal ascii input into gnuplot. Many users may not be up to this, but those same users won't be able to figure out the endian business anyhow. Where exactly is the big gain? Simplicity is worth *a lot*. Far more than saving a little bandwidth in the input pipe. Input of binary files containing regular arrays may be worth it for convenience. But more complicated input requiring flags for bit order, word size, floating point format, and pre-announcement of the file structure? ---- All that strikes me as being more trouble than it is worth. Will your code work on an Amiga? On a 64-bit VMS machine? Who is going to explain to users how to set all the right flags to make it work? I believe that Petr had some specific applications in mind, so maybe he can step in and clarify exactly what pieces of this code he wanted, and why. I myself plot many sorts of data in gnuplot, but I've never felt a need for direct binary input. -- Ethan A Merritt Department of Biochemistry & Biomolecular Structure Center University of Washington, Seattle |
From: Daniel J S. <dan...@ie...> - 2004-08-15 07:38:09
|
Ethan A Merritt wrote: >On Saturday 14 August 2004 08:54 pm, Daniel Sebald wrote: > > >>>Your docs say >>> + Gnuplot will retrieve a number of binary >>> + variables equal to the largest column specified in the `<using list>`. >>> + For example, `using 1:3` would cause three columns to be read, of which >>> + the second will be ignored. >>>So how do you handle the case of 10 logical columns of data in the file, >>>of which you only want to read the 2nd and 4th? How do you skip "columns" >>>5 to 10 of each "line"? >>> >>> >>"format" is supposed to be analogous to the "using" format string, so >>something like the following should work >> >>plot "datafile.dat" binary format ="%10float" using 2:4 >> >> > >But according to the documentation I quoted above, that would only >read in 4 logical columns, leaving 6 more unread values in the file >before you get to the next set of input values. How do you tell it >to skip the next 6 columns? > Well, the documentation is a bit misleading. If all one specifies is "using", without a format string, then the assumption is that the highest number is the number of columns. But when the format string is there, it tells that a line is 10 floats in this case. I guess the idea was that one could leave out the format string and just type "using 1:2:3" for example. I'd not be against the "format" and "using" required to appear together. A lot of code goes to these special assumptions and whatnot. >>>What constitutes the logical equivalent of a "blank line" in your binary >>>files? Or is there no equivalent to the auto-determination of scan lines? >>> >>> >>A blank line occurs when the scan line reaches its end. For example, >>here is the scatter2 example from the image demo >> >>splot 'scatter2.bin' binary record=30,30,29,26 endian=little using 1:2:3 >> >>which means blank lines occur at the 30th line, 60th line, etc. >> >> > >But there you have told it on the command line what the structure is. >The thing about blank lines in an ascii input file is that they define >a structure on the fly; you don't need to specify it on the command line. >I would much rather require a file format that indicates what each >logical line contains. A blank line is then indicated *in the file* by some >designated code (probably some number of 0s, but whatever). > My geezerness only goes back to the days of PDP 11-70 and the 8 inch floppy platter. But I can't recall binary files ever having special characters to serve as the end of a record. (If so, what are they that they wouldn't clash with a valid IEEE float? Is there a NAN in IEEE float that could serve as an end of record?) >>Well, I'd not thought of strings in the binary file, but perhaps >>something like "%s" in the format string? I would probably make the >>restriction that the strings within the file need to be NULL terminated. >> That's not an unrealistic expectation, is it? Or wait, maybe "%s" >>could be general length but NULL terminated; "%[#]s" could be a fixed >>length of # characters. >> >> > >Fixed length strings are not interesting. You could use NULL-termination, >but only if you specify everything on the command line because otherwise >the input routine doesn't know whether it's reading a string at all. > Right. I guess all this sort of information would be in some type of header. >>I'm not sure what you mean by a matrix of strings. >> >> > >I mean like an input file consisting of 10 lines of 5 strings each. >Only in this case it would be a binary file containing 50 NULL-terminated >strings that you have somehow flagged as being in a 10x5 matrix. > > > >>What is the matrix variant? >> >> > >Like your example above. (At least I *think* that's what your example >was doing). A regular array of values all of the same sort. E.g. >a 100x200x300 grid with x varying faster than y faster than z. >But since it's regular and all the entries are the same length you >know exactly where to find every element without any funky format >stuff. > Yes, I'd certainly be content with something simple to get data across from an application to gnuplot. I mean, the expectation is not that one has a thousand columns of binary data of which you want to pick out only two of them. In most cases one could say something like format="%int" using 1:2:3 or format="%float" using 1:2:3 I think originally this started as a slight variation on the using command, but it got too confusing from that. So "binary" developed it's own using string. >>The problem is that an image of say 500 x 500 pixels gets very big in ASCII. >> >> > >I think that is a non-issue. You don't have to store this anywhere; you're >just piping it in. But this is the very straightforward case that I called >a matrix. You know in advance it's a 500x500 array, and you know how big >each element is. No need for format statements, using specs, or any of that. > >[EAM puts on geezer hat again] Back in the old days of limited disk space >it was a big win to store numeric data in binary files. This caused >man-centuries of time to be wasted in dealing with cross-platform conversions >and uncertainty about the exact format of the binary files. When disks got >cheap we all heaved a huge sigh of relief and for the most part stopped >using binary output files. It's just not worth it. So what if the ascii >equivalent is big? Just compress it and it goes back to being about the >same size as the original binary (OK, that depends a bit on what sort of >data it is). > [Battle of the geezers coming] Point taken, when we're talking a few hundred data points. But when we're talking images, it could be a megabyte file, and converting that to ASCII yields a 10 megabyte file. (Each data point gets expanded to a floating point ASCII number. In addition the (x,y) locations have to be added as columns.) Then it has to be read in using the formatted I/O. (I assume scanf is slightly slower than raw data.) There gets to be this delay between hitting the return key and an image popping up in what looks like Octave, but is really Gnuplot. Another similar situation might be a speech waveform. I understand that the person using the software should really appropriately down-sample the data so that one isn't sending all kinds of data and extraneous high resolution to gnuplot, but people don't do that unfortunately. They've got fast computers and 120 Gbyte hard drives. >Bottom line is I really don't like this general binary input format. >If you know enough about your binary format to write a cryptic >description like > plot "datafile.dat" binary format="%*int16%float32%*float32%" \ > record=30,30,29,26 endian=little >then by gum, you know enough to write a jiffy filter routine and >pipe normal ascii input into gnuplot. Many users may not be >up to this, but those same users won't be able to figure out the >endian business anyhow. Where exactly is the big gain? >Simplicity is worth *a lot*. Far more than saving a little bandwidth >in the input pipe. > >Input of binary files containing regular arrays may be worth it >for convenience. But more complicated input requiring flags for >bit order, word size, floating point format, and pre-announcement >of the file structure? ---- All that strikes me as being more trouble >than it is worth. Will your code work on an Amiga? On a 64-bit >VMS machine? Who is going to explain to users how to set >all the right flags to make it work? > Hey, I'm not going to fight you on that one. I'm all for simplicity. No one ever offered up a simple solution. It started as a slight variation on the current implementation of user/etc. I probably figured at the time why not treat binary just like ascii so that all the functionality that ascii input has is also present for binary, e.g., passing through a function, etc.? It grew from there. I would add that I myself am deterred from implementing general binary if the df_readline() is going to continue to grow with functionality from within. Unless, say, the use_spec processing is converted to a function that can be called from multiple places, trying to maintain two "analogous", or "parallel", routines is too much for anyone, whether he or she is the original author or not. >I believe that Petr had some specific applications in mind, so >maybe he can step in and clarify exactly what pieces of this >code he wanted, and why. I myself plot many sorts of data in >gnuplot, but I've never felt a need for direct binary input. > > ... but I would say that binary input has to exist if one is going to display images. The faster the response between hitting the return key and the image popping up on the screen, the better. If it gets too slow for relatively small images, the user's response won't be favorable. Offer up a simple way of doing it... Would a syntax where there is _no_ format string and _no_ using string simplify matters? That is, everything must be binary floats and there cannot be any discarded columns. It would remove a lot of bits and pieces of code that would add up to pretty much I guess. How about the sample intervals? Are those useful? Again, the main structure of a binary data file from an application would just a solid string of raw data, but it isn't unreasonable to require everything about (x,y) positions be explicit, rather than implicit. Is there a subset of the syntax we've offered up which will work sufficiently and provide some flexibility so that both large image files and large linear files, like speech or other forms of lengthy time records can be transferred efficiently? If one wants to rule out lengthy binary linear records, and still allow large binary image files, how about an extension to the gnuplot binary format? That is, would you allow a variable to follow "binary" as binary presently exists in the CVS version? This purpose of the variable would be to indicate how many "channels" or "entries" or whatever are associated with a location in the grid. This would allow the use of grayscale and RGB images. For example, "binary 3" would be N x1 x2 x3 ... xN y1 <r11 g11 b11> <r21 g21 b21> <rN1 gN1 bN1> y2 <r12 g12 b12> <r22 g22 b22> <rN2 gN2 bN2> etc. The x and y wouldn't necessarily have to be Cartesian. They could be radial, if ever one gets adventurous enough to attempt circular images like sonograms, CT... which probably won't happen. I do acknowledge that gnuplot binary is limiting though. But it works for me. Petr may have feelings otherwise. (But Petr, it may be possible to take a lot of the binary code and write a little app that converts ESRF to gpbin or gpbin3, then have a little awk script so that gnuplot behaves almost exactly like "plot 'image.edf' with rgbimage".) Seriously, some consensus on a simple, acceptable approach and I can toss it together in a matter of hours and be done. From my perspective, so long as I can use Octave to get images, in this case spectrograms, into a PostScript or PS/Latex form, that has axes and tics and labels, and can be imported to a LaTeX document, I'm happy. What I have now works for me, but I won't be motivated to do a simpler design without some consensus, as opposed to offering up some other alternative for evaluation. I'm happy for the feedback and willing to change things if it will go somewhere. Dan |
From: Daniel J S. <dan...@ie...> - 2004-08-15 20:23:44
|
Ethan Merritt wrote: >(wandering a bit off topic) > >On Sunday 15 August 2004 01:03 am, Daniel J Sebald wrote: > > >>My geezerness only goes back to the days of PDP 11-70 and the 8 inch >>floppy platter. But I can't recall binary files ever having special >>characters to serve as the end of a record. >> >> > >PDP 11/xx used the FILES-11 filesystem, in which meta-information >about record type, disk allocation, ACLs, etc were stored in a >separate meta-file, not as in-line info. > >These filesystems supported very complicated record structures >for database work (still in use today), but also had 3 main "simple" >file structures: > Fixed-length records: > What it sounds like. The record length was specified in > meta-data. A read operation returned 1 whole record. > Variable-length records: > Each record began with an integer specifying how long > the record was. > CR/LF: > Unix-like stream-of-bytes, with end of record signalled by > either a CR or a LF. > >On top of that, Fortran used carriage-control characters at the beginning >of a record. > OK, you win the geezer challenge... Anyway, gnuplot binary then is similar to variable-length record. >But it's not a file. It never hits the disk, so I/O speed is not an issue. >And at current memory bandwidths, transferring 10 MB of data should >take only about 0.01 sec (if I haven't dropped a decimal point somewhere). >That will be totally dominated by the I/O time to read the original binary >data from a disk file. So it may be unaesthetic to have an intermediate >ascii stream, but I doubt it will be noticeable in terms of interactive >response. > Here is a test. Lets say a 500 x 500 image is processed in Octave and is to be plotted. I don't think 500 x 500 is unreasonable, x-ray angiography images, telescopic space images, they're usually pretty big. If you have octave, try the following to simulate the amount of data that would be transferred through the pipe. (Granted, we have no idea what kind of bottle necks might exist in how Octave is programmed for the pipe--perhaps it could be improved--but we'll use this as a rough test.) t = [1:500*500]/100; s = sin(t); plot(t,s); On my machine, a three year old Dell with a Pentium 4, 900-1000 MHz system bus, that plot takes 8 seconds. After the 4th second the octave command line returns, and 4 seconds after that the gnuplot plot appears. To me, that time is unacceptable. (Imagine the derision... No, the solution is not to buy a faster computer.) There are probably a couple things going on. First, the pipe may not transfer data at the rate you suggest, due to time sharing perhaps. Who knows? Second there is also the issue of this being formatted I/O, meaning that every value has to go through the scanf function. Does that slow things down? Now an example in Octave using the m-file designed to use the image and binary features added to gnuplot. A = 1./hilb(500); imagegp(A); This takes 3/4 to 1 second. Tolerable. There is a difference here though. The binary data goes through a file. So maybe the file is faster than the pipe. Let's try one last test. Sending the image data to a file in ascii form. I'll put an "if 1" around the instructions to ensure they are all executed as fast as possible one after the other. X = ones(size(A,2),1)* [1:size(A,1)]; Y = [1:size(A,2)]'*ones(1,size(A,1)); N = size(A,1)*size(A,2); B = [reshape(X,N,1) reshape(Y,N,1) reshape(A,N,1)]'; if 1 fid = fopen("junk.dat","w"); fprintf(fid, "%f %f %f\n", B); fclose(fid); graw("plot \'junk.dat\' using 1:2:3 w image\n"); end This takes 6 or 7 seconds. So files and a pipe are roughly the same in this crude test. Perhaps the file is even faster because more data is being transferred in that case. However, there are other things within gnuplot, i.e., reading from a file and reading from '-' may be different. Anyway, rough test. But, the conclusion is that it is probably the "fprintf' and 'scanf', i.e., formatted I/O, that slows things down, and binary data is a nice feature to have with images. >>I would add that I myself am deterred from implementing general binary >>if the df_readline() is going to continue to grow with functionality >>from within. Unless, say, the use_spec processing is converted to a >>function that can be called from multiple places, trying to maintain two >>"analogous", or "parallel", routines is too much for anyone, whether he >>or she is the original author or not. >> >> > >You mean changing use_spec[] from an array into a function? >If that turns out to be useful then I suppose it would be reasonable. > Yeah, but I'm not advocating that. You are persuading me that perhaps "binary" should be simpler. The question is, how many people will use Gnuplot, from the command line for processing images. Not many; so I would say that passing data through a function isn't that necessary, as in this example plot 'blutux.rgb' binary array=128x128 flipy format='%uchar' using (1.5*$1):2:3 with rgbimage The primary use I have in mind for this "large data set plotting" is something done by an application in an ephemeral way. Just send some data over, plot it, and discard the data. So, perhaps the ability to skip data within a binary file isn't necessary. That is, no '%*uchar%' kind of stuff, or skipping a number of bytes at the head of the file. How about tossing out the multiple records per file feature. If there is more than one big data set to plot, just create multiple files. How about tossing the implicit sampling interval? That would mean that all data must appear in the file, for example the (x,y) coordinates for each pixel of an image must be along with the pixel value. That means a sample image for the 'image.dem' program would increase in size by a factor of 5/3. No problem. Translations, toss that in the case where coordinates are in the file. All of this stuff would reduce a lot of the code, much of which is for interpreting the keywords. With no "using" there can be no functions. Also, let's say with binary, no strings, no time data, etc. Again, this kind of stuff will be small in quantity if ever it is plotted, in which case ASCII can be used. What I mean is there is no need to plot 500 strings. I'd hesitate to toss '%uchar', etc. Although I could give on that one. But let's rule out multiple data types per file. Maybe just one %float, etc. inside the format string. The code that does the transformation inside the df_readbinary() routine is fairly straightforward. There is a set of tables to compute datasizes upon compilation. Looks nasty but once it is compiled, it probably isn't too big. I'd hesitate to toss the endian information too. That code inside df_readbinary() also isn't too bad. The thing is, octave has a qualifier associated with its fopen() routine "ieee-le" and "ieee-be". They pay attention to endianess, so maybe gnuplot binary should too. So, in order to get functionality, here is a possible reduced syntax. binary {3 | xy | xyz | xyzc} {format="string"} {endian=little} Now if we want to toss the format, and require "all floats, all the time", fine. But the first part of that syntax is to allow entry for both images and long linear records such as speech waveforms or whatever. binary : The current gpbin file binary 3 : Very similar to current gpbin, what I call gpbin3. That is, it is the matrix format, but each element of the matrix has 3 components. (Could make that an arbitrary number, 1 up to max columns.) Now that covers images, i.e, a matrix format. But what about sampling in one dimension? Perhaps that could be done with gpbin if one sets N (the number of columns and first number in the file) to one. But that is tricky from the user's perspective. Hence the following: binary xy : Two "columns" of data. Would be useful for 2D plots. binary xyz : Three "columns" of data. Would be useful for 3D plots. binary xyzc : Four "columns" of data. Would be useful for 3D plot doing color. This wouldn't have to be the exact syntax. For example, it would be nice if one could just specify the number of columns with 2, 3, 4, 5, ..., max_cols, but that would conflict with trying to introduce multiple components per element of matrix binary. Dan |
From: Daniel J S. <dan...@ie...> - 2004-08-15 21:42:40
|
Daniel J Sebald wrote: > binary {3 | xy | xyz | xyzc} {format="string"} {endian=little} Or perhaps a better way of doing this would be binary {#} {columns} {format="string"} {endian=big} Here binary # : Matrix format, # specifies the number of components per matrix element. binary # columns : Binary data has # columns of "string" data format. How's that? No user functions, no strings, but I think that would cover most applications that use temporary data as opposed to, say, archived ascii where strings etc. are useful. I could implement that as df_readbinary() fairly easily. That would allow binary data for 2D data. So, how about 3D data? It already exists. But let's examine its structure. Basically, inside plot3d.c is the following: if (df_matrix) xdatum = df_3dmatrix(this_plot, NEED_PALETTE(this_plot)); else { <snip> while ((j = df_readline(v,MAXDATACOLS)) != DF_EOF) { Inside of df_3dmatrix is some code that looks very similar similar to df_readline (with using specs, etc.... maybe "using" can't be discarded afterall) , but instead df_3dmatrix does the work of storing the data in the plot structure, rather than passing data back to plot3d.c for interpretting. The question I pose is could df_3dmatrix be replaced by df_readline() which uses df_readbinary internally? The advantage of df_3dmatrix() is that it has a short little loop to do the storage rather than passing back a v[] vector. However, it isn't consistent with the most predominant gnuplot "paradigm" for interpretting and storing data. Plus, if there is a df_readline() which is very similar to df_3dmatrix, it is a bit of code repetition. Now, I know Ethan would like to pass in the plot structure pointer to the reading routine. Perhaps then the idea is to move away from that strategy, and df_3dmatrix is more the desired model. Here are some interesting comments from the code and online help: /* FIXME HBB 20001207: doesn't respect 'index' at all, even though it * could, and probably should. */ static float ** df_read_matrix(int *rows, int *cols) The `index` keyword is not supported, since the file format allows only one surface per file. The `every` and `using` filters are supported. `using` operates as if the data were read in the above triplet form. So, it seems maybe its desirable to have 'using' as part of the binary command. But, I don't think 'index' works from what is in a gnuplot-binary file. There is only the value <N+1> at the start of the file, not the other dimension. So there is no way of knowing when to end one record without something like "record=120x150". But, I'm drifting to the geezer camp, not allowing more than one record per file. Dan |
From: <mi...@ph...> - 2004-08-16 07:39:57
|
>> Well, I'd not thought of strings in the binary file, but perhaps >> something like "%s" in the format string? I would probably make the >> restriction that the strings within the file need to be NULL >> terminated. >> I'm not sure what you mean by a matrix of strings. Isn't "binary file with strings" the same as "ascii file with strings", just with \0 instead of \n -- thus, "tr" filter would do all. Aha, that won't help if binary data and strings are mixed in one file. Do you mean this case? >> The problem is that an image of say 500 x 500 pixels gets very big in >> ASCII. > > I think that is a non-issue. You don't have to store this anywhere; > you're just piping it in. Piping is not as fast as direct read. For example, I have recently benchmarked reading a big .gz files by a C program with (1) popen("gzip -c -d"), and with (2) linking it with zlib. The case (2) was 15% faster -- quite interesting speed-up if you have to read 2 GB of data. > When disks got cheap we all heaved a huge sigh of > relief and for the most part stopped using binary output files. I don't think this is right. - You can work on a computer where you are not authorized to replace hard disk (case of many companies). Also notebook hard disks are not cheap, fast, and easily replaceable. - Digital detectors are improving, and nowadays I have to deal with 2048x2048x16bit image series. That's plenty of data -- it grows quadratically with improvements in detector technology. - I guess image processing will never switch to ascii data -- it will always be too big and slow. > Many users may not be > up to this, but those same users won't be able to figure out the > endian business anyhow. In the "with image" patch, parameters and thus command line options for reading binary (matrix) files were designed carefully so that you can read any type of data. Command line options cover the same range of options you have to fill for any, even GUI-like, binary image reader, to read an arbitrary image data. The user must always know its data, that's it, and he is not bother that he has to pass this information to gnuplot or whichever other image drawer. > Simplicity is worth *a lot*. Far more than saving a little bandwidth > in the input pipe. The patch reading and drawing binary data is a major speedup. Try to compare drawing big (>512x512) traditional gnuplot binary data file and an binary image file. > Input of binary files containing regular arrays may be worth it > for convenience. But more complicated input requiring flags for > bit order, word size, floating point format, and pre-announcement > of the file structure? > All that strikes me as being more trouble > than it is worth. Will your code work on an Amiga? On a 64-bit > VMS machine? I think so. You can specify Float32, Float64 etc. Probably you cannot draw binary floats saved by Turbo Pascal v 5--6, because these are 6 B -- I don't know about recent Pascals. But you cannot read in any other image programs I guess. > Who is going to explain to users how to set > all the right flags to make it work? Users working with image processing know their format structure. That's the user who explains gnuplot what to draw via command line options to "plot ... with image". Otherwise, you or somebody else writes a reader for Octave, and from there you draw your matrix via imagegp.m included in the patch. > I believe that Petr had some specific applications in mind, so > maybe he can step in and clarify exactly what pieces of this > code he wanted, and why. Yes, I want to fastly image binary image data with axes x and y of physical units (not pixel numbers). > I myself plot many sorts of data in > gnuplot, but I've never felt a need for direct binary input. I need it always when drawing an image larger than >=128x128. Otherwise, the drawing speed is very low (and especially on X11) and memory consumption is high (I remember that gnuplot eats at about 130 B for a point read+drawn from a column-wise file). The current version of Daniel's patch is fully satisfying my needs. --- PM |
From: Daniel J S. <dan...@ie...> - 2004-08-16 17:04:05
|
mi...@ph... wrote: >>>Well, I'd not thought of strings in the binary file, but perhaps >>>something like "%s" in the format string? I would probably make the >>>restriction that the strings within the file need to be NULL >>>terminated. >>>I'm not sure what you mean by a matrix of strings. >>> >>> > >Isn't "binary file with strings" the same as "ascii file with strings", >just with \0 instead of \n -- thus, "tr" filter would do all. Aha, that >won't help if binary data and strings are mixed in one file. Do you mean >this case? > Yes, that case. But I think they still are the same. If you look at binary files with headers in an editor, the strings near the top are just as readable as if it were an ascii file. >>>The problem is that an image of say 500 x 500 pixels gets very big in >>>ASCII. >>> >>> >>I think that is a non-issue. You don't have to store this anywhere; >>you're just piping it in. >> >> > >Piping is not as fast as direct read. > >For example, I have recently benchmarked reading a big .gz files by a C >program with (1) popen("gzip -c -d"), and with (2) linking it with zlib. >The case (2) was 15% faster -- quite interesting speed-up if you have to >read 2 GB of data. > > >>When disks got cheap we all heaved a huge sigh of >>relief and for the most part stopped using binary output files. >> >> > >I don't think this is right. > >- You can work on a computer where you are not authorized to replace hard >disk (case of many companies). > Good point. That sort of thing has been in U.S. news lately, i.e., misplaced drives that shouldn't be misplaced. > Also notebook hard disks are not cheap, >fast, and easily replaceable. >- Digital detectors are improving, and nowadays I have to deal with >2048x2048x16bit image series. That's plenty of data -- it grows >quadratically with improvements in detector technology. >- I guess image processing will never switch to ascii data -- it will >always be too big and slow. > Petr has it exactly right with image processing. As computers get faster, the technology seems to fill the void. That is a point I wanted to make before. (Call me the "neo-geezer".) >>Many users may not be >>up to this, but those same users won't be able to figure out the >>endian business anyhow. >> We've included an option "swap" for which a person doesn't need to know what big/little endian mean. Just swap the order and see how it turns out. >In the "with image" patch, parameters and thus command line options for >reading binary (matrix) files were designed carefully so that you can read >any type of data. Command line options cover the same range of options you >have to fill for any, even GUI-like, binary image reader, to read an >arbitrary image data. The user must always know its data, that's it, and >he is not bother that he has to pass this information to gnuplot or >whichever other image drawer. > > >>Simplicity is worth *a lot*. Far more than saving a little bandwidth >>in the input pipe. >> >> > >The patch reading and drawing binary data is a major speedup. Try to >compare drawing big (>512x512) traditional gnuplot binary data file and an >binary image file. > > >>Input of binary files containing regular arrays may be worth it >>for convenience. But more complicated input requiring flags for >>bit order, word size, floating point format, and pre-announcement >>of the file structure? >> There are all kinds of data files out there. I guess the question is, should the user be obligated to write something to make their files conform to gnuplot, or should the syntax exist for the user to finagle gnuplot into reading his or her file? Take the moderately proficient linux user, like myself. I can toil with a program's syntax to a certain extent. But to write a linux utility to convert data file formats, that's more trouble. But I acknowledge Ethan's point; perhaps a bit of "code bloat". (Let's see if I can help that, see below.) >>All that strikes me as being more trouble >>than it is worth. Will your code work on an Amiga? On a 64-bit >>VMS machine? >> >> > >I think so. You can specify Float32, Float64 etc. > >Probably you cannot draw binary floats saved by Turbo Pascal v 5--6, >because these are 6 B -- I don't know about recent Pascals. But you cannot >read in any other image programs I guess. > > >>Who is going to explain to users how to set >>all the right flags to make it work? >> >> > >Users working with image processing know their format structure. That's >the user who explains gnuplot what to draw via command line options to >"plot ... with image". >Otherwise, you or somebody else writes a reader for Octave, and from there >you draw your matrix via imagegp.m included in the patch. > > >>I believe that Petr had some specific applications in mind, so >>maybe he can step in and clarify exactly what pieces of this >>code he wanted, and why. >> >> > >Yes, I want to fastly image binary image data with axes x and y of >physical units (not pixel numbers). > > >>I myself plot many sorts of data in >>gnuplot, but I've never felt a need for direct binary input. >> >> > >I need it always when drawing an image larger than >=128x128. Otherwise, >the drawing speed is very low (and especially on X11) and memory >consumption is high (I remember that gnuplot eats at about 130 B for a >point read+drawn from a column-wise file). >The current version of Daniel's patch is fully satisfying my needs. > That's correct. We'd thought the current format with gnuplot's eight-field-wide point was sort of inefficient for large data files. However, tacking in a new scheme for more compact storage would be too much of a paradigm shift. In that same vein, I'd like to address that df_3dmatrix() routine again. Now, inside of there is code that looks very similar to the df_readbinary() I've created. So, this df_3dmatrix() started out in some ways similar to df_readline(), with the use_specs and all. And Ethan and I have discussed now this problem of code re-use or similar functionality for binary and ascii, without intermixing them in the same routine to create a mess. So, that df_3dmatrix() has in some sense already has not kept up in functionality to df_readline(). I'd like to propose you let me take a bit of time to move the important parts of df_3dmatrix() that already aren't in df_readbinary(), which I think are very few, and move them into df_readbinary(). I could easily make that df_readbinary() routine read gnuplot binary files(). Then df_3dmatrix() and it's helper routine read_file() could be discarded. That would make the innards of plot2d.c and plot3d.c only use the "df_readline()" approach to bringing in data. Perhaps one doesn't like the "df_readline()" approach, but I think there is advantage to having just one paradigm *and* to having df_readascii() and df_readbinary() in the same file where they share their many similarities and it is a good reminder that if someone adds functionality to df_readascii() there is always that df_readbinary() to be aware of. Basically df_3dmatrix() and read_matrix() read in the whole data file at once, then go through a short loop to store the array into the "point structure". Is that the direction that Gnuplot should head? If you think not, and agree that having only just a "df_readline()" form of input is good, then that right there will free up some code space and assuage concerns about code bloat. Dan |