From: Lars Hecking <lhecking@us...>  20051201 09:51:16

 Forwarded message from Thomas Mattison <mattison@...>  To: lhecking@..., broeker@..., cgaylord@... From: Thomas Mattison <mattison@...> Subject: volunteering for gnuplot work? Date: Wed, 30 Nov 2005 13:21:24 0800 Hi I'm a physics professor, and I use gnuplot for teaching and my own work. There are some annoyances about the fitting aspects of gnuplot that I would like to volunteer to fix. The major annoyance is that the errors from fits are not done in what I consider to be the correct way. There exists a definition for the error of a fit that is independent of the goodness of the fit, although of course it depends on the function, the distribution of the data, and the error bars on the data. Gnuplot fits do not report this result. Internally, gnuplot does calculate this error, but it then rescales it by the goodness of the fit. If the function fits the data perfectly at every point, the error returned by gnuplot is zero. This is clearly nonsense. Consider the simplest possible case of a single data point with a y value and error, and fitting a singleparameter function whose value is just the parameter. The value of the fit parameter should be the y value, and the error on the fit parameter should be the error on the data point. A gnuplot fit returns the right parameter value, but since the goodness of fit is perfect, gnuplot says the error on the fit parameter is zero. The rescaling that gnuplot does to the fit errors is equivalent to calculating a single scale factor on the data point errors from the residuals of the fit, and using the rescaled data errors in the fit. There is some value to this rescaling in many common circumstances. The standard formula for fit parameter errors requires that valid data errors be input to the fit. If the user does not have them, the gnuplot algorithm is equivalent to computing a uniform value for the data errors from the residuals of the fit. I don't have a better suggestion for what to do in such cases. My proposed solution would be to report both versions of the error: the conventional error that is independent of the goodness of fit, and the rescaled error that gnuplot presently reports. I would also explicitly report the factor by which the data errors have been effectively rescaled. If the user does not include errors in the fit, I would make both answers be the present gnuplot answer. In this case, the above errorrescaling factor is in fact the effective error that gnuplot has calculated for the data points. If the data were plotted with this constant error value, then they should agree with the fitted function within this error about 2/3 of the time. A feature of this fix is that it would change the format of the parameter and error report from the fits. I don't have as good an idea as you folks do about whether users would care about format changes. Perhaps a backwardcompatibility switch could be put in, so the oldformat report could still be available. Another annoyance is the lack of a nice way to record fit error values (it would also be nice to record parameter errors from multiple similar fits to different data sets). I would add a feature: create a parameterlog file that would contain gnuplotreadable fit summary information: chisquare, parmA, normalerrorA, rescalederrorA, parmB, regularerrorB, rescalederrorB, ... Each fit would append to the end of the file, similar to the present fit log file. I would precede each line with a copy of the fit command that produced the fit (with a # in front so gnuplot would consider it to be a comment). Probably I would also have a commented line giving the time of the fit. It would also be possible to create a comment line containing headers to show which column means what for the fit, using the user's names for the fit variables. An example of using this feature is an experiment students do in my lab. They fit the currentvoltage relationship for a diode, which is I(V) = I0 * (exp(V/V0)  1). They do this with the diode dunked in liquid nitrogen, dry ice and alcohol, ice water, and boiling water. The I0 and V0 parameters depend on temperature, and I want the students to plot and fit the parameters vs temperature. The students could take this new parameterlog file, add a temperature column to the file with a text editor, then plot and fit from this file. I also have some minor annoyances that I think are worth fixing. I would change the default FIT_START_LAMBDA to be 0.01 as recommended by Numerical Recipes rather than the method used now (I tell my students to do this and it frequently helps). I would make the default value for uninitialized variables in fits to be 0.01 rather than 1e30 (the numerical derivatives algorithm tends to break with such a tiny default). I would fix the numerical derivatives algorithm so it would not break if the initial value of a parameter is zero. There is a more major feature addition that would be nice. Frequently, data has errors not only on the yvariable, but also on the xvariable. Gnuplot (nicely) handles plotting data with both x and y errors, but it doesn't know how to fit such data. The rigorously correct way involves fitting for adjustments to all the xvariable values, as well as the parameters. The naive way to do this produces impractically big matrices when there is a lot of data, even if the fit function is simple (the computation time goes as datapoints cubed instead of proportional to points). There is a more efficient but still rigorous way to do it (goes as parameters squared times points), and an even more efficient way (just a few times longer than the conventional fit, with not much extra coding required). Very few commonly available programs that do fits deal correctly with errors in x variables as well as y variables. I would be willing to try adding this to gnuplot. Note that there is a userinterface consistency issue. The following lines do analogous things. plot 'file' using 1:2:3 with err fit f(x) 'file' using 1:2:3 via a,b,c But the next two lines don't do analogous things plot 'file' using 1:2:3:4 with xyerr fit f(x,y) 'file' using 1:2:3:4 via a,b,c The plot command uses columns 3 and 4 for x and y errors, and the fit command uses columns 3 and 4 for z and z errors. One solution would be to make the interpretation of the columns in a fit command depend on the number of variables in the function. In this case, fit f(x ) 'file' using 1:2:3:4 via a,b,c fit f(x,y) 'file' using 1:2:3:4 via a,b,c would do different things. The first one would be y vs x with x and y errors, the second would be z vs (x,y) with z errors. Another solution would be to allow fit commands to "with" options like plot commands. If the "fit with" options were absent, gnuplot would do what it does now. If the "fit with" options were present, they would override the current default. I have my own clanguage fitting package for nonlinear fits using numerical derivatives that has all the capability of the gnuplot fitting internals. It would probably be easier for me to add xy error fitting by replacing the fitting guts of gnuplot with my own code than to reverseengineer what's there now well enough to add what would be required (yes I have looked at it). I would be willing to do bugfixes and maintenance to the fitting stuff afterwards. So what do you think of the possibility of me getting involved in this? Cheers Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British Columbia Present Address: Stanford Linear Accelerator Center 2575 Sand Hill Road, Menlo Park, CA, 94025 Building 48 (Research Office Building), Mail Station MS35 Office: ROB231 Phone: 6509265342 Fax: 6509268522  End forwarded message  