----- Forwarded message from Thomas Mattison <mattison@...> -----
To: lhecking@..., broeker@...,
From: Thomas Mattison <mattison@...>
Subject: volunteering for gnuplot work?
Date: Wed, 30 Nov 2005 13:21:24 -0800
I'm a physics professor, and I use gnuplot for teaching and my own
work. There are some annoyances about the fitting aspects of gnuplot
that I would like to volunteer to fix.
The major annoyance is that the errors from fits are not done in what I
consider to be the correct way. There exists a definition for the
error of a fit that is independent of the goodness of the fit, although
of course it depends on the function, the distribution of the data, and
the error bars on the data. Gnuplot fits do not report this result.
Internally, gnuplot does calculate this error, but it then rescales it
by the goodness of the fit.
If the function fits the data perfectly at every point, the error
returned by gnuplot is zero. This is clearly nonsense.
Consider the simplest possible case of a single data point with a y
value and error, and fitting a single-parameter function whose value is
just the parameter. The value of the fit parameter should be the y
value, and the error on the fit parameter should be the error on the
data point. A gnuplot fit returns the right parameter value, but since
the goodness of fit is perfect, gnuplot says the error on the fit
parameter is zero.
The rescaling that gnuplot does to the fit errors is equivalent to
calculating a single scale factor on the data point errors from the
residuals of the fit, and using the rescaled data errors in the fit.
There is some value to this rescaling in many common circumstances.
The standard formula for fit parameter errors requires that valid data
errors be input to the fit. If the user does not have them, the
gnuplot algorithm is equivalent to computing a uniform value for the
data errors from the residuals of the fit. I don't have a better
suggestion for what to do in such cases.
My proposed solution would be to report both versions of the error: the
conventional error that is independent of the goodness of fit, and the
rescaled error that gnuplot presently reports. I would also explicitly
report the factor by which the data errors have been effectively
rescaled. If the user does not include errors in the fit, I would make
both answers be the present gnuplot answer. In this case, the above
error-rescaling factor is in fact the effective error that gnuplot has
calculated for the data points. If the data were plotted with this
constant error value, then they should agree with the fitted function
within this error about 2/3 of the time.
A feature of this fix is that it would change the format of the
parameter and error report from the fits. I don't have as good an idea
as you folks do about whether users would care about format changes.
Perhaps a backward-compatibility switch could be put in, so the
old-format report could still be available.
Another annoyance is the lack of a nice way to record fit error values
(it would also be nice to record parameter errors from multiple similar
fits to different data sets).
I would add a feature: create a parameter-log file that would contain
gnuplot-readable fit summary information: chisquare, parm-A,
normal-errorA, rescaled-error-A, parm-B, regular-errorB,
rescaled-error-B, ... Each fit would append to the end of the file,
similar to the present fit log file. I would precede each line with a
copy of the fit command that produced the fit (with a # in front so
gnuplot would consider it to be a comment). Probably I would also have
a commented line giving the time of the fit. It would also be possible
to create a comment line containing headers to show which column means
what for the fit, using the user's names for the fit variables.
An example of using this feature is an experiment students do in my
lab. They fit the current-voltage relationship for a diode, which is
I(V) = I0 * (exp(V/V0) - 1). They do this with the diode dunked in
liquid nitrogen, dry ice and alcohol, ice water, and boiling water.
The I0 and V0 parameters depend on temperature, and I want the students
to plot and fit the parameters vs temperature. The students could take
this new parameter-log file, add a temperature column to the file with
a text editor, then plot and fit from this file.
I also have some minor annoyances that I think are worth fixing. I
would change the default FIT_START_LAMBDA to be 0.01 as recommended by
Numerical Recipes rather than the method used now (I tell my students
to do this and it frequently helps). I would make the default value
for uninitialized variables in fits to be 0.01 rather than 1e-30 (the
numerical derivatives algorithm tends to break with such a tiny
default). I would fix the numerical derivatives algorithm so it would
not break if the initial value of a parameter is zero.
There is a more major feature addition that would be nice. Frequently,
data has errors not only on the y-variable, but also on the x-variable.
Gnuplot (nicely) handles plotting data with both x and y errors, but
it doesn't know how to fit such data. The rigorously correct way
involves fitting for adjustments to all the x-variable values, as well
as the parameters. The naive way to do this produces impractically big
matrices when there is a lot of data, even if the fit function is
simple (the computation time goes as data-points cubed instead of
proportional to points). There is a more efficient but still rigorous
way to do it (goes as parameters squared times points), and an even
more efficient way (just a few times longer than the conventional fit,
with not much extra coding required).
Very few commonly available programs that do fits deal correctly with
errors in x variables as well as y variables. I would be willing to
try adding this to gnuplot.
Note that there is a user-interface consistency issue.
The following lines do analogous things.
plot 'file' using 1:2:3 with err
fit f(x) 'file' using 1:2:3 via a,b,c
But the next two lines don't do analogous things
plot 'file' using 1:2:3:4 with xyerr
fit f(x,y) 'file' using 1:2:3:4 via a,b,c
The plot command uses columns 3 and 4 for x and y errors, and the fit
command uses columns 3 and 4 for z and z errors.
One solution would be to make the interpretation of the columns in a
fit command depend on the number of variables in the function. In this
fit f(x ) 'file' using 1:2:3:4 via a,b,c
fit f(x,y) 'file' using 1:2:3:4 via a,b,c
would do different things. The first one would be y vs x with x and y
the second would be z vs (x,y) with z errors.
Another solution would be to allow fit commands to "with" options like
plot commands. If the "fit with" options were absent, gnuplot would do
what it does now. If the "fit with" options were present, they would
override the current default.
I have my own c-language fitting package for nonlinear fits using
numerical derivatives that has all the capability of the gnuplot
fitting internals. It would probably be easier for me to add xy error
fitting by replacing the fitting guts of gnuplot with my own code than
to reverse-engineer what's there now well enough to add what would be
required (yes I have looked at it). I would be willing to do bug-fixes
and maintenance to the fitting stuff afterwards.
So what do you think of the possibility of me getting involved in this?
Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British
Present Address: Stanford Linear Accelerator Center
2575 Sand Hill Road, Menlo Park, CA, 94025
Building 48 (Research Office Building), Mail Station MS35
Office: ROB-231 Phone: 650-926-5342 Fax: 650-926-8522
----- End forwarded message -----