Re: Re: interpretation of fit errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi

Consider the case where the "function" to be fit is just a constant: we think the the y-value of all data points should be the same, independent of x-value.  Let's call this constant C.  Surely we want gnuplot to be able to do this "fit" and give a meaningful error on C.

Then consider the special case where there is only a single data point: an x-value, a y-value, and a y-error.  The fit result should C=y, and the error should be sigma-C = sigma-y.  The fit should be "perfect" with a chisquare of zero.  The chisquare divided by the number of measurements is still zero.  

Internally, gnuplot's fit algorithm knows a number equivalent to sigma-C = sigma-y, but the reported error is that multiplied by chisquare/measurement, so the result is zero.

It is totally unreasonable to claim that "fitting" the single data point gives C with infinite precision, when the data point has a finite error.

One can make a similar argument about the case of fitting a line (two parameters, slope and intercept) to two x-y-yerror data points.  Gnuplot's fit will go exactly through both data points, with zero chisquare per measurement, so the reported errors on the slope and intercept will both be zero.  But clearly there is a range of slopes and range of intercepts that are "within one sigma" of both data points.  And internally gnuplot's fit algorithm knows those errors on slope and intercept, before it multiplies them by zero.

Every other fitting program that I have ever used, which takes into account y-errors, just reports the "internal" error and doesn't divide it by chisquare/measurement the way gnuplot does.

If the y-errors are known and are "correct", the points will scatter around the fit curve such that the residual (measurement minus curve) divided by the error has a gaussian distribution with mean of zero and sigma of 1.0.  If the "real" y-errors are twice as large as the "assumed" y-errors, the gaussian will have sigma of about 2.0, and the chisquare/measurement will be about 4.0.

However, the user is not required to provide y-errors to gnuplot.  In that case, the errors of the fit parameters are more problematic.  Gnuplot does what every other no-y-errors fitting program does: treat all the y-errors as being 1.0.  The "internal" error calculation is still done, but one does not expect it to really describe the errors of the parameters.  And one does not expect the "chisquare" per measurement to be approximately 1.0.

But we can use the "chisquare" per measurement to estimate the y-error, assuming that it's the same for all measurements.  The result is that y-error is just sqrt(chisquare/measurement).  

If we then repeated the fit using those y-errors, we do expect that the "internal" parameter errors will describe our knowledge of the parameters.

In fact, it's not necessary to actually repeat the fit.  The parameter values would come out exactly the same, and the parameter errors would just be multiplied by sqrt(chisquare/measurement).

This is in fact what gnuplot's fit algorithm does.  And for the case where no user errors are provided, I think it's (almost) the most sensible thing to do.  

So what should be done?  If it were me, I'd report both the "raw" and the "rescaled" parameter errors on the console.  The documentation should state that if the user has provided errors, the "raw" errors are the most meaningful, and if the user has not provided errors, the "rescaled" errors are the most meaningful.  This requires essentially no algorithm changes, just some print statement changes.

Perhaps at the end of the fit, one could print a recommendation to the user on which error to use, based on whether or not he provided errors, and on the chisquare per measurement.  If no errors were provided, only the rescaled errors are meaningful.  If errors were provided and the chisquare per measurement is reasonable (less than 2 perhaps), the raw errors are meaningful.  If errors were provided and the chisquare per measurement is not reasonable, then the raw errors are questionable and the rescaled errors might be more meaningful.

I would make one small additional change: chisquare should be divided by "degrees of freedom" == (measurements minus parameters) not just measurements.  Do this both for reporting STDFIT and rescaling of errors.  Dividing by degrees of freedom instead of measurements is totally standard statistical practice.

In the one-parameter fit to one data point and line-fit to two data points examples I give above, measurements-parameters is zero.  Ideally the chisquare is also zero, so ideally the rescaling factor is 0/0=NAN, which gives the "correct" result that the rescaled errors are meaningless (in these cases).  In practice the chisquare won't be exactly zero, so the rescaling factor will be INF, still telling the user that the rescaled errors are meaningless.  This is much more sensible than what gnuplot does now, which is report the errors are zero.

If we were fitting a line to 3 points instead of 2, with errors supplied, we expect the average raw chisquare to be 1, not 3.  If errors are not supplied, we expect the raw "chisquare" not to be 3*yerror^2, but only 1*yerror^2.  Dividing by (measurements - parameters) gives a significantly different (and more correct) estimate of the yerrors, and thus of the parameter errors.

                                      Cheers

Prof. Thomas Mattison           Hennings 276
University of British Columbia  Dept. of Physics and Astronomy
6224 Agricultural Road          Vancouver  BC        V6T 1Z1  CANADA
mat...@ph...         phone: 604-822-9690  fax:604-822-5324

Re: Re: interpretation of fit errors

A portable, multi-platform, command-line driven graphing utility

Re: Re: interpretation of fit errors