On 2-Dec-05, at 4:58 AM, Hans-Bernhard Broeker wrote:
> Thomas Mattison wrote:
>> On 1-Dec-05, at 4:38 AM, Hans-Bernhard Broeker wrote:
>>> Thomas Mattison wrote:
>
>>>> The major annoyance is that the errors from fits are not done in
>>>> what I consider to be the correct way.
> [...]
>
>>> I know. I made it that way, at least partly on purpose.
>
>> For cases where the model being is fit is good, the data is good, and
>> the errors are appropriate, there will still be fluctuations in the
>> value of chisquare from data set to data set.
>
> Yes. But those should be small. Small enough that the effect on
> parameter errors doesn't really matter. If you're seriously worried
> about, say, a 10-percent change to the parameter errors, no simple
> fitting program will do the job anyway. Odds are that more
> fundamental violations (non-gaussian data errors, mostly) will do much
> more damage to the fit's behaviour than that.
Many people do worry about getting the errors right to less than the
fluctuations in chisquare per degree of freedom. If I were refereeing
a paper, I would send it back for revision if it used the definition of
error that gnuplot does.
And you don't need a more complicated fitting program than gnuplot to
get those errors. Gnuplot already does the right calculation
internally, it just doesn't report it. It reports only a non-standard,
fluctuation-sensitive error (although I will grant that it is possible
in most cases to recover the standard error from the fit log by some
simple hand calculations).
> Maybe we could use a policy like what the PDG has for global parameter
> fits of particle properties: report the unmodified errors and the
> chisq/ndf individually as the usual result, but if the chisq/ndf is
> bigger than 1, scale the errors instead, and add a note to this
> effect.
This sounds like what I have been proposing: report both the
non-fluctuating standard error, and the rescaled error that gnuplot
produces now. But I would report both, instead of choosing for the
user based on the chisquare value.
Users should look at chisquare/ndf (or better, we should calculate the
chi square probability given the degrees of freedom for them inside
gnuplot). If that is reasonable, then use the standard non-fluctuating
error reported. If the chisquare/ndf is bad, the rescaled error is a
reasonable rough estimate of the errors, but they should be suspicious
If they did not even use errors when they did the fit, the chisquare
does not have a well-defined meaning, and only the rescaled error has
much meaning. In that case, I would be willing to consider reporting
only the rescaled error.
But the program and documentation would be simpler if it treated fits
with errors and fits without errors in the same way, and left it up to
the user to interpret the results appropriately.
>
>>> I think it would make a lot more sense to add a couple lines to
>>> 'fit.log' instead of creating what would be an almost complete copy
>>> of all of its content.
>
>> The reason I would like a separate file is that the fit.log file is
>> intended to be read by humans, and is not particularly readable by
>> gnuplot.
>
> It's relatively easy to have the best of both worlds. Just comment
> out all the lines meant for human eyes only, and change the format of
> the others a bit to accomodate gnuplot's syntax a bit.
That's pretty similar to my compromise suggestion: put a comment
character in the first column of every line presently sent to fit.log,
and add new lines containing gnuplot-readable parameters and errors.
But for the applications I have in mind, it would be best to put all of
the parameters and errors from a given fit into a single
gnuplot-readable line. That line might get rather wide, and would be
not very compatible with human eyes. So I would not change the
existing human-readable format except to add a comment character, and
just add one non-human-readable line (perhaps with a comment line
containing parameter names and error labels in horizontal format
instead of column format).
> But I think a continuously appended fit log is of rather limited
> usability for automated re-use --- you can't feed pieces of it back
> into gnuplot any easier than you could do with individual files
> created by the 'update' command. You could only 'load' all of it, but
> that won't do you any good, because all variables would be those of
> the last fit.
What I want is not the ability to 'load' the fit.log, but to 'plot' or
'fit' the parameters embedded in it from multiple similar fits to
different data sets. After many similar fits, fit.log (after ignoring
the commented human-readable lines) would effectively have columns of
fit parameters and errors. You could plot parameter 1 vs fit number
with its error, for instance, or plot parameter 3 vs parameter 2. The
user could also use a text-editor to add columns to each
gnuplot-readable line with information about the conditions of the data
(the temperature in my example).
> I think that kind of work would be easier with external tools that
> collect data from individual 'update'd files than by working from a
> single long log.
My goal is to make such tools unnecessary, and have gnuplot do most or
all of the work, if it can be done by modest changes to the code.
> That's exactly what 'update' is for --- it just doesn't output the
> errors and such yet, but that should be easy to add.
Adding the errors to "update" output would be easy, making the input
parser ignore them for " fit via 'file' " would be more challenging.
But the format of "update" and "via" files is human-readable with one
parameter per line, and that is not the best for feeding back into
gnuplot for plotting.
<note to spectators: now the subject changes to the possibility of
doing fits with errors on the x-variable as well as the y variable.
One issue is how to tell the fit command which column means what>
>> I also suggested having the fit parser respect "with" options to make
>> the distinction.
>
> I don't think 'with' is a good name to use for that --- it has a
> rather different job in plot and splot. Changing the number of
> allowed/required using specifiers, and the interpretation of the data
> they yield, is really only a side effect of selecting a plot style,
> and a somewhat confusing one at that. Exporting only the confusing
> aspects of 'with' over to fit, while keeping none of the
> straightforward meanings that excuse them, feels like a bad user
> interface design to me.
I'm not sure I follow your logic. From the user point of view, "plot
with" for errorbars seems straightforward. err or yerr means the third
column is y errors, xerr means the third column has x errors, xyerr
means 3=xerr, 4=yerr. All these could be accepted by the y=f(x)
fitter.
It's true that splot doesn't presently know how to deal with z-errors.
But if it ever does, I suspect that it will be triggered by "with zerr"
and will expect 4 data columns, for x:y:z:zerr.
Plot also accepts asymmetric errors (4-column form for err, xerr, yerr
and 6-column for xyerr) that would have to be rejected by the fitter.
If the errorbars are really asymmetric, and that really matters to the
fit user, (s)he needs professional help that (s)he's not going to get
from gnuplot!
Letting "fit" accept "with err" would allow it to be more similar to
"plot" which I think is good interface design. [I would keep the
existing syntax for y=f(x) fits functioning if the user leaves off
"with err" for the sake of compatibility]
The joker in this proposal is that it is not compatible with the
present syntax of the fit command, where the existence of a 4th data
column forces a z=f(x,y) fit [the existence of a 3d column forces a
y=f(x) fit with y errors, only 2 columns means y=f(x) fit with uniform
y errors].
But that syntax is already a bit awkward, because the user doesn't
always have z-errors available. At present, the user is told to invent
one by saying "fit f(x,y) 'file' using x:y:z:(1). But we don't force
users to explicitly invent an error column for y=f(x) fits without y
errors.
So my proposal is to invent an "sfit" command that does z=f(x,y) fits.
It would expect 3 columns for x:y:z, and accept 4 columns for
x:y:z:zerr. I would make its parser also accept (but probably not
require) "with zerr" right from the start.
gnuplot already makes a distinction between 'plot' for y(x) and 'splot'
for z(x,y). Adding 'sfit' is consistent with this. After all, to
check a 'fit' of y=f(x), the user does a 'plot' with both the data and
the fit function. After doing a z=f(x,y) fit, he needs to do an
'splot' with both the data and the fit function. So why not call the
fit an 'sfit' ?
> The impact of the 3D fitting feature on the parser is quite minimal as
> it is. It just checks whether there are 4 columns of data or not.
I'm trying to think of things from the user perspective, as well as the
maintainer perspective. I'm not sure that the present interface to
z=f(x,y) fits gets the balance right. And I am volunteering to do
proposed the parser changes. The point is to get opinions on what is
the right thing to do, which involves both users and maintainers.
> This would become much more complicated with all the suggested
> extensions, and then it might be necessary to add a new command. But
> we usually try to avoid adding top-level commands as far as possible.
> 'fit' and 'update' were about the only additions to that set in the
> last decade. So I would prefer to do it by some other means than
> adding a command 'sfit'.
A user probably doesn't care if a system has a small number of
top-level commands with many options or a large number of top-level
commands with few options. The user cares about how easy the system is
to use. If having more top-level commands allows the options to be
simpler to understand and use, it should be considered. Some such
changes can make the maintenance easier also, by allowing more code to
be shared across functionalities.
So here's the summary of what I am proposing to do for gnuplot fitting
1. Change fit error reporting to include both standard and rescaled
errors
(rescaled error == present gnuplot error)
2. Write gnuplot-readable parameter and error summary lines to file
(an additional file and/or fit.log, perhaps with commenting-out of
fit.log lines)
3. Add ability to fit data with both x errors as well as y errors
(distinguished by "with xerr" or "with xyerr" in fit command,
without breaking present interface, including x:y:z:zerr syntax
for z=f(x,y) fits)
4. Add ability of fit command parser to interpret "with err" and "with
yerr"
(without breaking present interface)
5. Add 'sfit' command for z=f(x,y) fits, accepting x:y:z, x:y:z:err,
and also "x:y:z:err with zerr"
Cheers
Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British
Columbia
Present Address: Stanford Linear Accelerator Center
2575 Sand Hill Road, Menlo Park, CA, 94025
Building 48 (Research Office Building), Mail Station MS35
Office: ROB-231 Phone: 650-926-5342 Fax: 650-926-8522
|