From: Hans-Bernhard B. <br...@ph...> - 2005-12-01 12:37:37
|
Thomas Mattison wrote: > The major annoyance is that the errors from fits are not done in what I > consider to be the correct way. There exists a definition for the error > of a fit that is independent of the goodness of the fit, although of > course it depends on the function, the distribution of the data, and the > error bars on the data. Gnuplot fits do not report this result. I know. I made it that way, at least partly on purpose. The basic problem is that once the chisq/ndf is far away from one, discussing correctness of parameter errors is a waste of energy. In such a case whole fit is plain and simply wrong, so no parameter error value really has a right of calling itself correct. chisq/ndf far away from one means that either the data errors are unrealistic, or the model is wrong (over-/underfitting), or both. In this situation, it's a quite completely arbitrary choice whether to believe the input errors, or the residuals. gnuplot chooses to believe the residuals. > If the function fits the data perfectly at every point, the error > returned by gnuplot is zero. This is clearly nonsense. No --- the fit itself is clearly nonsense. I have to insist that outputting nonsense as the result is fully justified in such a case. > My proposed solution would be to report both versions of the error: the > conventional error that is independent of the goodness of fit, and the > rescaled error that gnuplot presently reports. I would also explicitly > report the factor by which the data errors have been effectively > rescaled. That factor already is reported. It's sqrt(WSSR/ndf). I don't really think that there's much value in outputting two numbers as seemingly independent results where in reality there's just a multiplication by common factor needed to go from one to the other. > A feature of this fix is that it would change the format of the > parameter and error report from the fits. I don't have as good an idea > as you folks do about whether users would care about format changes. Probably not as much now as they would have before version 4.0, when we introduced 'set fit errorvariables' which lets users get at the parameter errors directly from inside gnuplot, without parsing the fit.log or screen output to extract them. > I would add a feature: create a parameter-log file that would contain > gnuplot-readable fit summary information: chisquare, parm-A, > normal-errorA, rescaled-error-A, parm-B, regular-errorB, > rescaled-error-B, ... Each fit would append to the end of the file, > similar to the present fit log file. I would precede each line with a > copy of the fit command that produced the fit (with a # in front so > gnuplot would consider it to be a comment). Probably I would also have > a commented line giving the time of the fit. It would also be possible > to create a comment line containing headers to show which column means > what for the fit, using the user's names for the fit variables. I think it would make a lot more sense to add a couple lines to 'fit.log' instead of creating what would be an almost complete copy of all of its content. > I also have some minor annoyances that I think are worth fixing. I > would change the default FIT_START_LAMBDA to be 0.01 as recommended by > Numerical Recipes rather than the method used now (I tell my students > to do this and it frequently helps). Caution there --- the actual algorithm is not the same as that in NR (although it used to be), so not all recommendations issued by that book may apply to gnuplot unmodified. I took the computation for the initial lambda from a textbook on numerical maths (Schwarz "Numerische Mathematik", Teubner Verlag, in German). I.e. the default startup for lambda is not any particular fixed number. It's computed from the problem. > I would make the default value for uninitialized variables in fits to > be 0.01 rather than 1e-30 (the numerical derivatives algorithm tends > to break with such a tiny default). It's not 1e-30 right now --- it's 1.0 > I would fix the numerical derivatives algorithm so it would > not break if the initial value of a parameter is zero. > There is a more major feature addition that would be nice. Frequently, > data has errors not only on the y-variable, but also on the x-variable. > Gnuplot (nicely) handles plotting data with both x and y errors, but it > doesn't know how to fit such data. Neither do Mrs. Marquard and Levenberg, or anybody else I've heard about, for generic non-linear fitting. > The rigorously correct way involves > fitting for adjustments to all the x-variable values, as well as the > parameters. Rigorously, fitting doesn't even know what x values are. It's just a typical use case simplification to treat the model as a function y[i] = f(x[i], parameters) Internally, the algorithm only assumes y[i] = f[i](parameters) > But the next two lines don't do analogous things > > plot 'file' using 1:2:3:4 with xyerr > fit f(x,y) 'file' using 1:2:3:4 via a,b,c > > The plot command uses columns 3 and 4 for x and y errors, and the fit > command uses columns 3 and 4 for z and z errors. That's because that fit has no useful relation with the 'plot' command you're comparing it to. The type of command to compare it with would be splot 'file' using 1:2:3:4 with errorbars (which unfortunately still doesn't exist). > One solution would be to make the interpretation of the columns in a fit > command depend on the number of variables in the function. It won't work as easy as that --- parameters can also be passed as parameters, i.e. you can fit f(x,a,b,c) 'file' u 1:2:3 via a,b,c instead of fit f(x) 'file' u 1:2:3 via a,b,c so the argument count of the function is strictly a red herring. |
From: Thomas M. <mat...@ph...> - 2005-12-01 22:56:56
|
Hi To the spectators of this discussion, it's in the context of me volunteering to do actual work, not complaining that someone else should! On 1-Dec-05, at 4:38 AM, Hans-Bernhard Broeker wrote: > Thomas Mattison wrote: > >> The major annoyance is that the errors from fits are not done in what >> I consider to be the correct way. There exists a definition for the >> error of a fit that is independent of the goodness of the fit, >> although of course it depends on the function, the distribution of >> the data, and the error bars on the data. Gnuplot fits do not report >> this result. > > I know. I made it that way, at least partly on purpose. > > The basic problem is that once the chisq/ndf is far away from one, > discussing correctness of parameter errors is a waste of energy. In > such a case whole fit is plain and simply wrong, so no parameter error > value really has a right of calling itself correct. > > chisq/ndf far away from one means that either the data errors are > unrealistic, or the model is wrong (over-/underfitting), or both. > In this situation, it's a quite completely arbitrary choice whether to > believe the input errors, or the residuals. gnuplot chooses to > believe the residuals. > For cases where the model being is fit is good, the data is good, and the errors are appropriate, there will still be fluctuations in the value of chisquare from data set to data set. It is wrong to say that the "lucky" data sets really determine the parameters better than the "unlucky" data sets. The parameter errors should be independent of the fluctuations in the chisquare. Every other fitting program I have run across reports the canonical error, which is independent of the chisquare. I agree that if the chisq/ndf is far greater than 1, one should interpret the canonical error with a grain of salt. That is why I advocate reporting both the canonical error and the rescaled error. >> If the function fits the data perfectly at every point, the error >> returned by gnuplot is zero. This is clearly nonsense. > > No --- the fit itself is clearly nonsense. I have to insist that > outputting nonsense as the result is fully justified in such a case. I disagree. If we fit a straight line to two data points, the fit will be perfect, with a chisquare of zero. If the errors on the two data points are known, then it is simple to calculate the errors on the slope and intercept by standard error propagation. The canonical error from a standard fitting program agrees exactly with the error-propagation result. The errors are well-defined, meaningful, and finite. But the gnuplot convention says that the errors on the slope and intercept are either zero because the chisquare is zero, or undefined (because in the case of two data points and two parameters, there are zero degrees of freedom). >> I would add a feature: create a parameter-log file that would contain >> gnuplot-readable fit summary information: chisquare, parm-A, >> normal-errorA, rescaled-error-A, parm-B, regular-errorB, >> rescaled-error-B, ... Each fit would append to the end of the file, >> similar to the present fit log file. I would precede each line with >> a copy of the fit command that produced the fit (with a # in front so >> gnuplot would consider it to be a comment). Probably I would also >> have a commented line giving the time of the fit. It would also be >> possible to create a comment line containing headers to show which >> column means what for the fit, using the user's names for the fit >> variables. > > I think it would make a lot more sense to add a couple lines to > 'fit.log' instead of creating what would be an almost complete copy of > all of its content. The reason I would like a separate file is that the fit.log file is intended to be read by humans, and is not particularly readable by gnuplot. In my original post I gave a real-world example from teaching with gnuplot where students do multiple similar fits to different data sets, then plot the fit parameters (not the fit data!) with gnuplot. They even fit the fit parameters (with errors) using gnuplot. It's tedious and error-prone to read the fit.log files, and type the parameters and errors into another file, before gnuplot can read them. If gnuplot fits put their results in gnuplot-readable form into a file, it would be simple and reliable. Adding lines to fit.log that would be gnuplot-readable would be a partial solution, but a human would still have to delete the non-readable lines. Perhaps a compromise of putting a comment character at the start of all the other lines of fit.log would be the best solution. >> There is a more major feature addition that would be nice. >> Frequently, data has errors not only on the y-variable, but also on >> the x-variable. Gnuplot (nicely) handles plotting data with both x >> and y errors, but it doesn't know how to fit such data. > > Neither do Mrs. Marquard and Levenberg, or anybody else I've heard > about, for generic non-linear fitting. The theory behind it is simple enough, and some programs (including mine) can do it, they are just not as widely known as the simpler case of errors only in y. > Rigorously, fitting doesn't even know what x values are. It's just a > typical use case simplification to treat the model as a function > > y[i] = f(x[i], parameters) > > Internally, the algorithm only assumes > > y[i] = f[i](parameters) Yes, as I stated you need to go beyond y[i] = f[i](parameters). It's not a simple extension of standard algorithms, in the way that z[i] = f(x[i], y[i], parameters) is a simple extension of y[i] = f(x[i], parameters). But it can be done, and errors on the x variables are common in practice, so it should be done more often. > >> But the next two lines don't do analogous things >> plot 'file' using 1:2:3:4 with xyerr >> fit f(x,y) 'file' using 1:2:3:4 via a,b,c >> The plot command uses columns 3 and 4 for x and y errors, and the fit >> command uses columns 3 and 4 for z and z errors. >> One solution would be to make the interpretation of the columns in a >> fit command depend on the number of variables in the function. > > It won't work as easy as that --- parameters can also be passed as > parameters, i.e. you can > > fit f(x,a,b,c) 'file' u 1:2:3 via a,b,c > > instead of > > fit f(x) 'file' u 1:2:3 via a,b,c > > so the argument count of the function is strictly a red herring. I hadn't thought of that. So that proposal doesn't work. I also suggested having the fit parser respect "with" options to make the distinction. Another possibility would be to make a new command name like "sfit" and use that for fits with both x and y as independent variables. That would probably simplify the fit command parser (at the expense of creating a new, but hopefully simple, command to parse), and allow more parallelism between plot/fit and splot/sfit. Cheers Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British Columbia Present Address: Stanford Linear Accelerator Center 2575 Sand Hill Road, Menlo Park, CA, 94025 Building 48 (Research Office Building), Mail Station MS35 Office: ROB-231 Phone: 650-926-5342 Fax: 650-926-8522 |
From: Hans-Bernhard B. <br...@ph...> - 2005-12-02 12:57:47
|
Thomas Mattison wrote: > On 1-Dec-05, at 4:38 AM, Hans-Bernhard Broeker wrote: >> Thomas Mattison wrote: >>> The major annoyance is that the errors from fits are not done in what >>> I consider to be the correct way. [...] >> I know. I made it that way, at least partly on purpose. > For cases where the model being is fit is good, the data is good, and > the errors are appropriate, there will still be fluctuations in the > value of chisquare from data set to data set. Yes. But those should be small. Small enough that the effect on parameter errors doesn't really matter. If you're seriously worried about, say, a 10-percent change to the parameter errors, no simple fitting program will do the job anyway. Odds are that more fundamental violations (non-gaussian data errors, mostly) will do much more damage to the fit's behaviour than that. > I agree that if the chisq/ndf is far greater than 1, one should > interpret the canonical error with a grain of salt. That is why I > advocate reporting both the canonical error and the rescaled error. I rest unconvinced that that's a good idea. Maybe we could use a policy like what the PDG has for global parameter fits of particle properties: report the unmodified errors and the chisq/ndf individually as the usual result, but if the chisq/ndf is bigger than 1, scale the errors instead, and add a note to this effect. The basic question is almost a philosophical one: what do we want to fit: the data, or their errors? chisq/ndf > 1 means that the model doesn't fit the data errors --- either the errorbars were unrealistically small, or the model too simplistic. The unscaled errors you would get in such a case try to fit the errorbars, gnuplot's output fits the data. It treats the errorbars as weights, and nothing else. >> I think it would make a lot more sense to add a couple lines to >> 'fit.log' instead of creating what would be an almost complete copy of >> all of its content. > The reason I would like a separate file is that the fit.log file is > intended to be read by humans, and is not particularly readable by gnuplot. It's relatively easy to have the best of both worlds. Just comment out all the lines meant for human eyes only, and change the format of the others a bit to accomodate gnuplot's syntax a bit. But I think a continuously appended fit log is of rather limited usability for automated re-use --- you can't feed pieces of it back into gnuplot any easier than you could do with individual files created by the 'update' command. You could only 'load' all of it, but that won't do you any good, because all variables would be those of the last fit. > In my original post I gave a real-world example from teaching with > gnuplot where students do multiple similar fits to different data sets, > then plot the fit parameters (not the fit data!) with gnuplot. They > even fit the fit parameters (with errors) using gnuplot. I think that kind of work would be easier with external tools that collect data from individual 'update'd files than by working from a single long log. > If gnuplot fits put their results in gnuplot-readable form into a file, > it would be simple and reliable. That's exactly what 'update' is for --- it just doesn't output the errors and such yet, but that should be easy to add. > I also suggested having the fit parser respect "with" options to make > the distinction. I don't think 'with' is a good name to use for that --- it has a rather different job in plot and splot. Changing the number of allowed/required using specifiers, and the interpretation of the data they yield, is really only a side effect of selecting a plot style, and a somewhat confusing one at that. Exporting only the confusing aspects of 'with' over to fit, while keeping none of the straightforward meanings that excuse them, feels like a bad user interface design to me. > Another possibility would be to make a new command name like "sfit" and > use that for fits with both x and y as independent variables. That > would probably simplify the fit command parser (at the expense of > creating a new, but hopefully simple, command to parse), and allow more > parallelism between plot/fit and splot/sfit. The impact of the 3D fitting feature on the parser is quite minimal as it is. It just checks whether there are 4 columns of data or not. This would become much more complicated with all the suggested extensions, and then it might be necessary to add a new command. But we usually try to avoid adding top-level commands as far as possible. 'fit' and 'update' were about the only additions to that set in the last decade. So I would prefer to do it by some other means than adding a command 'sfit'. |
From: Thomas M. <mat...@ph...> - 2005-12-02 20:07:51
|
On 2-Dec-05, at 4:58 AM, Hans-Bernhard Broeker wrote: > Thomas Mattison wrote: >> On 1-Dec-05, at 4:38 AM, Hans-Bernhard Broeker wrote: >>> Thomas Mattison wrote: > >>>> The major annoyance is that the errors from fits are not done in >>>> what I consider to be the correct way. > [...] > >>> I know. I made it that way, at least partly on purpose. > >> For cases where the model being is fit is good, the data is good, and >> the errors are appropriate, there will still be fluctuations in the >> value of chisquare from data set to data set. > > Yes. But those should be small. Small enough that the effect on > parameter errors doesn't really matter. If you're seriously worried > about, say, a 10-percent change to the parameter errors, no simple > fitting program will do the job anyway. Odds are that more > fundamental violations (non-gaussian data errors, mostly) will do much > more damage to the fit's behaviour than that. Many people do worry about getting the errors right to less than the fluctuations in chisquare per degree of freedom. If I were refereeing a paper, I would send it back for revision if it used the definition of error that gnuplot does. And you don't need a more complicated fitting program than gnuplot to get those errors. Gnuplot already does the right calculation internally, it just doesn't report it. It reports only a non-standard, fluctuation-sensitive error (although I will grant that it is possible in most cases to recover the standard error from the fit log by some simple hand calculations). > Maybe we could use a policy like what the PDG has for global parameter > fits of particle properties: report the unmodified errors and the > chisq/ndf individually as the usual result, but if the chisq/ndf is > bigger than 1, scale the errors instead, and add a note to this > effect. This sounds like what I have been proposing: report both the non-fluctuating standard error, and the rescaled error that gnuplot produces now. But I would report both, instead of choosing for the user based on the chisquare value. Users should look at chisquare/ndf (or better, we should calculate the chi square probability given the degrees of freedom for them inside gnuplot). If that is reasonable, then use the standard non-fluctuating error reported. If the chisquare/ndf is bad, the rescaled error is a reasonable rough estimate of the errors, but they should be suspicious If they did not even use errors when they did the fit, the chisquare does not have a well-defined meaning, and only the rescaled error has much meaning. In that case, I would be willing to consider reporting only the rescaled error. But the program and documentation would be simpler if it treated fits with errors and fits without errors in the same way, and left it up to the user to interpret the results appropriately. > >>> I think it would make a lot more sense to add a couple lines to >>> 'fit.log' instead of creating what would be an almost complete copy >>> of all of its content. > >> The reason I would like a separate file is that the fit.log file is >> intended to be read by humans, and is not particularly readable by >> gnuplot. > > It's relatively easy to have the best of both worlds. Just comment > out all the lines meant for human eyes only, and change the format of > the others a bit to accomodate gnuplot's syntax a bit. That's pretty similar to my compromise suggestion: put a comment character in the first column of every line presently sent to fit.log, and add new lines containing gnuplot-readable parameters and errors. But for the applications I have in mind, it would be best to put all of the parameters and errors from a given fit into a single gnuplot-readable line. That line might get rather wide, and would be not very compatible with human eyes. So I would not change the existing human-readable format except to add a comment character, and just add one non-human-readable line (perhaps with a comment line containing parameter names and error labels in horizontal format instead of column format). > But I think a continuously appended fit log is of rather limited > usability for automated re-use --- you can't feed pieces of it back > into gnuplot any easier than you could do with individual files > created by the 'update' command. You could only 'load' all of it, but > that won't do you any good, because all variables would be those of > the last fit. What I want is not the ability to 'load' the fit.log, but to 'plot' or 'fit' the parameters embedded in it from multiple similar fits to different data sets. After many similar fits, fit.log (after ignoring the commented human-readable lines) would effectively have columns of fit parameters and errors. You could plot parameter 1 vs fit number with its error, for instance, or plot parameter 3 vs parameter 2. The user could also use a text-editor to add columns to each gnuplot-readable line with information about the conditions of the data (the temperature in my example). > I think that kind of work would be easier with external tools that > collect data from individual 'update'd files than by working from a > single long log. My goal is to make such tools unnecessary, and have gnuplot do most or all of the work, if it can be done by modest changes to the code. > That's exactly what 'update' is for --- it just doesn't output the > errors and such yet, but that should be easy to add. Adding the errors to "update" output would be easy, making the input parser ignore them for " fit via 'file' " would be more challenging. But the format of "update" and "via" files is human-readable with one parameter per line, and that is not the best for feeding back into gnuplot for plotting. <note to spectators: now the subject changes to the possibility of doing fits with errors on the x-variable as well as the y variable. One issue is how to tell the fit command which column means what> >> I also suggested having the fit parser respect "with" options to make >> the distinction. > > I don't think 'with' is a good name to use for that --- it has a > rather different job in plot and splot. Changing the number of > allowed/required using specifiers, and the interpretation of the data > they yield, is really only a side effect of selecting a plot style, > and a somewhat confusing one at that. Exporting only the confusing > aspects of 'with' over to fit, while keeping none of the > straightforward meanings that excuse them, feels like a bad user > interface design to me. I'm not sure I follow your logic. From the user point of view, "plot with" for errorbars seems straightforward. err or yerr means the third column is y errors, xerr means the third column has x errors, xyerr means 3=xerr, 4=yerr. All these could be accepted by the y=f(x) fitter. It's true that splot doesn't presently know how to deal with z-errors. But if it ever does, I suspect that it will be triggered by "with zerr" and will expect 4 data columns, for x:y:z:zerr. Plot also accepts asymmetric errors (4-column form for err, xerr, yerr and 6-column for xyerr) that would have to be rejected by the fitter. If the errorbars are really asymmetric, and that really matters to the fit user, (s)he needs professional help that (s)he's not going to get from gnuplot! Letting "fit" accept "with err" would allow it to be more similar to "plot" which I think is good interface design. [I would keep the existing syntax for y=f(x) fits functioning if the user leaves off "with err" for the sake of compatibility] The joker in this proposal is that it is not compatible with the present syntax of the fit command, where the existence of a 4th data column forces a z=f(x,y) fit [the existence of a 3d column forces a y=f(x) fit with y errors, only 2 columns means y=f(x) fit with uniform y errors]. But that syntax is already a bit awkward, because the user doesn't always have z-errors available. At present, the user is told to invent one by saying "fit f(x,y) 'file' using x:y:z:(1). But we don't force users to explicitly invent an error column for y=f(x) fits without y errors. So my proposal is to invent an "sfit" command that does z=f(x,y) fits. It would expect 3 columns for x:y:z, and accept 4 columns for x:y:z:zerr. I would make its parser also accept (but probably not require) "with zerr" right from the start. gnuplot already makes a distinction between 'plot' for y(x) and 'splot' for z(x,y). Adding 'sfit' is consistent with this. After all, to check a 'fit' of y=f(x), the user does a 'plot' with both the data and the fit function. After doing a z=f(x,y) fit, he needs to do an 'splot' with both the data and the fit function. So why not call the fit an 'sfit' ? > The impact of the 3D fitting feature on the parser is quite minimal as > it is. It just checks whether there are 4 columns of data or not. I'm trying to think of things from the user perspective, as well as the maintainer perspective. I'm not sure that the present interface to z=f(x,y) fits gets the balance right. And I am volunteering to do proposed the parser changes. The point is to get opinions on what is the right thing to do, which involves both users and maintainers. > This would become much more complicated with all the suggested > extensions, and then it might be necessary to add a new command. But > we usually try to avoid adding top-level commands as far as possible. > 'fit' and 'update' were about the only additions to that set in the > last decade. So I would prefer to do it by some other means than > adding a command 'sfit'. A user probably doesn't care if a system has a small number of top-level commands with many options or a large number of top-level commands with few options. The user cares about how easy the system is to use. If having more top-level commands allows the options to be simpler to understand and use, it should be considered. Some such changes can make the maintenance easier also, by allowing more code to be shared across functionalities. So here's the summary of what I am proposing to do for gnuplot fitting 1. Change fit error reporting to include both standard and rescaled errors (rescaled error == present gnuplot error) 2. Write gnuplot-readable parameter and error summary lines to file (an additional file and/or fit.log, perhaps with commenting-out of fit.log lines) 3. Add ability to fit data with both x errors as well as y errors (distinguished by "with xerr" or "with xyerr" in fit command, without breaking present interface, including x:y:z:zerr syntax for z=f(x,y) fits) 4. Add ability of fit command parser to interpret "with err" and "with yerr" (without breaking present interface) 5. Add 'sfit' command for z=f(x,y) fits, accepting x:y:z, x:y:z:err, and also "x:y:z:err with zerr" Cheers Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British Columbia Present Address: Stanford Linear Accelerator Center 2575 Sand Hill Road, Menlo Park, CA, 94025 Building 48 (Research Office Building), Mail Station MS35 Office: ROB-231 Phone: 650-926-5342 Fax: 650-926-8522 |