From: Daniel J S. <dan...@ie...> - 2011-05-23 00:55:19
|
On 05/22/2011 01:14 PM, pl...@pi... wrote: > On 05/21/11 23:04, Daniel J Sebald wrote: >> On 05/21/2011 03:15 AM, pl...@pi... wrote: >>> Hi, >>> >>> 'help fit' reports that the fit command uses Levenberg–Marquardt algo to >>> do the fit. >>> >>> I think this raises an important question that very often over-looked by >>> many users of least-squares techniques even at maths PhD level. >>> >>> Such techniques often only optimised the y error rather than the >>> perpendicular error from the line. This is implicitly assuming y >>> uncertainty>> x uncertainty. While this condition is often satisfied >>> in a controlled experiment there are many situations where this is not >>> applicable and gets totally overlooked. >>> >>> A common case is scatter plots which are frequently used to seek a >>> relations between two quantities , each with significant errors / >>> uncertainties. >>> >>> In this situation the fitted line is "wrong". In fact it's the >>> application that is wrong , hence the wrong result. This may or may not >>> be apparent to the eye. >>> >>> I have seen this happen so many times (including once in a PhD thesis >>> report!) that I think it needs a serious health warning in the doc. >>> >>> "Warning: using least-squares inappropriately can seriously damage your >>> reputation". ;) >>> >>> Firstly , could you confirm the basis on which this algo is applied in >>> gnuplot? Does it only optimise vertical y residuals? >> >> Often it is an assumption that the independent variables are exact >> measurements. Not true, typically, but if the variance is small and >> homoscedastic, the two can probably be lumped together. I.e., we are >> searching for a relationship: >> >> Y = f(X + eps1) + eps2 >> ~= f(X) + C eps1 + residual + eps2 >> ~= f(X) + (C eps1 + eps2) >> >> where hopefully the residual due to nonlinearity of the relationship is >> small compared to other randomness. It's up to the user's judgment and >> knowledge of the application to determine that. >> >> Anyway, your point is true of most software packages: details are so >> often lacking. That's why it would be nice to have a set of white >> papers to go along with the software so that people know exactly what >> the algorithm is, both for the benefit of the user and other developers. >> Most of the time it is "here's a hunk of code, use it at your own risk". >> >> Dan >> > > Thanks, I've read up on that algo and clearly this is only doing NLLS on > y residuals. > > So my suggestion is that this is made abundantly clear in the help text. > I'm not suggesting your "while paper" but just some comment to the > effect that 'fit' will not give correct results if there are non > negligible errors in x values. > > It absolutely amazes me how few people realise this , even highly > qualified ones, so this is not some pedantic nicety. > > Most people seem to think once they've heard of doing a least squares > fit that's all there is to it and it's some magical formula that works > for all cases. > > I suggest modifying the first paragraph of help fit with something like > the following: > > >> > The `fit` command can fit a user-supplied expression to a set of data > points > (x,z) or (x,y,z), using an implementation of the nonlinear least-squares > (NLLS) Marquardt-Levenberg algorithm. Any user-defined variable > occurring in > the expression may serve as a fit parameter, but the return type of the > expression must be real. > >> > > new > >> > The `fit` command can fit a user-supplied expression to a set of data > points > (x,z) or (x,y,z), using an implementation of the non-linear least-squares > (NLLS) Marquardt-Levenberg algorithm. This algorithm optimises y > residuals only and > carries the implicit assumption that error/uncertainties in x are > negligible. If that is not the case the fit may succeed but will give > wrong results. I'm OK with the change, but for a few things. First, you stated 'optimizes y' and 'uncertainties in x', but please check that this precisely describes the algorithm. The beginning of the paragraph lists (x, z) or (x, y, z). The first expression has no 'y', so what is optimized in that case? The second expression has (x, y, z), so is it only the 'x' in that case that is assumed exact? Or is it both 'x' and 'y'? Perhaps that sentence should be, "This algorithm optimises z residuals only and carries the implicit assumption that error/uncertainties in x (and y) are negligible." Second, I would hold off using the statement "wrong results" and maybe use "a poor fit". Saying the result is wrong means the algorithm is broken, but it just provides numbers. The user is using the wrong tool. Also, "wrong" means a bad fit in this context and one can get a bad fit and misinterpret results even if uncertainty in x is small. Third, if there is an alternative approach for the case when x uncertainty can't be ignored, keep the discussion going and maybe we can add such a feature to the list of items we'd like to add in the future. Dan |