From: William P. Y. H. <wil...@gm...> - 2005-12-31 08:06:16
|
When doing my homework, I needed regress and rcoplot. So I've written them, but they're far from completed. The problems I have are: For regress: 1. In the Matlab documentation from www.mathworks.com (for regress), it mentions QR factorization (using QR factorization instead of computing an inverse matrix). Can anyone explain to me about this and maybe modify the code? 2. The confidence interval for beta as returned by regress differs from the one returned by Matlab's regress (I compared my results with my friend's results, and also another set of data from my textbook). I've implemented the formulas on my textbook (don't ask -- it's in Chinese, but if you would like to know the formulas, I can post it here), but I still don't know what went wrong. A note: for any particular dataset, the intervals seem to be scaled with a factor which is dependent on the dataset. So if Matlab's interval is [b-m,b+m], then mine would be [b-qm,b+qm], with q < 1. But I don't know where this q comes from, and for every dataset it is different. 3. The p value calculated is different from my friend's calculation (using Matlab). For rcoplot: 1. The round circles in the middle are ugly 2. It's a bit slow, but that's the fastest I can get. I've tried looping the plot command (and using hold on/off), but that's slower. Note that I'm using % for my comments, because I was comparing it with my friend's data, and he only knows Matlab. When the problems above are solved, I'll tidy up the functions and submit them (to octave-forge? or to Octave?). Please help :) -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2005-12-31 10:09:24
|
Here's what I wrote in wsolve: [nr,nc] = size(A); if nc > nr, error("underdetermined system"); end ## system solution: A x = y => x = inv(A) y ## QR decomposition has good numerical properties: ## AP = QR, with P'P = Q'Q = I, and R upper triangular ## so ## inv(A) y = P inv(R) inv(Q) y = P inv(R) Q' y = P (R \ (Q' y)) ## Note that b is usually a vector and Q is matrix, so it will ## be faster to compute (y' Q)' than (Q' y). [Q,R,p] = qr(A,0); x = R\(y'*Q)'; x(p) = x; s.R = R; s.R(:,p) = R; s.df = nr-nc; s.normr = norm(y - A*x); Here's what I wrote in polyconf: ## Confidence intervals for linear system are given by: ## x' p +/- sqrt( Finv(1-a,1,df) var(x' p) ) ## where for confidence intervals, ## var(x' p) = sigma^2 (x' inv(A'A) x) ## and for prediction intervals, ## var(x' p) = sigma^2 (1 + x' inv(A'A) x) ## ## Rather than A'A we have R from the QR decomposition of A, but ## R'R equals A'A. Note that R is not upper triangular since we ## have already multiplied it by the permutation matrix, but it ## is invertible. Rather than forming the product R'R which is ## ill-conditioned, we can rewrite x' inv(A'A) x as the equivalent ## x' inv(R) inv(R') x = t t', for t = x' inv(R) ## Since x is a vector, t t' is the inner product sumsq(t). ## Note that LAPACK allows us to do this simultaneously for many ## different x using sqrt(sumsq(X/R,2)), with each x on a different row. ## ## Note: sqrt(F(1-a;1,df)) = T(1-a/2;df) function [y,dy] = confidence(A,p,S,alpha,typestr) if nargin < 4, alpha = []; end if nargin < 5, typestr = 'ci'; end y = A*p(:); switch typestr, case 'ci', pred = 0; default_alpha=erfc(1/sqrt(2)); case 'pi', pred = 1; default_alpha=0.05; otherwise, error("use 'ci' or 'pi' for interval type"); end if isempty(alpha), alpha = default_alpha; end s = t_inv(1-alpha/2,S.df)*S.normr/sqrt(S.df); dy = s*sqrt(pred+sumsq(A/S.R,2)); Can you use these two functions to do what you need? - Paul On Dec 31, 2005, at 3:06 AM, William Poetra Yoga Hadisoeseno wrote: > In the Matlab documentation from www.mathworks.com (for regress), > it mentions QR factorization (using QR factorization instead of > computing an inverse matrix). Can anyone explain to me about this? |
From: Paul K. <pki...@us...> - 2005-12-31 10:37:10
|
You should just be able to use errorbar. Assuming you don't want to because you don't like how it looks, can you set the symbol size larger instead of drawing circles yourself? __gnuplot_set__('pointsize 2'); plot(rand(5,1),'o;;'); I would suggest plotting using: plot(cx,cy,'o;;',lx,ly,'-;;') where cx,cy are the points and lx,ly are matrices with three columns and two rows for each r defining the end points of the error bars and the caps respectively. If you decide to draw the circles yourself, change this to: plot(cx,cy,'o;;',lx,ly,'-;;') where cx,cy contain one column per circle. No for loops needed, and a small number of separate matrices being plotted should improve your performance a lot. - Paul On Dec 31, 2005, at 3:06 AM, William Poetra Yoga Hadisoeseno wrote: > For rcoplot: > 1. The round circles in the middle are ugly > 2. It's a bit slow, but that's the fastest I can get. I've tried > looping the plot command (and using hold on/off), but that's slower. |
From: William P. Y. H. <wil...@gm...> - 2005-12-31 15:12:12
|
On 12/31/05, Paul Kienzle <pki...@us...> wrote: > You should just be able to use errorbar. > > Assuming you don't want to because you don't like how it looks, can you > set the symbol size larger instead of drawing circles yourself? > > __gnuplot_set__('pointsize 2'); plot(rand(5,1),'o;;'); > > I would suggest plotting using: > > plot(cx,cy,'o;;',lx,ly,'-;;') > > where cx,cy are the points and lx,ly are matrices with three columns > and two rows for each r defining the end points of the error bars and > the caps respectively. > > If you decide to draw the circles yourself, change this to: > > plot(cx,cy,'o;;',lx,ly,'-;;') > > where cx,cy contain one column per circle. > > No for loops needed, and a small number of separate matrices being > plotted should improve your performance a lot. > OK, I was at first a bit confused because your code generates '+' style points on the graph; but then I think it's a bug in octave. Well, it seems I would have to look at the plotting commands... Thanks anyway, I'll see what I can use ;) -- William Poetra Yoga Hadisoeseno |
From: William P. Y. H. <wil...@gm...> - 2006-01-01 09:20:38
Attachments:
regress.m
|
On 12/31/05, William Poetra Yoga Hadisoeseno <wil...@gm...> wrot= e: > On 12/31/05, Paul Kienzle <pki...@us...> wrote: > > Here's what I wrote in wsolve: > > > ... > > > > Here's what I wrote in polyconf: > > > > Thanks, I'll take a look ;) > OK, today I finally finished the function (I think). I used Paul's notes at the end of wsolve.m. Take a look at the attachment ;) Paul, do you want to have it in octave-forge, or should I submit it for inclusion in Octave instead? -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2006-01-01 17:05:54
|
On Jan 1, 2006, at 4:20 AM, William Poetra Yoga Hadisoeseno wrote: > OK, today I finally finished the function (I think). I used Paul's > notes at the end of wsolve.m. Take a look at the attachment ;) You are not using the economy QR decomposition. This can lead to very large matrices for over determined systems. Also, you should generally avoid computing the full inverse as part of solving a system. It is more expensive and I believe less stable than working directly off the QR decomposition that you have already computed. Any reason you did not use my wsolve code as it stands? Similarly, do you have a reason for not using the code I sent for computing confidence intervals, which also avoids direct computation of the inverse? If there is some reason to prefer your code I would like to know what it is. - Paul |
From: William P. Y. H. <wil...@gm...> - 2006-01-03 08:40:58
|
On 1/2/06, Paul Kienzle <pki...@us...> wrote: > > On Jan 1, 2006, at 4:20 AM, William Poetra Yoga Hadisoeseno wrote: > > > OK, today I finally finished the function (I think). I used Paul's > > notes at the end of wsolve.m. Take a look at the attachment ;) > > You are not using the economy QR decomposition. This can lead to > very large matrices for over determined systems. > > Also, you should generally avoid computing the full inverse as > part of solving a system. It is more expensive and I believe > less stable than working directly off the QR decomposition that > you have already computed. > > Any reason you did not use my wsolve code as it stands? > > Similarly, do you have a reason for not using the code I sent > for computing confidence intervals, which also avoids direct > computation of the inverse? > > If there is some reason to prefer your code I would like > to know what it is. > No reason, it's just that I didn't study my linear algebra course well (never went to class, for example :( ), and I haven't retaken it yet, so that part of Mathematics is my "blurry spot" (not quite a blind spot...). As such, I don't really understand your code well; for my dataset, my code works, so for the time being I haven't used your code yet. I'm having my exams now, so I wouldn't have too much time for that. But I'll look at it again (so don't include it first). Alternatively, if you would like to modify the function, I don't mind :) (although it wouldn't be very educative for me). -- William Poetra Yoga Hadisoeseno |
From: William P. Y. H. <wil...@gm...> - 2006-02-15 13:03:27
Attachments:
regress.m
|
OK, so it seems I've been idle for the past few weeks. Would you like to try the file? It's attached. -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2006-02-19 23:39:19
|
On Feb 15, 2006, at 8:03 AM, William Poetra Yoga Hadisoeseno wrote: > OK, so it seems I've been idle for the past few weeks. Would you like > to try the file? It's attached. > > -- > William Poetra Yoga Hadisoeseno > <regress.m> Okay, I added a test of the Longley data set from the NIST Statistical Reference Database. The performance is pretty bad (norm of answer-certified value is about 10^-7), but that's a bit better than the competition. Someone with more R experience than me will have to see if it performs any better. Also, if you can convert bint to standard error values that would be convenient. - Paul ---- % Longley data from the NIST Statistical Reference Dataset Z = [ 60323 83.0 234289 2356 1590 107608 1947 61122 88.5 259426 2325 1456 108632 1948 60171 88.2 258054 3682 1616 109773 1949 61187 89.5 284599 3351 1650 110929 1950 63221 96.2 328975 2099 3099 112075 1951 63639 98.1 346999 1932 3594 113270 1952 64989 99.0 365385 1870 3547 115094 1953 63761 100.0 363112 3578 3350 116219 1954 66019 101.2 397469 2904 3048 117388 1955 67857 104.6 419180 2822 2857 118734 1956 68169 108.4 442769 2936 2798 120445 1957 66513 110.8 444546 4681 2637 121950 1958 68655 112.6 482704 3813 2552 123366 1959 69564 114.2 502601 3931 2514 125368 1960 69331 115.7 518173 4806 2572 127852 1961 70551 116.9 554894 4007 2827 130081 1962 ]; % Results certified by NIST using 100 digit arithmetic % b and standard error in b V = [ -3482258.63459582 890420.383607373 15.0618722713733 84.9149257747669 -0.358191792925910E-01 0.334910077722432E-01 -2.02022980381683 0.488399681651699 -1.03322686717359 0.214274163161675 -0.511041056535807E-01 0.226073200069370 1829.15146461355 455.478499142212 ]; Rsq = 0.995479004577296; F = 330.285339234588; y = Z(:,1); X = [ones(rows(Z),1), Z(:,2:end)]; [b, bint, r, rint, stats] = regress (y, X, 0.05); assert(b, V(:,1), 6e-7) assert(stats(1), Rsq, 7e-14) assert(stats(2), F, 5e-9) |
From: William P. Y. H. <wil...@gm...> - 2006-02-21 13:42:27
|
On 2/20/06, Paul Kienzle <pki...@us...> wrote: > > Okay, I added a test of the Longley data set from the NIST Statistical > Reference Database. > > The performance is pretty bad (norm of answer-certified value is about > 10^-7), but that's a bit better than the competition. > > Someone with more R experience than me will have to see if it performs > any better. > > Also, if you can convert bint to standard error values that would be > convenient. > > - Paul > > ---- > > % Longley data from the NIST Statistical Reference Dataset > Z =3D [ 60323 83.0 234289 2356 1590 107608 1947 > 61122 88.5 259426 2325 1456 108632 1948 > 60171 88.2 258054 3682 1616 109773 1949 > 61187 89.5 284599 3351 1650 110929 1950 > 63221 96.2 328975 2099 3099 112075 1951 > 63639 98.1 346999 1932 3594 113270 1952 > 64989 99.0 365385 1870 3547 115094 1953 > 63761 100.0 363112 3578 3350 116219 1954 > 66019 101.2 397469 2904 3048 117388 1955 > 67857 104.6 419180 2822 2857 118734 1956 > 68169 108.4 442769 2936 2798 120445 1957 > 66513 110.8 444546 4681 2637 121950 1958 > 68655 112.6 482704 3813 2552 123366 1959 > 69564 114.2 502601 3931 2514 125368 1960 > 69331 115.7 518173 4806 2572 127852 1961 > 70551 116.9 554894 4007 2827 130081 1962 ]; > % Results certified by NIST using 100 digit arithmetic > % b and standard error in b > V =3D [ -3482258.63459582 890420.383607373 > 15.0618722713733 84.9149257747669 > -0.358191792925910E-01 0.334910077722432E-01 > -2.02022980381683 0.488399681651699 > -1.03322686717359 0.214274163161675 > -0.511041056535807E-01 0.226073200069370 > 1829.15146461355 455.478499142212 ]; > Rsq =3D 0.995479004577296; > F =3D 330.285339234588; > y =3D Z(:,1); X =3D [ones(rows(Z),1), Z(:,2:end)]; > [b, bint, r, rint, stats] =3D regress (y, X, 0.05); > assert(b, V(:,1), 6e-7) > assert(stats(1), Rsq, 7e-14) > assert(stats(2), F, 5e-9) > > Yes, you can convert bint to the standard error values with (0.05 is alpha, 9 is the degree of freedom): ((bint(:, 1) - bint(:, 2)) / 2) / tinv (0.05 / 2, 9) but the accuracy will be affected. (It's not so accurate anyway...) -- William Poetra Yoga Hadisoeseno |
From: William P. Y. H. <wil...@gm...> - 2006-03-08 01:51:37
Attachments:
regress.m
|
On 2/21/06, William Poetra Yoga Hadisoeseno <wil...@gm...> wrote= : > On 2/20/06, Paul Kienzle <pki...@us...> wrote: > > > > Okay, I added a test of the Longley data set from the NIST Statistical > > Reference Database. > > > > The performance is pretty bad (norm of answer-certified value is about > > 10^-7), but that's a bit better than the competition. > > > > Someone with more R experience than me will have to see if it performs > > any better. > > > > Also, if you can convert bint to standard error values that would be > > convenient. > > > > - Paul > > > > ---- > > > > % Longley data from the NIST Statistical Reference Dataset > > Z =3D [ 60323 83.0 234289 2356 1590 107608 1947 > > 61122 88.5 259426 2325 1456 108632 1948 > > 60171 88.2 258054 3682 1616 109773 1949 > > 61187 89.5 284599 3351 1650 110929 1950 > > 63221 96.2 328975 2099 3099 112075 1951 > > 63639 98.1 346999 1932 3594 113270 1952 > > 64989 99.0 365385 1870 3547 115094 1953 > > 63761 100.0 363112 3578 3350 116219 1954 > > 66019 101.2 397469 2904 3048 117388 1955 > > 67857 104.6 419180 2822 2857 118734 1956 > > 68169 108.4 442769 2936 2798 120445 1957 > > 66513 110.8 444546 4681 2637 121950 1958 > > 68655 112.6 482704 3813 2552 123366 1959 > > 69564 114.2 502601 3931 2514 125368 1960 > > 69331 115.7 518173 4806 2572 127852 1961 > > 70551 116.9 554894 4007 2827 130081 1962 ]; > > % Results certified by NIST using 100 digit arithmetic > > % b and standard error in b > > V =3D [ -3482258.63459582 890420.383607373 > > 15.0618722713733 84.9149257747669 > > -0.358191792925910E-01 0.334910077722432E-01 > > -2.02022980381683 0.488399681651699 > > -1.03322686717359 0.214274163161675 > > -0.511041056535807E-01 0.226073200069370 > > 1829.15146461355 455.478499142212 ]; > > Rsq =3D 0.995479004577296; > > F =3D 330.285339234588; > > y =3D Z(:,1); X =3D [ones(rows(Z),1), Z(:,2:end)]; > > [b, bint, r, rint, stats] =3D regress (y, X, 0.05); > > assert(b, V(:,1), 6e-7) > > assert(stats(1), Rsq, 7e-14) > > assert(stats(2), F, 5e-9) > > > > > > Yes, you can convert bint to the standard error values with (0.05 is > alpha, 9 is the degree of freedom): > ((bint(:, 1) - bint(:, 2)) / 2) / tinv (0.05 / 2, 9) > but the accuracy will be affected. (It's not so accurate anyway...) > I still don't know how to improve the accuracy... I've added an assertion for the standard error values, but I've got to set the tolerance to 7.6e-3 (that's not good). Also, I think one decimal point for the tolerance would be good, what do you think? By the way, in the octave-forge version (in the repository), the functions tinv and fcdf are replaced by t_inv and f_cdf. Why? -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2006-03-08 12:37:06
|
On Mar 7, 2006, at 8:51 PM, William Poetra Yoga Hadisoeseno wrote: > On 2/21/06, William Poetra Yoga Hadisoeseno <wil...@gm...> > wrote: >> On 2/20/06, Paul Kienzle <pki...@us...> wrote: >>> >>> Okay, I added a test of the Longley data set from the NIST >>> Statistical >>> Reference Database. >>> >>> The performance is pretty bad (norm of answer-certified value is >>> about >>> 10^-7), but that's a bit better than the competition. >>> >>> Someone with more R experience than me will have to see if it >>> performs >>> any better. >>> >>> Also, if you can convert bint to standard error values that would be >>> convenient. >>> >>> - Paul >>> >>> ---- >>> >>> % Longley data from the NIST Statistical Reference Dataset >>> Z = [ 60323 83.0 234289 2356 1590 107608 1947 >>> 61122 88.5 259426 2325 1456 108632 1948 >>> 60171 88.2 258054 3682 1616 109773 1949 >>> 61187 89.5 284599 3351 1650 110929 1950 >>> 63221 96.2 328975 2099 3099 112075 1951 >>> 63639 98.1 346999 1932 3594 113270 1952 >>> 64989 99.0 365385 1870 3547 115094 1953 >>> 63761 100.0 363112 3578 3350 116219 1954 >>> 66019 101.2 397469 2904 3048 117388 1955 >>> 67857 104.6 419180 2822 2857 118734 1956 >>> 68169 108.4 442769 2936 2798 120445 1957 >>> 66513 110.8 444546 4681 2637 121950 1958 >>> 68655 112.6 482704 3813 2552 123366 1959 >>> 69564 114.2 502601 3931 2514 125368 1960 >>> 69331 115.7 518173 4806 2572 127852 1961 >>> 70551 116.9 554894 4007 2827 130081 1962 ]; >>> % Results certified by NIST using 100 digit arithmetic >>> % b and standard error in b >>> V = [ -3482258.63459582 890420.383607373 >>> 15.0618722713733 84.9149257747669 >>> -0.358191792925910E-01 0.334910077722432E-01 >>> -2.02022980381683 0.488399681651699 >>> -1.03322686717359 0.214274163161675 >>> -0.511041056535807E-01 0.226073200069370 >>> 1829.15146461355 455.478499142212 ]; >>> Rsq = 0.995479004577296; >>> F = 330.285339234588; >>> y = Z(:,1); X = [ones(rows(Z),1), Z(:,2:end)]; >>> [b, bint, r, rint, stats] = regress (y, X, 0.05); >>> assert(b, V(:,1), 6e-7) >>> assert(stats(1), Rsq, 7e-14) >>> assert(stats(2), F, 5e-9) >>> >>> >> >> Yes, you can convert bint to the standard error values with (0.05 is >> alpha, 9 is the degree of freedom): >> ((bint(:, 1) - bint(:, 2)) / 2) / tinv (0.05 / 2, 9) >> but the accuracy will be affected. (It's not so accurate anyway...) >> > > I still don't know how to improve the accuracy... > I've added an assertion for the standard error values, but I've got to > set the tolerance to 7.6e-3 (that's not good). Also, I think one > decimal point for the tolerance would be good, what do you think? I generally use only one digit for the tolerance. > By the way, in the octave-forge version (in the repository), the > functions tinv and fcdf are replaced by t_inv and f_cdf. Why? Because I tinv and fcdf were not available in my version of Octave. These can change once we abandon support for 2.1.x - Paul |
From: William P. Y. H. <wil...@gm...> - 2006-03-09 06:49:49
|
On 3/8/06, Paul Kienzle <pki...@us...> wrote: > > I generally use only one digit for the tolerance. > OK, then use 1 digit :) The standard error values differ by as much as 10e-3 (as can be seen from the assert), and it doesn't improve if I embed the test in the main part of the script (that is, just using sqrt(v*c) as the standard error values). > Because I tinv and fcdf were not available in my version of Octave. > These can change once we abandon support for 2.1.x > > OK, looking forward to it :) -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2006-03-09 09:41:27
|
On Mar 9, 2006, at 1:49 AM, William Poetra Yoga Hadisoeseno wrote: > On 3/8/06, Paul Kienzle <pki...@us...> wrote: >> >> I generally use only one digit for the tolerance. >> > > OK, then use 1 digit :) The standard error values differ by as much as > 10e-3 (as can be seen from the assert), and it doesn't improve if I > embed the test in the main part of the script (that is, just using > sqrt(v*c) as the standard error values). Yet again I had to loosen the criteria on the tests in order to pass the tests on OS X PPC architecture. The dataset is known to be hard, so I'm not surprised by the weak tolerances. I'm curious if e.g., R can compute better results. - Paul |
From: William P. Y. H. <wil...@gm...> - 2006-03-09 11:41:12
|
On 3/9/06, Paul Kienzle <pki...@us...> wrote: > > Yet again I had to loosen the criteria on the tests in order to pass > the tests on OS X PPC architecture. > > The dataset is known to be hard, so I'm not surprised by the weak > tolerances. I'm curious if e.g., R can compute better results. > Well, OK, so can we say regress.m is quite finished by now? And maybe I can start on rcoplot. I didn't study the code on the rcoplot manual, so is it legal for me to write that function anyway? (without referring to the code). -- William Poetra Yoga Hadisoeseno |
From: Paul K. <pki...@us...> - 2006-03-10 01:37:11
|
It is best if you can find an independent explanation of the algorithm which you can cite in the documentation. It makes the documentation better, and it makes the argument that it is an independent explanation easier to support. E.g., http://www.itl.nist.gov/div898/handbook/pri/section2/pri245.htm - Paul On Mar 9, 2006, at 6:41 AM, William Poetra Yoga Hadisoeseno wrote: > On 3/9/06, Paul Kienzle <pki...@us...> wrote: >> >> Yet again I had to loosen the criteria on the tests in order to pass >> the tests on OS X PPC architecture. >> >> The dataset is known to be hard, so I'm not surprised by the weak >> tolerances. I'm curious if e.g., R can compute better results. >> > > Well, OK, so can we say regress.m is quite finished by now? And maybe > I can start on rcoplot. I didn't study the code on the rcoplot manual, > so is it legal for me to write that function anyway? (without > referring to the code). > > -- > William Poetra Yoga Hadisoeseno > |