From: eric j. <er...@en...> - 2002-06-11 22:00:09
|
> From: cbarker@localhost.localdomain [mailto:cbarker@localhost.localdomain] > > eric jones wrote: > > The default axis choice influences how people choose to lay out their > > data in arrays. If the default is to sum down columns, then users lay > > out their data so that this is the order of computation. > > This is absolutely true. I definitely choose my data layout to that the > various rank reducing operators do what I want. Another reason to have > consistency. So I don't really care which way is default, so the default > might as well be the better performing option. > > Of course, compatibility with previous versions is helpful too...arrrgg! > > What kind of a performance difference are we talking here anyway? Guess I ought to test instead of just saying it is so... I ran the following test of summing 200 sets of 10000 numbers. I expected a speed-up of about 2... I didn't get it. They are pretty much the same speed on my machine.?? (more later) C:\WINDOWS\system32>python ActivePython 2.2.1 Build 222 (ActiveState Corp.) based on Python 2.2.1 (#34, Apr 15 2002, 09:51:39) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from Numeric import * >>> import time >>> a = ones((10000,200),Float) * arange(10000)[:,NewAxis] >>> b = ones((200,10000),Float) * arange(10000)[NewAxis,:] >>> t1 = time.clock();x=sum(a,axis=0);t2 = time.clock();print t2-t1 0.0772411018719 >>> t1 = time.clock();x=sum(b,axis=-1);t2 = time.clock();print t2-t1 0.079615705348 I also tried FFT, and did see a difference -- a speed up of 1.5+: >>> q = ones((1024,1024),Float) >>> t1 = time.clock();x = FFT.fft(q,axis=0);t2 = time.clock();print t2-t1 0.907373143793 >>> t1 = time.clock();x= FFT.fft(q,axis=-1);t2 = time.clock();print t2-t1 0.581641800843 >>> .907/.581 1.5611015490533564 Same in scipy >>> from scipy import * >>> a = ones((1024,1024),Float) >>> import time >>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1 0.870259488287 >>> t1 = time.clock(); q = fft(a,axis=-1); t2 = time.clock();print t2-t1 0.489512214541 >>> t1 = time.clock(); q = fft(a,axis=0); t2 = time.clock();print t2-t1 0.849266317367 >>> .849/.489 1.7361963190184049 So why is sum() the same speed for both cases? I don't know. I wrote a quick C program that is similar to how Numeric loops work, and I saw about a factor of 4 improvement by summing rows instead columns: C:\home\eric\wrk\axis_speed>gcc -O2 axis.c C:\home\eric\wrk\axis_speed>a summing rows (sec): 0.040000 summing columns (sec): 0.160000 pass These numbers are more like what I expected to see in the Numeric tests, but they are strange when compared to the Numeric timings -- the row sum is twice as fast as Numeric while the column sum is twice as slow. Because all the work is done in C and we're summing reasonably long arrays, the Numeric and C versions should be roughly the same speed. I can understand why summing rows is twice as fast in my C routine -- the Numeric loop code is not going to win awards for being optimal. What I don't understand is why the column summation is twice as slow in my C code as in Numeric. This should not be. I've posted it below in case someone can enlighten me. I think in general, you should see a speed up of 1.5+ when the summing over the "faster" axis. This holds true for fft in Python and my sum in C. As to why I don't in Numeric's sum(), I'm not sure. It is certainly true that non-strided access makes the best use of cache and *usually* is faster. eric ------------------------------------------------------------------------ -- #include <malloc.h> #include <time.h> int main() { double *a, *sum1, *sum2; int i, j, si, sj, ind, I, J; int small=200, big=10000; time_t t1, t2; I = small; J = big; si = big; sj = 1; a = (double*)malloc(I*J*sizeof(double)); sum1 = (double*)malloc(small*sizeof(double)); sum2 = (double*)malloc(small*sizeof(double)); //set memory for(i = 0; i < I; i++) { sum1[i] = 0; sum2[i] = 0; ind = si * i; for(j = 0; j < J; j++) { a[ind] = (double)j; ind += sj; } ind += si; } t1 = clock(); for(i = 0; i < I; i++) { sum1[i] = 0; ind = si * i; for(j = 0; j < J; j++) { sum1[i] += a[ind]; ind += sj; } ind += si; } t2 = clock(); printf("summing rows (sec): %f\n", (t2-t1)/(float)CLOCKS_PER_SEC); I = big; J = small; sj = big; si = 1; t1 = clock(); //set memory for(i = 0; i < I; i++) { ind = si * i; for(j = 0; j < J; j++) { a[ind] = (double)i; ind += sj; } ind += si; } for(j = 0; j < J; j++) { sum2[j] = 0; ind = sj * j; for(i = 0; i < I; i++) { sum2[j] += a[ind]; ind += si; } } t2 = clock(); printf("summing columns (sec): %f\n", (t2-t1)/(float)CLOCKS_PER_SEC); for (i=0; i < small; i++) { if(sum1[i] != sum2[i]) printf("failure %d, %f %f\n", i, sum1[i], sum2[i]); } printf("pass %f\n", sum1[0]); return 0; } |