From: Tim H. <tim...@ie...> - 2003-02-11 21:04:09
|
Perry Greenfield wrote: >Tim Hochberg writes: > > >> Overhead (c) Overhead (nc) >>TimePerElement (c) TimePerElement (nc) >>NumPy 10 us 10 >>us 85 ps 95 ps >>NumArray 200 us 530 us >>45 ps 135 ps >>Psymeric 50 us 65 >>us 80 ps 80 ps >> >> >>The times shown above are for Float64s and are pretty approximate, and >>they happen to be a particularly favorable array shape for Psymeric. I >>have seen pymeric as much as 50% slower than NumPy for large arrays of >>certain shapes. >> >>The overhead for NumArray is surprisingly large. After doing this >>experiment I'm certainly more sympathetic to Konrad wanting less >>overhead for NumArray before he adopts it. >> >> >> >Wow! Do you really mean picoseconds? I never suspected that >either Numeric or numarray were that fast. ;-) > > My bad, I meant ns. What's a little factor of 10^3 among friends. >Anyway, this issue is timely [Err...]. As it turns out we started > > >looking at ways of improving small array performance a couple weeks >ago and are coming closer to trying out an approach that should >reduce the overhead significantly. > >But I have some questions about your benchmarks. Could you show me >the code that is used to generate the above timings? In particular >I'm interested in the kinds of arrays that are being operated on. >It turns out that that the numarray overhead depends on more than >just contiguity and it isn't obvious to me which case you are testing. > > I'll send you psymeric, including all the tests by private email to avoid cluttering up the list. (Don't worry, it's not huge -- only 750 lines of Python at this point). You can let me know if you find any horrible issues with it. >For example, Todd's benchmarks indicate that numarray's overhead is >about a factor of 5 larger than numpy when the input arrays are >contiguous and of the same type. On the other hand, if the array >is not contiguous or requires a type conversion, the overhead is >much larger. (Also, these cases require blocking loops over large >arrays; we have done nothing yet to optimize the block size or >the speed of that loop.) If you are doing the benchmark on >contiguous, same type arrays, I'd like to get a copy of the benchmark >program to try to see where the disagreement arises. > > Basically, I'm operating on two, random contiguous, 3x3, Float64 arrays.In the noncontiguous case the arrays are indexed using [::2,::2] and [1::2,::2] so these arrays are 2x2 and 1x2. Hmmm, that wasn't intentional, I'm measuring axis stretching as well. However using [::2.::2] for both axes doesn't change things a whole lot. The core timing part looks like this: t0 = clock() if op == '+': c = a + b elif op == '-': c = a - b elif op == '*': c = a * b elif op == '/': c = a / b elif op == '==': c = a==b else: raise ValueError("unknown op %s" % op) t1 = clock() This is done N times, the first M values are thrown away and the remaining values are averaged. Currently N is 3 and M is 1, so not a lot averaging is taking place. >The very preliminary indications are that we should be able to make >numarray overheads approximately 3 times higher for all ufunc cases. >That's still slower, but not by a factor of 20 as shown above. How >much work it would take to reduce it further is unclear (the main >bottleneck at that point appears to be how long it takes to create >new output arrays) > > That's good. I think it's important to get people like Konrad on board and that will require dropping the overhead. >We are still mainly in the analysis and design phase of how to >improve performance for small arrays and block looping. We believe >that this first step will not require moving very much of the >existing Python code into C (but some will be). Hopefully we >will have some working code in a couple weeks. > I hope it goes well. -tim |