|
From: Dr. H. G. K. <Han...@E-...> - 2005-12-22 07:53:44
|
Tim Churches schrieb: > Hans Georg Krauthaeuser wrote: > >>Hans Georg Krauthaeuser schrieb: >> >> >>>Hi All, >>> >>>I was playing with scipy.stats.itemfreq when I observed the following >>>overflow: >>> >>>In [119]:for i in [254,255,256,257,258]: >>> .....: l=[0]*i >>> .....: print i, stats.itemfreq(l), l.count(0) >>> .....: >>>254 [ [ 0 254]] 254 >>>255 [ [ 0 255]] 255 >>>256 [ [0 0]] 256 >>>257 [ [0 1]] 257 >>>258 [ [0 2]] 258 >>> >>>itemfreq is pretty small (in stats.py): >>> >>>---------------------------------------------------------------------- >>>def itemfreq(a): >>> """ >>>Returns a 2D array of item frequencies. Column 1 contains item values, >>>column 2 contains their respective counts. Assumes a 1D array is passed. >>> >>>Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) >>>""" >>> scores = _support.unique(a) >>> scores = sort(scores) >>> freq = zeros(len(scores)) >>> for i in range(len(scores)): >>> freq[i] = add.reduce(equal(a,scores[i])) >>> return array(_support.abut(scores, freq)) >>>---------------------------------------------------------------------- >>> >>>It seems that add.reduce is the source for the overflow: >>> >>>In [116]:from scipy import * >>> >>>In [117]:for i in [254,255,256,257,258]: >>> .....: l=[0]*i >>> .....: print i, add.reduce(equal(l,0)) >>> .....: >>>254 254 >>>255 255 >>>256 0 >>>257 1 >>>258 2 >>> >>>Is there any possibility to avoid the overflow? > > > Apropos the preceding, herewith a thread on the Numpy list from a more > than a few months ago. The take-home message is that for integer arrays, > add.reduce is very fast at producing results which fall into two > categories: a) correct or b) incorrect due to overflow. Unfortunately > there is no equally quick method of determining into which of these two > categories any specific result returned by add.reduce falls. > > Tim C > > From: Tim Churches <tc...@op...> > To: Todd Miller <jm...@st...> > Date: Apr 12 2005 - 7:51am > > Todd Miller wrote: > >>On Sun, 2005-04-10 at 10:23 +1000, Tim Churches wrote: >> >> >>>I just got caught by code equivalent to this (with NumPy 23.8 on 32 bit >>>Linux): >>> >>> >>>>>>import Numeric as N >>>>>>a = N.array((2000000000,1000000000),typecode=N.Int32) >>>>>>N.add.reduce(a) >>> >>>-1294967296 >>> >>>OK, it is an elementary mistake, but the silent overflow caught me >>>unawares. casting the array to Float64 before summing it avoids the >>>error, but in my instance the actual data is a rank-1 array of 21 >>>million integers with a mean value of about 140 (which adds up more than >>>sys.maxint), and casting to Float64 will use quite a lot of memory (as >>>well as taking some time). >>> >>>Any advice for catching or avoiding such overflow without always >>>incurring a performance and memory hit by always casting to Float64? >> >> >>Here's what numarray does: >> >> >> >>>>>import numarray as N >>>>>a = N.array((2000000000,1000000000),typecode=N.Int32) >>>>>N.add.reduce(a) >> >>-1294967296 >> >>So basic reductions in numarray have the same "careful while you're >>shaving" behavior as Numeric; it's fast but easy to screw up. > > > Sure, but how does one be careful? It seems that for any array of two > integers or more which could sum to more than sys.maxint or less than > -sys.maxint, add.reduce() in both NumPy and Numeric will give either a) > the correct answer or b) the incorrect answer, and short of adding up > the array using a safer but much slower method, there is no way of > determining if the answer provided (quickly) by add.reduce is right or > wrong? Which seems to make it fast but useless (for integer arrays, at > least? Is that an unfair summary? Can anyone point me towards a method > for using add.reduce() on small arrays of large integers with values in > the billions, or on large arrays of fairly small integer values, which > will not suddenly and without warning give the wrong answer? > > >>But: >> >> >> >>>>>a.sum() >> >>3000000000L >> >> >>>>>a.sum(type='d') >> >>3000000000.0 >> >>a.sum() blockwise upcasts to the largest type of kind on the fly, in >>this case, Int64. This avoids the storage overhead of typecasting the >>entire array. > > > That's on a 64-bit platform, right? The same method could be used to > cast the accumulator to a Float64 on a 32-bit platform to avoid casting > the entire array? > > >>A better name for the method would have been sumall() since it sums all >>elements of a multi-dimensional array. The flattening process reduces >>on one dimension before flattening preventing a full copy of a >>discontiguous array. It could be smarter about choosing the dimension >>of the initial reduction. > > > OK, thanks. Unfortunately it is not possible for us to port our > application to numarray at the moment. But the insight is most helpful. > > Tim C Tim, Thank you very much for your answer. I posted two follow-ups to my own post on c.l.p (http://groups.google.de/group/comp.lang.python/browse_thread/thread/a96c404d73d71259/7769fca1fa562718?hl=de#7769fca1fa562718) A least for scipy version '0.3.2' the problem comes directly from the ufunc 'add' for bool arrays: In [178]:array(array([1],'b')*2) Out[178]:array([2],'i') In [179]:array(array([1],'b')+array([1],'b')) Out[179]:array([2],'b') 'multiply' changes the typecode. So, in this case a+a != a*2 if a is an array with bytecode 'b'. Regards Hans Georg Krauthäuser -- Otto-von-Guericke-Universitaet Magdeburg IGET | fon: +49 391 67 12195 Postfach 4120 | fax: +49 391 67 11236 39016 Magdeburg | email: hg...@et... Germany | www: www.uni-magdeburg.de/krauthae |