From: Stephen S. <ma...@st...> - 2006-07-18 11:59:36
|
Hi, Does anyone have any suggestions for summarising data in numpy? The quick description is that I want to do something like the SQL statement: SELECT sum(field1), sum(field2) FROM table GROUP BY field3; The more accurate description is that my data is stored in PyTables HDF format, with 24 monthly files, each with 4m records describing how customers performed that month. Each record looks something like this: ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95) The first 5 fields are status fields (month_string, account_number, product_code, account_status, months_since_customer_joined). The remaining 48 fields represent different aspects of the customer's performance during that month. I read 100,000 of these records at a time and turn them into a numpy recarray with: dat = hdf_table.read(start=pos, stop=pos+block_size) dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr) I'd like to reduce these 96m records x 53 fields down to monthly averages for each tuple (month_string, months_since_customer_joined) which in the case above is ('200604', 12L). This will let me compare the performance of newly acquired customers at the same point in their lifecycle as customers acquired 1 or 2 years ago. The end result should be a dataset something like res[month_index, months_since_customer_joined] = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... sum_field_52 ]) with a shape of (24, 24, 49). I've played around with lexsort(), take(), sum(), etc, but get very confused and end up feeling that I'm making things more complicated than they need to be. So any advice from numpy veterans on how best to proceed would be very welcome! Cheers Stephen |
From: Tom D. <tom...@al...> - 2006-07-18 13:49:56
|
I suggest lexsort itertools.groupby of the indices take I think it would be really great if numpy had the first two as a function or something like that. It is really useful to be able to take an array and bucket it and apply further numpy operations like accumulation functions. On 7/18/06, Stephen Simmons <ma...@st...> wrote: > Hi, > > Does anyone have any suggestions for summarising data in numpy? > > The quick description is that I want to do something like the SQL statement: > SELECT sum(field1), sum(field2) FROM table GROUP BY field3; > > The more accurate description is that my data is stored in PyTables HDF > format, with 24 monthly files, each with 4m records describing how > customers performed that month. Each record looks something like this: > ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, > 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, > 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95) > The first 5 fields are status fields (month_string, account_number, > product_code, account_status, months_since_customer_joined). The > remaining 48 fields represent different aspects of the customer's > performance during that month. I read 100,000 of these records at a time > and turn them into a numpy recarray with: > dat = hdf_table.read(start=pos, stop=pos+block_size) > dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr) > > I'd like to reduce these 96m records x 53 fields down to monthly > averages for each tuple (month_string, months_since_customer_joined) > which in the case above is ('200604', 12L). This will let me compare the > performance of newly acquired customers at the same point in their > lifecycle as customers acquired 1 or 2 years ago. > > The end result should be a dataset something like > res[month_index, months_since_customer_joined] > = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... > sum_field_52 ]) > with a shape of (24, 24, 49). > > I've played around with lexsort(), take(), sum(), etc, but get very > confused and end up feeling that I'm making things more complicated than > they need to be. So any advice from numpy veterans on how best to > proceed would be very welcome! > > Cheers > > Stephen > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > |
From: Stephen S. <ma...@st...> - 2006-07-21 02:05:57
|
Hi, The function bincount() counts the number of each value found in the input array: In [15]: numpy.bincount( array([1,3,3,3,4],dtype=int32) ) Out[15]: array([0, 1, 0, 3, 1]) According to the documentation, the input array must be non-negative integers. However an exception occurs when the input data type are unsigned integers (which is an explicit guarantee of this non-negativity condition): In [157]: numpy.bincount( array([1,3,3,3,4],dtype=uint32) ) TypeError: array cannot be safely cast to required type This seems to be a bug. Cheers Stephen P.S. I'm not familiar enough with the numpy source to track down where this typechecking is done. But I did find a trivial typo in an error msg in function arr_bincount() in numpy/lib/src/_compiled_base.c. The assert message here has lost its initial 'F': Py_Assert(numbers[mni] >= 0, "irst argument of bincount must be non-negative"); |
From: Stephen S. <ma...@st...> - 2006-07-21 03:45:06
|
While playing a little more with bincount(), one modification would be handy: Allow negative integers in the bin list, but skip them when counting bins My specific use case is calculating subtotals on columns of large datasets (1m rows x 30 cols), where some rows need to be excluded. The groupings are expensive to compute, and sometimes will involve ~99% of the rows (eliminate only outliers/errors), and other times only ~5% of the rows (focus in on a subset). I'd like to calculate subtotals like this using bincount(), without having to copy the large datasets just to eliminate the unwanted rows: # Assign each row to a group numbered from 0..G, except for -1 for rows to exclude row_groups = expensive_function(data) # Count number in each group, excluding those with grp==-1 grp_counts = bincount(list=row_groups) # Use bincount() to form subtotals by column, excluding those with grp==-1 subtotals = column_stack([ bincount(list=row_groups, weights=data[:,i]) for i in range(G+1) ]) Is there any appetite to make such a change to bincount()? This would require two simple changes to bincount() in _compiled_base.c and an update to the docstring. Here is the diff file with enough context to show the entire arr_bincount() function: *** orig_compiled_base.c 2006-07-21 13:14:21.250000000 +1000 --- _compiled_base.c 2006-07-21 13:34:41.718750000 +1000 *************** *** 70,143 **** intp j ; for ( j = 1 ; j < len; j ++ ) if ( i [j] < min ) {min = i [j] ; mn = j ;} return mn; } static PyObject * arr_bincount(PyObject *self, PyObject *args, PyObject *kwds) { /* histogram accepts one or two arguments. The first is an array ! * of non-negative integers and the second, if present, is an * array of weights, which must be promotable to double. * Call these arguments list and weight. Both must be one- * dimensional. len (weight) == len(list) * If weight is not present: ! * histogram (list) [i] is the number of occurrences of i in list. * If weight is present: * histogram (list, weight) [i] is the sum of all weight [j] ! * where list [j] == i. */ /* self is not used */ PyArray_Descr *type; PyObject *list = NULL, *weight=Py_None ; PyObject *lst=NULL, *ans=NULL, *wts=NULL; ! intp *numbers, *ians, len , mxi, mni, ans_size; int i; double *weights , *dans; static char *kwlist[] = {"list", "weights", NULL}; Py_Try(PyArg_ParseTupleAndKeywords(args, kwds, "O|O", kwlist, &list, &weight)); Py_Try(lst = PyArray_ContiguousFromAny(list, PyArray_INTP, 1, 1)); len = PyArray_SIZE(lst); numbers = (intp *) PyArray_DATA(lst); mxi = mxx (numbers, len) ; - mni = mnx (numbers, len) ; - Py_Assert(numbers[mni] >= 0, - "irst argument of bincount must be non-negative"); ans_size = numbers [mxi] + 1 ; type = PyArray_DescrFromType(PyArray_INTP); if (weight == Py_None) { Py_Try(ans = PyArray_Zeros(1, &ans_size, type, 0)); ians = (intp *)(PyArray_DATA(ans)); for (i = 0 ; i < len ; i++) ! ians [numbers [i]] += 1 ; Py_DECREF(lst); } else { Py_Try(wts = PyArray_ContiguousFromAny(weight, PyArray_DOUBLE, 1, 1)); weights = (double *)PyArray_DATA (wts); Py_Assert(PyArray_SIZE(wts) == len, "bincount: length of weights " \ "does not match that of list"); type = PyArray_DescrFromType(PyArray_DOUBLE); Py_Try(ans = PyArray_Zeros(1, &ans_size, type, 0)); dans = (double *)PyArray_DATA (ans); for (i = 0 ; i < len ; i++) { ! dans[numbers[i]] += weights[i]; } Py_DECREF(lst); Py_DECREF(wts); } return ans; fail: Py_XDECREF(lst); Py_XDECREF(wts); Py_XDECREF(ans); return NULL; } --- 70,145 ---- intp j ; for ( j = 1 ; j < len; j ++ ) if ( i [j] < min ) {min = i [j] ; mn = j ;} return mn; } static PyObject * arr_bincount(PyObject *self, PyObject *args, PyObject *kwds) { /* histogram accepts one or two arguments. The first is an array ! * of integers and the second, if present, is an * array of weights, which must be promotable to double. * Call these arguments list and weight. Both must be one- * dimensional. len (weight) == len(list) * If weight is not present: ! * histogram (list) [i] is the number of occurrences of i in list ! * for i>=0. Negative i values are ignored. * If weight is present: * histogram (list, weight) [i] is the sum of all weight [j] ! * where list [j] == i and i>=0. */ /* self is not used */ PyArray_Descr *type; PyObject *list = NULL, *weight=Py_None ; PyObject *lst=NULL, *ans=NULL, *wts=NULL; ! intp *numbers, *ians, len , mxi, ans_size; int i; double *weights , *dans; static char *kwlist[] = {"list", "weights", NULL}; Py_Try(PyArg_ParseTupleAndKeywords(args, kwds, "O|O", kwlist, &list, &weight)); Py_Try(lst = PyArray_ContiguousFromAny(list, PyArray_INTP, 1, 1)); len = PyArray_SIZE(lst); numbers = (intp *) PyArray_DATA(lst); mxi = mxx (numbers, len) ; ans_size = numbers [mxi] + 1 ; type = PyArray_DescrFromType(PyArray_INTP); if (weight == Py_None) { Py_Try(ans = PyArray_Zeros(1, &ans_size, type, 0)); ians = (intp *)(PyArray_DATA(ans)); for (i = 0 ; i < len ; i++) ! if (numbers[i]>=0) { ! ians[numbers [i]] += 1 ; ! } Py_DECREF(lst); } else { Py_Try(wts = PyArray_ContiguousFromAny(weight, PyArray_DOUBLE, 1, 1)); weights = (double *)PyArray_DATA (wts); Py_Assert(PyArray_SIZE(wts) == len, "bincount: length of weights " \ "does not match that of list"); type = PyArray_DescrFromType(PyArray_DOUBLE); Py_Try(ans = PyArray_Zeros(1, &ans_size, type, 0)); dans = (double *)PyArray_DATA (ans); for (i = 0 ; i < len ; i++) { ! if (numbers[i]>=0) { ! dans[numbers[i]] += weights[i]; ! } } Py_DECREF(lst); Py_DECREF(wts); } return ans; fail: Py_XDECREF(lst); Py_XDECREF(wts); Py_XDECREF(ans); return NULL; } Cheers Stephen |