Hi all,
currently one can request grouped statistics for a single statistics only when using the aggregate() function. The type is passed through the 3rd paramter funcname (of type string). Here an example computing the mean:
set verbose off
open data4-10 -q
# series/list mode
list L = ENROLL CATHOL
matrix m = aggregate(L, REGION, mean)
print m
A useful feature is if the third parameter funcname could be a string array allowing the user to request multiple statistics (each must return a scalar value for each distinct group).
Currently, this is quite cumbersome to do:
set verbose off
open data4-10 -q
# series/list mode
list L = ENROLL CATHOL
list groupby = REGION
matrix avg = aggregate(L, groupby, mean)
matrix std = aggregate(L, groupby, sd) # make sure you group by the same variables
matrix combine = avg ~ std[,2+nelem(groupby):]
# labels not unique; adding the name of the statistics requested in brackets would be useful
strings column_labels = cnameget(avg) + cnameget(std)[2+nelem(groupby):]
cnameset(combine, column_labels)
print combine
Here a pseudo-example using series as data input:
list L = ENROLL CATHOL
strings stats = defarray("mean", "median", "sd")
matrix m = aggregate(L, REGION, stats)
print m
Best
Artur
Well, I find aggregate() already quite complex. In the list case, for example, one would then need a convention about the ordering of the result columns, would it be by series/variable or by aggregation method?
I can see a potential argument relating to a speed gain, because the grouping doesn't have to be done again and again. So the question IMO is, is the complexity worth the gain? Is speed actually an issue?
I agree with Sven: aggregate is complex enough already.
Allow me to close this - sorry, Artur, but it seems others are not enthusiastic about squeezing more stuff into aggregate().