Jeroen De Dauw wrote:
> although I suppose the most general solution of all would be to implement aggregation queries.
> ..

> I guess GROUP BY and COUNT() functionality are the bits that would would jeopardize sanity? :)

I actually discussed this at length with Yaron, and we concluded that generic group by functionality would not be terribly useful, since it's hard to imagine cases where you would not just want to count the occurrences. My current implementation is pretty much equivalent to doing a group by count I think (not sure, as I'm not that familiar with the SQL group by statement).


GROUP BY is basically a way to tell the SQL parser that you want to feed every hit where field X has the same value into an aggregate function such as COUNT or SUM; for all aggregate functions except COUNT, this assumes that the function will be taking its parameters from another field or fields. 

This actually does apply to inline queries.  Take, for example, the following, taken from the SMW Wiki:

{{#ask: [[Category:City]] [[located in::Germany]] 
| ?population 
| ?area#km² = Size in km²
}}
This produces:

↓ Population↓ Size in km²↓
Berlin 3,391,409 891.85 km²
Frankfurt 679,664 248.31 km²
Munich 1,259,678 310.43 km²
Stuttgart 595,452 208.754 km²

(Which I hope is legible in your email client.)

If this was the return set for a database query, one could tweak it by grouping by "located in" and returning, say, the total number of people living in Germany's cities, or the average number of square kilometers in a German city.  Replace "Located in::Germany" with "Continent::Europe" while keeping things grouped by "Located in", and you could run a comparison of the urban populations of Germany, France, Switzerland, etc. 

The real question isn't whether or not such a query would be useful; the question is whether or not it would be useful enough to justify the complications and overhead that would come with implementing it.  Do we really want people performing statistical analysis by means of inline queries, or would the business of grouping pages by property and aggregate results within those groups be better handled by a third-party ontology engine? 

If we do decide that a more comprehensive "aggregate results" inline query is warranted, I'd suggest not trying to shoehorn it into #ask.  For example:
{{#summarize: [[Category:City]] [[located in::Germany]] 
| shared=located in
| ?sum(population) = urban population | ?avg(area)#km² = Average Size
}}
#summarize would be similar to #ask, except that there would be a mandatory shared parameter, all of the printout statements would be assumed to be aggregate functions, and the result formats would use the values of the shared property instead of the names of the matching pages:

↓ Urban Population↓ Average Size↓
Germany
5,926,203 414.836 km²

Again, the main issue here is the overhead that you're likely to encounter implementing this sort of thing.  How do you keep the processing overhead to a minimum, and how low can that minimum be?  Which aggregate functions does #summarize recognize?  (For instance, I could see arguments for recognizing aggregate functions such as "count if" and "sum of product", to borrow two fairly useful examples from the spreadsheet world; but that would entail more work on the designers' part, if only in the form of providing a light-weight but secure hook for others to use in creating their own.)  And so on. 

--
Jonathan "Dataweaver" Lang