From: Jeroen De D. <jer...@gm...> - 2011-11-08 14:08:50
|
Hey all, I have implemented general support for value distributions in result formats in SMW. This email explains this feature and is meant to gather feedback on it before SMW 1.7 is released. == Goal == Allow visualizing how many times each value in a result occurs, ie allow for creating value distributions. For example, this result set: foo bar baz foo bar bar ohi Will be turned into * bar (3) * foo (2) * baz (1) * ohi (1) This can then be displayed in chart formats, with the value as label and the occurrence count as value. Although the most obvious use for this are charts, it can really be used with any format. == Current implementation: how to use it == Each format needs to add support for this functionality before you'll be able to use it to visualize value distributions. Right now only jqplotbar and jqplotpie make use of it. All formats that support this functionality accept 3 additional parameters: * distribution (on/off) - if a value distribution should be calculated and shown instead of the regular results. * distributionsort (asc/desc/none) - the sort of the values, by occurance count. * distributionlimit (positive whole number) - the max amount of values to visualize. This example will get the countries the matching cities are located in, count the occurance of each, and display this as a pie chart. Note the use of the mainlabel parameter. If this is not done, the cities themselves will also be put into the value distribution. {{#ask: [[Category:Locations]] [[Has location type::City]] | ?Located in | format=jqplotpie | distribution=on | mainlabel=- | limit=500 }} This example will do the same query, but will only show the 10 countries with most matching cities, in descending order. {{#ask: [[Category:Locations]] [[Has location type::City]] | ?Located in | format=jqplotpie | distribution=on | distributionsort=desc | distributionlimit=10 | mainlabel=- | limit=500 }} You can see these examples and 2 others working on the mapping documentation wiki, making use of the example semantic data there: http://mapping.referata.com/wiki/Value_distribution_examples == Implementation details (technical) == After looking into several options I decided to implement this as a result printer class deriving from SMWResultPrinter, requiring changes to each format that wants to support this behaviour, but making this relatively easy. This approach seems like a good balance between making this functionality available as easy as possible and staying sane. This class is called SMWDistributablePrinter and can be found here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticMediaWiki/includes/queryprinters/SMW_QP_Distributable.php?view=markup Example jqplotpie implementation: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticResultFormats/jqPlot/SRF_jqPlotPie.php?view=markup == Request for comments == Feedback is welcome. The main question for users is what names the parameters should use. Right now they all start with "distribution", but there might be a better (and shorter) name. From developers I'd like to know if you agree with this architecture. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. -- |
From: Krabina B. <kr...@kd...> - 2011-11-08 14:14:17
|
Thumbs up, Jeoren!!!! Great feature that will definitely enhance our wikis! I don't have a problem with the naming of | distribution= | distributionsort= | distributionlimit= Can it also be used with regular formats like csv or table (resulting in the value for the distribution to be displayed?) regards, Bernhard ----- Ursprüngliche Mail ----- > Hey all, > > I have implemented general support for value distributions in result > formats in SMW. This email explains this feature and is meant to > gather > feedback on it before SMW 1.7 is released. > > == Goal == > > Allow visualizing how many times each value in a result occurs, ie > allow > for creating value distributions. > > For example, this result set: foo bar baz foo bar bar ohi > Will be turned into > * bar (3) > * foo (2) > * baz (1) > * ohi (1) > > This can then be displayed in chart formats, with the value as label > and > the occurrence count as value. Although the most obvious use for this > are > charts, it can really be used with any format. > > == Current implementation: how to use it == > > Each format needs to add support for this functionality before you'll > be > able to use it to visualize value distributions. Right now only > jqplotbar > and jqplotpie make use of it. All formats that support this > functionality > accept 3 additional parameters: > > * distribution (on/off) - if a value distribution should be > calculated and > shown instead of the regular results. > * distributionsort (asc/desc/none) - the sort of the values, by > occurance > count. > * distributionlimit (positive whole number) - the max amount of > values to > visualize. > > This example will get the countries the matching cities are located > in, > count the occurance of each, and display this as a pie chart. Note > the use > of the mainlabel parameter. If this is not done, the cities > themselves will > also be put into the value distribution. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | mainlabel=- > | limit=500 > }} > > This example will do the same query, but will only show the 10 > countries > with most matching cities, in descending order. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | distributionsort=desc > | distributionlimit=10 > | mainlabel=- > | limit=500 > }} > > You can see these examples and 2 others working on the mapping > documentation wiki, making use of the example semantic data there: > http://mapping.referata.com/wiki/Value_distribution_examples > > == Implementation details (technical) == > > After looking into several options I decided to implement this as a > result > printer class deriving from SMWResultPrinter, requiring changes to > each > format that wants to support this behaviour, but making this > relatively > easy. This approach seems like a good balance between making this > functionality available as easy as possible and staying sane. > > This class is called SMWDistributablePrinter and can be found here: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticMediaWiki/includes/queryprinters/SMW_QP_Distributable.php?view=markup > > Example jqplotpie implementation: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticResultFormats/jqPlot/SRF_jqPlotPie.php?view=markup > > == Request for comments == > > Feedback is welcome. The main question for users is what names the > parameters should use. Right now they all start with "distribution", > but > there might be a better (and shorter) name. From developers I'd like > to > know if you agree with this architecture. > > Cheers > > -- > Jeroen De Dauw > http://www.bn2vs.com > Don't panic. Don't be evil. > -- > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Semediawiki-user mailing list > Sem...@li... > https://lists.sourceforge.net/lists/listinfo/semediawiki-user > |
From: Dan B. <dan...@gm...> - 2011-11-08 14:31:53
|
Great work! I really like this feature. It has been requested many times, so it's really cool that its been done. Thanks for doing this. We briefly discussed how this could be implemented a while back, and, although the proposal back then may push the balance of implementation in the direction of insanity..., I'd like to mention that proposal again here for comparison. One way to do this would be to automatically give each property some special properties, such that the property itself could be queried for its set of unique values, and the number of times each value has been used. This is immediately tricky, because you need to link each value with its occurrence somehow (perhaps using sub-objects), but the advantage is that: 1) You don't have to modify results printers, you just pass them the results of the property query. 2) It allows you to query on the number of uses, for example, querying for all property-values that have more than 10 uses, or all values of City that have exactly 5 locations, etc. 3) It opens the way for having more meaningful property pages, automatically having unique values linked to searches, which I think is what people tend to expect. As you mentioned previously, this could be difficult to implement, but I prefer more general approaches than increasingly specific approaches to getting at the data... although I suppose the most general solution of all would be to implement aggregation queries. Thanks again for this great work that I'll definitely use heavily. I only mention the above as a point for discussion, and I expect you'll take it in that light (coming from a dumb user with no implementation experience). Cheers, Dan. On 8 November 2011 14:08, Jeroen De Dauw <jer...@gm...> wrote: > Hey all, > > I have implemented general support for value distributions in result formats > in SMW. This email explains this feature and is meant to gather feedback on > it before SMW 1.7 is released. > > == Goal == > > Allow visualizing how many times each value in a result occurs, ie allow for > creating value distributions. > > For example, this result set: foo bar baz foo bar bar ohi > Will be turned into > * bar (3) > * foo (2) > * baz (1) > * ohi (1) > > This can then be displayed in chart formats, with the value as label and the > occurrence count as value. Although the most obvious use for this are > charts, it can really be used with any format. > > == Current implementation: how to use it == > > Each format needs to add support for this functionality before you'll be > able to use it to visualize value distributions. Right now only jqplotbar > and jqplotpie make use of it. All formats that support this functionality > accept 3 additional parameters: > > * distribution (on/off) - if a value distribution should be calculated and > shown instead of the regular results. > * distributionsort (asc/desc/none) - the sort of the values, by occurance > count. > * distributionlimit (positive whole number) - the max amount of values to > visualize. > > This example will get the countries the matching cities are located in, > count the occurance of each, and display this as a pie chart. Note the use > of the mainlabel parameter. If this is not done, the cities themselves will > also be put into the value distribution. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | mainlabel=- > | limit=500 > }} > > This example will do the same query, but will only show the 10 countries > with most matching cities, in descending order. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | distributionsort=desc > | distributionlimit=10 > | mainlabel=- > | limit=500 > }} > > You can see these examples and 2 others working on the mapping documentation > wiki, making use of the example semantic data there: > http://mapping.referata.com/wiki/Value_distribution_examples > > == Implementation details (technical) == > > After looking into several options I decided to implement this as a result > printer class deriving from SMWResultPrinter, requiring changes to each > format that wants to support this behaviour, but making this relatively > easy. This approach seems like a good balance between making this > functionality available as easy as possible and staying sane. > > This class is called SMWDistributablePrinter and can be found here: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticMediaWiki/includes/queryprinters/SMW_QP_Distributable.php?view=markup > > Example jqplotpie implementation: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticResultFormats/jqPlot/SRF_jqPlotPie.php?view=markup > > == Request for comments == > > Feedback is welcome. The main question for users is what names the > parameters should use. Right now they all start with "distribution", but > there might be a better (and shorter) name. From developers I'd like to know > if you agree with this architecture. > > Cheers > > -- > Jeroen De Dauw > http://www.bn2vs.com > Don't panic. Don't be evil. > -- > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Semediawiki-devel mailing list > Sem...@li... > https://lists.sourceforge.net/lists/listinfo/semediawiki-devel > > |
From: Stephan G. <s7...@gm...> - 2011-11-08 18:25:39
|
Hi Jeroen, nice work, this was long overdue! For the discussion: What about something like this: {{#ask: [[Category:Locations]] [[Has location type::City]] | ?Located in | ?count(*) | group by=Located in | format=jqplotpie | mainlabel=- | limit=500 }} I guess GROUP BY and COUNT() functionality are the bits that would would jeopardize sanity? :) Cheers, Stephan |
From: Jeroen De D. <jer...@gm...> - 2011-11-09 05:45:44
|
Hey, > Can it also be used with regular formats like csv or table (resulting in the value for the distribution to be displayed?) Each format needs to add support for this functionality, so at the moment you can not use value distribution with these formats. However, it's relatively easy to add this support in. I'm not sure that adding it to all formats is makes much sense though. For a lot of them, the usefulness seems limited, and it's a bunch of work to do. In that case, it might be worth reworking the query result class a little so it's possible to modify query results (in a sane manner) before they get passed to the actual result printer, which would also allows for other kinds of post-query processing. > One way to do this would be to automatically give each property some special properties, such that the property itself could be queried for its set of unique values, and the number of times each value has been used. Interesting, I had not thought of this. Implementing this would be completely different then what I did though, and it'd be as you say more powerful. If any system to handle this is created, it could probably easily made more generic, and support all kinds of computations, not just the occurance count of values of a property. It might even go hand in hand with query management functionality (allows for automatic invalidation of query caches when their source data is modified). This will not be trivial to implement, and is out of scope of what I want to do here. If such functionality is created, it might make the value distribution feature a bit obsolete, but I don't see this happening soon (unless someone throws money or devs at it). I'm curious to your ideas about this though and have some questions: * Where/when would this property meta data be computed? On every change of any occurrence of the property might be quite expensive. * Where would you defined how to compute this meta data? If possible I'd be neat to have control over this in the wiki itself. > although I suppose the most general solution of all would be to implement aggregation queries. > .. > I guess GROUP BY and COUNT() functionality are the bits that would would jeopardize sanity? :) I actually discussed this at length with Yaron, and we concluded that generic group by functionality would not be terribly useful, since it's hard to imagine cases where you would not just want to count the occurrences. My current implementation is pretty much equivalent to doing a group by count I think (not sure, as I'm not that familiar with the SQL group by statement). > For the discussion: What about something like this: > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > |?count(*) > | group by=Located in > | format=jqplotpie > | mainlabel=- > | limit=500 > }} What would the advantage of this syntax be? I suspect It's less clear to most users, and it's definitely harder to implement, since you'll need to recognize ?count(*) as a special printout. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. -- |
From: Jon L. <dat...@gm...> - 2011-11-09 09:01:56
|
Jeroen De Dauw wrote: > > although I suppose the most general solution of all would be to > implement aggregation queries. > > .. > > > I guess GROUP BY and COUNT() functionality are the bits that would would > jeopardize sanity? :) > > I actually discussed this at length with Yaron, and we concluded that > generic group by functionality would not be terribly useful, since it's > hard to imagine cases where you would not just want to count the > occurrences. My current implementation is pretty much equivalent to doing a > group by count I think (not sure, as I'm not that familiar with the SQL > group by statement). > > GROUP BY is basically a way to tell the SQL parser that you want to feed every hit where field X has the same value into an aggregate function such as COUNT or SUM; for all aggregate functions *except* COUNT, this assumes that the function will be taking its parameters from another field or fields. This actually *does* apply to inline queries. Take, for example, the following, taken from the SMW Wiki: {{#ask: [[Category:City]] [[located in::Germany]] | ?population | ?area#km² = Size in km² }} This produces: [image: ↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#> Population <http://semantic-mediawiki.org/wiki/Property:Population>[image: ↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#> Size in km²<http://semantic-mediawiki.org/wiki/Property:Area>[image: ↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#> Berlin<http://semantic-mediawiki.org/wiki/Berlin> 3,391,409 891.85 km² Frankfurt<http://semantic-mediawiki.org/wiki/Frankfurt> 679,664 248.31 km² Munich <http://semantic-mediawiki.org/wiki/Munich> 1,259,678 310.43 km² Stuttgart<http://semantic-mediawiki.org/wiki/Stuttgart> 595,452 208.754 km² (Which I hope is legible in your email client.) If this was the return set for a database query, one could tweak it by grouping by "located in" and returning, say, the total number of people living in Germany's cities, or the average number of square kilometers in a German city. Replace "Located in::Germany" with "Continent::Europe" while keeping things grouped by "Located in", and you could run a comparison of the urban populations of Germany, France, Switzerland, etc. The real question isn't whether or not such a query would be useful; the question is whether or not it would be useful *enough* to justify the complications and overhead that would come with implementing it. Do we really want people performing statistical analysis by means of inline queries, or would the business of grouping pages by property and aggregate results within those groups be better handled by a third-party ontology engine? If we *do* decide that a more comprehensive "aggregate results" inline query is warranted, I'd suggest *not* trying to shoehorn it into #ask. For example: {{#summarize: [[Category:City]] [[located in::Germany]] | shared=located in | ?sum(population) = urban population | ?avg(area)#km² = Average Size }} #summarize would be similar to #ask, except that there would be a mandatory shared parameter, all of the printout statements would be assumed to be aggregate functions, and the result formats would use the values of the shared property instead of the names of the matching pages: [image: ↓] Urban Population<http://semantic-mediawiki.org/wiki/Property:Population>[image: ↓] Average Size <http://semantic-mediawiki.org/wiki/Property:Area>[image: ↓]<http://semantic-mediawiki.org/wiki/Help:Inline_queries#> Germany 5,926,203 414.836 km² Again, the main issue here is the overhead that you're likely to encounter implementing this sort of thing. How do you keep the processing overhead to a minimum, and how low can that minimum be? Which aggregate functions does #summarize recognize? (For instance, I could see arguments for recognizing aggregate functions such as "count if" and "sum of product", to borrow two fairly useful examples from the spreadsheet world; but that would entail more work on the designers' part, if only in the form of providing a light-weight but secure hook for others to use in creating their own.) And so on. -- Jonathan "Dataweaver" Lang |
From: Dan B. <dan...@gm...> - 2011-11-09 09:41:45
|
On 9 November 2011 05:45, Jeroen De Dauw <jer...@gm...> wrote: > Hey, >> One way to do this would be to automatically give each property some >> special properties, such that the property itself could be queried for its >> set of unique values, and the number of times each value has been used. > > This will not be trivial to implement, and is out of scope of what I want to > do here. If such functionality is created, it might make the value > distribution feature a bit obsolete, but I don't see this happening soon Right. I think the solution you have currently will be very useful for quite some time. I just wanted to raise one of my favourite suggestions ;-) > I'm curious to your ideas about > this though and have some questions: > > * Where/when would this property meta data be computed? On every change of > any occurrence of the property might be quite expensive. I'll leave this one to the experts! My frame of reference is categories, which do get updated every time a category is changed on a page (excepting changes made through templates). However, I see the problem, adding another instance of property-value [[x::y]], you don't want to re-calculate all the counts of all values for property x. > * Where would you defined how to compute this meta data? If possible I'd be > neat to have control over this in the wiki itself. Yeah that would be really cool. That's how I've been doing it with templates on the property page, but it isn't possible (afaik) to link string type properties to their counts. Again, I think this leads us back to aggregate queries (unless 'counts' were done as a one off, like 'Modification date'). >> although I suppose the most general solution of all would be to implement >> aggregation queries. >> .. >> I guess GROUP BY and COUNT() functionality are the bits that would would >> jeopardize sanity? :) > > I actually discussed this at length with Yaron, and we concluded that > generic group by functionality would not be terribly useful, since it's hard > to imagine cases where you would not just want to count the occurrences. My > current implementation is pretty much equivalent to doing a group by count I > think (not sure, as I'm not that familiar with the SQL group by statement). There are a bunch of neat things you can do with GROUP BY, here is a flavour: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html Thanks again for the great work. I'm going to go add barcharts to almost every numeric property in all my wikis :-D Dan. P.S. I like the #summarise idea with hooks for creating aggregate functions. |
From: Michael E. <er...@on...> - 2011-11-29 17:08:33
|
Hi Jeroen, we had a similar request like this already a couple of times. Although I think that having distribution=on (which actually is a count for groups) is often not sufficient, e.g. instead of the pure number users want the percentage (compared to all values), i.e. 42,8% instead of 3 (out of 7) occurrences. Other aggregates naturally come to mind as well, but are more complicated since they aggregate actual values not purely the number of "rows". Nevertheless, the simple solution you propose, IMHO, should be realized in a way that /no custom coding is required for query printers. /If I am not mistaken, the JQPlot-result formats expect a label and a numeric value for displaying the charts. If your Distributable code would modify the query results to contain only the labels and the numbers than they could be rendered also by all other result formats and no additional code would be needed. The aggregation would be a kind of post processing of the query results, before they are passed to the result printers turning a one column query with n lines and m values into a two column query with m lines. Would this be sth. your code could support? thx, michael On 08.11.2011 15:08, Jeroen De Dauw wrote: > Hey all, > > I have implemented general support for value distributions in result > formats in SMW. This email explains this feature and is meant to > gather feedback on it before SMW 1.7 is released. > > == Goal == > > Allow visualizing how many times each value in a result occurs, ie > allow for creating value distributions. > > For example, this result set: foo bar baz foo bar bar ohi > Will be turned into > * bar (3) > * foo (2) > * baz (1) > * ohi (1) > > This can then be displayed in chart formats, with the value as label > and the occurrence count as value. Although the most obvious use for > this are charts, it can really be used with any format. > > == Current implementation: how to use it == > > Each format needs to add support for this functionality before you'll > be able to use it to visualize value distributions. Right now only > jqplotbar and jqplotpie make use of it. All formats that support this > functionality accept 3 additional parameters: > > * distribution (on/off) - if a value distribution should be calculated > and shown instead of the regular results. > * distributionsort (asc/desc/none) - the sort of the values, by > occurance count. > * distributionlimit (positive whole number) - the max amount of values > to visualize. > > This example will get the countries the matching cities are located > in, count the occurance of each, and display this as a pie chart. Note > the use of the mainlabel parameter. If this is not done, the cities > themselves will also be put into the value distribution. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | mainlabel=- > | limit=500 > }} > > This example will do the same query, but will only show the 10 > countries with most matching cities, in descending order. > > {{#ask: [[Category:Locations]] [[Has location type::City]] > | ?Located in > | format=jqplotpie > | distribution=on > | distributionsort=desc > | distributionlimit=10 > | mainlabel=- > | limit=500 > }} > > You can see these examples and 2 others working on the mapping > documentation wiki, making use of the example semantic data there: > http://mapping.referata.com/wiki/Value_distribution_examples > > == Implementation details (technical) == > > After looking into several options I decided to implement this as a > result printer class deriving from SMWResultPrinter, requiring changes > to each format that wants to support this behaviour, but making this > relatively easy. This approach seems like a good balance between > making this functionality available as easy as possible and staying sane. > > This class is called SMWDistributablePrinter and can be found here: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticMediaWiki/includes/queryprinters/SMW_QP_Distributable.php?view=markup > > Example jqplotpie implementation: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticResultFormats/jqPlot/SRF_jqPlotPie.php?view=markup > > == Request for comments == > > Feedback is welcome. The main question for users is what names the > parameters should use. Right now they all start with "distribution", > but there might be a better (and shorter) name. From developers I'd > like to know if you agree with this architecture. > > Cheers > > -- > Jeroen De Dauw > http://www.bn2vs.com > Don't panic. Don't be evil. > -- -- Semantic Enterprise Wiki - SMW+ / Halo Extension Want to get involved? http://smwforum.ontoprise.com/development -- email: er...@on... Dr. Michael Erdmann tel: +49 / 163 / 509 8029 http://www.ontoprise.com Managing Directors: Prof. Dr. Jürgen Angele, Hans-Peter Schnurr Register court: Mannheim | Register number: HRB 109540 | Sales-Tax-ID: DE-201-761-257 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. |
From: Jeroen De D. <jer...@gm...> - 2011-11-29 17:18:42
|
Hey, > modify the query results to contain only the labels and the numbers This was actually the first approach I considered and implemented to some extend. However, the query result object is really not made to be used like this, and the ways to get around of this where just to much of a hack, which is why I decided to go with the current approach. Having some more generic mechanism that does not require QPs to care about what post processing is happening at all would be nice, but would require rewriting the query result class or going with some messed up architecture. Either way, it's a bunch of work, which although I agree would be useful, is not something I'm going to take on now. If you or someone else wants to have a go at it, please do, I'll be happy to help review it if needed. What I implemented should be seen as a way for query printers to support value distribution behaviour without all of them reinventing the wheel, not a generic post processing system. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. -- |