From: glenn m. <gl...@fu...> - 2005-10-18 14:46:12
|
[Discussion moved here from talk pages...] Here are a few more thoughts about why templates seem incredibly important to this effort, to me. - Redundancy is always dangerous. All the same arguments against separate coding of relations outside of the normal article text also apply to any scheme that requires an extra level of annotation in either a template or its inputs to add semantic significance. I suggest that the scheme will be qualitatively more reliable if templates are defined to have inherent semantic significance. This might ultimately involve tweaking the template system somehow before it makes total sense, but the current state of templates seems like a workable approximation to me. - Semantic annotations are individually worthless. That is, linking topic A to topic B isn't interesting in itself. What's interesting is defining the relationship between things like topic A and things like topic B, and then linking *all* of the things like topic A to all of their corresponding things like topic B. You have to approach complete coverage before your query results become useful. Thus the human work of defining the relationship schema is central to the whole semantic process. Templates (particularly infoboxes) are mediawiki's closest approach to microformats, at the moment, and thus the center of discussion about, effectively, the common schema of various data types. Anything we can do so that work done for input/display/consistency purposes is also, by definition, semantic progress, will be powerfully in our advantage. - Although it's obviously possible to build small demos on sample data (or to use non-public data from other installations of mediawiki), the big oohs needed to catapult this project into the mediawiki mainstream will only come from being able to execute a new semantic query against the live wikipedia to answer some question that up until now could not have been answered by machine. If we can repurpose template input as semantic data, we have some hope of doing a genuinely interesting query on current real data, which seems vastly preferable to having to go back and hand-code some huge number of relationships. And this applies even more so, obviously, to future wikipedia data. If semantic coding is a separate step, it will be done erratically, and all it takes is a tiny amount of data incompleteness to render semantic query results effectively meaningless. glenn |
From: Markus <ma...@ai...> - 2005-10-18 20:50:10
|
On Tuesday 18 October 2005 16:45, glenn mcdonald wrote: > [Discussion moved here from talk pages...] Thanks! > > Here are a few more thoughts about why templates seem incredibly > important to this effort, to me. Agreed, they are important ("incredibly" or otherwise ;-). > > - Redundancy is always dangerous. All the same arguments against > separate coding of relations outside of the normal article text also > apply to any scheme that requires an extra level of annotation in > either a template or its inputs to add semantic significance. I > suggest that the scheme will be qualitatively more reliable if > templates are defined to have inherent semantic significance. This > might ultimately involve tweaking the template system somehow before > it makes total sense, but the current state of templates seems like a > workable approximation to me. Well, the current software is (almost) able to do this already. Namely, you= =20 can define templates that include semantic relations. This does not work in= =20 the CVS release, because it can currently only be switched on by making a=20 small change to the MediaWiki code, and at the moment we want the system to= =20 be installed without such patches. However, a simple two-step way to enable= =20 this feature is now described in the CVS version of "INSTALL". To see it=20 working on a toy example, try http://wiki.ontoworld.org/index.php/Karlsruhe. Of course, this only supports cases where a template contains exactly the=20 object of the typed link -- no further intelligence is currently provided.= =20 However, one could certainly think about extending this to have an inherent= =20 semantic template mechanism, but it is already quite close, isnt't it? As I wrote elsewhere on talk pages, you can never expect all interesting=20 information to be included in templates, so you cannot solely rely on such= =20 templates. But it would certainly make another helpful input mechanism for = an=20 annotation database.=20 Considering your remark on redundancy, I perfectly agree. That is one reaso= n=20 why we chose to include the markup inside the article text (in contrast to= =20 other ideas where it is at the bottom or outside the wikisource). This=20 assures that data is given only once. The layout mechanism given by templat= es=20 can certainly be combined with the annotation mechanism given by semantic=20 links, but none of these technologies can really replace the other or even= =20 tries to do so. So I would not call this redundancy (you can easily prove m= e=20 wrong if you have a tool that extracts from templates what we can currently= =20 annotate). > > - Semantic annotations are individually worthless. That is, linking > topic A to topic B isn't interesting in itself. What's interesting is > defining the relationship between things like topic A and things like > topic B, and then linking *all* of the things like topic A to all of > their corresponding things like topic B. You have to approach complete > coverage before your query results become useful.=20 I partly agree. But "complete" (even close to complete) is too much to expe= ct.=20 It is like saying "Wikipedia can only be useful, if all articles are fully= =20 correct." ;-) If I look for politicians born in Uganda, I would certainly n= ot=20 rely on getting *all* such people -- Wikipedia's data is always incomplete,= =20 with or without the annotation. > Thus the human work=20 > of defining the relationship schema is central to the whole semantic > process. Templates (particularly infoboxes) are mediawiki's closest > approach to microformats,=20 This hints at a discussion that will be hard to conclude, even on this smal= l=20 list ...=20 A short remark on the relationship of our current work to microformats: As = you=20 know, we intend to manage our information in triples (article, relation,=20 article) and (article, attribute, data value), since this seems to be the=20 simplest form of the information we encounter in the wiki. This basically=20 suggests to use RDF and the related W3C standards, so that we can gain from= =20 the related developments. Microformats are closely related to these ideas,= =20 but favor different kinds of basic data structures and refer to a different= =20 set of technologies. I guess we could have used these as well, and we can=20 still incorporate some ideas from this community. However, the syntactic=20 differences between "semantic XHTML" and "semantic RDF" are not very releva= nt=20 to our basic endeavor. We chose RDF since it is an established standard tha= t=20 is based on a very simple dataformat, and since we are aware of much softwa= re=20 that is available to deal with such data. Independently of the technical ideas around microformats, templates are=20 without doubt a great place to extract information. But the fact that they= =20 have a loose relationship to microformats is not really important for us.=20 > at the moment, and thus the center of=20 > discussion about, effectively, the common schema of various data > types. Not sure whether I understand. What exactly do you mean with "common schema= of=20 various data types"? (the labels "relation", "attribute", "type" and so one= =20 tend to be very ambiguous; note that we have a fixed idea about these=20 concepts in the current project) =20 > Anything we can do so that work done for > input/display/consistency purposes is also, by definition, semantic > progress, will be powerfully in our advantage. I think so. A major motif of our conceptions was a user-friendly yet=20 unambiguous input method. Display is not really adressed so far (other than= =20 having some output at the bottom of pages, but this is not comparable to=20 having customized templates). Finally, "consistency" is another very wide=20 term. We try to be consistent with the current editing customs of the users= :=20 data is added in links, where it has been added in numerous other extension= s=20 before, and all information is given in the wikisource for anybody to read= =20 and change. We also tried to be consistent with current web standards, and= =20 thus chose RDF with the option for some OWL-type behavior later on. Is this= =20 what you mean with "consistency"? Do you see further actions that we should= =20 take now to improve said issues? > > - Although it's obviously possible to build small demos on sample data > (or to use non-public data from other installations of mediawiki), the > big oohs needed to catapult this project into the mediawiki mainstream > will only come from being able to execute a new semantic query against > the live wikipedia to answer some question that up until now could not > have been answered by machine.=20 Indeed. That is exactly what we intend to do. > If we can repurpose template input as > semantic data, we have some hope of doing a genuinely interesting > query on current real data, which seems vastly preferable to having to > go back and hand-code some huge number of relationships.=20 Again, using templates is something that we hope to be able to do. But not = as=20 a sole input method. They just don't contain enough data, and the data they= =20 contain is provided in a very variable form (also note that many "templates= "=20 still are plain wiki/html tables). Also you may have noted that many domain= s=20 in Wikipedia use templates only on some of the concerned articles, while ma= ny=20 other articles still wait for template support (I guess, the "Wikipedia=20 editor on the street" is not really that much into templates). However, we do not need to be afraid of hand-coding a huge number of=20 relationships. It may be some effort when starting the project in practice,= =20 but doing it for some fixed subdomain (like movies etc.) is not a big=20 problem. A handful of German Wikipedians annotated (in a different way,=20 compatible both with semantic templates and with semantic links) *all*=20 persons in German Wikipedia (which is the worlds second largest wiki). It=20 took them one weekend. They had some little tool to support them and just r= an=20 over the articles. It can be done quickly, and it can be done even by peopl= e=20 who are not experts in a given domain.=20 Other examples of successful large-scale editing efforts are again connecte= d=20 to templates. Most pages (e.g. all the animal-related articles!) started of= f=20 with a wiki-table based layout where we would now use templates. Since last= =20 year (when I noticed this during editing) someone must have changed all the= se=20 tables into templates. Probably also without much tool support. Wikipedia=20 really has this unique potential to develop extremely quickly -- in one yea= r,=20 semantic relations can be as common an ubiquitous as categories and=20 templates. > And this > applies even more so, obviously, to future wikipedia data. If semantic > coding is a separate step, it will be done erratically,=20 Maybe, but the chances are not that high. If you mistype the name of some=20 semantic relation, you can as well mistype the label of some template=20 variable. And others will correct it. That's the Wiki-Way. > and all it > takes is a tiny amount of data incompleteness to render semantic query > results effectively meaningless. As I said before, I don't think Wikipedia can ever achieve completeness of= =20 data. Neither can Google achieve to return *all* meaningful results to a=20 query. Though we have a good chance of approaching completeness in many=20 topics (given, for example, that the earths geography will continue to chan= ge=20 at a slower pace than Wikipedia ;-), completeness is not crucial for being= =20 highly useful. Regards, Markus =2D-=20 Markus Kr=C3=B6tzsch Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe ma...@ai... phone +49 (0)721 608 7362 www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 |
From: glenn m. <gl...@fu...> - 2005-10-19 15:22:56
|
Just a few other observations in reply: You say "you can never expect all interesting information to be included in templates, so you cannot solely rely on such templates." This seems a little circular to me. I agree that you can't expect the current template effort, done for reasons other than semantic tagging, to cover all the needs of semantic annotation completely by coincidence. But that doesn't mean that the two efforts couldn't be unified. In particular, if you took named template parameters to *be* the semantic annotation method, rather than introducing new syntax, then anything not yet covered by an existing template could simply be added as a new one. Like: Template:IsCapitalOf {{{1}}} or perhaps more usefully: Template:IsCapital {{{OfCountry}}} So anything already in templates is semantic annotation for free, and anything anybody wants to add takes no more rote work than defining relations, and less learning and error. And obviously there's now no special code involved in input, which simplifies everything considerably. You say "Wikipedia's data is always incomplete, with or without the annotat= ion." And obviously this is true, overall, but it's perfectly realistic to expect that Wikipedia can have complete sets of information within constrained domains. There's no reason it can't have the names, capitals, populations, currencies and official languages of all 192 UN-recognized nations, for example, and the coordinates and populations of all of those capitals and every other large (by whatever definition) city on the planet. Ditto basic discographical information for every Beatles studio album, basic biographical information for all US Presidents, etc. Private wikis running on the same software are likely to have other sorts of complete sets, like all of a company's employees, customers, products, etc. My contention is that the vast majority of useful semantic queries will be based primarily or exclusively on verifiably complete information sets based on known (even if informal) schema. You note "If you mistype the name of some semantic relation, you can as well mistype the label of some template variable." My point about fragility was that if there's only one thing to type (a template variable), and it has a direct visible consequence as well as an invisible semantic effect, editors will be much more likely to get it right. And lastly, my point about microformats was about the collaborative work of agreeing on the expected properties of given data types, not the literal XHTML/XML/whatever syntax used by microformats.org specs. hCard, for example, although it's literally an XHTML syntax, is also an agreement on the list of properties that constitutes a "business card" data element. (Or, in this case, an agreement that vCard already defined this subset, which is fine.) Although part of the debate about a Microformat is syntactical, just like part of the debate about a wikipedia infobox format is just about visual appearance, the core of both is an effort to codify the data schema. glenn |
From: Markus <ma...@ai...> - 2005-10-20 14:47:06
|
[I guess this mail was intended for the list. Here it is with the reply.] On Wednesday 19 October 2005 17:22, glenn mcdonald wrote: > Just a few other observations in reply: > > You say "you can never expect all interesting information to be > included in templates, so you cannot solely rely on such templates." > > This seems a little circular to me. I agree that you can't expect the > current template effort, done for reasons other than semantic tagging, > to cover all the needs of semantic annotation completely by > coincidence. But that doesn't mean that the two efforts couldn't be > unified. In particular, if you took named template parameters to *be* > the semantic annotation method, rather than introducing new syntax, > then anything not yet covered by an existing template could simply be > added as a new one. Like: > > Template:IsCapitalOf > {{{1}}} > > or perhaps more usefully: > > Template:IsCapital > {{{OfCountry}}} > > So anything already in templates is semantic annotation for free, and > anything anybody wants to add takes no more rote work than defining > relations, and less learning and error. And obviously there's now no > special code involved in input, which simplifies everything > considerably. Indeed, I withdraw one of my arguments, namely that templates would be too= =20 inflexible since they always are predefined collections of properties. I=20 found that this is not the case, since one can just juxtapose several=20 templates to get a layout that corresponds to having a single monolithic=20 template (articles on biological species do so). So I think you are right that one could use templates the way you propose.= =20 Inside the article, such an annotation would then probably look like this: {{has capital| object =3D Berlin}} =20 or maybe better {{Relation:has capital| object =3D Berlin}}=20 (where "Relation:" helps users to distinguish this from classical template= s) Such templates can also be included in running text, in which case they wou= ld=20 not have a markup other than maybe being a link (in case that some people=20 don't want to have template-like infoboxes on every article with semantic=20 info). So we really could use templates as a complete replacement for=20 semantic links. Also note that one would probably want to have an alternati= ve=20 label for template values in this case (since data types might require=20 formats that are not optimal inside the article). So the syntax in the=20 article might be: {{Relation:has capital| object =3D Berlin, Germany| alt=3DBerlin}}=20 Of course one has to be aware that it would still not be so easy to just=20 convert the given templates to semantic ones and have all annotations in=20 place -- my earlier remarks that most templates are used in a rather relaxe= d=20 way (where explanatory texts or multiple values are given in one template=20 variable) still applies, and one would still have to convert many things=20 manually. Anyway, classical and new "semantic" templates can coexist so tha= t=20 this transition would work.=20 So one would basically be left with the question of whether one prefers the= =20 syntax {{Relation:has capital|object=3DBerlin, Germany|alt=3DBerlin}} over [[has capital::Berlin (Germany)|Berlin]] and=20 {{Attribute:length| value =3D 45000km|alt=3D45,000km}} over [[length:=3D45000km|45,000km]] The former reuses a known scheme, the latter is IMHO easier to read and als= o=20 is very close to known schemes. Still I think both are good options on the= =20 article-side. On the "schema-side" one still has to take care of making the new semantic= =20 templates really semantic. So one has to somehow describe the special meani= ng=20 of the semantic templates in the article "Template:Relation:is captital of"= =2E=20 If you just write [[{{{object}}}|{{{alt}}}]] inside the template code, then= =20 this template is not distinguished from other templates. You need some more= =20 information for managing your annotation information in the database. An=20 example case would be that you have two templates, both of which can be use= d=20 to give an object for "has capital" (say one that fits into an info table,= =20 and one that can be used inline). So you have to make the connection to "ha= s=20 capital" within the template texts. You also need to know the datatype of a= n=20 attribute to parse its value and unit, which suggests a type management=20 system similar to the one currently targeted at (where "Attribute:foo"=20 articles are the places to define the type of "foo", and "Type:sometype"=20 might be used for documenting types, or even for customizing them if this i= s=20 implemented). Our current implementation achieves this and we could indeed restrict our n= ew=20 markup of links to work only in template code. So we already (using the sma= ll=20 patch described in INSTALL) allow users to have semantic templates and so t= he=20 "template only" operation can fully be simulated and tested. On the other=20 hand, this also allows us to keep our current extension in place: users can= =20 use it directly if they prefer the syntax, or they can leave the annotation= =20 to annotated templates, if they prefer this variant. I am sure that such=20 annotated templates will be very valuable in some areas, and we will really= =20 get much data "for free" as you say. > > > You say "Wikipedia's data is always incomplete, with or without the > annotation." > > And obviously this is true, overall, but it's perfectly realistic to > expect that Wikipedia can have complete sets of information within > constrained domains. There's no reason it can't have the names, > capitals, populations, currencies and official languages of all 192 > UN-recognized nations, for example, and the coordinates and > populations of all of those capitals and every other large (by > whatever definition) city on the planet. Ditto basic discographical > information for every Beatles studio album, basic biographical > information for all US Presidents, etc. Private wikis running on the > same software are likely to have other sorts of complete sets, like > all of a company's employees, customers, products, etc. My contention > is that the vast majority of useful semantic queries will be based > primarily or exclusively on verifiably complete information sets based > on known (even if informal) schema. I agree that there will be many domains where information is complete. I ju= st=20 don't see that this completeness is so essential for using the semantic=20 search etc. Today, I use Wikipedia for all kinds of questions, no matter=20 whether the information for this area is complete or not. Maybe completenes= s=20 is more important for offline uses (like importing the geographical data in= to=20 an education software). But even there one could live with some errors.=20 Anyway, the more complete the better, and struggling for perfection won't=20 hurt ;-) > > > You note "If you mistype the name of some semantic relation, you can > as well mistype the label of some template variable." > > My point about fragility was that if there's only one thing to type (a > template variable), and it has a direct visible consequence as well as > an invisible semantic effect, editors will be much more likely to get > it right. OK, agreed. As noted above, you can have this with the current implementati= on=20 already by hiding the annotation inside the template code. I agree that thi= s=20 is much better in cases where the template is known to have exactly the inp= ut=20 we expect (e.g. "Berlin" and not "[[Berlin]] since 1990, and [[Bonn]] befor= e=20 that"). > > > And lastly, my point about microformats was about the collaborative > work of agreeing on the expected properties of given data types, not > the literal XHTML/XML/whatever syntax used by microformats.org specs. > hCard, for example, although it's literally an XHTML syntax, is also > an agreement on the list of properties that constitutes a "business > card" data element. (Or, in this case, an agreement that vCard already > defined this subset, which is fine.) Although part of the debate about > a Microformat is syntactical, just like part of the debate about a > wikipedia infobox format is just about visual appearance, the core of > both is an effort to codify the data schema. Ah, I see! So you would consider templates as a predefined data scheme whic= h=20 should be developed based on community agreement. Now that I know that=20 templates are actually more flexible than I thought, I would even argue tha= t=20 one does not need a fixed data scheme -- you can just agree on a number of= =20 single properties that are suitable for a template and users can compose=20 these freely in each article. For example, this is the template part of "Fo= x=20 Squirrel": {{Taxobox begin | color =3D pink | name =3D Fox Squirrel}} {{Taxobox image | image =3D [[Image:squirrel_on_fence.jpg|200px]] | caption= =3D }} {{Taxobox begin placement | color =3D pink}} {{Taxobox regnum entry | taxon =3D [[Animal|Animalia]]}} {{Taxobox phylum entry | taxon =3D [[Chordate|Chordata]]}} {{Taxobox classis entry | taxon =3D [[Mammal]]ia}} {{Taxobox ordo entry | taxon =3D [[Rodentia]]}} {{Taxobox familia entry | taxon =3D [[Sciuridae]]}} {{Taxobox subfamilia entry | taxon =3D [[Sciurinae]]}} {{Taxobox genus entry | taxon =3D ''[[Sciurus]]''}} {{Taxobox species entry | taxon =3D '''''S. niger'''''}} {{Taxobox end placement}} {{Taxobox section binomial | color =3D pink | binomial_name =3D Sciurus nig= er |=20 author =3D [[Carolus Linnaeus|Linnaeus]], | date =3D 1758}} {{Taxobox end}} The box appears as a single table, but it is composed of many single=20 templates, grouped by "begin" and "end" templates. Looks rather elegant to = me=20 but of course leads to the creation of many tiny templates that can only be= =20 used in this context. For other areas, fixed templates might be better,=20 possibly with the option of inserting additional properties at their bottom. Markus =2D-=20 Markus Kr=F6tzsch Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe ma...@ai... phone +49 (0)721 608 7362 www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 |