From: <jf...@mo...> - 2006-04-09 12:57:44
|
Hello, Matthias Schindler has talked me into doing a short review of the Semantic MediaWiki extension. The idea of it looks very interesting. It would make many lists obsolete and would allow a kind of data minig we currently don't provide. And I'd love to have well structured geographic (meta-)data available to draw maps. The current software is Version 0.3, so it's still far from production readiness. Deficits I've found, in mostly chronological order: * The installation of the SMW tables is triggered from a special page. The database user that MediaWiki uses at that point is not allowed to create tables if the user was created by the MediaWiki installer. A hook for extensions in the installer could be helpful here. * The tables are created as type=MyISAM. That's a no-go for sites like wikipedia.org. MyISAM provides only very poor locking and is only suitable for read-only databases. * No indexes are created. All requests require full table scans. The choice of 'text' as datatype for page titles ('subject') in all SMW tables makes this even worse - it prevents the creation of proper indices. * References to the page table are given as (namespace,title), not using page_id. While this is nice for querying SMW tables (no join needed), this results in a potential risk when renaming pages (high write load). * Properties (like e.g. population, geographic coordinates, birth dates) are stored as strings. This makes queries likes 'all cities with more than a million inhabitants' very expensive. Another no-go for a site like wikipedia.org. Those queries will happen frequently. * Non-standard way to handle local settings. Should be incorporated into global LocalSettings.php. Many of these settings should be auto-detected, esp. smwgServer, SMW_ScriptPath and SMW_IP * Naming conventions for variables: SMW_RAPPath, enableTemplateSupport and glNamespacesWithSemanticLinks are examples for three different ways to name variables. * Poor use of the database abstraction layer. If you'd use the abstraction layer, you wouldn't need to use globals for the table names and you wouldn't have to quote all fields on your own. You'd also benefit from future extensions, like query optimization or query distribution. * There shouldn't be an editing help link on article pages, only show it on edit pages. * I don't like how the relations and attributes are displayed. But I've no idea yet how to improve it. For some of these items I've created bug reports on sourceforge, some also include patches. Regards, jens |
From: Markus <ma...@ai...> - 2006-04-10 10:12:06
|
On Sunday 09 April 2006 14:57, Jens Frank wrote: > Hello, > > Matthias Schindler has talked me into doing a short review of the Semantic > MediaWiki extension. The idea of it looks very interesting. It would make > many lists obsolete and would allow a kind of data minig we currently don= 't > provide. And I'd love to have well structured geographic (meta-)data > available to draw maps. > > The current software is Version 0.3, so it's still far from production > readiness. Deficits I've found, in mostly chronological order: > > * The installation of the SMW tables is triggered from a special page. > The database user that MediaWiki uses at that point is not allowed to > create tables if the user was created by the MediaWiki installer. A hook > for extensions in the installer could be helpful here. Yes, I definitely agree. Integration into the MediaWiki installer is=20 desirable, but we did not give it a high priority. Your next bunch of comments has a single answer -- see below ... > > * The tables are created as type=3DMyISAM. That's a no-go for sites like > wikipedia.org. MyISAM provides only very poor locking and is only > suitable for read-only databases. > > * No indexes are created. All requests require full table scans. > The choice of 'text' as datatype for page titles ('subject') in all > SMW tables makes this even worse - it prevents the creation of proper > indices. > > * References to the page table are given as (namespace,title), not using > page_id. While this is nice for querying SMW tables (no join needed), > this results in a potential risk when renaming pages (high write load). > > * Properties (like e.g. population, geographic coordinates, birth dates) > are stored as strings. This makes queries likes 'all cities with more > than a million inhabitants' very expensive. Another no-go for a site like > wikipedia.org. Those queries will happen frequently. All true and fully agreed. The reason is that the tables as you see them no= w=20 are not intended for querying at all. The simple semantic search does not=20 provide much functionality anyway and is clearly not efficient for large=20 sites (it does not even split high numbers of results on multiple pages as= =20 QueryPage does).=20 The internal tables are just caches that store the parser output for furthe= r=20 processing, especially for RDF export. You are right that one could still=20 optimize table structure and indexing, but since we did not implement all o= f=20 our intended functionality yet, it is hard to do a goal-directed optimizati= on=20 (e.g. creating indexes). The reason for the string format on numbers is similar: we use the tables a= s=20 caches for exporting RDF, and so just store the RDF string versions of the= =20 data when we have it parsed. The advantage is that we do not have to concei= ve=20 a complex table layout just to abolish it in the next version when another= =20 mode for query answering is found. The question is of course how we can provide efficient support for complex= =20 queries in a Wikipedia scale. There are two possibilities: (1) The original intention was to load the generated RDF into a triplestore= =20 that then efficiently handles datatype queries internally. Triplestores can= =20 deal with several (tenth to hundreds of) millions of triples, and they=20 support SPARQL as a query language (now almost standard, i.e. "W3C candidat= e=20 recommendation"). We developed an experimental server application which=20 crawls the wiki's RDF based on the recent changes, keeps a triplestore=20 up-to-date with the current wiki content, and supports all kinds of SPARQL= =20 queries. This was our plan, and it all looked nice, until we found out that Wikipedi= a=20 does not accept Sun-Java-Software. Unfortunately, the most powerful=20 triplestores all are written in Java and are very unlikely to run on free=20 implementations or compilers (though the stores themselves are free). But of course there is another way: (2) Handle queries with SQL by creating an efficient DB layout for querying= =20 that overcomes all the deficiencies you mention above. The disadvantage is= =20 that this solution is far more work. If you have a triplestore, you just=20 upgrade to the next version to get more features and better performance. Wi= th=20 your own DB layout and query mechanism you have to do it all by yourself.=20 Most of SPARQL should be mapped to SQL easily, but "intelligent" features=20 such as subclass-inferencing are not that easy to implement from scratch. Anyway, (2) now seems to be the only possible path towards Wikipedia and we= =20 are grateful for all support in setting up something fast. > > * Non-standard way to handle local settings. Should be incorporated into > global LocalSettings.php. Many of these settings should be > auto-detected, esp. smwgServer, SMW_ScriptPath and SMW_IP True. We kept things apart where possible in order to allow people to try i= t=20 out without too much patching in there MW installation. I would put this to= =20 the the "improve installation process" item above. > > * Naming conventions for variables: SMW_RAPPath, enableTemplateSupport > and glNamespacesWithSemanticLinks are examples for three different > ways to name variables. Yes. This is historical and should be easy to fix. If anybody cares, I am=20 willing to take a beautyfication tour through the whole code. Until recentl= y,=20 I just wrote the code alone, and did not have any ressources left for minor= =20 cleanups/rewrites. > > * Poor use of the database abstraction layer. If you'd use the > abstraction layer, you wouldn't need to use globals for the table names > and you wouldn't have to quote all fields on your own. You'd also > benefit from future extensions, like query optimization or query > distribution. OK, for the next DB restructuring (towards item (2) above ...) I will have = a=20 look into this.=20 > > * There shouldn't be an editing help link on article pages, only show it > on edit pages. Sure. We just put this in to have a visible starter on our demo wiki. The=20 whole infobox will become configurable. It should be possible to display it= =20 only if non-empty, and there should be some JScript to hide the box. The =20 whole data could then be hidden by default. > > * I don't like how the relations and attributes are displayed. But I've > no idea yet how to improve it. In the infobox? I agree. Maybe a hidden-by-default infobox will make this=20 issue less important. > > For some of these items I've created bug reports on sourceforge, some > also include patches. I will have a look at it. Thanks a lot for taking the time to read the code= so=20 thoroughly. I think we could really use some help for implementing efficien= t=20 SQL-based querying, since we are not experts on this domain (we are more on= =20 the side of triplestores ...). The problem is that we are really developing= =20 SMW in our non-existing spare time. The project could evolve much faster if= =20 we would not have 10h/day jobs besides the coding :-s=20 Still I think that the project has quite some potential to do much good for= =20 Wikpedia, also in combination with all the other small projects towards=20 usable machine processing of WP content -- but we could really use some hel= p=20 to get it deployable. Best regards, Markus =2D-=20 Markus Kr=F6tzsch Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe ma...@ai... phone +49 (0)721 608 7362 www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 |
From: Jama P. <ja...@de...> - 2006-04-10 12:48:48
|
On Mon, Apr 10, 2006 at 12:11:27PM +0200, Markus Kr?tzsch wrote: > This was our plan, and it all looked nice, until we found out that Wikipedia > does not accept Sun-Java-Software. Unfortunately, the most powerful > triplestores all are written in Java and are very unlikely to run on free > implementations or compilers (though the stores themselves are free). Outside of the "Sun's Java isn't free" argument is another - just as important argument - that people using mediawiki often don't have Java on their hosting solution. Java is also not known for integrating well with other language environments. I personally don't understand why Java is so popular in the semantic web software research field. To me the free and integration part is much more important than an 'easy' all-in-one platform for building RDF applications. > But of course there is another way: > > (2) Handle queries with SQL by creating an efficient DB layout for querying > that overcomes all the deficiencies you mention above. The disadvantage is > that this solution is far more work. If you have a triplestore, you just > upgrade to the next version to get more features and better performance. With > your own DB layout and query mechanism you have to do it all by yourself. > Most of SPARQL should be mapped to SQL easily, but "intelligent" features > such as subclass-inferencing are not that easy to implement from scratch. > > Anyway, (2) now seems to be the only possible path towards Wikipedia and we > are grateful for all support in setting up something fast. Have you looked at these MySQL/PHP based RDF stores: http://www.appmosphere.com/pages/en-arc_rdf_store http://www.aktors.org/technologies/3store/ I've used 3store, but still need to look at the ARC RDF store. 3store is oke, but not that allround yet. I think ARC RDF looks the most interesting for the SMW project. I don't think incremental updates are possible with ARC, though the recent-changes method sounds interesting. However, you would still need some kind of command-line tool to re-create the RDF graph from the whole MW DB once in a while (not just for optimizations). Here's an interesting application build using ARC: http://www.confoto.org regards, Jama Poulsen http://wikicompany.org |
From: Markus <ma...@ai...> - 2006-04-10 14:20:52
|
On Monday 10 April 2006 14:48, Jama Poulsen wrote: > On Mon, Apr 10, 2006 at 12:11:27PM +0200, Markus Kr?tzsch wrote: > > This was our plan, and it all looked nice, until we found out that > > Wikipedia does not accept Sun-Java-Software. Unfortunately, the most > > powerful triplestores all are written in Java and are very unlikely to > > run on free implementations or compilers (though the stores themselves > > are free). > > Outside of the "Sun's Java isn't free" argument is another - just as > important argument - that people using mediawiki often don't have Java on > their hosting solution. Java is also not known for integrating well with > other language environments. > > I personally don't understand why Java is so popular in the semantic web > software research field. To me the free and integration part is much more > important than an 'easy' all-in-one platform for building RDF application= s. Well, before Java, it has been C++, and C before that. It's just the curren= t=20 main stream imperative language. And this is not completely unjustified,=20 since it has many features that are quite helpful. I really doubt that Java= =20 should be the first thing to consider when writing a web-application, but i= t=20 is a language that many people know quite well (try to find some CS student= =20 who has worked with PHP to see what I mean ...). This is, I think, the main= =20 reason why it is so widespread today. And it has the advantage of being=20 relatively platform independent, which is also important in a research=20 context (yes, so are most scripting languages, but scripting is not the=20 solution for everything). But I also see that WP has good reasons not to us= e=20 Java. So we have to live with the given situation. > > > But of course there is another way: > > > > (2) Handle queries with SQL by creating an efficient DB layout for > > querying that overcomes all the deficiencies you mention above. The > > disadvantage is that this solution is far more work. If you have a > > triplestore, you just upgrade to the next version to get more features > > and better performance. With your own DB layout and query mechanism you > > have to do it all by yourself. Most of SPARQL should be mapped to SQL > > easily, but "intelligent" features such as subclass-inferencing are not > > that easy to implement from scratch. > > > > Anyway, (2) now seems to be the only possible path towards Wikipedia and > > we are grateful for all support in setting up something fast. > > Have you looked at these MySQL/PHP based RDF stores: > http://www.appmosphere.com/pages/en-arc_rdf_store Sounds interesting. I did not know this one. Is it free in a strict sense?= =20 Performance figures? Is it ready for productive use (it seems to be very=20 new)? > http://www.aktors.org/technologies/3store/ > > I've used 3store, but still need to look at the ARC RDF store. 3store is > oke, but not that allround yet. I think ARC RDF looks the most interesting > for the SMW project. We also thought about Redland, but I am not sure how active this project=20 currently is. 3Store is AFAIK document-based, i.e. you reload the whole RDF= =20 whenever something has changed. This would not be suitable for our=20 change-intensive environment. > > I don't think incremental updates are possible with ARC, though the > recent-changes method sounds interesting. However, you would still need > some kind of command-line tool to re-create the RDF graph from the whole = MW > DB once in a while (not just for optimizations). This must be possible in any case, but it should not happen often. The prob= lem=20 is that it must be possible to change an article, then go right to a search= =20 function or automatically generated list, and find the changes reflected=20 there. If the search results are updated only once in a while, users will b= e=20 confused or uninterested. > > Here's an interesting application build using ARC: http://www.confoto.org Another issue with RDF stores is that we actually need a quadstore in our=20 case. The reason is that RDF is slightly more complex than the data in our= =20 caching tables (there are additional triples for labels and types, but also= =20 some data might translate to more than one triple [e.g. for geo-coordinates= ,=20 we intend to use a format where you can also get the latitude and longitude= =20 values as decimal numbers]), so we have to keep track which data came from= =20 which article.=20 Regards, Markus =2D-=20 Markus Kr=F6tzsch Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe ma...@ai... phone +49 (0)721 608 7362 www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 |
From: Danny A. <dan...@gm...> - 2006-04-10 18:00:53
|
fyi, there's a well-maintained list of RDF toolkits here: http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/ -- http://dannyayers.com |
From: Jama P. <ja...@de...> - 2006-04-10 19:50:15
|
On Mon, Apr 10, 2006 at 04:18:49PM +0200, Markus Kr?tzsch wrote: > On Monday 10 April 2006 14:48, Jama Poulsen wrote: > > http://www.appmosphere.com/pages/en-arc_rdf_store > > Sounds interesting. I did not know this one. Is it free in a strict sense? > Performance figures? Is it ready for productive use (it seems to be very > new)? W3C Software License http://www.appmosphere.com/pages/en-arc_license http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231 Without any special "disclaimers, notices, or terms and conditions" this license is basically a BSD-style license, and thus compatible with the GNU GPL. I've just upgraded Wikicompany to MW 1.6 and also upgraded the SWM code (which is looking better all the time!), so I'll probably have a look at it soon, to check the API, documentation, etc.. I may contact appmosphere to see what their plans are for ARC. A somewhat more open project management style would be good for this project I think. > Another issue with RDF stores is that we actually need a quadstore in our > case. The reason is that RDF is slightly more complex than the data in our > caching tables (there are additional triples for labels and types, but also > some data might translate to more than one triple [e.g. for geo-coordinates, > we intend to use a format where you can also get the latitude and longitude > values as decimal numbers]), so we have to keep track which data came from > which article. Decoupling the statement store from the RDF query engine could also make it easier to switch RDF engines down the line. The GeoRSS 'microformat' may also be of interest here: http://georss.org/rdf_rss1.html Jama Poulsen |