From: Markus <ma...@ai...> - 2006-04-12 18:09:54
|
Hi Chris, thanks for the quick answer. I take from your words that RAP might not yet = be=20 quick enough *in general*. On the other hand, no single tool really meets a= ll=20 our needs (especially since the Java-stores are out), and RAP at least=20 appears to be well-maintained and evolving. Also I like that RAP can be=20 configured for various settings (e.g. with various levels of inferencing), = so=20 we could allow people to switch on complex features if they have smaller=20 wikis. My inquiry was rather general. We have a lot of data, but we do not need al= l=20 of these functions to be very fast. What we really do is: =3D=3D Standard wiki usage =3D=3D * On normal article *views* (by far the most common operation), at most som= e=20 simple reads are needed (if the article is not in cache and certain=20 annotations are used). Same is true for *previews* during editing. * On every article *write*, the store has to be updated (delete + write). T= his=20 could be optimized by checking for actual changes in the RDF. =3D=3D Semantic features =3D=3D * Further simple reads occur for exporting RDF. This could be optimized by= =20 caching. * Complex queries shall be supported in a simplified inline syntax: users a= dd=20 queries to the article source, and the article then shows the result lists.= =20 These lists need to be updated regularly, but not on every change. So, if i= t=20 is not affordable to do live-updates for the query results, updating result= =20 lists included in articles once a day might also be acceptable. This is qui= te=20 an extreme case (and might not be motivating for contributors who want to s= ee=20 their changes have immediate effect), but it illustrates that we are somewh= at=20 flexible. What we really need is to guarantee that the standard usage is hardly slowe= d=20 down at all. The added semantic features are somewhat optional: we need a=20 certain amount to convince anyone to use the extension, but we can be=20 restrictive to ensure acceptable performance. It would also be OK to restri= ct=20 queries wrt. complexity or size of result set. Our problem with evaluation = is=20 that we do not have real testing data until the extension is active in some= =20 major wiki, but that we need to ensure some amount of scalability before=20 that.=20 I would also like to learn more about the current capabilities of Appmosphe= re.=20 My impression was that its RDF-store and query features are rather new -- i= s=20 it currently recommended for major productive use? Having an integrated API= =20 of RAP and Appmosphere would clearly be great for our setting. Redland is the third store that we really consider. Since it seems to be a= =20 one-man-project, I wonder whether its future development is secured (e.g. t= he=20 demos on the site where all disabled when Dave Beckett switched to Yahoo!) Concerning 3Store, I thought that they have a document-centeric approach wh= ere=20 you first load a large RDF document and then ask queries. Whatever the=20 performance of the querying is, we could not afford to reload the whole dat= a=20 everytime someone makes a change. The PHP-binding of 3Store is realized by= =20 making calls to shell-commands from PHP. On Wednesday 12 April 2006 17:50, Chris Bizer wrote: > Hi Markus, > > > We consider using RAP as a quadstore for Semantic MediaWiki (see > > http://wiki.ontoworld.org). > > Interesting. > > > In the long run, we are interested in > > inferencing, but for now Wikipedia-size scalability is most important. > > Hmm sorry, up to my knowledge there are no systematic comparisons of the > performance of RAP with other RDF toolkits. > > We did some relatively unsystematic performance testing when we implement= ed > different features, but the results are outdated by now. > > S=F6ren Auer and Bart Pieterse (both cc'ed) have used RAP in bigger proje= cts > and I guess they are the best sources for practical experiences with the > performance of RAP with bigger real world datasets. > > My general impression is that as PHP itself is still slower than languages > like Java or C, RAP is also slow and its performance can not be compared > with toolkits like Jena or Sesame. S=F6ren might disagree on this point w= ith > me. > > > Are > > there recent evaluations concerning the performance of the different > > storage > > models? In particular, we are interested in scalability of the following > > functions: > > > > 1 SPARQL queries: > > 1.1 general performance > > Around one second for a medium complex query against a data set with 100 > 000 triples in memory, much slower if the data set is in a database. Tobi= as > Gauss can give you details. > > An PHP alternative for SPARQL queries against data sets which are stored = in > a database is Benjamin appmosphere toolkit > http://www.appmosphere.com/pages/en-arc. He does smarter SPARQL to SQL > rewriting than RAP and should theoretically be faster. > > > 1.2 performance of "join-intensive" queries (involving long chains of > > triples) > > 1.3 performance of datatype queries (e.g. selecting/sorting results by > > some > > xsd:int or xsd:decimal) > > 1.4 performance for partial result lists (e.g. getting only the first > > 20) 2 simple read access (e.g. getting all triples of a certain pattern > > or RDF dataset) > > OK with models up to 100 000 triples. Don't know about bigger models. > S=F6ren? > > > 3 write access > > 3.1 adding triples to an existing store > > 3.2 deleting selected triples from the store > > Should be OK. I think S=F6ren implemented some work arounds for bulk upda= tes. > > > 4 impact of RDF dataset features/named graph functionality > > About 5% slower than operations on classic RDF models. > > > For inclusion in Wikipedia, dealing with about 10 Mio triples split into > > 1 Mio > > RDF datasets is probably necessary. > > Too much for RAP, too much for appmoshere (Benjamin?), and I guess even > hard for Jena, Redland and Co if the queries become more complicated. > > > We are working on useful update and > > caching strategies to reduce access to the RDF store, but a rather high > > number of parallel requests still is to be expected (though normal > > reading of > > articles will not touch the store). It would also be possible to restri= ct > > to > > certain types of queries if this leads to improved performance. > > > > We currently use RAP as an RDF parser for importing ontologies into > > Semantic > > MediaWiki. For querying our RDF data, we consider reusing an existing > > triplestores such as Redland or RAP, but also using SQL queries directl= y. > > Java toolkits are not an option since Wikipedia requires the use of free > > software (and free Java implementations probably don't support current > > RDF stores). > > If current RDF stores means Named Graph stores then you could use a > combination of Jena and NG4J. Jena is BSD and supports SPARQL. NG4J adds a > API for manipulating Named Graph sets. See: > http://www.wiwiss.fu-berlin.de/suhl/bizer/ng4j/ > > > I can imagine that one can already find performance measures for RAP > > somewhere > > on the web -- sorry if I missed this. > > Not that I know. But all efforts into that direction are highly welcomed. > > Cheers > > Chris > > > Best regards, > > > > Markus > > > > -- > > Markus Kr=F6tzsch > > Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe > > ma...@ai... phone +49 (0)721 608 7362 > > www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 =2D-=20 Markus Kr=F6tzsch Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe ma...@ai... phone +49 (0)721 608 7362 www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717 |