Re: [Rdfapi-php-interest] Scalability and performance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Chris,

thanks for the quick answer. I take from your words that RAP might not yet =
be=20
quick enough *in general*. On the other hand, no single tool really meets a=
ll=20
our needs (especially since the Java-stores are out), and RAP at least=20
appears to be well-maintained and evolving. Also I like that RAP can be=20
configured for various settings (e.g. with various levels of inferencing), =
so=20
we could allow people to switch on complex features if they have smaller=20
wikis.

My inquiry was rather general. We have a lot of data, but we do not need al=
l=20
of these functions to be very fast. What we really do is:

=3D=3D Standard wiki usage =3D=3D

* On normal article *views* (by far the most common operation), at most som=
e=20
simple reads are needed (if the article is not in cache and certain=20
annotations are used). Same is true for *previews* during editing.
* On every article *write*, the store has to be updated (delete + write). T=
his=20
could be optimized by checking for actual changes in the RDF.

=3D=3D Semantic features =3D=3D

* Further simple reads occur for exporting RDF. This could be optimized by=
=20
caching.
* Complex queries shall be supported in a simplified inline syntax: users a=
dd=20
queries to the article source, and the article then shows the result lists.=
=20
These lists need to be updated regularly, but not on every change. So, if i=
t=20
is not affordable to do live-updates for the query results, updating result=
=20
lists included in articles once a day might also be acceptable. This is qui=
te=20
an extreme case (and might not be motivating for contributors who want to s=
ee=20
their changes have immediate effect), but it illustrates that we are somewh=
at=20
flexible.

What we really need is to guarantee that the standard usage is hardly slowe=
d=20
down at all. The added semantic features are somewhat optional: we need a=20
certain amount to convince anyone to use the extension, but we can be=20
restrictive to ensure acceptable performance. It would also be OK to restri=
ct=20
queries wrt. complexity or size of result set. Our problem with evaluation =
is=20
that we do not have real testing data until the extension is active in some=
=20
major wiki, but that we need to ensure some amount of scalability before=20
that.=20

I would also like to learn more about the current capabilities of Appmosphe=
re.=20
My impression was that its RDF-store and query features are rather new -- i=
s=20
it currently recommended for major productive use? Having an integrated API=
=20
of RAP and Appmosphere would clearly be great for our setting.

Redland is the third store that we really consider. Since it seems to be a=
=20
one-man-project, I wonder whether its future development is secured (e.g. t=
he=20
demos on the site where all disabled when Dave Beckett switched to Yahoo!)

Concerning 3Store, I thought that they have a document-centeric approach wh=
ere=20
you first load a large RDF document and then ask queries. Whatever the=20
performance of the querying is, we could not afford to reload the whole dat=
a=20
everytime someone makes a change. The PHP-binding of 3Store is realized by=
=20
making calls to shell-commands from PHP.

On Wednesday 12 April 2006 17:50, Chris Bizer wrote:
> Hi Markus,
>
> > We consider using RAP as a quadstore for Semantic MediaWiki (see
> > http://wiki.ontoworld.org).
>
> Interesting.
>
> > In the long run, we are interested in
> > inferencing, but for now Wikipedia-size scalability is most important.
>
> Hmm sorry, up to my knowledge there are no systematic comparisons of the
> performance of RAP with other RDF toolkits.
>
> We did some relatively unsystematic performance testing when we implement=
ed
> different features, but the results are outdated by now.
>
> S=F6ren Auer and Bart Pieterse (both cc'ed) have used RAP in bigger proje=
cts
> and I guess they are the best sources for practical experiences with the
> performance of RAP with bigger real world datasets.
>
> My general impression is that as PHP itself is still slower than languages
> like Java or C, RAP is also slow and its performance can not be compared
> with toolkits like Jena or Sesame. S=F6ren might disagree on this point w=
ith
> me.
>
> > Are
> > there recent evaluations concerning the performance of the different
> > storage
> > models? In particular, we are interested in scalability of the following
> > functions:
> >
> > 1 SPARQL queries:
> >  1.1 general performance
>
> Around one second for a medium complex query against a data set with 100
> 000 triples in memory, much slower if the data set is in a database. Tobi=
as
> Gauss can give you details.
>
> An PHP alternative for SPARQL queries against data sets which are stored =
in
> a database is Benjamin appmosphere toolkit
> http://www.appmosphere.com/pages/en-arc.  He does smarter SPARQL to SQL
> rewriting than RAP and should theoretically be faster.
>
> >  1.2 performance of "join-intensive" queries (involving long chains of
> >      triples)
> >  1.3 performance of datatype queries (e.g. selecting/sorting results by
> > some
> >      xsd:int or xsd:decimal)
> >  1.4 performance for partial result lists (e.g. getting only the first
> > 20) 2 simple read access (e.g. getting all triples of a certain pattern
> > or RDF dataset)
>
> OK with models up to 100 000 triples. Don't know about bigger models.
> S=F6ren?
>
> > 3 write access
> >  3.1 adding triples to an existing store
> >  3.2 deleting selected triples from the store
>
> Should be OK. I think S=F6ren implemented some work arounds for bulk upda=
tes.
>
> > 4 impact of RDF dataset features/named graph functionality
>
> About 5% slower than operations on classic RDF models.
>
> > For inclusion in Wikipedia, dealing with about 10 Mio triples split into
> > 1 Mio
> > RDF datasets is probably necessary.
>
> Too much for RAP, too much for appmoshere (Benjamin?), and I guess even
> hard for Jena, Redland and Co if the queries become more complicated.
>
> > We are working on useful update and
> > caching strategies to reduce access to the RDF store, but a rather high
> > number of parallel requests still is to be expected (though normal
> > reading of
> > articles will not touch the store). It would also be possible to restri=
ct
> > to
> > certain types of queries if this leads to improved performance.
> >
> > We currently use RAP as an RDF parser for importing ontologies into
> > Semantic
> > MediaWiki. For querying our RDF data, we consider reusing an existing
> > triplestores such as Redland or RAP, but also using SQL queries directl=
y.
> > Java toolkits are not an option since Wikipedia requires the use of free
> > software (and free Java implementations probably don't support current
> > RDF stores).
>
> If current RDF stores means Named Graph stores then you could use a
> combination of Jena and NG4J. Jena is BSD and supports SPARQL. NG4J adds a
> API for manipulating Named Graph sets. See:
> http://www.wiwiss.fu-berlin.de/suhl/bizer/ng4j/
>
> > I can imagine that one can already find performance measures for RAP
> > somewhere
> > on the web -- sorry if I missed this.
>
> Not that I know. But all efforts into that direction are highly welcomed.
>
> Cheers
>
> Chris
>
> > Best regards,
> >
> > Markus
> >
> > --
> > Markus Kr=F6tzsch
> > Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
> > ma...@ai...        phone +49 (0)721 608 7362
> > www.aifb.uni-karlsruhe.de/WBS/     fax +49 (0)721 693  717

=2D-=20
Markus Kr=F6tzsch
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
ma...@ai...        phone +49 (0)721 608 7362
www.aifb.uni-karlsruhe.de/WBS/     fax +49 (0)721 693  717