Thank you, Markus - it's a really good review! I wonder if there is any way to unify performance reporting for all SMW instances so we can compare the effects of large data sets, different systems configs (e.g. disabled cache and so on) - just looked at
profileinfo.php script, it might be an answer, actually.
I wonder if real Wikipedia set of data (outdated, maybe) is going to be set up as a test-case for SMW to handle (with Semantic Templates, of course) - I was going to do that, but don't have resources for this. This might help to make the goal of "Semantic Wikipedia" more transparent.
I'll be happy to run the tests on the system with significant amount of data if you need a testbed.
On Freitag, 14. Dezember 2007, Sergey Chernyshev wrote:> Got it - if it'll speed up the process, that'll be great. Currently SMW onYes, agreed. Of course we have always designed basic algorithms with regards
> top of MW runs significantly slower then just MW which is not very good
> because it means that SMW+MW can't scale as good as MW alone.
> Can you describe in a couple of paragraphs how SMW data and queries are
> getting cached and how that cache is being invalidated, what works on the
> fly and what is served from parser cache.
> I understand it's a lot to describe, but projects with massive amount of
> data and traffic, performance can be a big show-stopper - we picked MW for
> one of our projects because of Wikipedia performance example and
> predictability and I hope that it's not too distant for SMW to inherit
> these qualities, but I'd like to understand the overall picture.
to performance and scalability, and especially tried to pick features based
on this aspect. On the other hand, caching is significantly under-developed
in SMW as it is, since it mainly uses the existing MW caches where
applicable. There are various types of operations that are relevant to
performance, and each can probably be optimised/cached independently:
(1) Basic page display -- by far the most common operation.
(2) Query answering, inline and on Special:Ask
(3) Annotation parsing and page formatting.
(4) Maintenance specials such as Special:Properties.
(5) OWL/RDF export.
(6) Browsing special Special:Browse
I will sketch performance issues for each of those. For actual numbers, see
http://ontoworld.org/profileinfo.php to find out how severe each operation is
(1) is clearly the main operation, and for existing pages SMW merely uses MW's
parser/page caches. No mechanism for cache invalidation exists, but MW
regularly updates page caches. This allows outdated inline queries but gives
us good hope for basic scalability in large environments. Especially SMW
does not hook into any operations that happen when reproducing parser cached
pages. Even the Factbox comes from the parser cache (which is why we cannot
readily translate it to the user's language as MW does for categories).
(2) Query answering is done without any caching, and this is clearly a
problem. While inline queries are computed only once and stored in the parser
cache afterwards, Special:Ask has no caching facility at all. This needs to
change in the future. Targetted cache invalidation might still be difficult
and it is not clear whether the effort is needed (one could enable manual
cache clearing like for pages). A new query cache -- design, architecture and
implementation -- is needed here.
(3) Page formatting uses very few additional DB calls, and mainly works on the
wiki source code that was already retrieved anyway. It has no major
performance impact (see smwfParserHook in the profile).
(4) Maintenance special can be slow, but have been designed to allow the
caching mechanism that MW uses for its maintenance specials. This is not
implemented, but it would be possible. One design decision, probably in more
cases, is whether to have transparent caching in the sotrage implementation,
or whether to trigger caching explicitly in the caller (which may help to not
make the storage implementation even bigger than it is now).
(5) OWL/RDF export take time, but mostly depending on the export settings of
your site. The result could be cached internally in a similar way that
page-content is cached. External caches could be configured to cache RDF as
well. Yet this is not to be neglected, since a number of Semantic Web
crawlers and misguided RSS-spiders regularly visit the RDF.
(6) Special:Browse is not inefficient, but as it is a specialised form
of "What links here" it also faces similar performance issues.
Finally, SMW needs practically no time to load if it is not strictly needed.
So enabling it does hardly slow down the wiki for services that need no SMW.
Summing up, the required caching facilities in order of relevance would
probably be: (2) [Queries], (4) [Specials], (5) [OWL/RDF]. I do not think
that the other parts need to much care, but analysing the current profileinfo
may yield more insights. Concerning (2), which is by far the most severe
performance problem, we have included many ways of restricting queries, so
that large sites can always switch off features until it works again (SMW is
still useful without very complex queries). At the moment this is the
suggested procedure for large sites, and we can also offer some support for
helping such sites to not experience major problems (things of course also
depend a lot on the wiki's actual structure).
> Thank you,
> On Dec 14, 2007 1:12 PM, Markus Krötzsch <firstname.lastname@example.org> wrote:
> > On Freitag, 14. Dezember 2007, Sergey Chernyshev wrote:
> > > Markus, can you elaborate on three values - what's the difference
> > between
> > > SOME and FULL?
> > FULL is what used to be "true" in 1.0 (default)
> > NONE is what used to be "false" in all versions
> > SOME is new, but does basically what 0.7 did earlier.
> > So SOME only considers redirects for pages that appear directly in the
> > query.
> > For example, assume "r1" and "r2" are redirects to "p". Then asking
> > for "[[property::r1]]" yields the same results as asking
> > for "[[property::p]]" or "[[property::r1]]".
> > This is not too hard to do. Now FULL evaluates redirects even when
> > joining subqueries or asking for categories. As an example, assume that
> > in addition
> > to the above there is a page "q" with annotation "[[property::r1]]", and
> > assume further that r2 is in Category2 and that p is in Category3. Then
> > each
> > of the following queries contains "q" in its result list:
> > * <ask>[[property::<q>[[Category:Category3]]</q>]]</ask>
> > * <ask>[[property::<q>[[Category:Category2]]</q>]]</ask>
> > Neither would work with SOME only. But as you can imagine, doing these
> > additional considerations about redirects at query time consumes a lot of
> > additional time (in particular since we use MW's redirect table that is
> > not
> > even optimised for these kind of games).
> > If you make sure that properties do not point to redirects, and that
> > redirects
> > have no categories or properties, then SOME should always suffice (I
> > think it
> > was discussed earlier to have a Special page for that kind of
> > maintenance).
> > -- Markus
> > > Sergey