Re: [Exist-open] exist and performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

[I see in the meantime that a reply from Wolfgang has come in, but I will
post this anyway, even though it covers almost the same ground, just in case
viewing the same thing from a slightly different angle helps]

hakhan wrote:

[...]

> What do I do wrong? Did I reached the limitations of a native
> XML database and should I step over to an XML-enabled
> database where querying data is perhaps much faster?

I don't think it's possible to give you much useful advice at the level of
generality in which you state your problems. Certainly, >100MB per document
is pretty big compared to the size of documents used by other eXist users
known to me (although I believe there are some users with documents of such
sizes). And in general with all XML databases I've looked at, if a given
body of data can be conveniently split into many smaller documents rather
than placed in few very large ones, storage and retrieval performance
improves.

Do you really need the granularity of indexing and retrieval that eXist
offers? If your are dealing with what are in effect fairly flat documents,
you might be better using a fulltext retrieval system (maybe keeping the
filesystem as the repository and retrieving sub-document level sections via
a SAX parse). Lucene is probably the best-known Open Source example, but
there are others, and many proprietary ones. On the other hand, is your data
highly structured but not strongly hierarchical? In that case you might
be better storing it in a more traditional RDBMS or an OODB (you could still
generate XML for interchange purposes if that is a requirement). As for XML
=> RDBMS "adapters", most of which are commercial and expensive, I'm
personally a bit sceptical. I've not been able to do any proper testing, but
my subjective quick impression of two such systems was that they performed
best on data which could really have been better kept purely in a RDBMS in
the first place.

If a native XML solution is generically the right one for your needs, size
of document alone is not the only thing to consider when comparing systems.
Performance is dependent on the structure of your documents and on the
nature of your queries. As elsewhere, there is an indexation-related trade
off between storage and retrieval times. By default, eXist indexes all the
text in the document using its full-text index. Depending on your documents
and your needs, you might get significant improvements in storage time with
no penalty in retrieval time if you could identify portions of your
documents for selective fulltext indexing, disable indexing of attribute
values if you don't need them for retrieval, etc. Or you might want to turn
off fulltext indexing altogether and use range indexes on suitable
components. A further point is that some XQueries are intrinsically more
expensive than others, and some are more expensive under one implementation
than under others.

However, it could be that there are list members who can give concrete
advice on the basis of using eXist for documents like yours; and could
comment on the strategies embodied in the queries your are finding
frustratingly slow, but unless they know a bit more about what your
documents are indeed like and what sort of retrieval needs you have, they
won't be able to offer their experience.

Michael Beddow

Re: [Exist-open] exist and performance

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] exist and performance