Re: [Exist-development] How big can you supersize this puppy?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?

I have collections with nearly half a million documents. This doesn't say much.

The main question is what queries you intend to run and how well they
can be optimized (from previous conversations, I believe you should
have lots of optimization potential in your queries). A brute force
query which has to scan huge node sets will maybe scale worse than
linearly. Throwing more hardware at eXist may help to speed up a slow
query by a few percent, but you will nevertheless hit a limit at some
point. Improvements by an order of magnitude are only possible by
having a very close look at the queries and how they are evaluated.
Profiling is key.

I'm continuously trying to push the limit in the query engine. I still
see so many things to be improved and optimized. Right now, I'm
working on a facility to speed up complex "order by" expressions by
using pre-computed indexes. I also just found out that some queries
generate a huge amount of intermediate XQuery value instances which
are completely unnecessary. There's a lot more of this to look at.

Those are the areas we should invest into and I would really welcome
more interest in this.

> Bit too big or practical to cache the whole structure.dbx in memory,
> regardless of the size of the memory in the server.

eXist never caches the whole structure.dbx in memory. Doing so has a
very *negative* impact on performance as repeated tests showed. You
mainly want the inner btree pages to be in memory, while the leaf
pages reside on disk. So increasing the cache memory does only make
sense up to a certain limit.

> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
> distributed eXist instances?

Or: use something like Hadoop or Cassandra as backend for eXist? My
recent index redesigns were preparing the ground for things like that
and I was already testing a clustered big btree implementation with
the structural index (it worked, though I was running out of time and
would need to invest some more time to make it fast).

Wolfgang

Re: [Exist-development] How big can you supersize this puppy?

eXist-db is a feature rich Open Source native XML database

Re: [Exist-development] How big can you supersize this puppy?