From: Wolfgang M. <wol...@ex...> - 2010-03-29 16:19:17
|
> What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for > modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a > dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? I have collections with nearly half a million documents. This doesn't say much. The main question is what queries you intend to run and how well they can be optimized (from previous conversations, I believe you should have lots of optimization potential in your queries). A brute force query which has to scan huge node sets will maybe scale worse than linearly. Throwing more hardware at eXist may help to speed up a slow query by a few percent, but you will nevertheless hit a limit at some point. Improvements by an order of magnitude are only possible by having a very close look at the queries and how they are evaluated. Profiling is key. I'm continuously trying to push the limit in the query engine. I still see so many things to be improved and optimized. Right now, I'm working on a facility to speed up complex "order by" expressions by using pre-computed indexes. I also just found out that some queries generate a huge amount of intermediate XQuery value instances which are completely unnecessary. There's a lot more of this to look at. Those are the areas we should invest into and I would really welcome more interest in this. > Bit too big or practical to cache the whole structure.dbx in memory, > regardless of the size of the memory in the server. eXist never caches the whole structure.dbx in memory. Doing so has a very *negative* impact on performance as repeated tests showed. You mainly want the inner btree pages to be in memory, while the leaf pages reside on disk. So increasing the cache memory does only make sense up to a certain limit. > At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating > distributed eXist instances? Or: use something like Hadoop or Cassandra as backend for eXist? My recent index redesigns were preparing the ground for things like that and I was already testing a clustered big btree implementation with the structural index (it worked, though I was running out of time and would need to invest some more time to make it fast). Wolfgang |