From: Andrzej J. T. <an...@ch...> - 2010-03-29 14:59:12
|
Looking to get some guidance on how big you can scale an eXist database. Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging around 150-200K. This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M, which is not all that large compared to some relational databases. What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? OK.....if that works how about two orders of magnitude (100x current size)? That would give us 2.5M documents, 250GB dom.dbx and a structure.dbx in the 180GB range. Bit too big or practical to cache the whole structure.dbx in memory, regardless of the size of the memory in the server. At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating distributed eXist instances? Thanks for any insights from those that have pushed big databases in eXist... -- Andrzej Taramina Chaeron Corporation: Enterprise System Solutions http://www.chaeron.com |
From: Wolfgang M. <wol...@ex...> - 2010-03-29 16:19:17
|
> What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for > modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a > dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? I have collections with nearly half a million documents. This doesn't say much. The main question is what queries you intend to run and how well they can be optimized (from previous conversations, I believe you should have lots of optimization potential in your queries). A brute force query which has to scan huge node sets will maybe scale worse than linearly. Throwing more hardware at eXist may help to speed up a slow query by a few percent, but you will nevertheless hit a limit at some point. Improvements by an order of magnitude are only possible by having a very close look at the queries and how they are evaluated. Profiling is key. I'm continuously trying to push the limit in the query engine. I still see so many things to be improved and optimized. Right now, I'm working on a facility to speed up complex "order by" expressions by using pre-computed indexes. I also just found out that some queries generate a huge amount of intermediate XQuery value instances which are completely unnecessary. There's a lot more of this to look at. Those are the areas we should invest into and I would really welcome more interest in this. > Bit too big or practical to cache the whole structure.dbx in memory, > regardless of the size of the memory in the server. eXist never caches the whole structure.dbx in memory. Doing so has a very *negative* impact on performance as repeated tests showed. You mainly want the inner btree pages to be in memory, while the leaf pages reside on disk. So increasing the cache memory does only make sense up to a certain limit. > At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating > distributed eXist instances? Or: use something like Hadoop or Cassandra as backend for eXist? My recent index redesigns were preparing the ground for things like that and I was already testing a clustered big btree implementation with the structural index (it worked, though I was running out of time and would need to invest some more time to make it fast). Wolfgang |
From: Dmitriy S. <sha...@gm...> - 2010-03-29 16:32:04
Attachments:
smime.p7s
|
On Mon, 2010-03-29 at 10:59 -0400, Andrzej Jan Taramina wrote: > Looking to get some guidance on how big you can scale an eXist database. > > Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging > around 150-200K. This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M, > which is not all that large compared to some relational databases. > > What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for > modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a > dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? > > OK.....if that works how about two orders of magnitude (100x current size)? That would give us 2.5M documents, 250GB > dom.dbx and a structure.dbx in the 180GB range. Bit too big or practical to cache the whole structure.dbx in memory, > regardless of the size of the memory in the server. > > At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating > distributed eXist instances? > > Thanks for any insights from those that have pushed big databases in eXist... eXist can store as much as you have hdd space. The main question is what eXist can do with it & what it can't. It be very good on "single" small result selection from that big amount of data, but the problem starts as soon as you increase evaluations & "special" operations like order by. -- Cheers, Dmitriy Shabanov |
From: Adam R. <ad...@ex...> - 2010-03-31 20:43:30
|
Andrezj, A chap on the mailing list has quite some experience of scaling eXist into the hundreds of gigabytes range, perhaps if you email him he could share some of his experiences with you as well. José María Fernández González. jmfernandez <at> cnb.uam.es On 29 March 2010 15:59, Andrzej Jan Taramina <an...@ch...> wrote: > Looking to get some guidance on how big you can scale an eXist database. > > Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging > around 150-200K. This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M, > which is not all that large compared to some relational databases. > > What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for > modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a > dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? > > OK.....if that works how about two orders of magnitude (100x current size)? That would give us 2.5M documents, 250GB > dom.dbx and a structure.dbx in the 180GB range. Bit too big or practical to cache the whole structure.dbx in memory, > regardless of the size of the memory in the server. > > At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating > distributed eXist instances? > > Thanks for any insights from those that have pushed big databases in eXist... > > -- > Andrzej Taramina > Chaeron Corporation: Enterprise System Solutions > http://www.chaeron.com > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Exist-development mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-development > -- Adam Retter eXist Developer { United Kingdom } ad...@ex... irc://irc.freenode.net/existdb |
From: José M. F. G. <jm...@us...> - 2010-04-07 11:22:10
|
Hi Adam (and everybody), yes, that's my very old e-mail! My professional e-mail is now jmf...@cn... since the research group where I'm working on moved to CNIO (Spanish National Cancer Research Centre) three years ago. Obviously, you can also use this one :-) My old tests (1~2 years ago) were focused on scalability at single document, at collection and at query levels. A nice improvement since then is the FT indexes implementation used in eXist. It has improved A LOT, because the FTI implementation prior to the Lucene index did not scale up. eXist team has also removed many of the existing internal bottlenecks, most of them at intermediate results processing. But as Wolfgang has written, there are lots of possible improvements which are only needed when you are working with huge database instances. I guess intermediate results in a complex query on a huge database can still fire an OutOfMemoryError. For instance, a sequence of a hundred thousand in-memory nodes, which is being generated from database content. Other example, when you have to sort a huge sequence of nodes based on a complex condition (which is hopefully being addressed by latest Wolfgang developments). Best wishes, José María On 03/31/10 22:43, Adam Retter wrote: > Andrezj, > > A chap on the mailing list has quite some experience of scaling eXist > into the hundreds of gigabytes range, perhaps if you email him he > could share some of his experiences with you as well. José María > Fernández González. jmfernandez<at> cnb.uam.es > > On 29 March 2010 15:59, Andrzej Jan Taramina<an...@ch...> wrote: >> Looking to get some guidance on how big you can scale an eXist database. >> >> Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging >> around 150-200K. This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M, >> which is not all that large compared to some relational databases. >> >> What if we scale up 10x to nearly quarter of a million documents? The file sizes still shouldn't be all that big for >> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a >> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)? >> >> OK.....if that works how about two orders of magnitude (100x current size)? That would give us 2.5M documents, 250GB >> dom.dbx and a structure.dbx in the 180GB range. Bit too big or practical to cache the whole structure.dbx in memory, >> regardless of the size of the memory in the server. >> >> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating >> distributed eXist instances? >> >> Thanks for any insights from those that have pushed big databases in eXist... >> >> -- >> Andrzej Taramina >> Chaeron Corporation: Enterprise System Solutions >> http://www.chaeron.com >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Exist-development mailing list >> Exi...@li... >> https://lists.sourceforge.net/lists/listinfo/exist-development >> > > > -- "La violencia es el último recurso del incompetente" - Salvor Hardin en "La Fundación" de Isaac Asimov "Premature optimization is the root of all evil." - Donald Knuth José María Fernández González e-mail: jos...@gm... |