Thread: [Exist-development] How big can you supersize this puppy?

eXist-db is a feature rich Open Source native XML database

Brought to you by: deliriumsky, dizzzz, windauer, wolfgang_m

exist-development

[Exist-development] How big can you supersize this puppy?

From: Andrzej J. T. <an...@ch...> - 2010-03-29 14:59:12

Looking to get some guidance on how big you can scale an eXist database.

Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging
around 150-200K.  This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M,
which is not all that large compared to some relational databases.

What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?

OK.....if that works how about two orders of magnitude (100x current size)?  That would give us 2.5M documents, 250GB
dom.dbx and a structure.dbx in the 180GB range.  Bit too big or practical to cache the whole structure.dbx in memory,
regardless of the size of the memory in the server.

At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
distributed eXist instances?

Thanks for any insights from those that have pushed big databases in eXist...

-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

Re: [Exist-development] How big can you supersize this puppy?

From: Wolfgang M. <wol...@ex...> - 2010-03-29 16:19:17

> What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?

I have collections with nearly half a million documents. This doesn't say much.

The main question is what queries you intend to run and how well they
can be optimized (from previous conversations, I believe you should
have lots of optimization potential in your queries). A brute force
query which has to scan huge node sets will maybe scale worse than
linearly. Throwing more hardware at eXist may help to speed up a slow
query by a few percent, but you will nevertheless hit a limit at some
point. Improvements by an order of magnitude are only possible by
having a very close look at the queries and how they are evaluated.
Profiling is key.

I'm continuously trying to push the limit in the query engine. I still
see so many things to be improved and optimized. Right now, I'm
working on a facility to speed up complex "order by" expressions by
using pre-computed indexes. I also just found out that some queries
generate a huge amount of intermediate XQuery value instances which
are completely unnecessary. There's a lot more of this to look at.

Those are the areas we should invest into and I would really welcome
more interest in this.

> Bit too big or practical to cache the whole structure.dbx in memory,
> regardless of the size of the memory in the server.

eXist never caches the whole structure.dbx in memory. Doing so has a
very *negative* impact on performance as repeated tests showed. You
mainly want the inner btree pages to be in memory, while the leaf
pages reside on disk. So increasing the cache memory does only make
sense up to a certain limit.

> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
> distributed eXist instances?

Or: use something like Hadoop or Cassandra as backend for eXist? My
recent index redesigns were preparing the ground for things like that
and I was already testing a clustered big btree implementation with
the structural index (it worked, though I was running out of time and
would need to invest some more time to make it fast).

Wolfgang

Re: [Exist-development] How big can you supersize this puppy?

From: Dmitriy S. <sha...@gm...> - 2010-03-29 16:32:04

Attachments: smime.p7s

On Mon, 2010-03-29 at 10:59 -0400, Andrzej Jan Taramina wrote:
> Looking to get some guidance on how big you can scale an eXist database.
> 
> Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging
> around 150-200K.  This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M,
> which is not all that large compared to some relational databases.
> 
> What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?
> 
> OK.....if that works how about two orders of magnitude (100x current size)?  That would give us 2.5M documents, 250GB
> dom.dbx and a structure.dbx in the 180GB range.  Bit too big or practical to cache the whole structure.dbx in memory,
> regardless of the size of the memory in the server.
> 
> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
> distributed eXist instances?
> 
> Thanks for any insights from those that have pushed big databases in eXist...

eXist can store as much as you have hdd space. 
The main question is what eXist can do with it & what it can't.

It be very good on "single" small result selection from that big amount
of data, but the problem starts as soon as you increase evaluations &
"special" operations like order by.

-- 
Cheers,

Dmitriy Shabanov

Re: [Exist-development] How big can you supersize this puppy?

From: Adam R. <ad...@ex...> - 2010-03-31 20:43:30

Andrezj,

A chap on the mailing list has quite some experience of scaling eXist
into the hundreds of gigabytes range, perhaps if you email him he
could share some of his experiences with you as well. José María
Fernández González. jmfernandez <at> cnb.uam.es

On 29 March 2010 15:59, Andrzej Jan Taramina <an...@ch...> wrote:
> Looking to get some guidance on how big you can scale an eXist database.
>
> Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging
> around 150-200K.  This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M,
> which is not all that large compared to some relational databases.
>
> What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?
>
> OK.....if that works how about two orders of magnitude (100x current size)?  That would give us 2.5M documents, 250GB
> dom.dbx and a structure.dbx in the 180GB range.  Bit too big or practical to cache the whole structure.dbx in memory,
> regardless of the size of the memory in the server.
>
> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
> distributed eXist instances?
>
> Thanks for any insights from those that have pushed big databases in eXist...
>
> --
> Andrzej Taramina
> Chaeron Corporation: Enterprise System Solutions
> http://www.chaeron.com
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Exist-development mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-development
>



-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] How big can you supersize this puppy?

From: José M. F. G. <jm...@us...> - 2010-04-07 11:22:10

Hi Adam (and everybody),
	yes, that's my very old e-mail! My professional e-mail is now jmf...@cn... since the research group where I'm working on moved to CNIO (Spanish National Cancer Research Centre) three years ago.

	Obviously, you can also use this one :-)

	My old tests (1~2 years ago) were focused on scalability at single document, at collection and at query levels. A nice improvement since then is the FT indexes implementation used in eXist. It has improved A LOT, because the FTI implementation prior to the Lucene index did not scale up. eXist team has also removed many of the existing internal bottlenecks, most of them at intermediate results processing.

	But as Wolfgang has written, there are lots of possible improvements which are only needed when you are working with huge database instances. I guess intermediate results in a complex query on a huge database can still fire an OutOfMemoryError. For instance, a sequence of a hundred thousand in-memory nodes, which is being generated from database content. Other example, when you have to sort a huge sequence of nodes based on a complex condition (which is hopefully being addressed by latest Wolfgang developments).

	Best wishes,
		José María

On 03/31/10 22:43, Adam Retter wrote:
> Andrezj,
>
> A chap on the mailing list has quite some experience of scaling eXist
> into the hundreds of gigabytes range, perhaps if you email him he
> could share some of his experiences with you as well. José María
> Fernández González. jmfernandez<at>  cnb.uam.es
>
> On 29 March 2010 15:59, Andrzej Jan Taramina<an...@ch...>  wrote:
>> Looking to get some guidance on how big you can scale an eXist database.
>>
>> Right now, our instances are about 15-25K documents where each document is in the 25K-2M range, probably averaging
>> around 150-200K.  This results in a dom.dbx = 3.5G, structure.dbx = 1.8G, collections.dbx = 4.2M and values.dbx = 155M,
>> which is not all that large compared to some relational databases.
>>
>> What if we scale up 10x to nearly quarter of a million documents?  The file sizes still shouldn't be all that big for
>> modern hardware, but will the performance scale linearly or close to it, assuming a powerful enough server (say a
>> dual-cpu, 6-Core machine (12 cores, 24 native threads) with gobs of memory)?
>>
>> OK.....if that works how about two orders of magnitude (100x current size)?  That would give us 2.5M documents, 250GB
>> dom.dbx and a structure.dbx in the 180GB range.  Bit too big or practical to cache the whole structure.dbx in memory,
>> regardless of the size of the memory in the server.
>>
>> At what point do I start looking at alternative storage mechanisms, (RDBMS, Hadoop, memcached, etc.) or co-operating
>> distributed eXist instances?
>>
>> Thanks for any insights from those that have pushed big databases in eXist...
>>
>> --
>> Andrzej Taramina
>> Chaeron Corporation: Enterprise System Solutions
>> http://www.chaeron.com
>>
>> ------------------------------------------------------------------------------
>> Download Intel&#174; Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Exist-development mailing list
>> Exi...@li...
>> https://lists.sourceforge.net/lists/listinfo/exist-development
>>
>
>
>

-- 
"La violencia es el último recurso del incompetente"
	- Salvor Hardin en "La Fundación" de Isaac Asimov
"Premature optimization is the root of all evil." - Donald Knuth

José María Fernández González
e-mail: jos...@gm...