Re: [Exist-open] Scaling eXist

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Ben,
	the size of the documents I'm working with are much bigger (from a tens of MB 
to tens of GB each one, and the mean case is around 200MB), but the datasets 
have more or less the same volume. I hope next tips can help you.

	From my experience working with such volumes of information, first of all you 
need to use are fast filesystems (no ext3, please) for your eXist instances, 
fast storage access (for instance, Linux software RAID0 with 4 disks performs 
very well) and at least 2GB for the Java VM. If you are concerned about data 
loss risks due RAID0, then you should think on a RAID3 hardware solution. As 
http://www.acnc.com/04_01_03.html explains (and I had the chance to test some 
years ago), it behaves very well on writes.

	Another strategy which I have not had the chance to test with last 
developments is using a separate physical disk/device for the journal logs. 
Journal logs are synced very often on insertions, deletions and updates, and 
performance of those operations is hurt because too much hard disk heads moves 
must be accomplished (journal updates + insertion/update/deletion operations).

	At database level, you must identify the scalability bottlenecks which can 
affect your installation, and one of them is the FTS index. As indexes are 
associated to collections, the size of the indexes depend on the content of 
the collection. FTS indexes do not scale up very well because on the 
build/usage of each logged term it must be kept in memory an array with the 
ordered positions where the term appears. On a collection with many FTS 
indexed terms there is a slowdown in insertions due continuous cache 
invalidations. The partial solution is creating sub-collections for the 
content, so each collection can keep its own index subcopy.

	Another scalability bottleneck is the recovery from a crash (due power 
failures). As far as I know (correct me if I'm wrong, Wolf) eXist index files 
are not journaled, and in those cases, when the recovery system suspects that 
any index file can be corrupted it proceeds to erase the index files and to 
rebuild them. With a volume of tens or hundreds of GBs it can take almost the 
same time as if you were inserting the whole database content.

	Last, but not the least important, are the queries you are going to issue. 
eXist performs very well when it can use (range or FTS) qname-based indexes on 
the queries. In other case, when you are querying a huge volume of 
information, due the creation of intermediate node fragments the memory usage 
can grow too much, and the feared OutOfMemoryException is fired. Last commited 
patches in stable and trunk branches mitigates that problem, but I have not 
had the chance to test them.

	Best Regards,
		José María

Ben Bangert wrote:
> What's the current best strategy when it comes to scaling eXist to 
> handle huge amounts of data and throughput? I'm currently only storing 
> around 15-25gb of data, but its expanding at a decent pace and I'm 
> realistically looking at upwards of a terabyte of XML data in the next 6 
> months.
> 
> Is there anyone currently storing that much in eXist, or close to it?
> 
> I know in the past, that eXist generally scaled up to about 20gb or so, 
> so I figure I could always shard since my dataset does split well into 
> groupings that will likely be 20-25gb each. Though it'd be easier to 
> manage with less shards of course, so being able to store 100gb or so 
> per server would be more manageable.
> 
> Regarding data, my XML documents generally range in size from 15KB to 
> about 5MB, with about 3% of my XML documents being as large as 50MB. 
> eXist performs great so far, I'm just wondering how far others have 
> taken it, and what their experience has been.
> 
> Thanks,
> Ben
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open

-- 
"There is no reason why anybody would want a computer in their home" -
	Ken Olson, founder of DEC 1977
"640K ought to be enough for anybody" - Bill Gates, 1981
"Nobody will ever outgrow a 20Mb hard drive." - ???

"Premature optimization is the root of all evil." - Donald Knuth

José María Fernández González
Tlfn: (+34) 91 732 80 00 / 91 224 69 00 (ext 3061)
e-mail: jmf...@cn...		Fax: (+34) 91 224 69 76
Unidad del Instituto Nacional de Bioinformática
Biología Estructural y Biocomputación	Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas
C.P.: 28029				Zip Code: 28029
C/. Melchor Fernández Almagro, 3	Madrid (Spain)

**NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros adjuntos, pueden contener información protegida para el uso exclusivo de su destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de transmisión por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido.
**CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies.

Re: [Exist-open] Scaling eXist

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Scaling eXist