From: José M. F. G. <jmf...@cn...> - 2008-06-28 13:56:25
|
Hi Ben, the size of the documents I'm working with are much bigger (from a tens of MB to tens of GB each one, and the mean case is around 200MB), but the datasets have more or less the same volume. I hope next tips can help you. From my experience working with such volumes of information, first of all you need to use are fast filesystems (no ext3, please) for your eXist instances, fast storage access (for instance, Linux software RAID0 with 4 disks performs very well) and at least 2GB for the Java VM. If you are concerned about data loss risks due RAID0, then you should think on a RAID3 hardware solution. As http://www.acnc.com/04_01_03.html explains (and I had the chance to test some years ago), it behaves very well on writes. Another strategy which I have not had the chance to test with last developments is using a separate physical disk/device for the journal logs. Journal logs are synced very often on insertions, deletions and updates, and performance of those operations is hurt because too much hard disk heads moves must be accomplished (journal updates + insertion/update/deletion operations). At database level, you must identify the scalability bottlenecks which can affect your installation, and one of them is the FTS index. As indexes are associated to collections, the size of the indexes depend on the content of the collection. FTS indexes do not scale up very well because on the build/usage of each logged term it must be kept in memory an array with the ordered positions where the term appears. On a collection with many FTS indexed terms there is a slowdown in insertions due continuous cache invalidations. The partial solution is creating sub-collections for the content, so each collection can keep its own index subcopy. Another scalability bottleneck is the recovery from a crash (due power failures). As far as I know (correct me if I'm wrong, Wolf) eXist index files are not journaled, and in those cases, when the recovery system suspects that any index file can be corrupted it proceeds to erase the index files and to rebuild them. With a volume of tens or hundreds of GBs it can take almost the same time as if you were inserting the whole database content. Last, but not the least important, are the queries you are going to issue. eXist performs very well when it can use (range or FTS) qname-based indexes on the queries. In other case, when you are querying a huge volume of information, due the creation of intermediate node fragments the memory usage can grow too much, and the feared OutOfMemoryException is fired. Last commited patches in stable and trunk branches mitigates that problem, but I have not had the chance to test them. Best Regards, José María Ben Bangert wrote: > What's the current best strategy when it comes to scaling eXist to > handle huge amounts of data and throughput? I'm currently only storing > around 15-25gb of data, but its expanding at a decent pace and I'm > realistically looking at upwards of a terabyte of XML data in the next 6 > months. > > Is there anyone currently storing that much in eXist, or close to it? > > I know in the past, that eXist generally scaled up to about 20gb or so, > so I figure I could always shard since my dataset does split well into > groupings that will likely be 20-25gb each. Though it'd be easier to > manage with less shards of course, so being able to store 100gb or so > per server would be more manageable. > > Regarding data, my XML documents generally range in size from 15KB to > about 5MB, with about 3% of my XML documents being as large as 50MB. > eXist performs great so far, I'm just wondering how far others have > taken it, and what their experience has been. > > Thanks, > Ben > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > > > ------------------------------------------------------------------------ > > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open -- "There is no reason why anybody would want a computer in their home" - Ken Olson, founder of DEC 1977 "640K ought to be enough for anybody" - Bill Gates, 1981 "Nobody will ever outgrow a 20Mb hard drive." - ??? "Premature optimization is the root of all evil." - Donald Knuth José María Fernández González Tlfn: (+34) 91 732 80 00 / 91 224 69 00 (ext 3061) e-mail: jmf...@cn... Fax: (+34) 91 224 69 76 Unidad del Instituto Nacional de Bioinformática Biología Estructural y Biocomputación Structural Biology and Biocomputing Centro Nacional de Investigaciones Oncológicas C.P.: 28029 Zip Code: 28029 C/. Melchor Fernández Almagro, 3 Madrid (Spain) **NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros adjuntos, pueden contener información protegida para el uso exclusivo de su destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de transmisión por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido. **CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies. |