First off,  I suggest making sure you are allocating the maximum amount of VM available to your java process (you haven't said if this is .NET or Java).

If its Java the most you'll likely get on a 32 bit machine is 2G but on 64 bit machine you can get much larger.   Java will give you "out of memory" errors even if there is still memory on your machine if you don't launch it with the right args.   Look at the -Xmx  argument.   E.g.   -Xmx1024m will give Java 1G of ram to play with.

 

Then as Jason  says, a 10x multiplier is about average, could be less could be more.

But even with the streaming options in Saxon EE not everything automatically streams, only some operations.    So don't expect magically to be able to run your 5GB XML file in 1GB Ram even with Saxon EE.

 

Preprocessing the file into smaller files is often your best bet (or using an XML Database but that's a whole nother story)

There are lots of XML Splitters out there that work in streaming mode.

My favorite of course <author bias> is xsplit  which comes with xmlsh. 

 

http://www.xmlsh.org/CommandXsplit

 

But googling around you will find many and likely won't have to write your own.

Most Large XML files that I've had to work with are amenable to splitting as they are typically a root element wrapping lots of documents.   If you don't need to do cross-searching and can operate on a smaller bit at a time this works.  Otherwise you may need to look into an XML database designed for handling large collections of data.

----------------------------------------

David A. Lee

dlee@calldei.com

http://www.xmlsh.org

 

From: Jason Smith [mailto:jsmith@infotrustgroup.com]
Sent: Saturday, February 25, 2012 12:07 PM
To: Mailing list for the SAXON XSLT and XQuery processor
Subject: Re: [saxon] Running out of memory

 

http://saxonica.com/feature-matrix.html

 

According to the latest feature matrix, you'll need to get Saxon EE or EE-T if you want to do this purely with XSLT.  Search on "streaming facilities."

 

XML documents processed into DOMs are really big.  10, or even 20 times the amount of RAM as the original input document would be reasonable.

 

If you don't have EE or EE-T, or if you are just looking for a challenge, you can use SAX or one of the push parsers (e.g., XPP) to perform an initial stream of the large document to break it into small chunks. 

 

Jason Smith
Software Engineer

InfoTrust Group, Inc.

500 Discovery Parkway, Suite 200
Superior, CO 80027
Email jsmith@infotrustgroup.com

WEB www.infotrustgroup.com

This e-mail and all information included herein do not constitute a legal agreement accorded by INFOTRUST GROUP and its affiliates and subsidiaries.  All legal agreements must be formulated in writing by a legal representative of INFOTRUST GROUP. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company.

 


From: Mark Rubelmann [mrubelmann@gmail.com]
Sent: Saturday, February 25, 2012 8:15 AM
To: saxon-help@lists.sourceforge.net
Subject: [saxon] Running out of memory

Hi,

 

I'm *hoping* to process some big XML files with XSLT but Saxon is throwing an out of memory exception when the input is only around 350 or 400 mb.  I understand that XSLT needs to have the whole input tree in memory but I don't understand why it's failing with such a [relatively] small input.  Is the loaded tree like 10x bigger than the source XML?  Is there anything I can do to alleviate the problem?  I don't have a really solid requirement yet but I'd feel a lot better if it could handle two gigs.

 

Thanks,

Mark