I would suggest moving to the full Saxon product: the Saxon code is identical, but you will be using the Sun Java JDK 1.4 rather than the old Microsoft JVM.
 
If you pass a node-set as a parameter, Saxon will generally avoid storing the nodes in memory if it can: wherever possible it uses pipelined processing. The node-set is held internally as a "NodeSetIntent" object (in effect, an expression that can select the nodes when they are actually needed) not as a physical list of nodes. Saxon 8.0 is a lot smarter about this kind of thing than Saxon 6.5.3, so you may also like to try running it on that version (I would be interested in any comparison).
 
Having said that, processing a 1Gb source document is not going to be easy. I used to recommend that your real memory should be ten times the size of the source document. I've seen people get away with four times, but you shouldn't attempt it with less than that. Writing a prefilter (as a Java SAX filter), might be worthwhile.
 
Michael Kay


From: saxon-help-admin@lists.sourceforge.net [mailto:saxon-help-admin@lists.sourceforge.net] On Behalf Of Cary Millsap
Sent: 04 August 2004 17:56
To: Saxon Help List
Subject: [saxon] Pass a node-set? or an XPath expression?

I’m using Instant Saxon (thus XSL 1.1) to create an output document with the following structure:

 

            f(entire input document)

            f(subset 1 of input document)

            f(subset 2 of input document)

            …

            f(subset n of input document)

 

My input documents will be very large (1GB+ in some cases), as will some of the subsets.

 

What is the best way to pass the subset information to the template that implements f? Passing a node-set seems to be the most natural way, but I’m concerned about performance and memory consumption. Passing an XPath expression as a string and letting f do a saxon:evaluate() is another option. However, I suspect that with this option, my memory consumption will be the same as in the node-set case (unless passing a node-set creates an extra copy of the data).

 

Do you think one of these options is my best approach? Are there smarter options I should be considering?

 

Thank you very much.

 

Cary Millsap