It looks viable to me.
Another approach, which we use in the stylesheets for building the
XSLT/XQuery specs, is to run one stylesheet over the source which constructs
a document holding index information, which is stored on disk and then used
by later tasks in the processing pipeline (which in this case is controlled
by Ant). This means you only need to do the index-building when something
has changed.
You could save half the memory by not copying the documents to create a
temporary tree, but instead just having a variable that holds references to
the document.
Rather than
<xsl:copy-of select="for $d in doc($href) return ($d,
saxon:discard-document($d))"/>
I would suggest
<xsl:copy-of select="saxon:discard-document(doc($href))"/>
It looks odd, but discard-document is really just a way to say "release it
as soon as you can". In fact, your code will give you two copies of each
document because discard-document returns the value of its argument. Same
approach for the other call.
In fact though as you're using Saxon-SA I think you can do
<xsl:copy-of select="doc($href)" saxon:read-once="yes"/>
which doesn't need the discard - the tree isn't built in memory, so there's
no need to discard it. (I haven't actually checked that this simple use of
saxon:read-once works, but it should).
You could go a step further and do
<xsl:value-of select="unparsed-text($href)" disable-output-escaping="yes"/>
which bypasses the parsing and serialization as well as the tree-building...
Though it could give trouble with XML and DOCTYPE declarations.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: saxon-help-bounces@...
> [mailto:saxon-help-bounces@...] On Behalf
> Of James A. Robinson
> Sent: 25 October 2007 13:55
> To: Saxon Help
> Subject: [saxon] q about discard-document
>
>
> Hi folks,
>
> I'd like to ask folks who have experience using the saxon
> extension {http://saxon.sf.net/}discard-document about a
> scenerio I have in mind and whether or not they think there
> might be any benefit at all to using this extension.
>
> I have a stylesheet which is in effect performing a merge of
> around 60 documents, producing a 35 megabyte document. There
> is an initial document being fed into the document, and it is
> effectively a list of ids, e.g.,
>
> <atom:entry xmlns:atom="http://www.w3.org/2005/Atom">
> ...
> <atom:link href="forthcoming:123456"/>
> </atom>
>
> There is also a parameter, named $sources, being based in
> which lists a number of files:
>
> <atom:link xmlns:atom="http://www.w3.org/2005/Atom"
> href="file:/some/path/to/a/file"/>
>
> My stylesheet is taking the 'forthcoming:123456' and
> comparing it to atom:id values it extracts from files listed
> in $sources, e.g.,
>
> doc("file:/some/path/to/a/file")/atom:entry/atom:id
>
> Right now the stylesheet is simply walking through the
> $sources list and actually building a temporary tree where it
> is expanding all the atom:link elements using doc(@href).
> This gives me an enormous variable which I can use to resolve
> things like forthcoming:123456 into the appropriate
> $sources/atom:entry[atom:id eq '123456'].
>
> The problem is that this naive approach is very memory
> intensive, and I can see a point in the future where we might
> run into problems where there is just too much content to
> build the temporary tree.
>
> I've been thinking about building an alternative stylesheet
> to take advantage of {http://saxon.sf.net/}discard-document.
> Before I tackled it I wanted to ask if people here had
> thoughts about whether or not it would be worthwhile.
>
> An underlying contract of the operation is that the files
> listed in $sources will not change while my stylesheet is
> running, so I'm not worried about having to read the document twice.
>
> Basically I am thinking I can perform this work in two
> passes. The first pass will extract just the atom:id value
> from each file in $sources, the second pass would hopefully
> operate in a streaming fashion, where each file I inline
> would be read once, copied to the output tree, and discarded
> from memory as the next article is processed.
>
> I was thinking I'd need to apply the following techniques to
> the problem. One is a function to extract the id from a $sources
> document:
>
> <xsl:function name="hpp:atom-id" as="xs:token?">
> <xsl:param name="href" as="xs:string"/>
> <xsl:variable name="id" as="xs:string+"
> select="for $d in doc($href) return
> (xs:token($d/atom:entry/atom:id),
> saxon:discard-document($d))"/>
> <xsl:sequence select="xs:token($id[1])"/>
> </xsl:function>
>
> and the other is to build a template which will inline the
> document during the expansion phase, where I'm looking up the
> href for the ids I extracted prevously and then copying the
> document it points to into the output tree wholesale:
>
> <xsl:template match="atom:link[starts-with(@href, 'forthcoming:')]">
> <xsl:variable name="href"
> select="hpp:href-for-id(substring-after(@href,
> 'forthcoming:'))"/>
> <xsl:copy-of
> select="for $d in doc($href) return ($d,
> saxon:discard-document($d))"/>
> </xsl:template>
>
> I'd welcome any warnings or other advice people could share
> about whether or not this seems like workable technique to
> keep the memory usage down.
>
>
> Jim
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson jim.robinson@...
> Stanford University HighWire Press http://highwire.stanford.edu/
> +1 650 7237294 (Work) +1 650 7259335 (Fax)
>
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and
> a browser.
> Download your FREE copy of Splunk now >>
> http://get.splunk.com/ _______________________________________________
> saxon-help mailing list
> saxon-help@...
> https://lists.sourceforge.net/lists/listinfo/saxon-help
|