I'm trying to do the following:
I want a list of Elements from any XML file. Each Element should only be
listed once and only the one with the longest String should be listed.
Since the files I'll be transforming can be around a gigabyte, Streaming is
Here is what I came up with:
xmlns:xd="[url]http://www.oxygenxml.com/ns/doc/xsl[/url]" version="2.0" xmlns:saxon="[url]http://saxon.sf.net/[/url]"
<xsl:variable name="var" saxon:assignable="yes">
<xsl:variable name="ns" select="namespace-uri(.)"/>
<xsl:with-param name="ns" select="$ns"/>
<xsl:variable name="text" select="."/>
<xsl:variable name="length" select="string-length($text)"/>
<xsl:variable name="elem" select="name()"/>
<xsl:when test="not($var/elements/*[name()=$elem][string-length() > $length])">
I have never used the "Global Variable" concept before since I didn't know it
existed but I always missed that in XSLT compared to other Languages.
I'm running that code in Oxygen with Saxon9ee and with a 10mb sized file it
allready takes about 150s, so I'm guessing with a 1GB sized file it must take
Any Idea how to speed that up?
You are most definitely on the wrong track. Your usage of Saxon-specific
programming constructs coming from procedural programming seems to be
completely unnecessary. You just want to check each element node for the
length of text content and its name. This could be done using recursion, I
Using <xsl:template match="*"> you could visit each element node, and using
<xsl:param ...="">/<xsl:with-param ...=""> you could pass-on the names of already
Your chances to get a good hint into the correct direction are bigger in the
regular xsl-list, therefore you should try to build a solution without Saxon
I wouldn't have a Problem at all using tools from the standard xsl-list, my
Problem is the large filesize.
When I do a <xsl:template match="*"> on a 1GB file wouldn't I still be
building the DOM-tree?
Which I thought would be slower than streaming, if possible at all with the
available memory on my machine.
To get a streamed transformation here, you do indeed need the saxon:iterate
extension (which anticipates xsl:iterate in the XSLT 2.1 specification.
However, you don't need saxon:assign - the idea is that with saxon:iterate,
you can set the parameter values for the next iteration using xsl:with-param
within saxon:continue, and pick up these new values in an xsl:param nested
within the saxon:iterate call.
However, this won't directly help with the performance problem, which is
caused by the large xsl:copy-of used to copy the intermediate results on each
iteration (well, on many iterations). The problem is that the intermediate
results grow larger and larger. One solution is to call out to Java and use a
mutable HashMap to hold a mapping from element names to the size largest
element of that name so far encountered. Or you could use your current
approach, but modified so that you never have more than one value for each
element name in the intermediate results: this means that instead of simply
appending a new element to the results, you should do a transformation on the
histogram document to replace the element holding the previous largest value
by an element holding the new largest value. This way the size of the copy is
bounded by the number of distinct element names in the document vocabulary,
and tassuming random data the number of times the structure is copied is
statistically 1/1 + 1/2 + 1/3 + 1/4 ... 1/n where n is the number of element
instances in the source; this rises as n increases, but only slowly, so the
overall performance might be effectlvely linear.
There's another problem here that I didn't spot yesterday. Your selection
doc()// includes all elements in the document including the outermost
element. When you do // it needs to evaluate the string value of the element,
which for the outermost element will be about a gigabyte in size. I think that
what you are actually trying to do is to select elements that have no element
children - which isn't easy either in streaming mode. But a good start might
be to exclude the outermost element from the search, so it becomes ///.
I spotted that last one myself when I looked at the code again yesterday. :)
Thanks a lot for your insight though, I'll try to avoid the copy-of and modify
my current approach the way you suggested, since this is also kind of a proof
of concept for me I wanted to stick to xslt.
I find that simple sounding problem quite interesting looking at it from an
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.