Retrieving "largest" Element...

  • Tobias Klevenz

    Tobias Klevenz - 2010-07-12


    I'm trying to do the following:

    I want a list of Elements from any XML file. Each Element should only be
    listed once and only the one with the longest String should be listed.
    Since the files I'll be transforming can be around a gigabyte, Streaming is
    kinda mandatory.

    Here is what I came up with:

    <xsl:stylesheet xmlns:xsl="[url][/url]"
        xmlns:xs="[url][/url]" exclude-result-prefixes="xs"
        xmlns:xd="[url][/url]" version="2.0" xmlns:saxon="[url][/url]"
        <xsl:variable name="var" saxon:assignable="yes">
        <xsl:template match="/">
            <xsl:variable name="ns" select="namespace-uri(.)"/>
                <xsl:call-template name="iterate">
                    <xsl:with-param name="ns" select="$ns"/>
        <xsl:template name="iterate">
            <xsl:param name="ns"/>
            <saxon:iterate select="saxon:stream(doc('10000K.xml')//*[.=text()])">
                <xsl:variable name="text" select="."/>
                <xsl:variable name="length" select="string-length($text)"/>
                <xsl:variable name="elem" select="name()"/>
                <xsl:variable name="var2">
                            <xsl:when test="not($var/elements/*[name()=$elem][string-length() &gt; $length])">
                                <xsl:copy-of select="."/>
                                <xsl:copy-of select="$var/elements/*[not(name()=$elem)]"/>
                                <xsl:copy-of select="$var/elements/*"/>
                <saxon:assign name="var">
                    <xsl:copy-of select="$var2"/>
            <xsl:copy-of select="$var/elements/*"/>

    I have never used the "Global Variable" concept before since I didn't know it
    existed but I always missed that in XSLT compared to other Languages.
    I'm running that code in Oxygen with Saxon9ee and with a 10mb sized file it
    allready takes about 150s, so I'm guessing with a 1GB sized file it must take
    Any Idea how to speed that up?


  • Michael Müller-Hillebrand


    You are most definitely on the wrong track. Your usage of Saxon-specific
    programming constructs coming from procedural programming seems to be
    completely unnecessary. You just want to check each element node for the
    length of text content and its name. This could be done using recursion, I

    Using <xsl:template match="*"> you could visit each element node, and using
    <xsl:param ...="">/<xsl:with-param ...=""> you could pass-on the names of already
    seen nodes.

    Your chances to get a good hint into the correct direction are bigger in the
    regular xsl-list, therefore you should try to build a solution without Saxon

    • Michael
  • Tobias Klevenz

    Tobias Klevenz - 2010-07-12

    I wouldn't have a Problem at all using tools from the standard xsl-list, my
    Problem is the large filesize.

    When I do a <xsl:template match="*"> on a 1GB file wouldn't I still be
    building the DOM-tree?
    Which I thought would be slower than streaming, if possible at all with the
    available memory on my machine.

    • Tobias
  • Michael Kay

    Michael Kay - 2010-07-12

    To get a streamed transformation here, you do indeed need the saxon:iterate
    extension (which anticipates xsl:iterate in the XSLT 2.1 specification.
    However, you don't need saxon:assign - the idea is that with saxon:iterate,
    you can set the parameter values for the next iteration using xsl:with-param
    within saxon:continue, and pick up these new values in an xsl:param nested
    within the saxon:iterate call.

    However, this won't directly help with the performance problem, which is
    caused by the large xsl:copy-of used to copy the intermediate results on each
    iteration (well, on many iterations). The problem is that the intermediate
    results grow larger and larger. One solution is to call out to Java and use a
    mutable HashMap to hold a mapping from element names to the size largest
    element of that name so far encountered. Or you could use your current
    approach, but modified so that you never have more than one value for each
    element name in the intermediate results: this means that instead of simply
    appending a new element to the results, you should do a transformation on the
    histogram document to replace the element holding the previous largest value
    by an element holding the new largest value. This way the size of the copy is
    bounded by the number of distinct element names in the document vocabulary,
    and tassuming random data the number of times the structure is copied is
    statistically 1/1 + 1/2 + 1/3 + 1/4 ... 1/n where n is the number of element
    instances in the source; this rises as n increases, but only slowly, so the
    overall performance might be effectlvely linear.

  • Michael Kay

    Michael Kay - 2010-07-13

    There's another problem here that I didn't spot yesterday. Your selection
    doc()// includes all elements in the document including the outermost
    element. When you do //
    it needs to evaluate the string value of the element,
    which for the outermost element will be about a gigabyte in size. I think that
    what you are actually trying to do is to select elements that have no element
    children - which isn't easy either in streaming mode. But a good start might
    be to exclude the outermost element from the search, so it becomes ///.

  • Tobias Klevenz

    Tobias Klevenz - 2010-07-13

    Hi Michael,

    I spotted that last one myself when I looked at the code again yesterday. :)

    Thanks a lot for your insight though, I'll try to avoid the copy-of and modify
    my current approach the way you suggested, since this is also kind of a proof
    of concept for me I wanted to stick to xslt.

    I find that simple sounding problem quite interesting looking at it from an
    xslt view.

    • Tobias

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks