Menu

How do you store substrings?

Help
Anonymous
2001-08-01
2012-10-08
  • Anonymous

    Anonymous - 2001-08-01

    Just curious, but having perhaps too much experience with procedural languages, I cannot help but wonder how quickly the stack and heap fill up during recursive calls. For example, since strings are immutable, do you only store an index and length pair for each call to substring? Also, do you optimize interpretation of templates with tail recursion?

    Here is a somewhat contrived example, but it helps illustrate the potential for exponential consumption of memory:

    <xsl:template name="print">
        <xsl:param name="text"/>
       
        <xsl:choose>
            <xsl:when test="starts-with($text, '&#x09;')">-</xsl:when>
            <xsl:when test="starts-with($text, '&#x0A;')">~</xsl:when>
            <xsl:when test="starts-with($text, '&#x20;')">_</xsl:when>
           
            <xsl:otherwise>
                <xsl:value-of select="substring($text, 1, 1)"/>
            </xsl:otherwise>
        </xsl:choose>
       
        <xsl:if test="$text">
            <xsl:call-template name="print">
                <xsl:with-param name="text" select="substring($text, 2)"/>
            </xsl:call-template>
        </xsl:if>
    </xsl:template>

     
    • Michael Kay

      Michael Kay - 2001-08-02

      Your particular example will be implemented in Saxon using a technique called "tail recursion": if the template calls itself, and doesn't need to do any work when the recursive call returns, then the same stack-frame is reused for each level of recursion. In my tests, Saxon handles a million levels of recursion happily where MSXML3 and Xalan fall over after about 10,000.

      Incidentally, even when tail recursion isn't possible, the amount of memory used increases linearly with the depth of recursion: it certainly isn't exponential, as your question suggests.

      As to the original question, Saxon generally evaluates scalar values (strings, booleans, and numbers) eagerly, but evaluates node-sets lazily. This is because node-sets can be very memory-hungry whereas scalar values usually aren't.

      Mike Kay

       
      • Anonymous

        Anonymous - 2001-08-02

        Fantastic. I am not surprised early implementations do not have efficient tail recursion processing yet. It is great to hear Saxon does.

        My idea for substrings might be implemented with a String wrapper like the following:

        class Substring
        {
            Substring(String s)
            {
                this.s = s;
                p = 0;
                l = s.length();
            }

            Substring(Substring ss, int p, int l)
            {
                this.s = ss.s;
                this.p = p;
                this.l = l;
            }

            etc...

            final String s;
            final int p;
            final int l;
        }

        This could significantly reduce memory consumption and processing time for large files.

         
        • Michael Kay

          Michael Kay - 2001-08-02

          Certainly possible, hard to judge the "win".

          The big win is to find some way of reducing the number of string copy's between the parser reading the raw text, the transformer copying it to the result tree, and the serializer outputting it. For example, I've been looking at whether it's possible to do entity expansion lazily - which would require a tighter interface between Saxon and the XML parser.

          Mike Kay