Predict implementation strategy

Help
2010-05-04
2012-10-08
  • Hello Mr Kay!

    I would like to have your advice on how I can ensure that
    my function be efficient in Saxon.

    Context:

    I'm trying to parse text reports in xslt.
    From the xslt point of view report is a value of type xs:string*.

    The process is organized as a sequence of subviews of
    data imposed on original data. This way parsing looks like as
    a sequence of filters over original data.

    The problem:

    Data subview is implemented as a function.
    My problem is to ensure that such funtions were efficient.

    An example:

    Consider such subview: data before a specific line pattern.
    A function p:view1() implements such a filter.

      <xsl:function name="p:view1" as="xs:string*">
        <xsl:param name="view" as="xs:string*"/>
    
        <xsl:sequence select="p:view1($view, 1)"/>
      </xsl:function>
    
      <xsl:function name="p:view1" as="xs:string*">
        <xsl:param name="view" as="xs:string*"/>
        <xsl:param name="row" as="xs:integer"/>
    
        <xsl:variable name="line" as="xs:string?" select="$view[$row]"/>
    
        <xsl:if test="exists($line) and not(p:condition1($view, $row))">
          <xsl:sequence select="$line"/>
          <xsl:sequence select="p:page1($view, $row + 1)"/>
        </xsl:if>
      </xsl:function>
    
      <xsl:function name="p:condition1" as="xs:boolean">
        <xsl:param name="view" as="xs:string*"/>
        <xsl:param name="row" as="xs:integer"/>
    
        <xsl:sequence select="
          matches($view[$row], '^\s+$') and
          matches($view[$row + 1], 'E N D   O F   R E P O R T')"/>
      </xsl:function>
    
    ...
      <!-- view before final marker. -->
      <xsl:variable name="view1" as="xs:string*" select="p:view1($view)"/>
    ...
      <!-- view of joined page content without page headers and footers -->
      <xsl:variable name="view2" as="xs:string*" select="p:view2($view1)"/>
    ...
    

    My concern is to make these functions efficient; meaning that they
    should be lazy enough and should not try to cache whole output.

    How do you think is it possible to achieve this goal in Saxon?

    P.S. After all I've written here I convinced no more ragarding correct
    implementation language.
    I see that I'm going deeply in implementation details. Probably I should
    consider generation of
    java or C# report parsers. :-)

     
  • Michael Kay
    Michael Kay
    2010-05-04

    The general advice I would give is (a) create a measurement framework that
    allows you to determine the performance you are getting with sufficient
    precision, including its scalability as workload factors (such as input
    document size) change, (b) set performance targets, (c) if performance is not
    meeting targets, try to understand why, by using the tools available: the
    -explain output to show the decisions made by the optimizer, timing profiles
    showing where the execution time is spent at the XSLT level, Java timing
    profiles showing where it is spent at the Java level. If you find an example
    where Saxon's execution strategy is clearly sub-optimal, then I'm always
    interested to know. If you want to know why the optimizer made the decisions
    it did, then I will try and explain. However, in general I can't give free
    advice or help to people who want me to do open-ended performance studies or
    improvements on a particular workload.

    There's nothing obviously inefficient about your code; if I were doing a
    performance exercise on it I would want to know a lot more about the project
    requirements, e.g. the actual and required performance, the data volumes, the
    environment in which the XSLT code is running, the opportunities for tuning
    components other than your XSLT code.

     
  • Thank you.
    I realize that I cound not expect exhaustive answer.
    Probably I just needed the place to articulate my problems.
    Sorry for this misuse.

    My problem is in unlimited size of input (I've seen at least 3Gb).
    I could supply it into xslt as xs:string* through an extension function.
    I should prevent the engine to cache the input or some view of the input.

    From the implementation perspective I would like the engine to work with
    buffered sequence (buffered stream analogy).
    I might be able to implement "buffered sequence" with some extension function
    that would wrap output of each
    function that produces subview.

     
  • David Lee
    David Lee
    2010-05-04

    My oppinions may not be the same as mr Kay's.
    But for me I would not try to use XSLT to parse a 3GB text file on any mortal
    machine.
    If it were me I would write a pre-parser in another language like C, or Java
    or C# or even perl ..
    This pre-parser would split the file into smaller pieces at appropreate places
    (instaed of just using "split" which is too blind).
    And also try to add SOME xml structure to the files.

    Once you have a directory of reasonable sized even slightly XML encoded files
    you will have an easier time.
    However depending on the needs of your processing it still may be difficult.
    Ideally you dont need access to the entire 3GB all at once, so can run xslt on
    a subset of files iteratively. 3GB of text will load into Java atleast 10GB of
    RAM if not more.