Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Eliminating saxon:assign

2005-06-21
2012-10-08
  • [This thread continues the "Template recursion, StackOverflowError, saxon:while a nd variable assignability" thread on the saxon-help list.]

    Hi Andre, Dimitre, Michael

    Trying to summarise/mediate the follow-ons to my last message
    >
    > Am I right in saying your problem could be equivalent to
    > the following
    >
    > You have two input models.
    > - your large database model
    > - your problem goal
    > You perform a side-effect free analysis to compute
    > - a goal-specific change request model
    > The change request model then controls an update of the
    > database model?

    I think Andre says yes. The above algorithm forms part of an infinite
    loop, and there may be many of side-effect analyses of varying complexity.

    Dimitre is very sceptical that the side-effect analyses require
    saxon:assign. I share this scepticism. Each analysis is probably amenable
    to a suitable functional approach.

    Michael has not commented on whether Saxon can optimise the application
    of the change request to the database model, so until XSLT can perform
    an in-place update, I'm inclined to agree with Andre that saxon:assign
    is required to avoid copying a huge model just to change a couple of nodes.
    This is the use-case that we need to address.

    Regards
    
        Ed Willink
    
     
    • > Michael has not commented on whether Saxon can
      > optimise the application
      > of the change request to the database model, so
      > until XSLT can perform
      > an in-place update, I'm inclined to agree with Andre
      > that saxon:assign
      > is required to avoid copying a huge model just to
      > change a couple of nodes.
      > This is the use-case that we need to address.

      I completely don't understand this.

      How can this be done with saxon:assign?

      It assigns to a variable -- the need to have an updated xml document is still there.

      On the other side, Mike presented his optimisation called "virtual copy". This allows not to copy every time a sequence of nodes but only to copy the newly added nodes.

      This can be used in updating a graph in-memory in only linear time and space as opposed to quadratical one if all nodes had to be copied.

      Cheers,
      Dimitre Novatchev.

       
    • StratML-OS
      StratML-OS
      2005-06-27

      Communication is not easy some times as I can not see the "proof again that the main problem is the excessive complexity". What excessive complexity? It is not excessive to us anyway and I am trying to understand why it is excessive to you? I also can not see how assign "actually contributes to the increased complexity and to making the problem bigger... ", not for us surely!

      Isn't the file generation tracking example simple enough? or is "How would you use keys to accumulate the ids of the pages that have been done so far and to lookup ids and check each new one if it has been used already ? and how would you add it to the list for the next iteration, if it has not? " too complex still ?

      I agree with you about Mr. Kay, but how can you talk of bad progamming style already, when you have no decent alternative yet and you seem to have difficulties with the "excessive complexity" and in understanding the issues?
      Please tell me more about your programming style and show me how you would solve any of our use cases currently using assign, even just the simple one that I proposed here and that you seem to ignore, so systematically, so far, would help, for a start.

      ac

      Note: The other use case that you are refering to is not about finding the paths between 2 nodes but about evaluating those paths while traversing the graph.

      Note: Our application is really very small for all that it does. It is structured and modular, constantly evolving, yet operating 24/7. Still we plan to take it much further. I can not judge our programming style like you do, but something has to be right.

      I would have appreciated if you could have helped make it (and our programming style) better.

       
    • StratML-OS
      StratML-OS
      2005-06-21

      How about a parenthesis to look at a simpler but yet somewhat similar case :

      an XSLT application processes the contents of trees of xml files and produces, lets say, thousands of xhtml pages, and lets also say that the application, as it generates every xhtml file, needs to accumulate the urls (or file names, as determined from the corresponding content anlysis/processing) of each already generated file in order to constantly ensure that the new file url about to be created has not already been used and so, avoid a "Cannot write more than one result document to the same URI" XSLT processing error.

      In a case like this, what kind of data structure and process should be used to accumulate and search the url list ?

      Thank you
      ac

       
    • > In a case like this, what kind of data structure
      > and process should be used to accumulate and
      > search the url list ?

      This is exactly the case where virtual copy must shine!

      The data structure to use is a sequence of nodes.

      Each time a new node is appended to it, none of the existing nodes is copied.

      As for the naming -- there are some ways that will guarantee unique names and will avoid tracking the set of names completely.

      Cheers,
      Dimitre

       
    • To Dimitre:

      > > Michael has not commented on whether Saxon can
      > > optimise the application
      > > of the change request to the database model, so
      > > until XSLT can perform
      > > an in-place update, I'm inclined to agree with Andre
      > > that saxon:assign
      > > is required to avoid copying a huge model just to
      > > change a couple of nodes.
      > > This is the use-case that we need to address.
      >
      >
      > I completely don't understand this.
      >
      > How can this be done with saxon:assign?
      >
      > It assigns to a variable -- the need to have an updated xml
      > document is still there.

      [I've never actually used saxon:assign, so I may misunderstand
      it's capabilities.]

      I would imagine that the 'efficient' update could use an
      XPath to tunnel down to the input node that needs changing,
      then zap it with an assign, and then return without creating
      an output document, but having updated the input.

      The cost of this is pretty much the XPath traversal to the
      target node(s) rather than an operation, however optimised,
      that involves every node in the database.

      In principle, we can match this assign performance, by realising
      the zap in an apply-templates/copy traversal, and have the
      XSLT compiler recognise that

      a) the input life-time terminates at the traversal
       [b) the output life-time starts at the traversal]
      c) the output is substantially the same as the input
      

      so that the traversal implementation, while apparently creating
      a new output tree with a full set of copies, is optimised to
      re-use input nodes as output nodes and to perform in-place
      updates/assigns where the output nodes differ from the inputs.
      Since input is re-used as output, the traversal of unchanged
      nodes should be optimised away so that only changed nodes are
      visited.

      [I am sure issues of node identity and pattern matches make this
      impossible in arbitrary code, since local compilation cannot properly
      analyse nested matches. However XSLT does not allow late discovery
      of dynamically loaded stylesheets, so there must be a point at which
      a global compilation can perform the relevant determination of
      hierarchical output/input correspondence. No doubt there are
      certain challenging XSLT concepts that would preclude optimisation
      - generate-id() since input and output no longer distinct
      - saxon:eval() since its accesses cannot be analysed
      This was why I suggested that the programmers intent of in-place
      operation be an annotation so that the above reason can be reported
      in a diagnostic.]

      > On the other side, Mike presented his optimisation called
      > "virtual copy". This allows not to copy every time a sequence
      > of nodes but only to copy the newly added nodes.
      >
      > This can be used in updating a graph in-memory in only linear
      > time and space as opposed to quadratical one if all nodes had
      > to be copied.

      The "virtual copy" copy is certainly good compared to a copy,
      but let's get this in perspective. In C/C++ terms a virtual
      copy is just a pointer to const, so of course you can have
      lots of pointers to source objects at minimal cost. Michael
      clarified the restrictive usage to orphan outputs, such as a
      sequence of nodes used as a work list. I'm not sure that it
      would be possible to have such a sequence of nodes without a
      "virtual copy", so the "virtual copy" is a necessary and very
      sensible requirement of XSLT 2, it's not an optimisation.
      [Sorry Michael; like with magic, once you understand the
      trick you wonder why you ever thought it was clever.]

      To DNAOS:

      > > In a case like this, what kind of data structure
      > > and process should be used to accumulate and
      > > search the url list ?
      >
      > This is exactly the case where virtual copy must shine!
      >
      > The data structure to use is a sequence of nodes.

      Yes, or a sequence of generate-id() returns.

      Regards
      
          Ed Willink
      
       
      • Ed,

        > [I've never actually used saxon:assign, so I may
        > misunderstand it's capabilities.]

        saxon:assign only updates an xsl:variable -- not a node.

        >
        > I would imagine that the 'efficient' update could use
        > an XPath to tunnel down to the input node that
        > needs changing, then zap it with an assign, and
        > then return without creating an output document, but
        > having updated the input.

        This will probably not be implemented as it violates the basic principles of functional programming, creating side effects among other things.

        Cheers,
        Dimitre

         
    • Hi Dimitre

      > > [I've never actually used saxon:assign, so I may
      > > misunderstand it's capabilities.]

      >
      > saxon:assign only updates an xsl:variable -- not a node.

      Oh, well that makes saxon:assign much less useful, and that
      much easier to replace. We therefore lack a convincing
      use-case for it - short of analysing Andre's 25K lines.

      > >
      > > I would imagine that the 'efficient' update could use
      > > an XPath to tunnel down to the input node that
      > > needs changing, then zap it with an assign, and
      > > then return without creating an output document, but
      > > having updated the input.
      >
      > This will probably not be implemented as it violates the
      > basic principles of functional programming, creating side
      > effects among other things.

      Yes, it should not be what the functional programmer programs;
      just what the optimised execution of a functional program does.

      Regards
      
          Ed Willink
      
       
    • StratML-OS
      StratML-OS
      2005-06-22

      Of course we could have a unique id generator (ex: generateid()) but what if the generated pages are referred to by other documents and the links need to be maintained?

      As for using a nodeset for the required data structure, it is an alternative and would be very useful in many cases, if performance is good. Mutable nodesets, without assign, is interesting, but in this example, if the file names are not too large, wouldn't a string with, lets say, for example, a space delimited list, searched with regular expressions, be a more efficient (space & time) alternative, assuming that the string is also mutable with (or without assign) ?

      Thank you.
      ac

       
      • > Of course we could have a unique id generator (ex:
        > generateid()) but what if the generated pages are
        > referred to by other documents and the links need
        > to be maintained?

        Many ways:

        1. Using an XPath expression that selects a node-set consisting of only that node.

        2. Calling an extension function that produces a GUID

        .....

        etc.

        > but in this example, if the file names are not too
        > large, wouldn't a string with, lets say, for example, a
        > space delimited list, searched with regular
        > expressions, be a more efficient (space & time)
        > alternative, assuming that the string is also mutable
        > with (or without assign) ?

        No. having to copy a string every time something must be appended to it is O(N^2)

        Cheers,
        Dimitre

         
        • Michael Kay
          Michael Kay
          2005-06-24

          >No. having to copy a string every time something must be appended to it is O(N^2)

          It all depends what one chooses to optimize...

          MK

           
          • > > No. having to copy a string every time something
            > > must be appended to it is O(N^2)
            >
            > It all depends what one chooses to optimize...
            >
            > MK

            Yes, but we'd (we == the XSLT programmers) prefer the optimization to be done on sequence of nodes, otherwise serializing it to a pipe-delimiting string defeats the whole idea behind XSLT and makes using JavaScript preferrable.

            I still didn't have the time to test the effects of virtual copy, but this is a very good step forward.

            This optimization is being performed in functional languages as sublist sharing and is one of the most popular and wellknown optimization technique.

            DN

             
    • StratML-OS
      StratML-OS
      2005-06-23

      Hi,

      This is not just a use case, it is a pattern that we faced 25 years ago as we developed Lisp, Logo, and other functional programming engines and it is still here today. The reason that it does not go away is that it is a fundamental logic pattern that is not well covered by a pure functional approach. I am sorry that life is not perfect and simple but it is the only one that we have. Since life has more use cases along this (and other) pattern(s), from simpler to more complex ones, all dependent on assign, not only the subject of this message track could have been something more like "re-evaluating assign", but I maintain my initial claim that assign is indispensable to XSLT and should be part of the standard language, just like regular expressions, for example.

      If ever this can be better solved and magically handled by an "optimized execution of a functional program " hiding intricacies of "what the functional programmer programs " today, and at a better performance and maintenance cost than using assign, I will be very interested, and at more than one level.

      In the mean time, lets not cripple this fascinating language.

      Thank you.
      Andre Cusson
      01 COMMUNICATIONS

       
    • StratML-OS
      StratML-OS
      2005-06-24

      Node-set or String, update optimization is key for performance. Size, and size related performance issues (ex: copying to append), still remain proportional to the nature of the data structure itself (ex: string or nodeset). But what about the (read-only) search/lookup operations, that are usually performed much more often than updates? How should they be optimized? What is the performance factor between searching a string with regex or traversing a nodeset comparing string contents?

      Thanks.
      ac

       
    • StratML-OS
      StratML-OS
      2005-06-24

      > Of course we could have a unique id generator (ex:
      > generateid()) but what if the generated pages are
      > referred to by other documents and the links need
      > to be maintained?

      Many ways:

      1. Using an XPath expression that selects a node-set consisting of only that node.

      2. Calling an extension function that produces a GUID

      I mean outside documents, somewhere on the Web not under our application's control and that have links referring to pages that the application is (re-)generating. The way I understand it, if the id changes, the links break unless they are also updated correspondingly. I can not see how "Using an XPath expression that selects a node-set consisting of only that node" or "Calling an extension function that produces a GUID" can be used to maintain the outside links (and bookmarks) to the pages that we are generating with new IDs. Could you explain?

      Thank you.
      ac

       
    • > I can not see how "Using an XPath expression that
      > selects a node-set consisting of only that node"
      > or "Calling an extension function that produces a
      > GUID" can be used to maintain the outside links
      > (and bookmarks) to the pages that we are
      > generating with new IDs. Could you explain?

      Let's take a GUID.

      By definition it is unique.

      Make a particular GUID (say guid1)the @id attribute of the node that must be referenced.

      Then use an XPath expression either with the id() function or just something like a:

      pathExpression[@id = guid1]

      In the extreme case when the presence of a DTD or the structure of the xml document cannot be guaranteed, this may be used:

      (//nodeName[@id = guid1])[1]

      which some XSLT processors optimize.

      Using a XPath expression for a node is a similar approach, assuming that only such changes to a xml document are done that do not change the path to the node.

      Cheers,
      Dimitre

       
      • Of course, in XSLT the way to reference nodes is with keys (xsl:key and the key() function).

        As originally the problem was referring to finding a path between two nodes in a graph, if one is using GML to represent a graph, then it is very convenient to use keys in order to find all nodes for which there's an arc from a given node or all nodes from which there is an arc to a given node, or to find all outgoing or all incoming arcs from/to a given node.

        DN

         
    • StratML-OS
      StratML-OS
      2005-06-27

      Hi,

      >Make a particular GUID (say guid1)the @id attribute of the >node that must be referenced.
      >
      >Then use an XPath expression either with the id() function >or just something like a:
      >
      >pathExpression[@id = guid1]

      There is no XPath engine for the referring pages, they are bookmarks in users or subscriber browsers or links in html pages, in unknown web sites ... How would they know about your generated ids?

      >As originally the problem was referring to finding a path >between two nodes in a graph, if one is using GML to >represent a graph, then it is very convenient to use >keys in order to find all nodes for which there's an arc >from a given node or all nodes from which there is an >arc to a given node, or to find all outgoing or all >incoming arcs from/to a given node.

      We had no need to figure-out all incoming or outgoing arcs from a given node (as this is obvious when traversing, as we then have direct access to the required references to adjacent nodes), the need was for analyzing all potential paths between 2 (typically non-adjacent) nodes, and selecting the most appropriate in a given context and even then the main issue was specifically to avoid running in circles while traversing the structure (+- travelling salesman with no plan and a non- mutable graph).

      Also, I had a quick look at our application core which declares over 400 global variables and params, of which 16 are assignable and represent about 8 different use cases for assign, most of which follow a relatively similar logical pattern.

      Removing assign is not a theoretical issue here.
      Without assign or a very efficient and adequate replacement, we will be out of business and so will many other projects, and consequently the platform.

      I think that this is serious and that as there seems to be no effective replacement and none at a price that Mr Kay and other providers can afford (typically less than maintaining assign, I suppose), assign should be added to the standard.

      It is fascinating to dream about pure concepts and approaches but a programming language is a tool to solve real problems in the real world, where everything is not just simple and pure. Unfortunate maybe but interesting surely.

      We also have some use cases for keys that are working just fine. Keys could not be used for our assign use-cases, for various reasons including performance, size, non-mutability, only-global, not the right datastructure, or a mix of any of these.

      How would you use keys to accumulate the ids of the pages that have been done so far and to lookup ids and check each new one if it has been used already ? and how would you add it to the list for the next iteration, if it has not?

      ac

       
    • > Also, I had a quick look at our application core
      > which declares over 400 global variables and
      > params, of which 16 are assignable and represent
      > about 8 different use cases for assign, most of
      > which follow a relatively similar logical pattern.

      So, this is a proof again that the main problem is the excessive complexity of the system described.

      As for finding a path between two nodes in a graph, (of course without going into a loop) there exist clean algorithms to do this.

      And using saxon:assign actually contributes to the increased complexity and to making the problem bigger...

      I respect very much Michael Kay and admire his work. No doubt, he will take the right decision on saxon:assign and the negative value of supporting bad programming style.

      DN