## Re: [saxon] Help understanding Saxon performance problem

 Re: [saxon] Help understanding Saxon performance problem From: Michael Kay - 2013-02-20 10:25:16 Attachments: Message as HTML ```It's a little bit difficult to know exactly which expression is causing the trouble here, partly because I don't have access to the volumetrics. For example following-sibling::node()[following-sibling::*[1] is \$next] is potentially quadratic in the number of siblings, but it's not clear from your description whether the number of siblings grows very large or remains constant as the file size grows. A recursive function might do better: function endMatter (this, next, all) { var f = \$this/following-sibling::node()[1] if f is \$next then \$all else endMatter(\$f, \$next, (\$f, \$all)) } Another possible culprit is the call on preceding::* in node()[preceding::*[. intersect \$docNodes][1] is \$final] I think the question here is, how big is \$docNodes? Testing the intersect operator will be linear in the size of \$docNodes. And you use a similar construct elsewhere, e.g this looks rather expensive if \$docNodes is large: match="*[not(descendant::* intersect \$docNodes)]"> .... For this kind of thing a useful trick can be to define a key with use="generate-id()": xsl:key name="n" match="*[. intersect \$docNodes]" use="generate-id()"/> if test="key('n', generate-id(\$X))" then tests whether \$X is in \$docNodes in constant time. For the particular match pattern, another technique I have used is to precompute the nodes that will match it in a global variable: var \$docNodesAncestors = \$docNodes/ancestor::* and then match="*[not(. intersect \$docNodesAncestors)]" or use the same key() trick as before. Michael Kay Saxonica On 20/02/2013 03:38, John Barstow wrote: > I've got one particular XSLT construct that seems to be a huge problem > under Saxon, but I'm not sure exactly how to rewrite it for better > performance. When I say huge, it takes the processing of a 2MB > document from a few seconds to over 36 hours. > > The Problem > ========= > > The intention is to split a monolithic XML file into a number of > smaller self-contained files. In this example, we want each `part` > element to end up in its own document. In reality, we pass in the list > of nodes to break on as a stylesheet parameter as it varies according > to external metadata. > > The twist that makes this less straightfoward is that we want any end > matter that follows a part to become a child of that part when we've > generated the files. > > Aside: The reasoning is that we're working with legislation which > places constraints on how we structure the presentation. The sample > below is taken from a unit test and represents the most complex case; > in the most common case there is no end matter but we still pay a huge > performance penalty! > > Sample doc: > > > > a > > b > c > > > d > e > f > g > > > > > > Desired output: > 1.xml > > > a > > > > > 2.xml - > b > > 3.xml - > c > > 4.xml - > de > > 5.xml - > fg > > The Underperforming Solution > ===================== > > We pass in the list of nodes which will become documents as the > `docNodes` parameter. I've elided all the logic irrelevant to the > immediate problem, hopefully I haven't elided anything critical to > following the logic here. In all probability I've introduced > unnecessary complexity in the iteration towards a working solution. > > Unfortunately the profiler is not giving me very useful information, > basically pointing me towards this template without enough information > to deduce which XPath expressions are the culprits. > > > > > > > > > > > select="node()[preceding::*[. intersect \$docNodes][1] is \$final]" /> > > select="following-sibling::*[descendant-or-self::* intersect > \$docNodes][1]" /> > > select="following-sibling::node()[following-sibling::*[1] is \$next]" /> > > > > > > > > select="(\$child_endmatter, \$sibling_endmatter, \$endmatter)" /> > name="endmatter_dest" select="\$final" /> > > > > > > > > > > > > > > > > > > mode="chunk" /> > select="node()[not(parent::*/descendant::*[. intersect \$docNodes][1] > << .)]" mode="chunk"> > > > > select="\$child_endmatter, \$sibling_endmatter, \$endmatter" mode="chunk" /> > > > > > > > We have similar template to optimize for a common case where no > descendants will result in a new document (basically less convoluted > logic): > > .... > > The "chunk" mode is uninteresting, basically a copy to the output > document: > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > > > _______________________________________________ > saxon-help mailing list archived at http://saxon.markmail.org/ > saxon-help@... > https://lists.sourceforge.net/lists/listinfo/saxon-help ```

 [saxon] Help understanding Saxon performance problem From: John Barstow - 2013-02-20 03:39:02 Attachments: Message as HTML ```I've got one particular XSLT construct that seems to be a huge problem under Saxon, but I'm not sure exactly how to rewrite it for better performance. When I say huge, it takes the processing of a 2MB document from a few seconds to over 36 hours. The Problem ========= The intention is to split a monolithic XML file into a number of smaller self-contained files. In this example, we want each `part` element to end up in its own document. In reality, we pass in the list of nodes to break on as a stylesheet parameter as it varies according to external metadata. The twist that makes this less straightfoward is that we want any end matter that follows a part to become a child of that part when we've generated the files. Aside: The reasoning is that we're working with legislation which places constraints on how we structure the presentation. The sample below is taken from a unit test and represents the most complex case; in the most common case there is no end matter but we still pay a huge performance penalty! Sample doc: a b c d e f g Desired output: 1.xml a 2.xml - b 3.xml - c 4.xml - de 5.xml - fg The Underperforming Solution ===================== We pass in the list of nodes which will become documents as the `docNodes` parameter. I've elided all the logic irrelevant to the immediate problem, hopefully I haven't elided anything critical to following the logic here. In all probability I've introduced unnecessary complexity in the iteration towards a working solution. Unfortunately the profiler is not giving me very useful information, basically pointing me towards this template without enough information to deduce which XPath expressions are the culprits. We have similar template to optimize for a common case where no descendants will result in a new document (basically less convoluted logic): .... The "chunk" mode is uninteresting, basically a copy to the output document: ```
 Re: [saxon] Help understanding Saxon performance problem From: Michael Kay - 2013-02-20 10:25:16 Attachments: Message as HTML ```It's a little bit difficult to know exactly which expression is causing the trouble here, partly because I don't have access to the volumetrics. For example following-sibling::node()[following-sibling::*[1] is \$next] is potentially quadratic in the number of siblings, but it's not clear from your description whether the number of siblings grows very large or remains constant as the file size grows. A recursive function might do better: function endMatter (this, next, all) { var f = \$this/following-sibling::node()[1] if f is \$next then \$all else endMatter(\$f, \$next, (\$f, \$all)) } Another possible culprit is the call on preceding::* in node()[preceding::*[. intersect \$docNodes][1] is \$final] I think the question here is, how big is \$docNodes? Testing the intersect operator will be linear in the size of \$docNodes. And you use a similar construct elsewhere, e.g this looks rather expensive if \$docNodes is large: match="*[not(descendant::* intersect \$docNodes)]"> .... For this kind of thing a useful trick can be to define a key with use="generate-id()": xsl:key name="n" match="*[. intersect \$docNodes]" use="generate-id()"/> if test="key('n', generate-id(\$X))" then tests whether \$X is in \$docNodes in constant time. For the particular match pattern, another technique I have used is to precompute the nodes that will match it in a global variable: var \$docNodesAncestors = \$docNodes/ancestor::* and then match="*[not(. intersect \$docNodesAncestors)]" or use the same key() trick as before. Michael Kay Saxonica On 20/02/2013 03:38, John Barstow wrote: > I've got one particular XSLT construct that seems to be a huge problem > under Saxon, but I'm not sure exactly how to rewrite it for better > performance. When I say huge, it takes the processing of a 2MB > document from a few seconds to over 36 hours. > > The Problem > ========= > > The intention is to split a monolithic XML file into a number of > smaller self-contained files. In this example, we want each `part` > element to end up in its own document. In reality, we pass in the list > of nodes to break on as a stylesheet parameter as it varies according > to external metadata. > > The twist that makes this less straightfoward is that we want any end > matter that follows a part to become a child of that part when we've > generated the files. > > Aside: The reasoning is that we're working with legislation which > places constraints on how we structure the presentation. The sample > below is taken from a unit test and represents the most complex case; > in the most common case there is no end matter but we still pay a huge > performance penalty! > > Sample doc: > > > > a > > b > c > > > d > e > f > g > > > > > > Desired output: > 1.xml > > > a > > > > > 2.xml - > b > > 3.xml - > c > > 4.xml - > de > > 5.xml - > fg > > The Underperforming Solution > ===================== > > We pass in the list of nodes which will become documents as the > `docNodes` parameter. I've elided all the logic irrelevant to the > immediate problem, hopefully I haven't elided anything critical to > following the logic here. In all probability I've introduced > unnecessary complexity in the iteration towards a working solution. > > Unfortunately the profiler is not giving me very useful information, > basically pointing me towards this template without enough information > to deduce which XPath expressions are the culprits. > > > > > > > > > > > select="node()[preceding::*[. intersect \$docNodes][1] is \$final]" /> > > select="following-sibling::*[descendant-or-self::* intersect > \$docNodes][1]" /> > > select="following-sibling::node()[following-sibling::*[1] is \$next]" /> > > > > > > > > select="(\$child_endmatter, \$sibling_endmatter, \$endmatter)" /> > name="endmatter_dest" select="\$final" /> > > > > > > > > > > > > > > > > > > mode="chunk" /> > select="node()[not(parent::*/descendant::*[. intersect \$docNodes][1] > << .)]" mode="chunk"> > > > > select="\$child_endmatter, \$sibling_endmatter, \$endmatter" mode="chunk" /> > > > > > > > We have similar template to optimize for a common case where no > descendants will result in a new document (basically less convoluted > logic): > > .... > > The "chunk" mode is uninteresting, basically a copy to the output > document: > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > > > _______________________________________________ > saxon-help mailing list archived at http://saxon.markmail.org/ > saxon-help@... > https://lists.sourceforge.net/lists/listinfo/saxon-help ```
 Re: [saxon] Help understanding Saxon performance problem From: John Barstow - 2013-02-22 00:37:50 Attachments: Message as HTML ```On Wednesday, February 20, 2013, Michael Kay wrote: > > Another possible culprit is the call on preceding::* in > > node()[preceding::*[. intersect \$docNodes][1] is \$final] > > I think the question here is, how big is \$docNodes? Testing the intersect > operator will be linear in the size of \$docNodes. > This seems to be the issue. A small document with 40 \$docNodes runs in about a second, while a more typical document has about 800 \$docNodes and takes 1.5 hours. There are quite a lot of expressions that do an intersection test, so switching to a set of keys will definitely help. Once again your undisputed mastery of XSLT lights the way. ```