On 11/02/2011 00:57, CRB wrote:
Possibly an excercise in the ridiculous - but I am stumped by what I thought would be rather simple: using XSLT to partition a large log file (20mb) into multiple smaller files (4mb).

Here is what I have:

   <xsl:variable name="input" select="unparsed-text('ServerLog.csv')"/>
    
   <xsl:template match="/">
       
       <rawsearchlog>
           <xsl:for-each-group select="tokenize($input, '\n')" group-by="position() mod 2000 = 0">
               <set>
                   <xsl:for-each select="current-group()">
                       <xsl:element name="row">
                           <xsl:sequence select="."/>
                       </xsl:element>
                   </xsl:for-each>
               </set>
           </xsl:for-each-group>
       </rawsearchlog>
       
   </xsl:template>
The value of the group-by key here is a boolean, so you end up with two groups. You'll get lines 2000, 4000, 6000 etc in one group, and all the other lines in the other group.

Try instead:

<xsl:for-each-group select="tokenize($input, '\n')" group-adjacent="(position() -1) mod 2000">

Note that group-adjacent is always likely to be more efficient than group-by.
Multiple outputs aside for the moment, I find myself challenged just to get the grouping of a sequence. The above runs but does not partition into groups of 2000.

Alternatively, I had been thinking the group-by would be something like:

group-by=". | following-sibling::node()[position() &lt; 2000]"

The items in a sequence are not (in general) siblings of each other. To be siblings, two items need to have a common parent in an XML tree. tokenize() produces strings, which don't have a parent because they are not nodes.

Michael Kay
Saxonica

(this question isn't actually Saxon-specific, so it would be better posted on a general XSLT forum such as the xsl-list at mulberrytech.com)