XMLFilters to split documents (was: preview)

Help
2004-05-12
2012-10-08
  • Gunther Schadow

    Gunther Schadow - 2004-05-12

    Hello,

    I have a little XMLFilter that splits a document into multiple documents at the second-level element. This is to split up HUGE files into smaller chunks to transform. Problem is SAXON doesn't handle startDocument and endDocument consistently.

    Example: HUGE document:

    <root>
      <record id="1"> ... </record>
      <record id="2"> ... </record>
      ...
      <record id="1234483822934883"> ... </record>
    <root>

    XMLSplitter filter does this:

    startDocument()
    <record id="1"> ... </record>
    endDocument()

    startDocument()
    <record id="2"> ... </record>
    endDocument()

    ...

    startDocument()
    <record id="1234483822934883"> ... </record>
    endDocument()

    if the final transformer is the IdentityTransformer all is fine. But if I use a real SAXON transformer at the end or in the middle of a longer filter chain, it only transforms the first document and then stops.

    Would it be against the relevant specifications if the transformer would transform at endDocuent and at the next startDocument would reset its state and start over again?

    I think this would be a nice replacement for the lost preview mode.

    regards,
    -Gunther

     
    • Gunther Schadow

      Gunther Schadow - 2004-05-12

      This task is a conundrum. I am chewing on this for several hours now with no workable solution in sight.

      What I want is a single filter pipeline where one node in the chain transforms smaller sub-documents of a big document to avoid building a humongous tree.

      The problem is that at the point when we catch the startElement event of the node that should become the top node of the sub-document, we need to return right away in order for anything to move forward. However, the only way to invoke a Transformer is using the synchronous transform method, which will block.

      Somehow the parent XML parser must receive back the control from startElement and still the Transformer must be initialized and ready to become the ContentHandler for the next event.

      This seems not possible. That's bad.

      Either the Controller should allow multiple documents to come down the pipeline or we need an asynchronous mode of invoking it, e.g.,

      Transformer transformer = transformerFactory.newTransformer(new StreamSource(xsltFileName));

      startElement(...) {

      if(...) // second level element
      {
      SAXResult result = new SAXResult();
      ...

      ContentHandler contentHandler = transformer.getReadyForSAXEvents(result);

      parent.setContentHandler(contentHandler);

      return;
      }
      ...
      }

      How hard would it be?

      -Gunther

       
      • Michael Kay

        Michael Kay - 2004-05-13

        This is surely what the TransformerHandler is designed to do. You create a TransformerHandler from your Templates object. This is a ContentHandler, so you can feed it all the SAX events. Saxon's implementation simply builds the source tree in response to these events, and does the transformation during the endDocument() call.

        Michael Kay

         
    • Michael Kay

      Michael Kay - 2004-05-13

      Another observation: Saxon's implementation of TransformerHandler is not at present serially reusable. It wouldn't be too difficult to make it so (the main effort is in testing), but there really isn't a need. I generally recommend creating a new Transformer for each transformation, and the same goes for the TransformerHandler: just create a new one before calling startDocument() to process each mini-document. Reusing Transformers means that the document pool (which holds all source documents used during a transformation) is not cleared out, which consumes memory unnecessarily.

      Michael Kay

       
    • Gunther Schadow

      Gunther Schadow - 2004-05-14

      Thanks Michael, however, whether creating a new Transformer or not (I trust that one can make it so that the XSLT source is only parsed once) -- the real problem is with the SAX and TRAX protocols being incompatible for a use case like this.

      Have you ever actually done this, i.e., used an XMLFilter to break a big document into a sequence of small documents, and then called the Transformer on each one? After having gone circles for several hours on this I am convinced that it is an impossible thing to do it straight-forwardly.

      The dilemma is that (1) the startElement event call-back needs to return in order for the upstream parser to continue producing events, and (2) the only way to invoke a Transformer is through the blocking transform() method. (1) and (2) are inherently incompatible.

      The solutions I know of are all messy. Either hacking wildly in the SAXON Controller to provide a means for calling it asynchronously, or to create a DOM or just XML character buffer of the sub-document and process it at the endElement event. Both solutions involve unnecessary overhead.

      While this is an issue with the TRAX and SAX specifications, a nice work-around would be if SAXON would pilot an alternative TRAX API as I outlined above. One where SAXON presents itself (or a proxy of itself) as a ContentHandler that can be sent SAX events, rather than asking for a SAXSource in a blocking transformer invocation.

      EXAMPLE FOR THE PROBLEM:

      Hmm ... could it be that the XMLFilter is such a proxy? Could the protocol go like this:

      XMLReader <---parent--- XMLSplitterFilter <---...

      on startElement() where a sub-document starts, XMLSplitterFilter creates a new SAXON XMLFilter and another XMLSplitter.Receiver that catches the endDocument(). Then it will return from startElement(). ... AHHHHHH! I'm going crazy ... this still doesn't work because the only way to start the new XMLFilter transformer is by invoking is synchronously through its XMLReader.parse() interface.

      What would really be needed is SAX specification to allow pushing multiple documents through a single pipe with the semantics of resetting any transformer upon the receipt of the startDocument() event. Probably the startDocument event should be givena String systemId argument as well.

       
      • Michael Kay

        Michael Kay - 2004-05-14

        The mechanism you describe "One where SAXON presents itself (or a proxy of itself) as a ContentHandler that can be sent SAX events, rather than asking for a SAXSource in a blocking transformer invocation." sounds precisely like the JAXP TransformerHandler. Can you explain why the TransformerHandler doesn't fit the bill?

        Michael Kay

         
    • Gunther Schadow

      Gunther Schadow - 2004-05-14

      ... aha! you may be right. This could be it! Wonderful. I'll try it right away, sounds like this is the thing I was missing.

       
    • Gunther Schadow

      Gunther Schadow - 2004-05-16

      Hurray! That was it. Thanks for giving me that hint about the TransformerHandler, I completely forgot about that. To give back to the group, I updated my TRAXPipe stuff in the "patch" section of this project. It now has document splitter and joiner XMLFilters and by default each transformer node can handle a sequence of documents thanks to the TransformerHandler.

      regards
      -Gunther

       

Log in to post a comment.