Quick processing with DocumentBuilder

  • AlecF

    AlecF - 2011-05-08

    I'm using xpath support in Saxon 9.3HE to extract text from blog pages posted
    by friends and colleagues. The process of creating an XdmNode with s9api
    DocumentBuilder from an xhtml file always requires 100+ seconds and I am
    struggling to determine how to build an XdmNode from the original web pages
    quickly. In contrast, when I experiment with a well formed XML file the build
    method completes sub-second.

    Document validation is not required for my purpose so I explicitly disable it.
    Also, I have read that Saxon prefers StreamSource and SAXSource. In my
    experience, though, calling build with either as a Source does not reduce the
    time requirement.

    A snippet of my code follows for illustration.

            DocumentBuilder builder = processor.newDocumentBuilder();
            builder.setSchemaValidator(null); // disable validation
            // build from previously fetched xhtml file (10 - 30KB)
            // HtmlTidy used to clean up page first and temporarily save to disk
            // 100+ seconds required to build an XdmNode
            XdmNode doc = builder.build(new File(fileName));
            // xpath compile quick and successful
            XPathCompiler xPathCompiler = processor.newXPathCompiler();
            xPathCompiler.declareNamespace("", "[url]http://www.w3.org/1999/xhtml[/url]");
            XPathExecutable xPathExecutable = xPathCompiler.compile(xpath);

    My question is what technique should I use to reduce the time required to
    build a document?

    Thank you for your consideration.

  • Michael Kay

    Michael Kay - 2011-05-08

    My first guess would be that the time is being spent fetching the XHTML DTD
    from a web server. If that's the case, the answer is to redirect the DTD
    references to a local copy by using a catalog resolver. Note that the DTD will
    be fetched whether or not you are performing validation.

  • AlecF

    AlecF - 2011-06-19

    Thank you for the diagnosis. I used
    org.apache.xml.resolver.tools.CatalogResolver and my initial parse time
    dropped to < 2 seconds. As a hint to others, in my case I created a
    CatalogManager.properties file, catalog file, and then added a CatalogResolver
    from the Xerces project in the following manner:

                XMLReader reader = XMLReaderFactory.createXMLReader();
            InputSource is = new InputSource(new FileReader(fileName));
            javax.xml.transform.sax.SAXSource saxSource = new SAXSource(reader,
            DocumentBuilder builder = processor.newDocumentBuilder();
            XdmNode doc = builder.build(saxSource);
                // continue with xpath handling

Log in to post a comment.