Sorry I'm late responding, but I hope the answers are useful anyway.

On 18 Feb 2014, at 14:57, Rademacher, Gunther <> wrote:

I am trying to process a large XML document from Java, such that the
application can pull results in linear time, i.e. independent of the
document size.

I would understand "independent" as meaning constant time, and I'm slightly confused about which you mean. 

Parsing the document will always take linear time, and will often dominate query execution time. This is true whether or not you are streaming.

Query execution time may be constant time in some cases; in other cases it is so much faster than document parsing time that it appears constant, even though it isn't. And of course it may be linear with (say) maximum number of siblings of a single parent node, which may or may not increase with document size.
My naive attempt was to use the iterator from XQueryEvaluator like this:
      XQueryExecutable xqueryExecutable = xqeryCompiler.compile("doc('" + input + "')/*/*/string()");
      XQueryEvaluator query = xqueryExecutable.load();
      query.setSource(new StreamSource(new FileInputStream(new File(new URI(input)))));
      return query.iterator();

First point to note is that this won't do streaming; it will build the document tree in memory. The parsing and tree building occurs within the setSource() method.
however a fair amount of time (depending on document size) is spent when
setSource is called,

That makes sense, this is the parsing and tree building

and most of the remaining time goes into fetching
the first result:
      t(0): 3885 msec
      t(1): 8628 msec
      t(65536): 8659 msec
      t(503524): 9814 msec

I'm not sure what these numbers actually represent. Is this the time taken to fetch the first result, as a function of the size of the document?

I think it's possible that you are parsing the source document twice, once when doing setSource(), and once when calling the doc() function. This won't happen if Saxon recognizes that both are the same absolute URI, but that could easily not be the case. You don't need to supply the same input document twice.
Also tried saxon:stream, or using with a SAXDestination, but no success.

That's a completely different approach and the metrics are going to be quite different, so we need to look at it separately.


Michael Kay