Thanks for the reply Mike. Yes I have read this paper (not detailed) and as you said it is very promising in case of huge documents what we deal here. I was trying to apply XQuery on 8.7GB file and as you know this is really huge and then we got an idea similar to Document projection (initially we thought of using XSLT to strip off the XML but again with some initial tests we thought why can’t we use regular SAX parsing) and we have used StAX to eliminate unwanted XML elements which reduced the size from 8.7GB to around 500MB.

 

As we want to do this before we pass it to XQuery engine as our XQuery itself is complicated (simple but needs to do lot of things) and we thought not to overload XQuery engine. But the challenge here is how to get list of element’s from XQuery. Then I thought Saxon (StaticQueryContext) is analyzing XQuery and it might have this information which I can use to strip off the XML and the pass this stripped XML stream for execution. If you have more pointers on this I will appreciate otherwise it is fine. I will investigate more in to your code and will come back with specific questions. Thanks for your detailed explanation on this.

 

>Most of the static analysis that it requires is already done by Saxon (at least in Saxon-SA), though not all.

 

What do you mean by this statement? You mean it should have the entire elements list (required for XQuery from input XML’s). What are your plans to implement Document Projection in your future versions of Saxon?

 

>If you've got time, I suggest you study the paper and see how much of the required information can be obtained from the existing >Saxon expression tree.

           

I will read this paper thoroughly and will come back on this with specific questions. (Need to understand Saxon expression tree)

 

>(Alternatively, you could work from the parsed query in XQueryX form: but you would then have to perform a lot more of the analysis >yourself.)

 

            I would skip this for now and will see if nothing works for me.

 

Thanks,

Srinivas


From: saxon-help-bounces@lists.sourceforge.net [mailto:saxon-help-bounces@lists.sourceforge.net] On Behalf Of Michael Kay
Sent: Monday, August 28, 2006 3:41 PM
To: 'Mailing list for SAXON XSLT queries'
Subject: Re: [saxon] Saxon Grouping Extension Function (Xquery)

 

 > I have a quick question I was investigating Saxon code to find out how the StaicQueryContext compiles XQuery (it seems complicatedJ).  

 

Yes, it's complicated! There are six main phases of processing:

 

* parsing, which constructs an abstract syntax tree (a tree of Expression objects)

 

* binding of variable and function references to their declarations

 

* simplification, which does some very simple context-independent rewrites

 

* type checking, which checks expressions against the type checking rules, decides the static type of each expression, and generates extra code (extra nodes in the expression tree) to perform run-time checks and conversions

 

* optimization, which does more complex rewrites such as moving expressions out of loops

 

* slot allocation: defining where variables will be stored on the local stackframe.

 

 

>Is it possible to find out, which elements (from the input XML) does the given XQuery is interested in as you suggested I need to strip down my XML based on some technique? 

 

This technique is sometimes called "document projection", and is described in a paper by Amelie Marian and Jerome Simeon at http://www-db.research.bell-labs.com/user/simeon/xml_projection.pdf. It looks a very promising technique. Most of the static analysis that it requires is already done by Saxon (at least in Saxon-SA), though not all. If you've got time, I suggest you study the paper and see how much of the required information can be obtained from the existing Saxon expression tree. (Alternatively, you could work from the parsed query in XQueryX form: but you would then have to perform a lot more of the analysis yourself.)

 

Michael Kay

http://www.saxonica.com/

 

 

 

Thanks,

Srinivas Kusunam

 


From: saxon-help-bounces@lists.sourceforge.net [mailto:saxon-help-bounces@lists.sourceforge.net] On Behalf Of Michael Kay
Sent: Friday, August 25, 2006 3:09 PM
To: 'Mailing list for SAXON XSLT queries'
Subject: Re: [saxon] Saxon Grouping Extension Function (Xquery)

 

My first thought was that you would be best off doing this using a sort, followed by group-adjacent. The group-adjacent functionality is available in XSLT, but not in XQuery, even with the saxon:for-each-group extension. In XQuery, you would have to implement the group-adjacent logic using a recursive scan, which imposes its own stresses with this kind of data volume. On reflection, however, I don't think the sort would use any less memory than for-each-group.

 

In XSLT, the logic is simply

 

<xsl:for-each-group select="Title" group-by="ModelYear">

             <distribution>

                <value><xsl:value-of select="current-grouping-key()"/></value>

                <count><xsl:value-of select="count(current-group())"/></count>

              </distribution>

</xsl:for-each-group>

 

and I would suggest you try that first.

 

First check that you can actually load the document into memory (e.g by running a query such as count(//*)). If that fails then you're going to have to do something to reduce its size by pre-filtering. If it does load into memory, then the above code adds a requirement to hold a hash table containing one entry for each key value, mapped to list containing object references to the nodes with that key. That's likely to be much smaller than the tree itself.

 

You can of course write the above using the XQuery saxon:for-each-group construct if you really want, but I'm not sure why you would want to: a standard XSLT solution seems better than a non-standard XQuery one.

 

(There's a fairly easy Saxon optimization I could implement to detect that the only thing you are doing with the group is to count its members: but before implementing an optimization, I have to ask how many people would benefit from it).

 

The performance of this should be fine so long as you don't run out of memory.

 

Michael Kay

http://www.saxonica.com/

 

*****************************************************************
This message has originated from RLPTechnologies,
26955 Northwestern Highway, Southfield, MI 48033.
 
RLPTechnologies sends various types of email
communications.  If this email message concerns the
potential licensing of an RLPT product or service, and
you do not wish to receive further emails regarding Polk
products, forward this email to Do_Not_Send@rlpt.com
with the word "remove" in the subject line.
 
The email and any files transmitted with it are confidential
and intended solely for the individual or entity to whom they
are addressed.
 
If you have received this email in error, please delete this
message and notify the Polk System Administrator at
postmaster@rlpt.com.
*****************************************************************
 
*****************************************************************
This message has originated from RLPTechnologies,
26955 Northwestern Highway, Southfield, MI 48033.

RLPTechnologies sends various types of email
communications.  If this email message concerns the
potential licensing of an RLPT product or service, and
you do not wish to receive further emails regarding Polk
products, forward this email to Do_Not_Send@rlpt.com
with the word "remove" in the subject line.

The email and any files transmitted with it are confidential
and intended solely for the individual or entity to whom they
are addressed.

If you have received this email in error, please delete this
message and notify the Polk System Administrator at
postmaster@rlpt.com.
*****************************************************************