From: Adam R. <ad...@ex...> - 2010-07-22 08:45:54
|
Hmm an interesting question... Basically your query constructs a LOT of strings and then throws away the ones it does not need or at least offers them up to the GC (via distinct-values). The problem is that for every node in every document you are creating a string of its full path, so for 100,000 documents if your documents have maybe 100 nodes then that is 100,000 * 100 strings of perhaps substantial length. If you know how many nodes you have, and the average string length of a path representation you could make a stab at the memory calculations based on the fact that eXist is built on Java and it is possible to calculate the memory usage of a string - http://www.javamex.com/tutorials/memory/string_memory_usage.shtml Do you need this information just once for some reason? or is it something you need frequently? It may be something that would be better done at document storage time by means of a trigger or some sort of plugin to the indexing pipeline. On 22 July 2010 03:42, David Elwell <el...@ve...> wrote: > > I am new to XQuery and eXist & trying to get a handle on a fairly large > collection of XML files--approaching 100,000. The files are > consistently structured, so I am not anticipating more than a couple > hundred paths. > > Thanks to http://www.xqueryfunctions.com/ I have managed to construct a > query that lists elements more or less hierarchically: > > --------------------------------------------------------------- > > xquery version "1.0"; > declare namespace functx = "http://www.functx.com"; > declare function functx:path-to-node > ( $nodes as node()* ) as xs:string* { > > $nodes/string-join(ancestor-or-self::*/name(.), '/') > } ; > declare function functx:distinct-element-paths > ( $nodes as node()* ) as xs:string* { > > distinct-values(functx:path-to-node($nodes/descendant-or-self::*)) > } ; > declare function functx:sort > ( $seq as item()* ) as item()* { > > for $item in $seq > order by $item > return $item > } ; > let $in-xml := collection("NAMEOFCOLLECTION") > return > functx:sort(functx:distinct-element-paths( > $in-xml)) > > ------------------------------------------------------------------ > > In eXist's sandbox, this generates a numbered, alphabetical list of > nodes & descendants for test collections of a few files, but I haven't > figured out how to run such a query on the complete collection of tens > of thousands of files without exceeding memory and other limits. > > My intention is to develop a query that would allow me to catalog the > node structure of any collection as a first step toward constructing > more specific queries. As someone who in in over his head, I have a few > questions, and even a partial answer to any one of them would be > helpful: > > 1. Is this something that has been done already, as XQuery or by some > other means? If so, where can I find more information on the proper > tools and approach? > > 2. Is this task impossible or simply not worth the time and trouble for > such a large collection? How does a smart user estimate system > limitations in advance of running a query. > > 3. How should I adapt or replace my query to address a large > collection? > > 4. How should I improve the query to make it more useful and > informative as a diagram of a collection? > > 5. Am I asking the wrong questions. > > David Elwell > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > -- Adam Retter eXist Developer { United Kingdom } ad...@ex... irc://irc.freenode.net/existdb |