From: Alistair M. <ali...@go...> - 2010-07-23 14:36:56
|
Hi all, I don't know if this is useful, but last year I wrote a short python script using the sax api to generate some statistics on the paths in a large collection of XML (6 million mods records), and used hadoop to run it. Details at [1]. Cheers Alistair [1] http://dublincore.org/dcmirdataskgroup/DataConversion On Fri, Jul 23, 2010 at 06:50:44AM -0500, Dan McCreary wrote: > Hi David, > > I put together two examples of path analysis queries in the XQuery Wikibook > here: > > http://en.wikibooks.org/wiki/XQuery#Path_Analysis > > These examples work fine for small XML files and using these tools and the > oXygen tools that create XML Schemas from instance documents we now have > tools that can do some very quick analysis on small documents. I also use a > small library of "metrics" reports that count the number of nodes, elements, > attributes etc. but I think these are not quite as useful. > > Adam is correct, most complex XML data sets do come with XML Schemas that > are the ultimate guide to complex data sets. But sometimes we get TEI > documents that don't have XML Schemas and it is very handy to be able to get > a report of what TEI elements are use to annotate paragraphs for example. > We use these reports to configure the TEI annotator tools. > > What is missing are tools that work with very large collections of > not-quite-uniform documents. The way I would start to look at these data > sets is to build some basic XML Schemas and then run scripts that validate > all documents in a collection against these XML Schemas. > > I should note that information about what elements are used within documents > is stored in the indexes used by eXist to perform fast XQueries, but I do > not know if these interfaces are exposed as XQuery functions. Using these > interfaces could provide a very fast interface for very large collections. > > - Dan > > On Fri, Jul 23, 2010 at 3:51 AM, Adam Retter <ad...@ex...> wrote: > > > I have to say in my experience this is the first time someone has > > wanted to see all distinct paths so that they can determine what they > > are working with. Typically there is usually some advance > > understanding of the data, its meaning and structure, perhaps from a > > Schema or from the creator of the set. > > > > On 23 July 2010 04:25, David Elwell <el...@ve...> wrote: > > > Adam-- > > > > > > Thanks for pointing me to > > > http://www.javamex.com/tutorials/memory/string_memory_usage.shtml. It's > > > going to take me another reading or two to make sense of it. > > > > > > I am using the paths from the small sample collection to guide my > > > construction of queries addressing the complete collection, so even > > > though I probably don't _need_ the information, I am finding it useful. > > > It just gnaws at me that I might be missing something if I can't > > > catalog every distinct path in the collection. > > > > > > I don't foresee using such a query frequently on the same collection. > > > It starts to answer the first question I had when confronted with a > > > collection of XML files: Structurally, what am I dealing with here? I > > > imagine I'll have the same question when I confront my next collection, > > > and then I would turn to an improved version of the same query to make > > > an initial survey. > > > > > > David > > > > > > On 07/22/2010 04:45:44 AM, Adam Retter wrote: > > >> Hmm an interesting question... > > >> > > >> Basically your query constructs a LOT of strings and then throws away > > >> the ones it does not need or at least offers them up to the GC (via > > >> distinct-values). > > >> > > >> The problem is that for every node in every document you are creating > > >> a string of its full path, so for 100,000 documents if your documents > > >> have maybe 100 nodes then that is 100,000 * 100 strings of perhaps > > >> substantial length. > > >> > > >> If you know how many nodes you have, and the average string length of > > >> a path representation you could make a stab at the memory > > >> calculations > > >> based on the fact that eXist is built on Java and it is possible to > > >> calculate the memory usage of a string - > > >> http://www.javamex.com/tutorials/memory/string_memory_usage.shtml > > >> > > >> Do you need this information just once for some reason? or is it > > >> something you need frequently? > > >> > > >> It may be something that would be better done at document storage > > >> time > > >> by means of a trigger or some sort of plugin to the indexing > > >> pipeline. > > >> > > >> > > >> On 22 July 2010 03:42, David Elwell <el...@ve...> wrote: > > >> > > > >> > I am new to XQuery and eXist & trying to get a handle on a fairly > > >> large > > >> > collection of XML files--approaching 100,000. The files are > > >> > consistently structured, so I am not anticipating more than a > > >> couple > > >> > hundred paths. > > >> > > > >> > Thanks to http://www.xqueryfunctions.com/ I have managed to > > >> construct a > > >> > query that lists elements more or less hierarchically: > > >> > > > >> > --------------------------------------------------------------- > > >> > > > >> > xquery version "1.0"; > > >> > declare namespace functx = "http://www.functx.com"; > > >> > declare function functx:path-to-node > > >> > ( $nodes as node()* ) as xs:string* { > > >> > > > >> > $nodes/string-join(ancestor-or-self::*/name(.), '/') > > >> > } ; > > >> > declare function functx:distinct-element-paths > > >> > ( $nodes as node()* ) as xs:string* { > > >> > > > >> > distinct-values(functx:path-to-node($nodes/descendant-or- > > >> self::*)) > > >> > } ; > > >> > declare function functx:sort > > >> > ( $seq as item()* ) as item()* { > > >> > > > >> > for $item in $seq > > >> > order by $item > > >> > return $item > > >> > } ; > > >> > let $in-xml := collection("NAMEOFCOLLECTION") > > >> > return > > >> > functx:sort(functx:distinct-element-paths( > > >> > $in-xml)) > > >> > > > >> > ------------------------------------------------------------------ > > >> > > > >> > In eXist's sandbox, this generates a numbered, alphabetical list of > > >> > nodes & descendants for test collections of a few files, but I > > >> haven't > > >> > figured out how to run such a query on the complete collection of > > >> tens > > >> > of thousands of files without exceeding memory and other limits. > > >> > > > >> > My intention is to develop a query that would allow me to catalog > > >> the > > >> > node structure of any collection as a first step toward > > >> constructing > > >> > more specific queries. As someone who in in over his head, I have a > > >> few > > >> > questions, and even a partial answer to any one of them would be > > >> > helpful: > > >> > > > >> > 1. Is this something that has been done already, as XQuery or by > > >> some > > >> > other means? If so, where can I find more information on the proper > > >> > tools and approach? > > >> > > > >> > 2. Is this task impossible or simply not worth the time and trouble > > >> for > > >> > such a large collection? How does a smart user estimate system > > >> > limitations in advance of running a query. > > >> > > > >> > 3. How should I adapt or replace my query to address a large > > >> > collection? > > >> > > > >> > 4. How should I improve the query to make it more useful and > > >> > informative as a diagram of a collection? > > >> > > > >> > 5. Am I asking the wrong questions. > > >> > > > >> > David Elwell > > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------------------------------ > > >> > This SF.net email is sponsored by Sprint > > >> > What will you do first with EVO, the first 4G phone? > > >> > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > > >> > _______________________________________________ > > >> > Exist-open mailing list > > >> > Exi...@li... > > >> > https://lists.sourceforge.net/lists/listinfo/exist-open > > >> > > > >> > > >> > > >> > > >> -- > > >> Adam Retter > > >> > > >> eXist Developer > > >> { United Kingdom } > > >> ad...@ex... > > >> irc://irc.freenode.net/existdb > > >> > > >> > > > > > > > > > > > > > > > > > > > > -- > > Adam Retter > > > > eXist Developer > > { United Kingdom } > > ad...@ex... > > irc://irc.freenode.net/existdb > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Sprint > > What will you do first with EVO, the first 4G phone? > > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > > _______________________________________________ > > Exist-open mailing list > > Exi...@li... > > https://lists.sourceforge.net/lists/listinfo/exist-open > > > > > > -- > Dan McCreary > Semantic Solutions Architect > office: (952) 931-9198 > cell: (612) 986-1552 > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open -- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: ali...@gm... Tel: +44 (0)1865 287669 |