Re: [Exist-open] Extracting collection node structure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all,

I don't know if this is useful, but last year I wrote a short python
script using the sax api to generate some statistics on the paths in a
large collection of XML (6 million mods records), and used hadoop to run
it. Details at [1].

Cheers

Alistair

[1] http://dublincore.org/dcmirdataskgroup/DataConversion

On Fri, Jul 23, 2010 at 06:50:44AM -0500, Dan McCreary wrote:
> Hi David,
> 
> I put together two examples of path analysis queries in the XQuery Wikibook
> here:
> 
> http://en.wikibooks.org/wiki/XQuery#Path_Analysis
> 
> These examples work fine for small XML files and using these tools and the
> oXygen tools that create XML Schemas from instance documents we now have
> tools that can do some very quick analysis on small documents.  I also use a
> small library of "metrics" reports that count the number of nodes, elements,
> attributes etc. but I think these are not quite as useful.
> 
> Adam is correct, most complex XML data sets do come with XML Schemas that
> are the ultimate guide to complex data sets.  But sometimes we get TEI
> documents that don't have XML Schemas and it is very handy to be able to get
> a report of what TEI elements are use to annotate paragraphs for example.
> We use these reports to configure the TEI annotator tools.
> 
> What is missing are tools that work with very large collections of
> not-quite-uniform documents.  The way I would start to look at these data
> sets is to build some basic XML Schemas and then run scripts that validate
> all documents in a collection against these XML Schemas.
> 
> I should note that information about what elements are used within documents
> is stored in the indexes used by eXist to perform fast XQueries, but I do
> not know if these interfaces are exposed as XQuery functions.  Using these
> interfaces could provide a very fast interface for very large collections.
> 
> - Dan
> 
> On Fri, Jul 23, 2010 at 3:51 AM, Adam Retter <ad...@ex...> wrote:
> 
> > I have to say in my experience this is the first time someone has
> > wanted to see all distinct paths so that they can determine what they
> > are working with. Typically there is usually some advance
> > understanding of the data, its meaning and structure, perhaps from a
> > Schema or from the creator of the set.
> >
> > On 23 July 2010 04:25, David Elwell <el...@ve...> wrote:
> > > Adam--
> > >
> > > Thanks for pointing me to
> > > http://www.javamex.com/tutorials/memory/string_memory_usage.shtml. It's
> > > going to take me another reading or two to make sense of it.
> > >
> > > I am using the paths from the small sample collection to guide my
> > > construction of queries addressing the complete collection, so even
> > > though I probably don't _need_ the information, I am finding it useful.
> > > It just gnaws at me that I might be missing something if I can't
> > > catalog every distinct path in the collection.
> > >
> > > I don't foresee using such a query frequently on the same collection.
> > > It starts to answer the first question I had when confronted with a
> > > collection of XML files: Structurally, what am I dealing with here? I
> > > imagine I'll have the same question when I confront my next collection,
> > > and then I would turn to an improved version of the same query to make
> > > an initial survey.
> > >
> > > David
> > >
> > > On 07/22/2010 04:45:44 AM, Adam Retter wrote:
> > >> Hmm an interesting question...
> > >>
> > >> Basically your query constructs a LOT of strings and then throws away
> > >> the ones it does not need or at least offers them up to the GC (via
> > >> distinct-values).
> > >>
> > >> The problem is that for every node in every document you are creating
> > >> a string of its full path, so for 100,000 documents if your documents
> > >> have maybe 100 nodes then that is 100,000 * 100 strings of perhaps
> > >> substantial length.
> > >>
> > >> If you know how many nodes you have, and the average string length of
> > >> a path representation you could make a stab at the memory
> > >> calculations
> > >> based on the fact that eXist is built on Java and it is possible to
> > >> calculate the memory usage of a string -
> > >> http://www.javamex.com/tutorials/memory/string_memory_usage.shtml
> > >>
> > >> Do you need this information just once for some reason? or is it
> > >> something you need frequently?
> > >>
> > >> It may be something that would be better done at document storage
> > >> time
> > >> by means of a trigger or some sort of plugin to the indexing
> > >> pipeline.
> > >>
> > >>
> > >> On 22 July 2010 03:42, David Elwell <el...@ve...> wrote:
> > >> >
> > >> > I am new to XQuery and eXist & trying to get a handle on a fairly
> > >> large
> > >> > collection of XML files--approaching 100,000. The files are
> > >> > consistently structured, so I am not anticipating more than a
> > >> couple
> > >> > hundred paths.
> > >> >
> > >> > Thanks to http://www.xqueryfunctions.com/ I have managed to
> > >> construct a
> > >> > query that lists elements more or less hierarchically:
> > >> >
> > >> > ---------------------------------------------------------------
> > >> >
> > >> > xquery version "1.0";
> > >> > declare namespace functx = "http://www.functx.com";
> > >> > declare function functx:path-to-node
> > >> >  ( $nodes as node()* )  as xs:string* {
> > >> >
> > >> > $nodes/string-join(ancestor-or-self::*/name(.), '/')
> > >> >  } ;
> > >> > declare function functx:distinct-element-paths
> > >> >  ( $nodes as node()* )  as xs:string* {
> > >> >
> > >> >   distinct-values(functx:path-to-node($nodes/descendant-or-
> > >> self::*))
> > >> >  } ;
> > >> > declare function functx:sort
> > >> >  ( $seq as item()* )  as item()* {
> > >> >
> > >> >   for $item in $seq
> > >> >   order by $item
> > >> >   return $item
> > >> >  } ;
> > >> > let $in-xml := collection("NAMEOFCOLLECTION")
> > >> > return
> > >> > functx:sort(functx:distinct-element-paths(
> > >> >     $in-xml))
> > >> >
> > >> > ------------------------------------------------------------------
> > >> >
> > >> > In eXist's sandbox, this generates a numbered, alphabetical list of
> > >> > nodes & descendants for test collections of a few files, but I
> > >> haven't
> > >> > figured out how to run such a query on the complete collection of
> > >> tens
> > >> > of thousands of files without exceeding memory and other limits.
> > >> >
> > >> > My intention is to develop a query that would allow me to catalog
> > >> the
> > >> > node structure of any collection as a first step toward
> > >> constructing
> > >> > more specific queries. As someone who in in over his head, I have a
> > >> few
> > >> > questions, and even a partial answer to any one of them would be
> > >> > helpful:
> > >> >
> > >> > 1. Is this something that has been done already, as XQuery or by
> > >> some
> > >> > other means? If so, where can I find more information on the proper
> > >> > tools and approach?
> > >> >
> > >> > 2. Is this task impossible or simply not worth the time and trouble
> > >> for
> > >> > such a large collection? How does a smart user estimate system
> > >> > limitations in advance of running a query.
> > >> >
> > >> > 3. How should I adapt or replace my query to address a large
> > >> > collection?
> > >> >
> > >> > 4. How should I improve the query to make it more useful and
> > >> > informative as a diagram of a collection?
> > >> >
> > >> > 5. Am I asking the wrong questions.
> > >> >
> > >> > David Elwell
> > >> >
> > >> >
> > >> >
> > >>
> > ------------------------------------------------------------------------------
> > >> > This SF.net email is sponsored by Sprint
> > >> > What will you do first with EVO, the first 4G phone?
> > >> > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> > >> > _______________________________________________
> > >> > Exist-open mailing list
> > >> > Exi...@li...
> > >> > https://lists.sourceforge.net/lists/listinfo/exist-open
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Adam Retter
> > >>
> > >> eXist Developer
> > >> { United Kingdom }
> > >> ad...@ex...
> > >> irc://irc.freenode.net/existdb
> > >>
> > >>
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > Adam Retter
> >
> > eXist Developer
> > { United Kingdom }
> > ad...@ex...
> > irc://irc.freenode.net/existdb
> >
> >
> > ------------------------------------------------------------------------------
> > This SF.net email is sponsored by Sprint
> > What will you do first with EVO, the first 4G phone?
> > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> > _______________________________________________
> > Exist-open mailing list
> > Exi...@li...
> > https://lists.sourceforge.net/lists/listinfo/exist-open
> >
> 
> 
> 
> -- 
> Dan McCreary
> Semantic Solutions Architect
> office: (952) 931-9198
> cell: (612) 986-1552

> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first

> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open

-- 
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: ali...@gm...
Tel: +44 (0)1865 287669

Re: [Exist-open] Extracting collection node structure

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Extracting collection node structure