From: Birnbaum, D. J <dj...@pi...> - 2009-02-25 03:44:09
|
Dear eXistentialists, I've returned to an old problem, now with a little more information, and I'd be grateful for any advice any readers might be able to provide. In the XQuery script below, the corpus contains 103 documents with <TEI.2> as their root. Documents contain a variable number of <articleName> elements at various depths under the root; there are 1537 total <articleName> elements with 749 distinct values. I'm trying to generate an alphabetized list of all of the distinct <articleName> values with a count of how many times each value occurs in the corpus. The collection is indexed, and the collection.xconf file follows the XQuery, below. The index lives in: /db/system/config/db/mss/collection.xconf When I inquired about this script a while ago, some readers of this list advised me to try to avoid distinct-values(), since that requires atomization, which bypasses indexing. It turns out that distinct-values() isn't the bottleneck; the bottleneck line is: let $occurrenceCount := count(collection("/db/mss")//articleName[. eq $i]) If I replace this with: let $occurrenceCount := 1 the query returns essentially instantly (that is, distinct-values() doesn't slow it down in any way that a user would notice). In its real version, as reproduced below, it takes approximately 23 seconds to return, which is unacceptable by a factor of roughly ... well ... 23. In both cases I've run the script multiple times; I don't notice any difference between first and subsequent runs (that is, I don't think the difference between the two versions has anything to do with caching). For what it's worth, I use "eq" instead of "&=" in the offending line because I need to match the full content of <articleName> exactly, since some <articleName> values are substrings of other <articleName> values. Changing "eq" to "&=" changes the count (i.e., returns the wrong count) and, incidentally, takes longer (approximately 31 seconds, i.e., approximately 33% longer). Any suggestions about how I might speed this up? I could pre-cook the counts and store them separately, but shouldn't I be able to generate this sort of information on the fly? Would a comparable counting operation in an RDBMS also entail this sort of spectacularly tedious bottleneck and, if not, any thoughts about what I might be doing wrong here? Thanks, David dj...@pi... ----[consolidated.xq]---- xquery version "1.0"; let $corpus := collection("/db/mss")//TEI.2 let $totalarticles := $corpus//articleName let $article := for $i in distinct-values($totalarticles) order by $i return $i return <html> <head><title>Consolidated Text List</title></head> <body> <p>Total manuscripts in corpus: {string(count($corpus))} <br/>Total articles in corpus: {string(count($totalarticles))} <br/>Distinct articles in corpus: {string(count($article))} </p> <ol> {for $i in $article let $occurrenceCount := count(collection("/db/mss")//articleName[. eq $i]) return <li>{$i} ({$occurrenceCount} {if ($occurrenceCount eq 1) then " occurrence" else " occurrences"})</li> }</ol> </body> </html> ----[collection.xconf]---- <collection xmlns="http://exist-db.org/collection-config/1.0"> <index> <fulltext default="none" attributes="false"> <!-- Full text indexes --> <create qname="articleName"/> </fulltext> <!-- Range indexes --> <create qname="articleName" type="xs:string"/> </index> </collection> |