From: Joe W. <jo...@gm...> - 2011-09-15 13:53:37
|
Sharmin, I can't speak directly to the memory aspect of your question, but I have some suggestions for ways to optimize your query. > <create qname="marc:record"/> > <ngram qname="marc:record"/> > for $record in /marc:collection/marc:record[fn:matches(., "design", 'i')] Let me observe here that you've defined range and ngram indexes on marc:record. By using fn:matches(), your query is going to make use of the range index. 1. "Function fn:matches returns true if any substring of its argument string matches the regular expression. The query engine thus needs to scan all index entries as the match could be at any position of an entry. You can reduce the range of entries to be scanned by anchoring your pattern at the start of a string (where applicable):" (See http://exist-db.org/tuning.html#d1973e783; see also http://exist-db.org/indexing.html#rangeidx and http://demo.exist-db.org/functions/fn/matches.) So try: for $record in /marc:collection/marc:record[fn:matches(., "^design$", 'i')] See also http://demo.exist-db.org/functions/fn/matches for info regarding these regular expression anchors. 2. Range indexes are best for strongly typed data or strings (see http://exist-db.org/tuning.html#d1973e497). I'd go further and say that range indexes perform best when you're using =, <, or >. If marc:record doesn't fit this qualification, consider applying a Lucene full text index on marc:record rather than a range or ngram (see http://www.exist-db.org/lucene.html). Your query would look like: for $record in /marc:collection/marc:record[ft:query(., "design")] 3. Whether you go with range or lucene for the index, you can optimize your query by avoiding the top-down approach of selecting "/marc:collection" before "marc:record" (see http://exist-db.org/tuning.html#d1973e783). Consider these alternatives that would take this advice: for $record in //marc:record[fn:matches(., "^design$", 'i')][parent::marc:collection] or if the parent::marc:collection isn't significant for your query: for $record in //marc:record[fn:matches(., "^design$", 'i')] or in the Lucene case: for $record in //marc:record[ft:query(., "design")][parent::marc:collection] for $record in //marc:record[ft:query(., "design")] Cheers, Joe On Thu, Sep 15, 2011 at 9:07 AM, Sharmin Choudhury <sha...@ya...> wrote: > Hi, > I have posted about this before and unfortunately have still been unable to > solve the problem. So I am trying again, > eXist setup: Running on a standalone Jetty Server with JVM options -Xms1000m > -Xmx5000m. The machine itself has 12.0 GB of memory. So plenty of memory. > Dataset: MARC21XML library database that 831 MB in size. > Index for the MARC21XML: > <?xml version="1.0" encoding="UTF-8"?> > <collection xmlns="http://exist-db.org/collection-config/1.0"> > <index xmlns:marc="http://www.loc.gov/MARC21/slim"> > <create qname="marc:record"/> > <create qname="@tag"/> > <create qname="@code"/> > <create qname="marc:subfield"/> > <create qname="marc:leader"/> > <create qname="marc:datafield"/> > <ngram qname="@tag"/> > <ngram qname="@code"/> > <ngram qname="marc:subfield"/> > <ngram qname="marc:leader"/> > <ngram qname="marc:datafield"/> > <ngram qname="marc:record"/> > </index> > </collection> > Query I am trying to execute through the Sandbox: > declare namespace marc = "http://www.loc.gov/MARC21/slim"; > for $record in /marc:collection/marc:record[fn:matches(., "design", 'i')] > let $title := $record/marc:datafield[@tag='245']/marc:subfield/text() > let $author := > $record/marc:datafield[@tag='100']/marc:subfield[@code='a']/text() > let $otherauthor := > $record/marc:datafield[@tag='700']/marc:subfield[@code='a']/text() > let $publocation := > $record/marc:datafield[@tag='260']/marc:subfield[@code='a']/text() > let $publisher := > $record/marc:datafield[@tag='260']/marc:subfield[@code='b']/text() > let $pubdate := > $record/marc:datafield[@tag='260']/marc:subfield[@code='c']/text() > let $edition := $record/marc:datafield[@tag='250']/marc:subfield/text() > let $description := $record/marc:datafield[@tag='653']/marc:subfield/text() > let $campus := > $record/marc:datafield[@tag='949']/marc:subfield[@code='l']/text() > let $shelf := > $record/marc:datafield[@tag='949']/marc:subfield[@code='s']/text() > let $isbn := > $record/marc:datafield[@tag='020']/marc:subfield[@code='a']/text() > return <data id="MDX > Catalogue"><sort><x_sort>Date</x_sort><y_sort>Title</y_sort></sort><title > id="Title">{$title}</title><top_left > id="Date">{$pubdate}</top_left><subtitle > id="Edition">{$edition}</subtitle><cat_1 id="Author(s)"> > <keyword_1>{$author[1]}</keyword_1> <keyword_2>{$otherauthor[1]}</keyword_2> > <keyword_3>{$otherauthor[2]}</keyword_3></cat_1><cat_2><keyword_1 > id="Location">{$publocation}</keyword_1><keyword_2 > id="Publisher">{$publisher}</keyword_2><keyword_3 id="Library > Location">{$campus[1]}, {$shelf[1]}</keyword_3></cat_2><blurb > id="Description">{$description[1]}</blurb><drill_1 id="Web" type="Library > Website">{$isbn}</drill_1><drill_2 id="Web" > type="Waterstones"></drill_2></data> > > But eXist crashes giving Java Heap Space Error. I have been assured that > eXist should be able to handle much larger datasets then the one I have but > I am not sure what I am doing wrong. If anyone has any insights, that would > be really helpful. > Thanks! > ------------------------------------------------------------------------------ > Doing More with Less: The Next Generation Virtual Desktop > What are the key obstacles that have prevented many mid-market businesses > from deploying virtual desktops? How do next-generation virtual desktops > provide companies an easier-to-deploy, easier-to-manage and more affordable > virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/ > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > |