I think there are too many unknowns to give a reasonable answer. I would say it's best to make the key as selective as possible, unless the time taken to build the index is becoming excessive - one of the variables is how often the index is being used after constructing it.

As always, the best way -- perhaps the only way -- to answer performance questions is by measurement.

Michael Kay
Saxonica

On 15/04/2011 18:11, Joseph R. McEnerney wrote:

I am using Saxon 9 EE on a semantic web project at Cornell University's Mann Library.

My work involves tens of thousands of organization names arrayed in a XML e.g.

 

<?xml version="1.0" encoding="UTF-8"?>

<ExtantOrgs>

   <org>

     <uri>http://vivo.cornell.edu/individual/AI-ISCER-01FADCA1A14000076D1</uri>

     <name>Abraham House</name>

   </org>

   <org>

     <uri>http://vivo.cornell.edu/individual/AI-ISCER-01FADCA1A14000076D3</uri>

     <name>ABS Global. Inc</name>

   </org>

.

.

.

</ExtantOrgs>

 

I am using <xsl:key> and key() together with a function to cut down on

the size of the sets on which I need to do isomorphic string matches.

 

My transform flow goes like this:

 

<!-- Read in the big file: -->

 

<xsl:variable name='extantOrgs'

               select="document($orgsIn)/ExtantOrgs"/>

 

<!-- Key Defn: -->

 

<xsl:key name='orgsKey' match='org' use='vfx:myfunc(name)'/>

 

<!-- Given a string to match ( i.e. rdfs:label), construct a small

list of possible candidates -->

 

<xsl:variable name='orglist'

               select='key("orgsKey",vfx:myfunc(rdfs:label)),$extantOrgs)'/>

 

<!-- Search the list: -->

 

<xsl:variable name='results'

        select='$orglist[vfx:isoMatch(name,$n)]/uri'/>

 

<!-- do something interesting with $results -->

 

 

 

For example, if vfx:myfunc just returns the first n characters

of an organization name, I typically have only a few dozen iso matches to do.

As I hoped; this technique produces small searches and good performance.

 

I can see that there are (at least) two searches in this technique:

 

  1. the search of the key index and

  2. the search of $orglist

 

MY QUESTION concerns the design of vfx:myfunc:

 

Should vfx:myfunc tend to make the size of $orglist as small as possible or

just produce an approximately uniform 'set size' distribution or

is there a better way to proceed?

 

Thanks for your attention.

 

-joe

 

Joseph R. Mc Enerney

jrm424@cornell.edu

------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________ saxon-help mailing list archived at http://saxon.markmail.org/ saxon-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/saxon-help