I am using Saxon 9 EE on a semantic web project at Cornell University's Mann Library.

My work involves tens of thousands of organization names arrayed in a XML e.g.

 

<?xml version="1.0" encoding="UTF-8"?>

<ExtantOrgs>

   <org>

     <uri>http://vivo.cornell.edu/individual/AI-ISCER-01FADCA1A14000076D1</uri>

     <name>Abraham House</name>

   </org>

   <org>

     <uri>http://vivo.cornell.edu/individual/AI-ISCER-01FADCA1A14000076D3</uri>

     <name>ABS Global. Inc</name>

   </org>

.

.

.

</ExtantOrgs>

 

I am using <xsl:key> and key() together with a function to cut down on

the size of the sets on which I need to do isomorphic string matches.

 

My transform flow goes like this:

 

<!-- Read in the big file: -->

 

<xsl:variable name='extantOrgs'

               select="document($orgsIn)/ExtantOrgs"/>

 

<!-- Key Defn: -->

 

<xsl:key name='orgsKey' match='org' use='vfx:myfunc(name)'/>

 

<!-- Given a string to match ( i.e. rdfs:label), construct a small

list of possible candidates -->

 

<xsl:variable name='orglist'

               select='key("orgsKey",vfx:myfunc(rdfs:label)),$extantOrgs)'/>

 

<!-- Search the list: -->

 

<xsl:variable name='results'

        select='$orglist[vfx:isoMatch(name,$n)]/uri'/>

 

<!-- do something interesting with $results -->

 

 

 

For example, if vfx:myfunc just returns the first n characters

of an organization name, I typically have only a few dozen iso matches to do.

As I hoped; this technique produces small searches and good performance.

 

I can see that there are (at least) two searches in this technique:

 

  1. the search of the key index and

  2. the search of $orglist

 

MY QUESTION concerns the design of vfx:myfunc:

 

Should vfx:myfunc tend to make the size of $orglist as small as possible or

just produce an approximately uniform 'set size' distribution or

is there a better way to proceed?

 

Thanks for your attention.

 

-joe

 

Joseph R. Mc Enerney

jrm424@cornell.edu