I think it’s a question of degrees – I think creating an unstemmed equivalent to EVERY searchable field may be taking it too far; in some cases, one unstemmed field actually covers several others (i.e. title_full_unstemmed has the same text as title, title_short, title_full).  We could create unstemmed versions of all the variations in order to get extremely granular relevance ranking, but I think that’s probably overkill.

 

One simple change that would offer somewhat more comprehensive coverage without greatly expanding the schema would be to add an allfields_unstemmed field.  That would probably have a fairly significant effect on index size…  but maybe not.

 

Then it comes to a question of which cases in between matter?  Do we care about tables of contents?  How about geographic/genre/era?  How about series titles?  These are relatively little-used areas where adding unstemmed versions would probably have little impact on index size…  but is it worth increasing the size and complexity of the schema and search configuration?  I’m not sure.

 

I’m definitely not opposed to expanding the use of unstemmed fields in the trunk – the unstemmed title and topic fields just went into the trunk recently, and it may well be appropriate to add a few more.  I’m just not sure how far to take it before it becomes a burden rather than a help.  Comments are welcome!  If you would like to share a patch for discussion, that might be helpful as well.

 

- Demian

 

From: Tuan Nguyen [mailto:tuan@yorku.ca]
Sent: Wednesday, September 29, 2010 9:31 AM
To: Demian Katz
Cc: Osullivan L.; vufind-tech@lists.sourceforge.net
Subject: Re: [VuFind-Tech] ? wildcard

 

We took this approach from day one, every searchable field has an equivalent unstemmed version. We also use the unstemmed version to give higher boost to exact/unstemmed matches. Could we expand this and make it part of the standard schema that every searchable field has unstemmed equivalent? I understand the concern about growing the size of the index, but from our experience the increase in index size is not significant.

 

 

On Sep 29, 2010, at 9:00 AM, Demian Katz wrote:



Take a look at r3023 – I made a few adjustments to the searchspecs.yaml file so that unstemmed fields are used more effectively when advanced queries are generated.  The situation still isn’t perfect, as there are still stemmed fields without unstemmed equivalents…  but this offers proper coverage of title and subject, so it’s a vast improvement!

 

- Demian

 

From: Tuan Nguyen [mailto:tuan@yorku.ca] 
Sent: Wednesday, September 29, 2010 8:52 AM
To: Osullivan L.
Cc: vufind-tech@lists.sourceforge.net
Subject: Re: [VuFind-Tech] ? wildcard

 

Hi Luke,

 

The ? wildcard works as advertised. The problem is with the stemming. You can see how this works in the analysis tab of the solr admin interface.

 

 

globalization gets stemmed to global

globalisation gets stemmed to globalis

 

<image001.png><image002.png>

 

On Sep 29, 2010, at 8:15 AM, Osullivan L. wrote:




globalisation