When I find time in the next few weeks, I will evaluate the
effects on index size of adding allfields_unstemmed field on my test server – if
they are not too severe, that’s probably worth implementing as at least one
more step toward full unstemmed search support.
From: Tuan Nguyen
Sent: Wednesday, September 29, 2010 10:23 AM
To: Demian Katz
Cc: Osullivan L.; firstname.lastname@example.org
Subject: Re: [VuFind-Tech] ? wildcard
You're right, we may have taken it too far.
allfields_unstemmed would have sufficed for dealing with this type of wildcard
On Sep 29, 2010, at 9:47 AM, Demian Katz wrote:
I think it’s a question of degrees – I think creating an
unstemmed equivalent to EVERY searchable field may be taking it too far; in some
cases, one unstemmed field actually covers several others (i.e.
title_full_unstemmed has the same text as title, title_short,
title_full). We could create unstemmed versions of all the variations in
order to get extremely granular relevance ranking, but I think that’s probably
One simple change that would offer somewhat more comprehensive
coverage without greatly expanding the schema would be to add an
allfields_unstemmed field. That would probably have a fairly significant
effect on index size… but maybe not.
Then it comes to a question of which cases in between
matter? Do we care about tables of contents? How about
geographic/genre/era? How about series titles? These are relatively
little-used areas where adding unstemmed versions would probably have little
impact on index size… but is it worth increasing the size and complexity
of the schema and search configuration? I’m not sure.
I’m definitely not opposed to expanding the use of unstemmed
fields in the trunk – the unstemmed title and topic fields just went into the
trunk recently, and it may well be appropriate to add a few more. I’m
just not sure how far to take it before it becomes a burden rather than a help.
Comments are welcome! If you would like to share a patch for discussion,
that might be helpful as well.
We took this approach from day
one, every searchable field has an equivalent unstemmed version. We also use
the unstemmed version to give higher boost to exact/unstemmed matches. Could we
expand this and make it part of the standard schema that every searchable field
has unstemmed equivalent? I understand the concern about growing the size of
the index, but from our experience the increase in index size is not
On Sep 29, 2010, at 9:00 AM,
Demian Katz wrote:
Take a look at r3023 – I made a few adjustments to the
searchspecs.yaml file so that unstemmed fields are used more effectively when
advanced queries are generated. The situation still isn’t perfect, as
there are still stemmed fields without unstemmed equivalents… but this
offers proper coverage of title and subject, so it’s a vast improvement!
The ? wildcard works as
advertised. The problem is with the stemming. You can see how this works in the
analysis tab of the solr admin interface.
globalization gets stemmed to
globalisation gets stemmed to
On Sep 29, 2010, at 8:15 AM,
Osullivan L. wrote: