Hi, I have question about using a stemmer.
In some cases, you want to use a stemmer and sometimes not.
I wonder if there’s a way to combine those results together via one query statement.
I was able to get those two distinct results by running search multiple times, but I
want to know if there’s a practical way to get combined results.
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You don't mention what you are using for your indexing/query software.
Indri builds indexes with one stemmer at a time. Querying that index uses the stemmer (or lack thereof) defined during the build.
Galago allows you to build indexes with multiple stemmed parts. However, you must specify which stemmer (or no stemming) you are using to process queries. The stemmer defined is applied to all the queries.
There are ways to mix unstemmed and stemmed parts in very low level Galago queries but the work required to fill in query smoothing parameters is excessive and not really worthwhile.
Not sure how meaningful combined results would be. The scoring of terms will be different depending on the query terms and stemmers used, making the result scores/rankings somewhat confusing to interpret.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Furthermore, I am not certain how Galago handles duplicate document IDs having different scores in a ranking. I don't think it is possible to get the same document ID multiple times in a ranking (different scores).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the response.
In my domain, searching with the use of a stemmer works generally better. However some of the terms in queries, usually proper nouns, needs to be searched as is. I can search multiple times with options, but the scoring schemes are different as you already mentioned. Merging those scores by means of averaging becomes impractical. One other approach I can think of is to decide whether to use stemmer by preprocessing the query, but it seems too much of feature-engineering to me.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I have question about using a stemmer.
In some cases, you want to use a stemmer and sometimes not.
I wonder if there’s a way to combine those results together via one query statement.
I was able to get those two distinct results by running search multiple times, but I
want to know if there’s a practical way to get combined results.
Thanks.
You don't mention what you are using for your indexing/query software.
Indri builds indexes with one stemmer at a time. Querying that index uses the stemmer (or lack thereof) defined during the build.
Galago allows you to build indexes with multiple stemmed parts. However, you must specify which stemmer (or no stemming) you are using to process queries. The stemmer defined is applied to all the queries.
There are ways to mix unstemmed and stemmed parts in very low level Galago queries but the work required to fill in query smoothing parameters is excessive and not really worthwhile.
Not sure how meaningful combined results would be. The scoring of terms will be different depending on the query terms and stemmers used, making the result scores/rankings somewhat confusing to interpret.
Furthermore, I am not certain how Galago handles duplicate document IDs having different scores in a ranking. I don't think it is possible to get the same document ID multiple times in a ranking (different scores).
Thanks for the response.
In my domain, searching with the use of a stemmer works generally better. However some of the terms in queries, usually proper nouns, needs to be searched as is. I can search multiple times with options, but the scoring schemes are different as you already mentioned. Merging those scores by means of averaging becomes impractical. One other approach I can think of is to decide whether to use stemmer by preprocessing the query, but it seems too much of feature-engineering to me.