#3 TokenizerOperation check number of tokens in result


Having this feature, the TokenizerOperation should also
check if ALL the tokens in the query are somehow
represented in the resulting base form iw(s).

to illustrate this:

for a multi word term query, lets say 'european union',
the Morphological Processor should not return 'european'
and 'union' beside 'european union' as base forms, any
especially, for a multi-word-term that hase no good base
form, the result should be an empty set, ie for the
query 'fun organisation' the result set should be empty
instead of containing 'fun' and 'organisation'.

however, if the concatenation does contain both original
tokens in some way ('cell phone' -> 'cellphone' or 'cell-
phone') such a base form should be returned.

see forum "open discussion":

By: jdidion ( John Didion )
RE: MorphologicalProcessor and TokenizerOperation
2004-01-13 11:00

My original goal in designing the TokenizerOperation was
to return as much as possible and let the developer
decide how to filter the results. However, it does seem
kind of useless in most cases to return individual tokens
as results. I think the most logical thing to do is add a
parameter in the properties file for only returning words
with the same number of tokens as the word being
stemed. Could you please enter this as a feature



Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks