I wonder how to modify the parameter file in a way that makes the Tokenizer Operation in the Morphological Processor concatenate multi word terms and look for these but doesn't let it return single terms as results when 'lookupAllBaseForms' (resp. 'lookupIndexWord') is called.
to illustrate this:
if I have a multi word term, lets say 'european union', the Morphological Processor returns 'european' and 'union' beside 'european union' as base forms, while i would like to have only 'european union' returned. this is a problem if there is no index word for the multi word term itself, like with lets say som mysterious 'fun organisation' in which case I would get 'fun' and 'organisation', while i would rather have an empty set returned. anyone knows how to do this ? of course, I could patch this myself, checking wether the number of terms in query and result are the same, but this would also destroy such nice things as getting the desirable result of 'cellphone' for the query 'cell phone'.
thanks in advance
My original goal in designing the TokenizerOperation was to return as much as possible and let the developer decide how to filter the results. However, it does seem kind of useless in most cases to return individual tokens as results. I think the most logical thing to do is add a parameter in the properties file for only returning words with the same number of tokens as the word being stemed. Could you please enter this as a feature request?
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.