From: Thomas C. <t_c...@ya...> - 2012-05-30 09:55:40
|
"The computing concept of a word as an immutable unit between nonalphanumeric characters is very anglocentric. Most languages (most European languages at least) have considerable inflection compared to English and this isn't consistent with the "word" concept." And the idea that adding implicit * in both sides solves the problem is a very common error for computer programmers : even in english you have inflexions inside the word (for example irregular verbs : make/made, get/got...) If I good understand, a well-parametrized tokenizer - the ones given by Lucene, for example - is capable to understand such inside inflexions. You are right when you say that "word" is not the right term : from a linguistic point of view the right term could be "lemma" or "stem". As I suggested in my last message if you want to do like traditionnal word processors (which are not linguistic-centric) you should add a checkbox "whole words only", maybe unchecked by default so that the default behaviour remains unchanged. Or you may want to adopt a linguistic point of view and speak about key stems/lemmata, meaning that the result will be dependant on the tokenizer you use. But I understand that this second option is more difficult to implement so it could be a second step. Thomas |