From: Paul J. L. <pa...@lu...> - 2010-02-10 01:26:24
|
On Feb 9, 2010, at 2:36 PM, Itamar Syn-Hershko wrote: > I'm not sure what you mean. I mean the ability to know, for a given piece of text, where the token boundaries are (e.g., words). > CLucene StandardTokenizer is meant for internal use only, and provides the > calling Analyzer with a stream of identified tokens (it classifies the > tokens, not just tokenizes them). Classifies them how? Also, one can plug in one's own tokenizer, yes? > The ICU tokenizer is a general purpose tokenizer (like Boost's > implementation is), with loads of extra functionality the CLucene one > doesn't have or need. I only care about tokenization of a sequence of characters into words. - Paul |