Re: [CLucene-dev] CLucene tokenizer vs ICU tokenizer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Feb 9, 2010, at 2:36 PM, Itamar Syn-Hershko wrote:

> I'm not sure what you mean.

I mean the ability to know, for a given piece of text, where the token boundaries are (e.g., words).

> CLucene StandardTokenizer is meant for internal use only, and provides the
> calling Analyzer with a stream of identified tokens (it classifies the
> tokens, not just tokenizes them).

Classifies them how?  Also, one can plug in one's own tokenizer, yes?

> The ICU tokenizer is a general purpose tokenizer (like Boost's
> implementation is), with loads of extra functionality the CLucene one
> doesn't have or need.

I only care about tokenization of a sequence of characters into words.

- Paul