RE: RE: RE: [Classifier4j-devel] SimpleHTMLTokenizer should use d ecor ator pattern

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> 
> 
> On Thu, 2 Sep 2004 16:44:43 +0930, Nick Lothian 
> <nl...@es...> wrote:
> > > DefaultTokenizer can only work for latin language.  I'm 
> planning to
> > > write a CJKTokenizer to splite chinese characters.
> > >
> > 
> > Why does DefaultTokenizer only work for latin languages? There is a
> > constructor that will let you pass in a custom regexp to 
> split on - is that
> > not sufficient?
> > 
> 
> Some asian languages are not like English, words are not seperated by
> space or any other characters. There are continous texts in a
> sentence.
> 
> Some discussion about CJK word segment:
> http://www.webmasterworld.com/forum32/284.htm
> 

I can't read that thread - it is marked member's only.

I knew that asian languages didn't split based on spaces, but I did think it
was possible to split based on a regexp. (See n-gram tokenization in
http://sourceforge.net/mailarchive/forum.php?thread_id=3404351&forum_id=8740
and Zope's CJKSplitter: http://www.zope.org/Members/panjunyong/CJKSplitter).

How are you planning on doing it? I've seen some discussion of
dictionary-based splitting - is that what you are planning?

In any case, I'm happy to accept patches to SimpleHTMLTokenizer to make it
work how you'd like.

Nick