RE: RE: RE: [Classifier4j-devel] SimpleHTMLTokenizer should use d ecor ator pattern
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2004-09-02 23:23:46
|
> > > On Thu, 2 Sep 2004 16:44:43 +0930, Nick Lothian > <nl...@es...> wrote: > > > DefaultTokenizer can only work for latin language. I'm > planning to > > > write a CJKTokenizer to splite chinese characters. > > > > > > > Why does DefaultTokenizer only work for latin languages? There is a > > constructor that will let you pass in a custom regexp to > split on - is that > > not sufficient? > > > > Some asian languages are not like English, words are not seperated by > space or any other characters. There are continous texts in a > sentence. > > Some discussion about CJK word segment: > http://www.webmasterworld.com/forum32/284.htm > I can't read that thread - it is marked member's only. I knew that asian languages didn't split based on spaces, but I did think it was possible to split based on a regexp. (See n-gram tokenization in http://sourceforge.net/mailarchive/forum.php?thread_id=3404351&forum_id=8740 and Zope's CJKSplitter: http://www.zope.org/Members/panjunyong/CJKSplitter). How are you planning on doing it? I've seen some discussion of dictionary-based splitting - is that what you are planning? In any case, I'm happy to accept patches to SimpleHTMLTokenizer to make it work how you'd like. Nick |