Re: RE: RE: [Classifier4j-devel] SimpleHTMLTokenizer should use decor ator pattern
Status: Beta
Brought to you by:
nicklothian
From: Leo L. <leo...@gm...> - 2004-09-02 09:14:50
|
On Thu, 2 Sep 2004 16:44:43 +0930, Nick Lothian <nl...@es...> wrote: > > DefaultTokenizer can only work for latin language. I'm planning to > > write a CJKTokenizer to splite chinese characters. > > > > Why does DefaultTokenizer only work for latin languages? There is a > constructor that will let you pass in a custom regexp to split on - is that > not sufficient? > Some asian languages are not like English, words are not seperated by space or any other characters. There are continous texts in a sentence. Some discussion about CJK word segment: http://www.webmasterworld.com/forum32/284.htm -- ----------------------------------------------------------------------------------------- Leo Liang E-mail: leo...@gm... Blog (tech & learning): http://aleung.blogbus.com Blog (photography & outdoor): http://sunnyday.cn2k.net Delicious bookmark: http://del.icio.us/aleung ----------------------------------------------------------------------------------------- |