RE: RE: [Classifier4j-devel] SimpleHTMLTokenizer should use decor ator pattern
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2004-09-02 07:17:47
|
> > > > > > > Now, SimpleHTMLTokenizer inherits from DefaultTokenizer. > If I make a > > > new ITokenizer implement, I have to rewrite a HTML tokenizer. > > > > > > If SimpleHTMLTokenizer use decorator pattern, it can be re-used in > > > other ITokenizer implements. > > > > > > --------------------> ITokenizer > > > | | | > > > -- SimpleHTMLTokenizer DefaultTokenizer > > > > > > > > > > Why would you want to use any of the functionality of > SimpleHTMLTokenizer > > without also using DefaultTokenizer? > > > > SimpleHTMLTokenizer doesn't really do a great deal more than > > DefaultTokenizer, and I would like to understand which > parts of it you want > > to reuse. > > > > Nick > > DefaultTokenizer can only work for latin language. I'm planning to > write a CJKTokenizer to splite chinese characters. > Why does DefaultTokenizer only work for latin languages? There is a constructor that will let you pass in a custom regexp to split on - is that not sufficient? I should also point out that SimpleHTMLTokenizer is probably insufficient for almost any real world usage - it will break on mis-matched tags, for instance. Nick |