>
> > >
> > > Now, SimpleHTMLTokenizer inherits from DefaultTokenizer.
> If I make a
> > > new ITokenizer implement, I have to rewrite a HTML tokenizer.
> > >
> > > If SimpleHTMLTokenizer use decorator pattern, it can be re-used in
> > > other ITokenizer implements.
> > >
> > > --------------------> ITokenizer
> > > | | |
> > > -- SimpleHTMLTokenizer DefaultTokenizer
> > >
> > >
> >
> > Why would you want to use any of the functionality of
> SimpleHTMLTokenizer
> > without also using DefaultTokenizer?
> >
> > SimpleHTMLTokenizer doesn't really do a great deal more than
> > DefaultTokenizer, and I would like to understand which
> parts of it you want
> > to reuse.
> >
> > Nick
>
> DefaultTokenizer can only work for latin language. I'm planning to
> write a CJKTokenizer to splite chinese characters.
>
Why does DefaultTokenizer only work for latin languages? There is a
constructor that will let you pass in a custom regexp to split on - is that
not sufficient?
I should also point out that SimpleHTMLTokenizer is probably insufficient
for almost any real world usage - it will break on mis-matched tags, for
instance.
Nick
|