    • H.X.T

      H.X.T - 2009-05-05

      Hi , Few days ago, I found this framework for the creation of the VSM.

      these day , I was working for a simple chinese text classification(TC) system.

      I used WVTool in my system. However ,I found a problem.

      When I used WVTool In English for TC, it's ok.

      but, whem it comes to chinese ,there is problem.

      You assume that the length of a word should greater than 2.

      That's ok for english.

      In Chinese, The Word is composed of  Characters.

      A character likes a letter.

      In Chinese, the length of the most of chinese words only have 2 characters.

      You put the assumptiong in the Class AbstractStemmer, so ,even the DummyStemmer Class was set in the config ,which supposed to do nothing , in fact filters all the word whose length are less than 3.

      So, when I used this lib in my system, I found I got nothing in the wordlist.

      I was confused util I saw the source code of the AbstractStemmer.

      P.S. WordFilter contains this assumption too.

      So, as a universal lab, not only for the english , do you think there is anything should be done for this problem?

        Sorry , I made a mistake,

        the Class made that assumption is the



