Welcome to Open Discussion
Hi , Few days ago, I found this framework for the creation of the VSM.
these day , I was working for a simple chinese text classification(TC) system.
I used WVTool in my system. However ,I found a problem.
When I used WVTool In English for TC, it's ok.
but, whem it comes to chinese ,there is problem.
You assume that the length of a word should greater than 2.
That's ok for english.
In Chinese, The Word is composed of Characters.
A character likes a letter.
In Chinese, the length of the most of chinese words only have 2 characters.
You put the assumptiong in the Class AbstractStemmer, so ,even the DummyStemmer Class was set in the config ,which supposed to do nothing , in fact filters all the word whose length are less than 3.
So, when I used this lib in my system, I found I got nothing in the wordlist.
I was confused util I saw the source code of the AbstractStemmer.
P.S. WordFilter contains this assumption too.
So, as a universal lab, not only for the english , do you think there is anything should be done for this problem?
Sorry , I made a mistake,
the Class made that assumption is the
Log in to post a comment.