Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review, Where to put stemmer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Nick Lothian wrote:

>>Nick Lothian wrote:
> Our stemmers & stop words will be language
> specific. Unfortunately I don't see a way around this, unless there is some
> magic way to generate a stop word list & stemmer in any language...

Could we take the same approach Lucene does? In short (if you neve used 
Lucene) here is an article on indexing with Lucene 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html. So, to index 
documents in russian, i do something like this:

RussianAnalyzer analyzer = new 
RussianAnalyzer(RussianCharsets.UnicodeRussian);
              IndexWriter writer = new IndexWriter(indexPath, analyzer, 
false /*do not create index, it exists*/);
                          writer.addDocument(toLuceneDocument(dto));
                          writer.optimize();
                          writer.close();

where Index Writer is smth like JDBC provider. Analyzers are tokenisers 
and stemmers in one. There are simple analyzers - just to convert all 
text to lowercase letters. "...The second parameter provides the 
implementation of Analyzer that should be used for pre-processing the 
text before it is indexed. This [*NOT FROM MY CODE, see article for 
context (moedusa)*] particular implementation of Analyzer eliminates 
stop words, converts tokens to lower case, and performs a few other 
small input modifications, such as eliminating periods from acronyms" 
(http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html).

So, to deal with encoding, I need only my (russian) analyzer. it sould 
be stemming analyzer, if there is an implementation, or just stop-words 
analyzer. Initial stop words are stored as array, and also could be 
instantiated from anywhere alse, that is to developer, there are no code 
in Lucene to search fo stopwords list or smth...

It means that you, as author, must only provide API :) we'll do the rest.