Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review, Where to put stemmer
Status: Beta
Brought to you by:
nicklothian
From: moedusa <mo...@in...> - 2003-11-17 07:03:51
|
Nick Lothian wrote: >>Nick Lothian wrote: > Our stemmers & stop words will be language > specific. Unfortunately I don't see a way around this, unless there is some > magic way to generate a stop word list & stemmer in any language... Could we take the same approach Lucene does? In short (if you neve used Lucene) here is an article on indexing with Lucene http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html. So, to index documents in russian, i do something like this: RussianAnalyzer analyzer = new RussianAnalyzer(RussianCharsets.UnicodeRussian); IndexWriter writer = new IndexWriter(indexPath, analyzer, false /*do not create index, it exists*/); writer.addDocument(toLuceneDocument(dto)); writer.optimize(); writer.close(); where Index Writer is smth like JDBC provider. Analyzers are tokenisers and stemmers in one. There are simple analyzers - just to convert all text to lowercase letters. "...The second parameter provides the implementation of Analyzer that should be used for pre-processing the text before it is indexed. This [*NOT FROM MY CODE, see article for context (moedusa)*] particular implementation of Analyzer eliminates stop words, converts tokens to lower case, and performs a few other small input modifications, such as eliminating periods from acronyms" (http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html). So, to deal with encoding, I need only my (russian) analyzer. it sould be stemming analyzer, if there is an implementation, or just stop-words analyzer. Initial stop words are stored as array, and also could be instantiated from anywhere alse, that is to developer, there are no code in Lucene to search fo stopwords list or smth... It means that you, as author, must only provide API :) we'll do the rest. |