From: Travis B. <tr...@us...> - 2003-06-24 03:31:39
|
Adriana, I'm sorry I didn't reply to your last email. Choosing an indexing vocabulary is a good question and it depends on your data set. IGLU is design not to make those decisions for you. After you have trained the vector creator by passing the documents through it, you can make two method calls to control the vocabular and size of each index. setLimitTopN(int) will tell the vector creator to only use the top N most frequently occurring terms in the indices generated. setMaxSize(int n) will tell it to only return indices with the top N most highly ranked terms for each term vector created. In my experience, I find that setLimitTopN is not terribly effective because by limiting the vocabulary size in that way, you run the risk of dropping out terms that help differentiate documents. setMaxSize is usually good to use because you don't want your vectors to be too large. I'd try using a setMaxSize(25) and see if you get good performance with that. Does this answer your question? Travis --- Adriana - IG <adr...@ig...> wrote: > Dear Travis Bauer, > > Do you know how can I get the indexing terms from > the docs.? How many > terms does need it have.? > > I used all classes from the IGLU to index the > document. And I have good > results. I only don´t know how many terms select to > Index list. > > > Thank you. > > Adriana. > > |