|
From: Geoff H. <ghu...@ws...> - 2003-02-28 17:59:50
|
> 1. Why do the documentation for external_parser and the comments > before Retriever::got_word both say that the word location must be > in the range 0-1000? That's a 3.1-ism. The documentation is wrong. Oops. > first word of any *other* entry. Could we add "meta" information at > successive locations starting from, say, location 10,000? Actually, now that I think about it, a better idea is to use negative word locations for META information. This would leave "0" empty and make it impossible to match across the boundary, but fix phrase searching for META words. As for some other arbitrary number--we might actually have documents that long (esp. with PDF indexing). > 3. With phrase searching, do we still need valid_punctuation? For > example, "post-doctoral" currently gets entered as three words at the > *same* location: "post", "doctoral" and "postdoctoral". Would it be > better to convert queries for post-doctoral into the phrase "post This is a strange example. What if I had a hyphenated word? I don't know that your "phrase conversion" is the best solution. What we do need is a flexible "word parser" that addresses some of these issues. After all, Unicode raises even more problems about "what is a word." > "the non-smoker" will match "the smoker", since all the words are > given the same position in the database, but a search for "the non > smoker" won't match "the non-smoker". This also reduces the size of For some people, punctuation has meaning. Let's say we have part numbers or dates. "3/24/03" isn't really the same as "32403" and I'm not sure the phrase search works well either. Yes, reducing the database size would improve speed. Perhaps Gilles can comment on the motivations for the compound-word additions. (I'm having a hard time pulling them up in my mail archive or on the web.) > 4. Does anybody know what the existing external parsers do about words > less than the minimum length? Because they are passed the I don't think most external parsers bother with the config file. Remember that any word should go through HtWordList and this should throw out words that are too long, too short in the bad_words list, etc. -Geoff |