Re: [htdig-dev] Several questions...

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 1. Why do the documentation for  external_parser  and the comments
> before  Retriever::got_word  both say that the word location must be
> in the range 0-1000?

That's a 3.1-ism. The documentation is wrong. Oops.

> first word of any *other* entry.  Could we add "meta" information at
> successive locations starting from, say, location 10,000?

Actually, now that I think about it, a better idea is to use negative 
word locations for META information. This would leave "0" empty and 
make it impossible to match across the boundary, but fix phrase 
searching for META words. As for some other arbitrary number--we might 
actually have documents that long (esp. with PDF indexing).

> 3. With phrase searching, do we still need  valid_punctuation?  For
> example, "post-doctoral" currently gets entered as three words at the
> *same* location:  "post", "doctoral" and "postdoctoral".  Would it be
> better to convert queries for  post-doctoral  into the phrase "post

This is a strange example. What if I had a hyphenated word? I don't know
that your "phrase conversion" is the best solution. What we do need is a
flexible "word parser" that addresses some of these issues. After all,
Unicode raises even more problems about "what is a word."

> "the non-smoker" will match "the smoker", since all the words are
> given the same position in the database, but a search for "the non
> smoker" won't match "the non-smoker".  This also reduces the size of

For some people, punctuation has meaning. Let's say we have part numbers
or dates. "3/24/03" isn't really the same as "32403" and I'm not sure the
phrase search works well either.

Yes, reducing the database size would improve speed. Perhaps Gilles can
comment on the motivations for the compound-word additions. (I'm having a
hard time pulling them up in my mail archive or on the web.)

> 4. Does anybody know what the existing external parsers do about words
> less than the minimum length?  Because they are passed the

I don't think most external parsers bother with the config file. Remember
that any word should go through HtWordList and this should throw out words
that are too long, too short in the bad_words list, etc.

-Geoff