[htdig] Re: [htdig-dev] pdf, docs and keywords

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm moving this discussion to the htdig-general mailing list, because
it's not developer-oriented, but rather deals with configuration issues.

According to Natalija Stevens:
> Yes I have re-indexed pages after changing conf files.
> Version of htdig is 3.2 -oB4

Which snapshot of htdig 3.2.0b4 are you using?  3.2.0b4 hasn't been
released yet, but people have been using development code snapshots of
this for about 2 years now, and we've been recommending that for over a
year now.  Problem is the snapshots come out weekly, and a lot of changes
have happened over the hundred some weeks, including a lot of bug fixes.
So, when you say you're using "B4", it doesn't mean much to us without a
snapshot date.  There were scoring bugs in snapshots before November 2000.

> External converter we are using is doc2html.pl, which  does pdf to text.
> For docs we use catdoc 0.914-1.
> 
> The answer on the last question is  that I don't think so. I know that the
> search is performed through first 100 words and if not found, than the
> message will appear like : " Can't find your search at the beginning of
> document". It still displays document if there was a searched word after
> first 100 words.

This is standard behaviour, because the excerpts that htdig stores in the
database are often shorter than the whole document.  That can be changed.
On my site, I use an excerpt_length setting 5 times the default.

See http://www.htdig.org/attrs.html#excerpt_length
    http://www.htdig.org/attrs.html#no_excerpt_show_top
and http://www.htdig.org/attrs.html#no_excerpt_text

> Hope this clarifies my problem a bit.

Not necessarily, unless you're assuming that if the word doesn't show up
in the excerpt, then it should score lower even if it's in the document.
That isn't the case.  The excerpt has no bearing on score calculations.
If a word appears several times in a Word or PDF document, even if it's
beyond the first 10K or so in the excerpt, it will still score higher
than the same word appearing only once in another document.

Also, because your title_factor is 100, words in titles will score
very high, even if they're not anywhere else in the text.  doc2html.pl
extracts the PDF title from the PDF's info dictionary (whether it's set
appropriately or not) and it also uses the file name of a Word document
in a title it generates when converting these.

> From: Gilles Detillieux [mailto:gr...@sc...]
...
> According to Natalija Stevens:
> > Hi my current weighting in .conf file is
> > 
> > title_factor: 100
> > keywords_factor: 50
> > text_factor: 1
> > other _factors are set to 0
> >  
> > at the top of the pages I have < meta name="keywords" content="blah, blah,
> > blah">
> > 
> > I have also set in conf file the line that equals keywords and
> > htdig-keywords, that I found on this discussion group.
> > 
> > 
> > My problem is that as a result of search I first get all pdf and doc files
> (
> > marked with four stars), then rest of the search. This boders me as some
> of
> > these pdf-s and docs files are not really relevant to the search, they
> might
> > just have search word mention on one or two places in the text.
> > 
> > The rest of the search is marked with three and less stars, although those
> > with search words in title should really get 4 stars rather then 3 and 2. 
> 
> What verion of htdig are you running?  If 3.1.x, did you reindex after
> changing the factors in your config file?  What external parser or converter
> are you using to index pdf and doc files?  Do these scripts output <title>
> and meta keywords tags from info extracted from the pdf or doc files?

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)