From: David A. <D.J...@so...> - 2003-02-17 11:45:41
|
You have keywords_factor: 500 Is it possible that the authors of the PDF documents have been diligent in setting the keywords whilst the authors of the HTML pages have not bothered? Just a thought. David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Michael Boer" <boerm@u.washington.edu> To: <htd...@li...> Sent: Friday, February 14, 2003 11:18 PM Subject: [htdig] pdf and doc hits sorted first in htsearch results? > We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed from > the standard Red Hat RPMs. We have been using doc2html to parse PDFs and DOCs, > with the following lines at the end of /etc/htdig.conf: > > external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl \ > application/postscript->text/html /usr/local/bin/doc2html.pl \ > application/pdf->text/html /usr/local/bin/doc2html.pl > > The mystery is: How can we get htsearch to stop bunching all the .pdf and .doc > files at the top of the results? For reasons unclear to me, all matching .pdf > files are listed, then all the .docs files, and then all the .html files. > > Our search algorithm and weighting factors are like this: > > search_algorithm: exact:1 synonyms:0.2 endings:0.1 > > #backlink_factor: 1000.0 > #date_factor: 0.00 > #description_factor: 150 > #heading_factor: 5.0 > keywords_factor: 500 > meta_description_factor: 100 > #text_factor: 1 > #title_factor: 100 > heading_factor_1: 10 > heading_factor_2: 5 > heading_factor_3: 4 > #heading_factor_4: 1 > #heading_factor_5: 1 > #heading_factor_6: 0 > > > Any suggestions? (We're just about ready to give up indexing .pdf and .doc > files altogether.) > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: FREE SSL Guide from Thawte > are you planning your Web Server Security? Click here to get a FREE > Thawte SSL guide and find the answers to all your SSL security issues. > http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en > _______________________________________________ > htdig-general mailing list <htd...@li...> > To unsubscribe, send a message to <htd...@li...> with a subject of unsubscribe > FAQ: http://htdig.sourceforge.net/FAQ.html > |