From: Michael B. <boerm@u.washington.edu> - 2003-02-14 23:18:15
|
We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed from the standard Red Hat RPMs. We have been using doc2html to parse PDFs and DOCs, with the following lines at the end of /etc/htdig.conf: external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl \ application/postscript->text/html /usr/local/bin/doc2html.pl \ application/pdf->text/html /usr/local/bin/doc2html.pl The mystery is: How can we get htsearch to stop bunching all the .pdf and .doc files at the top of the results? For reasons unclear to me, all matching .pdf files are listed, then all the .docs files, and then all the .html files. Our search algorithm and weighting factors are like this: search_algorithm: exact:1 synonyms:0.2 endings:0.1 #backlink_factor: 1000.0 #date_factor: 0.00 #description_factor: 150 #heading_factor: 5.0 keywords_factor: 500 meta_description_factor: 100 #text_factor: 1 #title_factor: 100 heading_factor_1: 10 heading_factor_2: 5 heading_factor_3: 4 #heading_factor_4: 1 #heading_factor_5: 1 #heading_factor_6: 0 Any suggestions? (We're just about ready to give up indexing .pdf and .doc files altogether.) |