From: Steve Y. <st...@fr...> - 2004-12-15 16:00:34
|
you may want to double-check the permissions and ownership of pdf2text and pdfinfo. I had the same problem and realized those files were owned by root (had to install them as root) and thus the web server could not use them until I changed ownership...just a suggestion On Wed, 15 Dec 2004 08:41:21 -0600, Jon Sorensen <jo...@st...> wrote: > ? [application/pdf] Plain Text 190604 > !! Unable to execute /www/htdig/bin/doc2html/pdf2html.pl for PDF > (pdf2html) document > > in the log file which would lead me to believe that the permissions are > wrong or that > my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl'; > is wrong in doc2html.pl > > but as far as I know that's not the case. Is there anything else that > could be causing this? > > thanks > ----- Original Message ----- > From: David Adams > To: Jon Sorensen ; htd...@li... > Sent: Wednesday, December 15, 2004 3:59 AM > Subject: Re: [htdig] pdf indexing problems > > > What do you see in the /www/htdig/bin/doc2html/DOC2HTML_LOG file? > > David Adams > ----- Original Message ----- > From: Jon Sorensen > To: htd...@li... > Sent: Tuesday, December 14, 2004 5:19 PM > Subject: [htdig] pdf indexing problems > > > I posted a question recently about indexing pdfs with doc2html > but I can't figure out what the problem is. I believe that the > conifg is correct > but there may be a problem there. when I dig a number of pdfs the > files > are read but the words indexed are not correct: > word: Read@0 > word: 8192@4 > word: from@9 > Does anyone know what this indicates? > From looking at the message archives it seems that others have had > this problem > but there weren't any solutions posted in the messages > > my config and output follows. thanks in advance for any help, I > appreciate it. > > in doc2html.pl: > > $ENV{DOC2HTML_LOG} = '/www/htdig/bin/doc2html/DOC2HTML_LOG'; > my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl'; > > in pdf2html.pl: > > my $PDFTOTEXT = "/usr/bin/pdftotext"; > my $PDFINFO = "/usr/bin/pdfinfo"; > > rundig output: > > Content-Type: application/pdf > Header line: > returnStatus = 0 > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 907 from document > Read a total of 361355 bytes > word: Read@0 > word: 8192@4 > word: from@9 > word: document@13 > word: Read@21 > word: 8192@26 > word: from@30 > word: document@35 > word: Read@43 > word: 8192@47 > word: from@52 > word: document@56 > size = 361355 > pick: www.flexco.com, # servers = 1 > 80:358:0:http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: > Retrieval command for > http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: > GET /prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf > HTTP/1.0 > Cookie: authorized=true > User-Agent: htdig/3.1.6 (jo...@st...) > Host: www.flexco.com > > > config file: > > database_dir: /www/htdig/db_flexco_new > start_url: http://www.flexco.com/index.cfm > limit_urls_to: http://www.flexco.com/ > exclude_urls: /cgi-bin/ .cgi /prod_info/safety.cfm /landing.cfm > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi > .css #.pdf > maintainer: jo...@st... > max_head_length: 10000 > max_doc_size: 5000000 > no_excerpt_show_top: true > search_algorithm: exact:1 synonyms:0.5 endings:0.1 > template_map: Long long ${common_dir}/flexco/long.html \ > Short short ${common_dir}/flexco/short.html > template_name: long > search_results_header: ${common_dir}/flexco/header.html > search_results_footer: ${common_dir}/flexco/footer.html > #search_results_wrapper: ${common_dir}/flexco/wrapper.html > nothing_found_file: ${common_dir}/flexco/nomatch.html > syntax_error_file: ${common_dir}/flexco/syntax.html > cookie: authorized=true > maximum_pages: 20 > external_parsers: application/pdf->text/html > /www/htdig/bin/doc2html/doc2html.pl > wordlist_compress: false > wordlist_compress_zlib: false > minimum_word_length: 2 > bad_word_list: ${common_dir}/badwords.txt |