Re: [htdig] PDF-SEARCH

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Natalya,
then it seems that the path to perl is wrong and that's why the Perl 
Script(s) don't work.

Check out the first lines of each Perl Script (.pl) and correct the path 
to perl. Maybe there isn't even perl installed  ;-)

Best wishes,
Martin

Natalya Kolesnikova schrieb:

> Hello David, 
> 
> thank you very much for your support!
> 
> Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok, too.
>  But when I run conv_doc.pl (or doc2html.pl ) from command line with a
> pdf-file as a argument I get error message:
> bad interpreter: no such file or directory/usr/bin/perl.
> 
> What is here wrong???
> 
> best regards
> Natalya
> 
> 
>>OK, so far we have established:
>>
>>1)    Htdig is reading a .PDF file
>>2)    You are attempting to use /usr/local/bin/conv_doc.pl to convert it.
>>3)    No text is being extracted from the .PDF file, so it is not being
>>indexed.
>>
>>This suggests that the fault is with /usr/local/bin/conv_doc.pl.  Please
>>try
>>executing this from the command line:
>>
>>            /usr/local/bin/conv_doc.pl  somepdffile.pdf
>>
>>where somepdffile.pdf is a PDF file from which it should be able to
>>extract
>>text.  See what happens.
>>This is a necessary step in the diagnosis.
>>
>>David Adams
>>Corporate Information Services
>>Information Systems Services
>>University of Southampton
>>
>>
>>----- Original Message ----- 
>>From: "Natalya Kolesnikova" <Ja...@gm...>
>>To: "Gilles Detillieux" <gr...@sc...>
>>Cc: <D.J...@so...>; <htd...@li...>
>>Sent: Thursday, October 09, 2003 9:51 AM
>>Subject: Re: [htdig] PDF-SEARCH
>>
>>
>>
>>>Yes, I get error message "Deleted: no excerpt"!!!
>>>
>>>Natalya
>>>
>>>
>>>>According to Natalya Kolesnikova:
>>>>
>>>>>Thank you, David, for your help!
>>>>>
>>>>>But when I run htmerge, I get follow message:
>>>>>htmerge: Document database has no URLs. Check your config file and
>>
>>try
>>
>>>>>running htdig again.
>>>>
>>>>Are there any other htmerge error messages, such as a "Deleted: no
>>>>excerpt"
>>>>message?  I suspect what's happening here is that htdig adds the
>>
>>single
>>
>>>>URL for the PDF file, which you specify in start_url, to the database,
>>>>but when it tries to index it, it finds nothing to index.  When
>>
>>htmerge
>>
>>>>sees that nothing was indexed for this one document, it removes it
>>
>>from
>>
>>>>the database, but then complains that there are no URLs left in the
>>>>database.
>>>>Seeing all the htmerge error messages (try htmerge -v after htdig)
>>
>>would
>>
>>>>give us a more complete picture.
>>>>
>>>>Please follow through on Dave's and my suggestions below...
>>>>
>>>>
>>>>>>Ok, your configuration file contains:
>>>>>>
>>>>>>external_parsers: application/msword->text/html
>>>>
>>>>/usr/local/bin/conv_doc.pl
>>>>
>>>>>>\
>>>>>>              application/postscript->text/html
>>>>
>>>>/usr/local/bin/conv_doc.pl
>>>>
>>>>>>\
>>>>>>              application/pdf->text/html
>>
>>/usr/local/bin/conv_doc.pl
>>
>>>>>>so you are using conv_doc.pl.
>>>>>>
>>>>>>Please check one thing in your configuration file: make sure there
>>
>>are
>>
>>>>no
>>>>
>>>>>>white space characters after the \ characters at the end of lines,
>>>>
>>>>this is
>>>>
>>>>>>most important.
>>>>
>>>>My first hunch is that this isn't the problem, because if htdig didn't
>>>>see the full external_parsers definition (all 3 lines of it), it
>>
>>likely
>>
>>>>would be trying to use acroread and the PDF:: class, so we'd see
>>
>>messages
>>
>>>>>from there.  However, it's an easy thing to check for, and always a
>>
>>good
>>
>>>>idea to pay close attention to in any case, so please do have a look
>>
>>at
>>
>>>>these lines.
>>>>
>>>>
>>>>>>If your configuration file is OK, then the problem must be with
>>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls.
>>>>>>Try running /usr/local/bin/conv_doc.pl from the command line with
>>
>>a
>>
>>>>.PDF
>>>>
>>>>>>file as argument and see what the result is.
>>>>
>>>>This is a very important test.  Your first test, with the start_url
>>
>>set
>>to
>>
> http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf
> 
>>>>showed that it failed with this single PDF file, which suggests a
>>
>>problem
>>
>>>>either with that PDF file or with the setup of the external parser.
>>>>The next step is to find out which is at fault, and this test will do
>>>>that.  If it fails on the introduction_to_IPR.pdf file (i.e. it
>>
>>produces
>>
>>>>no output), try it on a few other files as well.  If it doesn't work
>>
>>on
>>
>>>>any of them, I'd suspect that conv_doc.pl is not properly configured.
>>>>In this case, you should try pdftotext directly on these PDF files to
>>>>see if that works.
>>>>
>>>>If it produces output for some PDF files, but not others, it may be
>>
>>that
>>
>>>>the ones for which it produces nothing actually contain no indexable
>>
>>text.
>>
>>>>Some PDF files contain only image data, including perhaps scanned
>>
>>pages
>>
>>>>that display as text, but in fact are only a "picture" of a page.
>>>>
>>>>Once you can get conv_doc.pl to spit out text when run manually,
>>>>the following step will be to try htdig on those same PDF files,
>>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time,
>>>>so htdig shows each word it parses).  If you get that far, then the
>>>>next stage would be to use your original start_url to index your whole
>>>>site, and see if it will find all the PDF files.  If it doesn't, see
>>>>http://www.htdig.org/FAQ.html#q5.27
>>>>
>>>>-- 
>>>>Gilles R. Detillieux              E-mail: <gr...@sc...>
>>>>Spinal Cord Research Centre       WWW:   
>>
>>http://www.scrc.umanitoba.ca/
>>
>>>>Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
>>>>
>>
>