|
From: Natalya K. <Ja...@gm...> - 2003-10-08 09:22:57
|
Thank you very much for your help!
I don't get error message, but I have never .pdf-Files in my search-List!!!
Hier is htdig -ivvv output when start_url is a single PDF file.
What is wrong???
natalya.kolesnikova@intranet:~> htdig -ivvv
1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i
ntroduction_to_IPR.pdf
New server: intranet.panasonic.de, 80
Retrieval command for http://intranet.panasonic.de/robots.txt: GET
/robots.txt H
TTP/1.0
User-Agent: htdig/3.1.6 (kol...@pa...)
Host: intranet.panasonic.de
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT
Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1
Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT
Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00
Header line: ETag: "44005-e7-3b82d9e0"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 231
Header line: Connection: close
Header line: Content-Type: text/plain
Header line:
returnStatus = 0
Read 231 from document
Read a total of 231 bytes
Parsing robots.txt file using myname = htdig
Robots.txt line: # exclude help system from robots
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow: /manual/
Found 'disallow' line: /manual/
Robots.txt line: Disallow: /doc/
Found 'disallow' line: /doc/
Robots.txt line: Disallow: /gif/
Found 'disallow' line: /gif/
Robots.txt line: # but allow htdig to index our doc-tree
Robots.txt line: User-agent: susedig
Found 'user-agent' line: susedig
Robots.txt line: Disallow:
Robots.txt line: # disallow stress test
Robots.txt line: user-agent: stress-agent
Found 'user-agent' line: stress-agent
Robots.txt line: Disallow: /
Pattern: /manual/|/doc/|/gif/
pushed
pick: intranet.panasonic.de, # servers =
1
0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introdu
ction_to_IPR.pdf: Retrieval command for
http://intranet.panasonic.de/pel/ipr/tra
ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET
/pel/ipr/training_course
/IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0
User-Agent: htdig/3.1.6 (kol...@pa...)
Host: intranet.panasonic.de
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT
Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1
Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT
Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19
Header line: ETag: "314005-51e38-3f4f381f"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 335416
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 7736 from document
Read a total of 335416 bytes
size = 335416
pick: intranet.panasonic.de, # servers = 1
natalya.kolesnikova@intranet:~>
> According to Natalya Kolesnikova:
> > may be I am stupid, but it doesn't work by me! Can somebody help me? I
> have
> > tried with acroread and with external parser xpdf, but it doesn't
> work!!!!
> > I need the Installation Guide!!! :)))
>
> See http://www.htdig.org/FAQ.html#q4.9
>
> That is the installation guide for PDF indexing. If you've carefully read
> and implemented everything recommended there, and checked out FAQs 5.2
> and 5.37 as David recommended (twice), then please provide more details,
> such as what error messages you get, or give us an excerpt of htdig -ivvv
> output when start_url is set to point to just one single PDF file.
>
> There are dozens of potential points of failure in this process, so simply
> saying "it doesn't work" gives us no information that can help pinpoint
> which point of failure is the one that needs to be addressed.
>
> Also, make sure you have links in your HTML files to all PDF files you
> want to index. (See http://www.htdig.org/FAQ.html#q5.25)
>
> --
> Gilles R. Detillieux E-mail: <gr...@sc...>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> ht://Dig general mailing list: <htd...@li...>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>
--
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
Jetzt kostenlos anmelden unter http://www.gmx.net
+++ GMX - die erste Adresse für Mail, Message, More! +++
|
|
From: Gustave Stresen-R. <ted...@ma...> - 2003-10-08 16:41:16
|
If I'm not mistaken, since your start_url is a pdf document, it's the only document that will get parsed and as far as I know, htdig is unable to follow links in a pdf document. Htdig is only able to follow links in html documents. Please correct me if I'm wrong on this last statement. You'll probably need to create some sort of index document that has links to all the pdf files you want to index. On Wednesday, October 8, 2003, at 09:19 AM, Natalya Kolesnikova wrote: start_url: http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf Ted Stresen-Reuter |
|
From: Gilles D. <gr...@sc...> - 2003-10-08 21:51:28
|
According to Gustave Stresen-Reuter: > If I'm not mistaken, since your start_url is a pdf document, it's the > only document that will get parsed and as far as I know, htdig is unable > to follow links in a pdf document. Htdig is only able to follow links in > html documents. Please correct me if I'm wrong on this last statement. > > You'll probably need to create some sort of index document that has > links to all the pdf files you want to index. > > On Wednesday, October 8, 2003, at 09:19 AM, Natalya Kolesnikova wrote: > > start_url: http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf Natalya set the start_url this way at my recommendation (see earlier postings in the thread) to rule out whether it's a problem with htdig being able to actually index PDF files given the URLs, as opposed to a problem with finding the URLs to the PDFs. Her test showed that it failed with a single PDF file, which suggests a problem either with that PDF file or with the setup of the external parser. That's the next stage of testing to tackle. Once her configuration is working reliably for a single PDF, given the URL, she'll be in a better position to try and see if it's also having problems finding the URLs from links in other documents. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Natalya K. <Ja...@gm...> - 2003-10-14 07:11:17
|
Ok, pdf-search runs! I try now to index .ppt and .xls Files: htdig.conf external_parsers: application/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ text/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/pdf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-excel->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-powerpoint->text/html /srv/www/htdig/doc2html/doc2html.pl\ ppthtml and xlhtml are working from command line ok. doc2html with .ppt-file or .xls-file as Argument is working ok, also. But if I run rundig, I neither see .ppt-files nor .xls-files indexing! best regards Natalya, > Glad that you have made progress. > > I don't recognize the "PRINT OUTPUNT!!!???" message, but to run doc2html. > pl > >from the command line it is necessary to give two arguments: > > doc2html.pl filename.pdf application/pdf > > It this fails try: > > pdf2html.pl filename.pdf > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Martin Joisten" <web...@cl...> > Cc: <htd...@li...> > Sent: Friday, October 10, 2003 12:27 PM > Subject: Re: [htdig] PDF-SEARCH > > > > Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!! > > > > If I run doc2html.pl with pdf-file as argument from command line, I ge > t > > PRINT OUTPUNT!!!??? > > > > > > best regards > > Natalya > > > > > Hi Natalya, > > > then it seems that the path to perl is wrong and that's why the Perl > > > Script(s) don't work. > > > > > > Check out the first lines of each Perl Script (.pl) and correct the p > ath > > > to perl. Maybe there isn't even perl installed ;-) > > > > > > Best wishes, > > > Martin > > > > > > > > > Natalya Kolesnikova schrieb: > > > > > > > Hello David, > > > > > > > > thank you very much for your support! > > > > > > > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working > ok, > > > too. > > > > But when I run conv_doc.pl (or doc2html.pl ) from command line wit > h a > > > > pdf-file as a argument I get error message: > > > > bad interpreter: no such file or directory/usr/bin/perl. > > > > > > > > What is here wrong??? > > > > > > > > best regards > > > > Natalya > > > > > > > > > > > >>OK, so far we have established: > > > >> > > > >>1) Htdig is reading a .PDF file > > > >>2) You are attempting to use /usr/local/bin/conv_doc.pl to conve > rt > > > it. > > > >>3) No text is being extracted from the .PDF file, so it is not > being > > > >>indexed. > > > >> > > > >>This suggests that the fault is with /usr/local/bin/conv_doc.pl. > Please > > > >>try > > > >>executing this from the command line: > > > >> > > > >> /usr/local/bin/conv_doc.pl somepdffile.pdf > > > >> > > > >>where somepdffile.pdf is a PDF file from which it should be able to > > > >>extract > > > >>text. See what happens. > > > >>This is a necessary step in the diagnosis. > > > >> > > > >>David Adams > > > >>Corporate Information Services > > > >>Information Systems Services > > > >>University of Southampton > > > >> > > > >> > > > >>----- Original Message ----- > > > >>From: "Natalya Kolesnikova" <Ja...@gm...> > > > >>To: "Gilles Detillieux" <gr...@sc...> > > > >>Cc: <D.J...@so...>; <htd...@li...> > > > >>Sent: Thursday, October 09, 2003 9:51 AM > > > >>Subject: Re: [htdig] PDF-SEARCH > > > >> > > > >> > > > >> > > > >>>Yes, I get error message "Deleted: no excerpt"!!! > > > >>> > > > >>>Natalya > > > >>> > > > >>> > > > >>>>According to Natalya Kolesnikova: > > > >>>> > > > >>>>>Thank you, David, for your help! > > > >>>>> > > > >>>>>But when I run htmerge, I get follow message: > > > >>>>>htmerge: Document database has no URLs. Check your config file a > nd > > > >> > > > >>try > > > >> > > > >>>>>running htdig again. > > > >>>> > > > >>>>Are there any other htmerge error messages, such as a "Deleted: n > o > > > >>>>excerpt" > > > >>>>message? I suspect what's happening here is that htdig adds the > > > >> > > > >>single > > > >> > > > >>>>URL for the PDF file, which you specify in start_url, to the > database, > > > >>>>but when it tries to index it, it finds nothing to index. When > > > >> > > > >>htmerge > > > >> > > > >>>>sees that nothing was indexed for this one document, it removes i > t > > > >> > > > >>from > > > >> > > > >>>>the database, but then complains that there are no URLs left in t > he > > > >>>>database. > > > >>>>Seeing all the htmerge error messages (try htmerge -v after htdig > ) > > > >> > > > >>would > > > >> > > > >>>>give us a more complete picture. > > > >>>> > > > >>>>Please follow through on Dave's and my suggestions below... > > > >>>> > > > >>>> > > > >>>>>>Ok, your configuration file contains: > > > >>>>>> > > > >>>>>>external_parsers: application/msword->text/html > > > >>>> > > > >>>>/usr/local/bin/conv_doc.pl > > > >>>> > > > >>>>>>\ > > > >>>>>> application/postscript->text/html > > > >>>> > > > >>>>/usr/local/bin/conv_doc.pl > > > >>>> > > > >>>>>>\ > > > >>>>>> application/pdf->text/html > > > >> > > > >>/usr/local/bin/conv_doc.pl > > > >> > > > >>>>>>so you are using conv_doc.pl. > > > >>>>>> > > > >>>>>>Please check one thing in your configuration file: make sure th > ere > > > >> > > > >>are > > > >> > > > >>>>no > > > >>>> > > > >>>>>>white space characters after the \ characters at the end of lin > es, > > > >>>> > > > >>>>this is > > > >>>> > > > >>>>>>most important. > > > >>>> > > > >>>>My first hunch is that this isn't the problem, because if htdig > didn't > > > >>>>see the full external_parsers definition (all 3 lines of it), it > > > >> > > > >>likely > > > >> > > > >>>>would be trying to use acroread and the PDF:: class, so we'd see > > > >> > > > >>messages > > > >> > > > >>>>>>from there. However, it's an easy thing to check for, and alwa > ys > a > > > >> > > > >>good > > > >> > > > >>>>idea to pay close attention to in any case, so please do have a l > ook > > > >> > > > >>at > > > >> > > > >>>>these lines. > > > >>>> > > > >>>> > > > >>>>>>If your configuration file is OK, then the problem must be with > > > >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls. > > > >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line wi > th > > > >> > > > >>a > > > >> > > > >>>>.PDF > > > >>>> > > > >>>>>>file as argument and see what the result is. > > > >>>> > > > >>>>This is a very important test. Your first test, with the start_u > rl > > > >> > > > >>set > > > >>to > > > >> > > > > > > > > > > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introd > uct > ion_to_IPR.pdf > > > > > > > >>>>showed that it failed with this single PDF file, which suggests a > > > >> > > > >>problem > > > >> > > > >>>>either with that PDF file or with the setup of the external parse > r. > > > >>>>The next step is to find out which is at fault, and this test wil > l > do > > > >>>>that. If it fails on the introduction_to_IPR.pdf file (i.e. it > > > >> > > > >>produces > > > >> > > > >>>>no output), try it on a few other files as well. If it doesn't w > ork > > > >> > > > >>on > > > >> > > > >>>>any of them, I'd suspect that conv_doc.pl is not properly > configured. > > > >>>>In this case, you should try pdftotext directly on these PDF file > s > to > > > >>>>see if that works. > > > >>>> > > > >>>>If it produces output for some PDF files, but not others, it may > be > > > >> > > > >>that > > > >> > > > >>>>the ones for which it produces nothing actually contain no indexa > ble > > > >> > > > >>text. > > > >> > > > >>>>Some PDF files contain only image data, including perhaps scanned > > > >> > > > >>pages > > > >> > > > >>>>that display as text, but in fact are only a "picture" of a page. > > > >>>> > > > >>>>Once you can get conv_doc.pl to spit out text when run manually, > > > >>>>the following step will be to try htdig on those same PDF files, > > > >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time, > > > >>>>so htdig shows each word it parses). If you get that far, then t > he > > > >>>>next stage would be to use your original start_url to index your > whole > > > >>>>site, and see if it will find all the PDF files. If it doesn't, > see > > > >>>>http://www.htdig.org/FAQ.html#q5.27 > > > >>>> > > > >>>>-- > > > >>>>Gilles R. Detillieux E-mail: > <gr...@sc...> > > > >>>>Spinal Cord Research Centre WWW: > > > >> > > > >>http://www.scrc.umanitoba.ca/ > > > >> > > > >>>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > >>>> > > > >> > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.net email is sponsored by: SF.net Giveback Program. > > > SourceForge.net hosts over 70,000 Open Source Projects. > > > See the people who have HELPED US provide better services: > > > Click here: http://sourceforge.net/supporters.php > > > _______________________________________________ > > > ht://Dig general mailing list: <htd...@li...> > > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > > List information (subscribe/unsubscribe, etc.) > > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > > -- > > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > > Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService > > > > Jetzt kostenlos anmelden unter http://www.gmx.net > > > > +++ GMX - die erste Adresse für Mail, Message, More! +++ > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: SF.net Giveback Program. > > SourceForge.net hosts over 70,000 Open Source Projects. > > See the people who have HELPED US provide better services: > > Click here: http://sourceforge.net/supporters.php > > _______________________________________________ > > ht://Dig general mailing list: <htd...@li...> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > SourceForge.net hosts over 70,000 Open Source Projects. > See the people who have HELPED US provide better services: > Click here: http://sourceforge.net/supporters.php > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: Gilles D. <gr...@sc...> - 2003-10-14 18:53:34
|
According to Natalya Kolesnikova: > Ok, pdf-search runs! Great! > I try now to index .ppt and .xls Files: > htdig.conf > external_parsers: application/rtf->text/html > /srv/www/htdig/doc2html/doc2html.pl \ > text/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ > application/pdf->text/html /srv/www/htdig/doc2html/doc2html.pl \ > application/vnd.ms-excel->text/html /srv/www/htdig/doc2html/doc2html.pl \ > application/vnd.ms-powerpoint->text/html > /srv/www/htdig/doc2html/doc2html.pl\ > > ppthtml and xlhtml are working from command line ok. > doc2html with .ppt-file or .xls-file as Argument is working ok, also. > > But if I run rundig, I neither see .ppt-files nor .xls-files indexing! Again, it would be a good idea to run htdig -ivvv with start_url set to the URLs of a single .xls file and a single .ppt file, just to see how it deals with these. Pay special attention to the Content-Type header that the server returns for each of these files, as not all web servers follow the common convention of using application/vnd.ms-excel and application/vnd.ms-powerpoint for these content types. I've seen several different variations of these, especially for Excel files. Also, never end the last line of a multi-line attribute definition with a backslash, as it will cause htdig to swallow the following line as part of the same definition. The content types you define in your external_parsers definition must match those your server actually returns. You can have multiple entries in external_parsers for a given file type just to cover all bases as far as possible content types a server might use, especially when indexing several differently-configured web servers. E.g.: external_parsers: \ application/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ text/rtf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/pdf->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-excel->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/msexcel->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/excel->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/vnd.ms-powerpoint->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/mspowerpoint->text/html /srv/www/htdig/doc2html/doc2html.pl \ application/powerpoint->text/html /srv/www/htdig/doc2html/doc2html.pl You may also need to customise the doc2html.pl script to allow any non-standard content types your server returns. Alternatively, if your server is returning unusual content types, and you can configure the server, then that may be the easiest/best fix. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Natalya K. <Ja...@gm...> - 2003-10-08 14:00:32
|
When I run htmerge, I get follow message: htmerge: Document database has no URLs. Check your config file and try running htdig again. Thank you for your tipps! Natalya > Thank you very much for your help! > I don't get error message, but I have never .pdf-Files in my > search-List!!! > Hier is htdig -ivvv output when start_url is a single PDF file. > What is wrong??? > > natalya.kolesnikova@intranet:~> htdig -ivvv > > 1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i > ntroduction_to_IPR.pdf > New server: intranet.panasonic.de, 80 > Retrieval command for http://intranet.panasonic.de/robots.txt: GET > /robots.txt H > TTP/1.0 > User-Agent: htdig/3.1.6 (kol...@pa...) > Host: intranet.panasonic.de > > Header line: HTTP/1.1 200 OK > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT > Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 > Header line: ETag: "44005-e7-3b82d9e0" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 231 > Header line: Connection: close > Header line: Content-Type: text/plain > Header line: > returnStatus = 0 > Read 231 from document > Read a total of 231 bytes > Parsing robots.txt file using myname = htdig > Robots.txt line: # exclude help system from robots > Robots.txt line: User-agent: * > Found 'user-agent' line: * > Robots.txt line: Disallow: /manual/ > Found 'disallow' line: /manual/ > Robots.txt line: Disallow: /doc/ > Found 'disallow' line: /doc/ > Robots.txt line: Disallow: /gif/ > Found 'disallow' line: /gif/ > Robots.txt line: # but allow htdig to index our doc-tree > Robots.txt line: User-agent: susedig > Found 'user-agent' line: susedig > Robots.txt line: Disallow: > Robots.txt line: # disallow stress test > Robots.txt line: user-agent: stress-agent > Found 'user-agent' line: stress-agent > Robots.txt line: Disallow: / > Pattern: /manual/|/doc/|/gif/ > pushed > pick: intranet.panasonic.de, # servers = > 1 > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introdu > ction_to_IPR.pdf: Retrieval command for > http://intranet.panasonic.de/pel/ipr/tra > ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET > /pel/ipr/training_course > /IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 > User-Agent: htdig/3.1.6 (kol...@pa...) > Host: intranet.panasonic.de > > Header line: HTTP/1.1 200 OK > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT > Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 > Header line: ETag: "314005-51e38-3f4f381f" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 335416 > Header line: Connection: close > Header line: Content-Type: application/pdf > Header line: > returnStatus = 0 > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 7736 from document > Read a total of 335416 bytes > size = 335416 > pick: intranet.panasonic.de, # servers = 1 > natalya.kolesnikova@intranet:~> > > According to Natalya Kolesnikova: > > > may be I am stupid, but it doesn't work by me! Can somebody help me? I > > have > > > tried with acroread and with external parser xpdf, but it doesn't > > work!!!! > > > I need the Installation Guide!!! :))) > > > > See http://www.htdig.org/FAQ.html#q4.9 > > > > That is the installation guide for PDF indexing. If you've carefully > read > > and implemented everything recommended there, and checked out FAQs 5.2 > > and 5.37 as David recommended (twice), then please provide more details, > > such as what error messages you get, or give us an excerpt of htdig > -ivvv > > output when start_url is set to point to just one single PDF file. > > > > There are dozens of potential points of failure in this process, so > simply > > saying "it doesn't work" gives us no information that can help pinpoint > > which point of failure is the one that needs to be addressed. > > > > Also, make sure you have links in your HTML files to all PDF files you > > want to index. (See http://www.htdig.org/FAQ.html#q5.25) > > > > -- > > Gilles R. Detillieux E-mail: <gr...@sc...> > > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > ht://Dig general mailing list: <htd...@li...> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > -- > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService > > Jetzt kostenlos anmelden unter http://www.gmx.net > > +++ GMX - die erste Adresse für Mail, Message, More! +++ > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: David A. <D.J...@so...> - 2003-10-08 14:10:22
|
Thank you, that output establishes that htdig is reading a .pdf file. The next question is: what is it doing with it? To answer that we need to see what you have in your configuration file. David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message -----=20 From: "Natalya Kolesnikova" <Ja...@gm...> To: "Gilles Detillieux" <gr...@sc...> Cc: <htd...@li...> Sent: Wednesday, October 08, 2003 10:22 AM Subject: Re: [htdig] PDF-SEARCH > Thank you very much for your help! > I don't get error message, but I have never .pdf-Files in my search-List!!! > Hier is htdig -ivvv output when start_url is a single PDF file. > What is wrong??? > > natalya.kolesnikova@intranet:~> htdig -ivvv > > 1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/= i > ntroduction_to_IPR.pdf > New server: intranet.panasonic.de, 80 > Retrieval command for http://intranet.panasonic.de/robots.txt: GET > /robots.txt H > TTP/1.0 > User-Agent: htdig/3.1.6 (kol...@pa...) > Host: intranet.panasonic.de > > Header line: HTTP/1.1 200 OK > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT > Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 > Header line: ETag: "44005-e7-3b82d9e0" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 231 > Header line: Connection: close > Header line: Content-Type: text/plain > Header line: > returnStatus =3D 0 > Read 231 from document > Read a total of 231 bytes > Parsing robots.txt file using myname =3D htdig > Robots.txt line: # exclude help system from robots > Robots.txt line: User-agent: * > Found 'user-agent' line: * > Robots.txt line: Disallow: /manual/ > Found 'disallow' line: /manual/ > Robots.txt line: Disallow: /doc/ > Found 'disallow' line: /doc/ > Robots.txt line: Disallow: /gif/ > Found 'disallow' line: /gif/ > Robots.txt line: # but allow htdig to index our doc-tree > Robots.txt line: User-agent: susedig > Found 'user-agent' line: susedig > Robots.txt line: Disallow: > Robots.txt line: # disallow stress test > Robots.txt line: user-agent: stress-agent > Found 'user-agent' line: stress-agent > Robots.txt line: Disallow: / > Pattern: /manual/|/doc/|/gif/ > pushed > pick: intranet.panasonic.de, # servers =3D > 1 > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/= int rodu > ction_to_IPR.pdf: Retrieval command for > http://intranet.panasonic.de/pel/ipr/tra > ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET > /pel/ipr/training_course > /IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 > User-Agent: htdig/3.1.6 (kol...@pa...) > Host: intranet.panasonic.de > > Header line: HTTP/1.1 200 OK > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT > Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 > Header line: ETag: "314005-51e38-3f4f381f" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 335416 > Header line: Connection: close > Header line: Content-Type: application/pdf > Header line: > returnStatus =3D 0 > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 7736 from document > Read a total of 335416 bytes > size =3D 335416 > pick: intranet.panasonic.de, # servers =3D 1 > natalya.kolesnikova@intranet:~> > > According to Natalya Kolesnikova: > > > may be I am stupid, but it doesn't work by me! Can somebody help me= ? I > > have > > > tried with acroread and with external parser xpdf, but it doesn't > > work!!!! > > > I need the Installation Guide!!! :))) > > > > See http://www.htdig.org/FAQ.html#q4.9 > > > > That is the installation guide for PDF indexing. If you've carefully read > > and implemented everything recommended there, and checked out FAQs 5.= 2 > > and 5.37 as David recommended (twice), then please provide more detai= ls, > > such as what error messages you get, or give us an excerpt of htdig -ivvv > > output when start_url is set to point to just one single PDF file. > > > > There are dozens of potential points of failure in this process, so simply > > saying "it doesn't work" gives us no information that can help pinpoi= nt > > which point of failure is the one that needs to be addressed. > > > > Also, make sure you have links in your HTML files to all PDF files yo= u > > want to index. (See http://www.htdig.org/FAQ.html#q5.25) > > > > --=20 > > Gilles R. Detillieux E-mail: <gr...@sc...> > > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.c= a/ > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > ht://Dig general mailing list: <htd...@li...> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > --=20 > NEU F=DCR ALLE - GMX MediaCenter - f=FCr Fotos, Musik, Dateien... > Fotoalbum, File Sharing, MMS, Multimedia-Gru=DF, GMX FotoService > > Jetzt kostenlos anmelden unter http://www.gmx.net > > +++ GMX - die erste Adresse f=FCr Mail, Message, More! +++ > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > |
|
From: Natalya K. <Ja...@gm...> - 2003-10-08 14:19:44
|
Hier is my htdig.conf: # # Example config file for ht://Dig. # # This configuration file is used by all the programs that make up ht://Dig. # Please refer to the attribute reference manual for more details on what # can be put into this file. (http://www.htdig.org/confindex.html) # Note that most attributes have very reasonable default values so you # really only have to add attributes here if you want to change the defaults. # # What follows are some of the common attributes you might want to change. # # # Specify where the database files need to go. Make sure that there is # plenty of free disk space available for the databases. They can get # pretty big. # database_dir: /srv/www/htdig/db # # This specifies the URL where the robot (htdig) will start. You can specify # multiple URLs here. Just separate them by some whitespace. # The example here will cause the ht://Dig homepage and related pages to be # indexed. # You could also index all the URLs in a file like so: #start_url: `${common_dir}/start.url` # start_url: http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf #start_url: http://intranet.panasonic.de star_image: /img_htdig/star.gif star_blank: /img_htdig/star_blank.gif # # This attribute limits the scope of the indexing process. The default is to # set it to the same as the start_url above. This way only pages that are on # the sites specified in the start_url attribute will be indexed and it will # reject any URLs that go outside of those sites. # # Keep in mind that the value for this attribute is just a list of string # patterns. As long as URLs contain at least one of the patterns it will be # seen as part of the scope of the index. # limit_urls_to: ${start_url} # # If there are particular pages that you definitely do NOT want to index, you # can use the exclude_urls attribute. The value is a list of string patterns. # If a URL matches any of the patterns, it will NOT be indexed. This is # useful to exclude things like virtual web trees or database accesses. By # default, all CGI URLs will be excluded. (Note that the /cgi-bin/ convention # may not work on your web server. Check the path prefix used on your web # server.) # exclude_urls: /cgi-bin/ .cgi # # Since ht://Dig does not (and cannot) parse every document type, this # attribute is a list of strings (extensions) that will be ignored during # indexing. These are *only* checked at the end of a URL, whereas # exclude_url patterns are matched anywhere. # bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css # # The string htdig will send in every request to identify the robot. Change # this to your email address. # maintainer: kol...@pa... # # The excerpts that are displayed in long results rely on stored information # in the index databases. The compiled default only stores 512 characters of # text from each document (this excludes any HTML markup...) If you plan on # using the excerpts you probably want to make this larger. The only concern # here is that more disk space is going to be needed to store the additional # information. Since disk space is cheap (! :-)) you might want to set this # to a value so that a large percentage of the documents that you are going # to be indexing are stored completely in the database. At SDSU we found # that by setting this value to about 50k the index would get 97% of all # documents completely and only 3% was cut off at 50k. You probably want to # experiment with this value. # Note that if you want to set this value low, you probably want to set the # excerpt_show_top attribute to false so that the top excerpt_length characters # of the document are always shown. # max_head_length: 10000 # # To limit network connections, ht://Dig will only pull up to a certain limit # of bytes. This prevents the indexing from dying because the server keeps # sending information. However, several FAQs happen because people have files # bigger than the default limit of 100KB. This sets the default a bit higher. # (see <http://www.htdig.org/FAQ.html> for more) # max_doc_size: 10000000000000000000000 # # Most people expect some sort of excerpt in results. By default, if the # search words aren't found in context in the stored excerpt, htsearch shows # the text defined in the no_excerpt_text attribute: # (None of the search words were found in the top of this document.) # This attribute instead will show the top of the excerpt. # no_excerpt_show_top: true # # Depending on your needs, you might want to enable some of the fuzzy search # algorithms. There are several to choose from and you can use them in any # combination you feel comfortable with. Each algorithm will get a weight # assigned to it so that in combinations of algorithms, certain algorithms get # preference over others. Note that the weights only affect the ranking of # the results, not the actual searching. # The available algorithms are: # accents # exact # endings # metaphone # prefix # soundex # substring # synonyms # By default only the "exact" algorithm is used with weight 1. # Note that if you are going to use the endings, metaphone, soundex, accents, # or synonyms algorithms, you will need to run htfuzzy to generate # the databases they use. # search_algorithm: exact:1 synonyms:0.5 endings:0.1 # # The following are the templates used in the builtin search results # The default is to use compiled versions of these files, which produces # slightly faster results. However, uncommenting these lines makes it # very easy to change the format of search results. # See <http://www.htdig.org/hts_templates.html> for more details. # # template_map: Long long ${common_dir}/long.html \ # Short short ${common_dir}/short.html # template_name: long # # The following are used to change the text for the page index. # The defaults are just boring text numbers. These images spice # up the result pages quite a bit. (Feel free to do whatever, though) # next_page_text: <img src="/img_htdig/buttonr.gif" border="0" align="middle" width="30" height="30" alt="next"> no_next_page_text: prev_page_text: <img src="/img_htdig/buttonl.gif" border="0" align="middle" width="30" height="30" alt="prev"> no_prev_page_text: page_number_text: '<img src="/img_htdig/button1.gif" border="0" align="middle" width="30" height="30" alt="1">' \ '<img src="/img_htdig/button2.gif" border="0" align="middle" width="30" height="30" alt="2">' \ '<img src="/img_htdig/button3.gif" border="0" align="middle" width="30" height="30" alt="3">' \ '<img src="/img_htdig/button4.gif" border="0" align="middle" width="30" height="30" alt="4">' \ '<img src="/img_htdig/button5.gif" border="0" align="middle" width="30" height="30" alt="5">' \ '<img src="/img_htdig/button6.gif" border="0" align="middle" width="30" height="30" alt="6">' \ '<img src="/img_htdig/button7.gif" border="0" align="middle" width="30" height="30" alt="7">' \ '<img src="/img_htdig/button8.gif" border="0" align="middle" width="30" height="30" alt="8">' \ '<img src="/img_htdig/button9.gif" border="0" align="middle" width="30" height="30" alt="9">' \ '<img src="/img_htdig/button10.gif" border="0" align="middle" width="30" height="30" alt="10">' # # To make the current page stand out, we will put a border around the # image for that page. # no_page_number_text: '<img src="/img_htdig/button1.gif" border="2" align="middle" width="30" height="30" alt="1">' \ '<img src="/img_htdig/button2.gif" border="2" align="middle" width="30" height="30" alt="2">' \ '<img src="/img_htdig/button3.gif" border="2" align="middle" width="30" height="30" alt="3">' \ '<img src="/img_htdig/button4.gif" border="2" align="middle" width="30" height="30" alt="4">' \ '<img src="/img_htdig/button5.gif" border="2" align="middle" width="30" height="30" alt="5">' \ '<img src="/img_htdig/button6.gif" border="2" align="middle" width="30" height="30" alt="6">' \ '<img src="/img_htdig/button7.gif" border="2" align="middle" width="30" height="30" alt="7">' \ '<img src="/img_htdig/button8.gif" border="2" align="middle" width="30" height="30" alt="8">' \ '<img src="/img_htdig/button9.gif" border="2" align="middle" width="30" height="30" alt="9">' \ '<img src="/img_htdig/button10.gif" border="2" align="middle" width="30" height="30" alt="10">' external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl \ application/postscript->text/html /usr/local/bin/conv_doc.pl \ application/pdf->text/html /usr/local/bin/conv_doc.pl # local variables: # mode: text # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) )) (font-lock-mode))) # end: > Thank you, that output establishes that htdig is reading a .pdf file. > > The next question is: what is it doing with it? > To answer that we need to see what you have in your configuration file. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Gilles Detillieux" <gr...@sc...> > Cc: <htd...@li...> > Sent: Wednesday, October 08, 2003 10:22 AM > Subject: Re: [htdig] PDF-SEARCH > > > > Thank you very much for your help! > > I don't get error message, but I have never .pdf-Files in my > search-List!!! > > Hier is htdig -ivvv output when start_url is a single PDF file. > > What is wrong??? > > > > natalya.kolesnikova@intranet:~> htdig -ivvv > > > > 1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i > > ntroduction_to_IPR.pdf > > New server: intranet.panasonic.de, 80 > > Retrieval command for http://intranet.panasonic.de/robots.txt: GET > > /robots.txt H > > TTP/1.0 > > User-Agent: htdig/3.1.6 (kol...@pa...) > > Host: intranet.panasonic.de > > > > Header line: HTTP/1.1 200 OK > > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > > Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT > > Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 > > Header line: ETag: "44005-e7-3b82d9e0" > > Header line: Accept-Ranges: bytes > > Header line: Content-Length: 231 > > Header line: Connection: close > > Header line: Content-Type: text/plain > > Header line: > > returnStatus = 0 > > Read 231 from document > > Read a total of 231 bytes > > Parsing robots.txt file using myname = htdig > > Robots.txt line: # exclude help system from robots > > Robots.txt line: User-agent: * > > Found 'user-agent' line: * > > Robots.txt line: Disallow: /manual/ > > Found 'disallow' line: /manual/ > > Robots.txt line: Disallow: /doc/ > > Found 'disallow' line: /doc/ > > Robots.txt line: Disallow: /gif/ > > Found 'disallow' line: /gif/ > > Robots.txt line: # but allow htdig to index our doc-tree > > Robots.txt line: User-agent: susedig > > Found 'user-agent' line: susedig > > Robots.txt line: Disallow: > > Robots.txt line: # disallow stress test > > Robots.txt line: user-agent: stress-agent > > Found 'user-agent' line: stress-agent > > Robots.txt line: Disallow: / > > Pattern: /manual/|/doc/|/gif/ > > pushed > > pick: intranet.panasonic.de, # servers = > > 1 > > > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/int > rodu > > ction_to_IPR.pdf: Retrieval command for > > http://intranet.panasonic.de/pel/ipr/tra > > ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET > > /pel/ipr/training_course > > /IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 > > User-Agent: htdig/3.1.6 (kol...@pa...) > > Host: intranet.panasonic.de > > > > Header line: HTTP/1.1 200 OK > > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > > Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT > > Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 > > Header line: ETag: "314005-51e38-3f4f381f" > > Header line: Accept-Ranges: bytes > > Header line: Content-Length: 335416 > > Header line: Connection: close > > Header line: Content-Type: application/pdf > > Header line: > > returnStatus = 0 > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 7736 from document > > Read a total of 335416 bytes > > size = 335416 > > pick: intranet.panasonic.de, # servers = 1 > > natalya.kolesnikova@intranet:~> > > > According to Natalya Kolesnikova: > > > > may be I am stupid, but it doesn't work by me! Can somebody help me? > I > > > have > > > > tried with acroread and with external parser xpdf, but it doesn't > > > work!!!! > > > > I need the Installation Guide!!! :))) > > > > > > See http://www.htdig.org/FAQ.html#q4.9 > > > > > > That is the installation guide for PDF indexing. If you've carefully > read > > > and implemented everything recommended there, and checked out FAQs 5.2 > > > and 5.37 as David recommended (twice), then please provide more > details, > > > such as what error messages you get, or give us an excerpt of > htdig -ivvv > > > output when start_url is set to point to just one single PDF file. > > > > > > There are dozens of potential points of failure in this process, so > simply > > > saying "it doesn't work" gives us no information that can help > pinpoint > > > which point of failure is the one that needs to be addressed. > > > > > > Also, make sure you have links in your HTML files to all PDF files you > > > want to index. (See http://www.htdig.org/FAQ.html#q5.25) > > > > > > -- > > > Gilles R. Detillieux E-mail: <gr...@sc...> > > > Spinal Cord Research Centre WWW: > http://www.scrc.umanitoba.ca/ > > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > > > > > > ------------------------------------------------------- > > > This sf.net email is sponsored by:ThinkGeek > > > Welcome to geek heaven. > > > http://thinkgeek.com/sf > > > _______________________________________________ > > > ht://Dig general mailing list: <htd...@li...> > > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > > List information (subscribe/unsubscribe, etc.) > > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > > > > > > -- > > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > > Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService > > > > Jetzt kostenlos anmelden unter http://www.gmx.net > > > > +++ GMX - die erste Adresse für Mail, Message, More! +++ > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > ht://Dig general mailing list: <htd...@li...> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: David A. <D.J...@so...> - 2003-10-08 15:00:20
|
Ok, your configuration file contains:
external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl \
application/postscript->text/html /usr/local/bin/conv_doc.pl \
application/pdf->text/html /usr/local/bin/conv_doc.pl
so you are using conv_doc.pl.
Please check one thing in your configuration file: make sure there are no
white space characters after the \ characters at the end of lines, this is
most important.
If your configuration file is OK, then the problem must be with
/usr/local/bin/conv_doc.pl or the utilities it calls.
Try running /usr/local/bin/conv_doc.pl from the command line with a .PDF
file as argument and see what the result is.
----- Original Message -----
From: "Natalya Kolesnikova" <Ja...@gm...>
To: "David Adams" <D.J...@so...>
Cc: <gr...@sc...>; <htd...@li...>
Sent: Wednesday, October 08, 2003 3:19 PM
Subject: Re: [htdig] PDF-SEARCH
> Hier is my htdig.conf:
> #
> # Example config file for ht://Dig.
> #
> # This configuration file is used by all the programs that make up
ht://Dig.
> # Please refer to the attribute reference manual for more details on what
> # can be put into this file. (http://www.htdig.org/confindex.html)
> # Note that most attributes have very reasonable default values so you
> # really only have to add attributes here if you want to change the
> defaults.
> #
> # What follows are some of the common attributes you might want to change.
> #
>
> #
> # Specify where the database files need to go. Make sure that there is
> # plenty of free disk space available for the databases. They can get
> # pretty big.
> #
> database_dir: /srv/www/htdig/db
>
> #
> # This specifies the URL where the robot (htdig) will start. You can
> specify
> # multiple URLs here. Just separate them by some whitespace.
> # The example here will cause the ht://Dig homepage and related pages to
be
> # indexed.
> # You could also index all the URLs in a file like so:
> #start_url:
> `${common_dir}/start.url`
> #
> start_url:
http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf
> #start_url: http://intranet.panasonic.de
> star_image: /img_htdig/star.gif
> star_blank: /img_htdig/star_blank.gif
>
>
>
> #
> # This attribute limits the scope of the indexing process. The default is
> to
> # set it to the same as the start_url above. This way only pages that are
> on
> # the sites specified in the start_url attribute will be indexed and it
will
> # reject any URLs that go outside of those sites.
> #
> # Keep in mind that the value for this attribute is just a list of string
> # patterns. As long as URLs contain at least one of the patterns it will
be
> # seen as part of the scope of the index.
> #
> limit_urls_to: ${start_url}
>
> #
> # If there are particular pages that you definitely do NOT want to index,
> you
> # can use the exclude_urls attribute. The value is a list of string
> patterns.
> # If a URL matches any of the patterns, it will NOT be indexed. This is
> # useful to exclude things like virtual web trees or database accesses.
By
> # default, all CGI URLs will be excluded. (Note that the /cgi-bin/
> convention
> # may not work on your web server. Check the path prefix used on your
web
> # server.)
> #
> exclude_urls: /cgi-bin/ .cgi
>
>
>
> #
> # Since ht://Dig does not (and cannot) parse every document type, this
> # attribute is a list of strings (extensions) that will be ignored during
> # indexing. These are *only* checked at the end of a URL, whereas
> # exclude_url patterns are matched anywhere.
> #
> bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
> .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css
>
> #
> # The string htdig will send in every request to identify the robot.
Change
> # this to your email address.
> #
> maintainer: kol...@pa...
>
> #
> # The excerpts that are displayed in long results rely on stored
information
> # in the index databases. The compiled default only stores 512 characters
> of
> # text from each document (this excludes any HTML markup...) If you plan
on
> # using the excerpts you probably want to make this larger. The only
> concern
> # here is that more disk space is going to be needed to store the
additional
> # information. Since disk space is cheap (! :-)) you might want to set
this
> # to a value so that a large percentage of the documents that you are
going
> # to be indexing are stored completely in the database. At SDSU we found
> # that by setting this value to about 50k the index would get 97% of all
> # documents completely and only 3% was cut off at 50k. You probably want
to
> # experiment with this value.
> # Note that if you want to set this value low, you probably want to set
the
> # excerpt_show_top attribute to false so that the top excerpt_length
> characters
> # of the document are always shown.
> #
> max_head_length: 10000
>
> #
> # To limit network connections, ht://Dig will only pull up to a certain
> limit
> # of bytes. This prevents the indexing from dying because the server keeps
> # sending information. However, several FAQs happen because people have
> files
> # bigger than the default limit of 100KB. This sets the default a bit
> higher.
> # (see <http://www.htdig.org/FAQ.html> for more)
> #
> max_doc_size: 10000000000000000000000
>
> #
> # Most people expect some sort of excerpt in results. By default, if the
> # search words aren't found in context in the stored excerpt, htsearch
shows
>
> # the text defined in the no_excerpt_text attribute:
> # (None of the search words were found in the top of this document.)
> # This attribute instead will show the top of the excerpt.
> #
> no_excerpt_show_top: true
>
> #
> # Depending on your needs, you might want to enable some of the fuzzy
search
> # algorithms. There are several to choose from and you can use them in
any
> # combination you feel comfortable with. Each algorithm will get a weight
> # assigned to it so that in combinations of algorithms, certain algorithms
> get
> # preference over others. Note that the weights only affect the ranking
of
> # the results, not the actual searching.
> # The available algorithms are:
> # accents
> # exact
> # endings
> # metaphone
> # prefix
> # soundex
> # substring
> # synonyms
> # By default only the "exact" algorithm is used with weight 1.
> # Note that if you are going to use the endings, metaphone, soundex,
> accents,
> # or synonyms algorithms, you will need to run htfuzzy to generate
> # the databases they use.
> #
> search_algorithm: exact:1 synonyms:0.5 endings:0.1
>
> #
> # The following are the templates used in the builtin search results
> # The default is to use compiled versions of these files, which produces
> # slightly faster results. However, uncommenting these lines makes it
> # very easy to change the format of search results.
> # See <http://www.htdig.org/hts_templates.html> for more details.
> #
> # template_map: Long long ${common_dir}/long.html \
> # Short short ${common_dir}/short.html
> # template_name: long
>
> #
> # The following are used to change the text for the page index.
> # The defaults are just boring text numbers. These images spice
> # up the result pages quite a bit. (Feel free to do whatever, though)
> #
> next_page_text: <img src="/img_htdig/buttonr.gif" border="0"
align="middle"
> width="30" height="30" alt="next">
> no_next_page_text:
> prev_page_text: <img src="/img_htdig/buttonl.gif" border="0"
align="middle"
> width="30" height="30" alt="prev">
> no_prev_page_text:
> page_number_text: '<img src="/img_htdig/button1.gif" border="0"
> align="middle" width="30" height="30" alt="1">' \
> '<img src="/img_htdig/button2.gif" border="0" align="middle" width="30"
> height="30" alt="2">' \
> '<img src="/img_htdig/button3.gif" border="0" align="middle" width="30"
> height="30" alt="3">' \
> '<img src="/img_htdig/button4.gif" border="0" align="middle" width="30"
> height="30" alt="4">' \
> '<img src="/img_htdig/button5.gif" border="0" align="middle" width="30"
> height="30" alt="5">' \
> '<img src="/img_htdig/button6.gif" border="0" align="middle" width="30"
> height="30" alt="6">' \
> '<img src="/img_htdig/button7.gif" border="0" align="middle" width="30"
> height="30" alt="7">' \
> '<img src="/img_htdig/button8.gif" border="0" align="middle" width="30"
> height="30" alt="8">' \
> '<img src="/img_htdig/button9.gif" border="0" align="middle" width="30"
> height="30" alt="9">' \
> '<img src="/img_htdig/button10.gif" border="0" align="middle" width="30"
> height="30" alt="10">'
> #
> # To make the current page stand out, we will put a border around the
> # image for that page.
> #
> no_page_number_text: '<img src="/img_htdig/button1.gif" border="2"
> align="middle" width="30" height="30" alt="1">' \
> '<img src="/img_htdig/button2.gif" border="2" align="middle" width="30"
> height="30" alt="2">' \
> '<img src="/img_htdig/button3.gif" border="2" align="middle" width="30"
> height="30" alt="3">' \
> '<img src="/img_htdig/button4.gif" border="2" align="middle" width="30"
> height="30" alt="4">' \
> '<img src="/img_htdig/button5.gif" border="2" align="middle" width="30"
> height="30" alt="5">' \
> '<img src="/img_htdig/button6.gif" border="2" align="middle" width="30"
> height="30" alt="6">' \
> '<img src="/img_htdig/button7.gif" border="2" align="middle" width="30"
> height="30" alt="7">' \
> '<img src="/img_htdig/button8.gif" border="2" align="middle" width="30"
> height="30" alt="8">' \
> '<img src="/img_htdig/button9.gif" border="2" align="middle" width="30"
> height="30" alt="9">' \
> '<img src="/img_htdig/button10.gif" border="2" align="middle" width="30"
> height="30" alt="10">'
>
> external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl
\
> application/postscript->text/html /usr/local/bin/conv_doc.pl
\
> application/pdf->text/html /usr/local/bin/conv_doc.pl
>
>
>
> # local variables:
> # mode: text
> # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list
> '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" .
> font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) ))
(font-lock-mode)))
> # end:
>
....
|
|
From: Natalya K. <Ja...@gm...> - 2003-10-08 15:12:48
|
Thank you, David, for your help! But when I run htmerge, I get follow message: htmerge: Document database has no URLs. Check your config file and try running htdig again. Thank you for your tipps! Natalya > Ok, your configuration file contains: > > external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl > \ > application/postscript->text/html /usr/local/bin/conv_doc.pl > \ > application/pdf->text/html /usr/local/bin/conv_doc.pl > > so you are using conv_doc.pl. > > Please check one thing in your configuration file: make sure there are no > white space characters after the \ characters at the end of lines, this is > most important. > > If your configuration file is OK, then the problem must be with > /usr/local/bin/conv_doc.pl or the utilities it calls. > Try running /usr/local/bin/conv_doc.pl from the command line with a .PDF > file as argument and see what the result is. > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "David Adams" <D.J...@so...> > Cc: <gr...@sc...>; <htd...@li...> > Sent: Wednesday, October 08, 2003 3:19 PM > Subject: Re: [htdig] PDF-SEARCH > > > > Hier is my htdig.conf: > > # > > # Example config file for ht://Dig. > > # > > # This configuration file is used by all the programs that make up > ht://Dig. > > # Please refer to the attribute reference manual for more details on > what > > # can be put into this file. (http://www.htdig.org/confindex.html) > > # Note that most attributes have very reasonable default values so you > > # really only have to add attributes here if you want to change the > > defaults. > > # > > # What follows are some of the common attributes you might want to > change. > > # > > > > # > > # Specify where the database files need to go. Make sure that there is > > # plenty of free disk space available for the databases. They can get > > # pretty big. > > # > > database_dir: /srv/www/htdig/db > > > > # > > # This specifies the URL where the robot (htdig) will start. You can > > specify > > # multiple URLs here. Just separate them by some whitespace. > > # The example here will cause the ht://Dig homepage and related pages to > be > > # indexed. > > # You could also index all the URLs in a file like so: > > #start_url: > > `${common_dir}/start.url` > > # > > start_url: > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > > #start_url: http://intranet.panasonic.de > > star_image: /img_htdig/star.gif > > star_blank: /img_htdig/star_blank.gif > > > > > > > > # > > # This attribute limits the scope of the indexing process. The default > is > > to > > # set it to the same as the start_url above. This way only pages that > are > > on > > # the sites specified in the start_url attribute will be indexed and it > will > > # reject any URLs that go outside of those sites. > > # > > # Keep in mind that the value for this attribute is just a list of > string > > # patterns. As long as URLs contain at least one of the patterns it will > be > > # seen as part of the scope of the index. > > # > > limit_urls_to: ${start_url} > > > > # > > # If there are particular pages that you definitely do NOT want to > index, > > you > > # can use the exclude_urls attribute. The value is a list of string > > patterns. > > # If a URL matches any of the patterns, it will NOT be indexed. This is > > # useful to exclude things like virtual web trees or database accesses. > By > > # default, all CGI URLs will be excluded. (Note that the /cgi-bin/ > > convention > > # may not work on your web server. Check the path prefix used on your > web > > # server.) > > # > > exclude_urls: /cgi-bin/ .cgi > > > > > > > > # > > # Since ht://Dig does not (and cannot) parse every document type, this > > # attribute is a list of strings (extensions) that will be ignored > during > > # indexing. These are *only* checked at the end of a URL, whereas > > # exclude_url patterns are matched anywhere. > > # > > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ > > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css > > > > # > > # The string htdig will send in every request to identify the robot. > Change > > # this to your email address. > > # > > maintainer: kol...@pa... > > > > # > > # The excerpts that are displayed in long results rely on stored > information > > # in the index databases. The compiled default only stores 512 > characters > > of > > # text from each document (this excludes any HTML markup...) If you > plan > on > > # using the excerpts you probably want to make this larger. The only > > concern > > # here is that more disk space is going to be needed to store the > additional > > # information. Since disk space is cheap (! :-)) you might want to set > this > > # to a value so that a large percentage of the documents that you are > going > > # to be indexing are stored completely in the database. At SDSU we > found > > # that by setting this value to about 50k the index would get 97% of all > > # documents completely and only 3% was cut off at 50k. You probably > want > to > > # experiment with this value. > > # Note that if you want to set this value low, you probably want to set > the > > # excerpt_show_top attribute to false so that the top excerpt_length > > characters > > # of the document are always shown. > > # > > max_head_length: 10000 > > > > # > > # To limit network connections, ht://Dig will only pull up to a certain > > limit > > # of bytes. This prevents the indexing from dying because the server > keeps > > # sending information. However, several FAQs happen because people have > > files > > # bigger than the default limit of 100KB. This sets the default a bit > > higher. > > # (see <http://www.htdig.org/FAQ.html> for more) > > # > > max_doc_size: 10000000000000000000000 > > > > # > > # Most people expect some sort of excerpt in results. By default, if the > > # search words aren't found in context in the stored excerpt, htsearch > shows > > > > # the text defined in the no_excerpt_text attribute: > > # (None of the search words were found in the top of this document.) > > # This attribute instead will show the top of the excerpt. > > # > > no_excerpt_show_top: true > > > > # > > # Depending on your needs, you might want to enable some of the fuzzy > search > > # algorithms. There are several to choose from and you can use them in > any > > # combination you feel comfortable with. Each algorithm will get a > weight > > # assigned to it so that in combinations of algorithms, certain > algorithms > > get > > # preference over others. Note that the weights only affect the ranking > of > > # the results, not the actual searching. > > # The available algorithms are: > > # accents > > # exact > > # endings > > # metaphone > > # prefix > > # soundex > > # substring > > # synonyms > > # By default only the "exact" algorithm is used with weight 1. > > # Note that if you are going to use the endings, metaphone, soundex, > > accents, > > # or synonyms algorithms, you will need to run htfuzzy to generate > > # the databases they use. > > # > > search_algorithm: exact:1 synonyms:0.5 endings:0.1 > > > > # > > # The following are the templates used in the builtin search results > > # The default is to use compiled versions of these files, which produces > > # slightly faster results. However, uncommenting these lines makes it > > # very easy to change the format of search results. > > # See <http://www.htdig.org/hts_templates.html> for more details. > > # > > # template_map: Long long ${common_dir}/long.html \ > > # Short short ${common_dir}/short.html > > # template_name: long > > > > # > > # The following are used to change the text for the page index. > > # The defaults are just boring text numbers. These images spice > > # up the result pages quite a bit. (Feel free to do whatever, though) > > # > > next_page_text: <img src="/img_htdig/buttonr.gif" border="0" > align="middle" > > width="30" height="30" alt="next"> > > no_next_page_text: > > prev_page_text: <img src="/img_htdig/buttonl.gif" border="0" > align="middle" > > width="30" height="30" alt="prev"> > > no_prev_page_text: > > page_number_text: '<img src="/img_htdig/button1.gif" border="0" > > align="middle" width="30" height="30" alt="1">' \ > > '<img src="/img_htdig/button2.gif" border="0" align="middle" width="30" > > height="30" alt="2">' \ > > '<img src="/img_htdig/button3.gif" border="0" align="middle" width="30" > > height="30" alt="3">' \ > > '<img src="/img_htdig/button4.gif" border="0" align="middle" width="30" > > height="30" alt="4">' \ > > '<img src="/img_htdig/button5.gif" border="0" align="middle" width="30" > > height="30" alt="5">' \ > > '<img src="/img_htdig/button6.gif" border="0" align="middle" width="30" > > height="30" alt="6">' \ > > '<img src="/img_htdig/button7.gif" border="0" align="middle" width="30" > > height="30" alt="7">' \ > > '<img src="/img_htdig/button8.gif" border="0" align="middle" width="30" > > height="30" alt="8">' \ > > '<img src="/img_htdig/button9.gif" border="0" align="middle" width="30" > > height="30" alt="9">' \ > > '<img src="/img_htdig/button10.gif" border="0" align="middle" width="30" > > height="30" alt="10">' > > # > > # To make the current page stand out, we will put a border around the > > # image for that page. > > # > > no_page_number_text: '<img src="/img_htdig/button1.gif" border="2" > > align="middle" width="30" height="30" alt="1">' \ > > '<img src="/img_htdig/button2.gif" border="2" align="middle" width="30" > > height="30" alt="2">' \ > > '<img src="/img_htdig/button3.gif" border="2" align="middle" width="30" > > height="30" alt="3">' \ > > '<img src="/img_htdig/button4.gif" border="2" align="middle" width="30" > > height="30" alt="4">' \ > > '<img src="/img_htdig/button5.gif" border="2" align="middle" width="30" > > height="30" alt="5">' \ > > '<img src="/img_htdig/button6.gif" border="2" align="middle" width="30" > > height="30" alt="6">' \ > > '<img src="/img_htdig/button7.gif" border="2" align="middle" width="30" > > height="30" alt="7">' \ > > '<img src="/img_htdig/button8.gif" border="2" align="middle" width="30" > > height="30" alt="8">' \ > > '<img src="/img_htdig/button9.gif" border="2" align="middle" width="30" > > height="30" alt="9">' \ > > '<img src="/img_htdig/button10.gif" border="2" align="middle" width="30" > > height="30" alt="10">' > > > > external_parsers: application/msword->text/html > /usr/local/bin/conv_doc.pl > \ > > application/postscript->text/html > /usr/local/bin/conv_doc.pl > \ > > application/pdf->text/html /usr/local/bin/conv_doc.pl > > > > > > > > # local variables: > > # mode: text > > # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list > > '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . > > font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) )) > (font-lock-mode))) > > # end: > > > .... > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: Gilles D. <gr...@sc...> - 2003-10-08 22:13:02
|
According to Natalya Kolesnikova: > Thank you, David, for your help! > > But when I run htmerge, I get follow message: > htmerge: Document database has no URLs. Check your config file and try > running htdig again. Are there any other htmerge error messages, such as a "Deleted: no excerpt" message? I suspect what's happening here is that htdig adds the single URL for the PDF file, which you specify in start_url, to the database, but when it tries to index it, it finds nothing to index. When htmerge sees that nothing was indexed for this one document, it removes it from the database, but then complains that there are no URLs left in the database. Seeing all the htmerge error messages (try htmerge -v after htdig) would give us a more complete picture. Please follow through on Dave's and my suggestions below... > > Ok, your configuration file contains: > > > > external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl > > \ > > application/postscript->text/html /usr/local/bin/conv_doc.pl > > \ > > application/pdf->text/html /usr/local/bin/conv_doc.pl > > > > so you are using conv_doc.pl. > > > > Please check one thing in your configuration file: make sure there are no > > white space characters after the \ characters at the end of lines, this is > > most important. My first hunch is that this isn't the problem, because if htdig didn't see the full external_parsers definition (all 3 lines of it), it likely would be trying to use acroread and the PDF:: class, so we'd see messages from there. However, it's an easy thing to check for, and always a good idea to pay close attention to in any case, so please do have a look at these lines. > > If your configuration file is OK, then the problem must be with > > /usr/local/bin/conv_doc.pl or the utilities it calls. > > Try running /usr/local/bin/conv_doc.pl from the command line with a .PDF > > file as argument and see what the result is. This is a very important test. Your first test, with the start_url set to http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf showed that it failed with this single PDF file, which suggests a problem either with that PDF file or with the setup of the external parser. The next step is to find out which is at fault, and this test will do that. If it fails on the introduction_to_IPR.pdf file (i.e. it produces no output), try it on a few other files as well. If it doesn't work on any of them, I'd suspect that conv_doc.pl is not properly configured. In this case, you should try pdftotext directly on these PDF files to see if that works. If it produces output for some PDF files, but not others, it may be that the ones for which it produces nothing actually contain no indexable text. Some PDF files contain only image data, including perhaps scanned pages that display as text, but in fact are only a "picture" of a page. Once you can get conv_doc.pl to spit out text when run manually, the following step will be to try htdig on those same PDF files, one at a time, using htdig -ivvvv (note: 4 "v" options this time, so htdig shows each word it parses). If you get that far, then the next stage would be to use your original start_url to index your whole site, and see if it will find all the PDF files. If it doesn't, see http://www.htdig.org/FAQ.html#q5.27 -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Natalya K. <Ja...@gm...> - 2003-10-09 08:52:09
|
Yes, I get error message "Deleted: no excerpt"!!! Natalya > According to Natalya Kolesnikova: > > Thank you, David, for your help! > > > > But when I run htmerge, I get follow message: > > htmerge: Document database has no URLs. Check your config file and try > > running htdig again. > > Are there any other htmerge error messages, such as a "Deleted: no > excerpt" > message? I suspect what's happening here is that htdig adds the single > URL for the PDF file, which you specify in start_url, to the database, > but when it tries to index it, it finds nothing to index. When htmerge > sees that nothing was indexed for this one document, it removes it from > the database, but then complains that there are no URLs left in the > database. > Seeing all the htmerge error messages (try htmerge -v after htdig) would > give us a more complete picture. > > Please follow through on Dave's and my suggestions below... > > > > Ok, your configuration file contains: > > > > > > external_parsers: application/msword->text/html > /usr/local/bin/conv_doc.pl > > > \ > > > application/postscript->text/html > /usr/local/bin/conv_doc.pl > > > \ > > > application/pdf->text/html /usr/local/bin/conv_doc.pl > > > > > > so you are using conv_doc.pl. > > > > > > Please check one thing in your configuration file: make sure there are > no > > > white space characters after the \ characters at the end of lines, > this is > > > most important. > > My first hunch is that this isn't the problem, because if htdig didn't > see the full external_parsers definition (all 3 lines of it), it likely > would be trying to use acroread and the PDF:: class, so we'd see messages > >from there. However, it's an easy thing to check for, and always a good > idea to pay close attention to in any case, so please do have a look at > these lines. > > > > If your configuration file is OK, then the problem must be with > > > /usr/local/bin/conv_doc.pl or the utilities it calls. > > > Try running /usr/local/bin/conv_doc.pl from the command line with a > .PDF > > > file as argument and see what the result is. > > This is a very important test. Your first test, with the start_url set to > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > showed that it failed with this single PDF file, which suggests a problem > either with that PDF file or with the setup of the external parser. > The next step is to find out which is at fault, and this test will do > that. If it fails on the introduction_to_IPR.pdf file (i.e. it produces > no output), try it on a few other files as well. If it doesn't work on > any of them, I'd suspect that conv_doc.pl is not properly configured. > In this case, you should try pdftotext directly on these PDF files to > see if that works. > > If it produces output for some PDF files, but not others, it may be that > the ones for which it produces nothing actually contain no indexable text. > Some PDF files contain only image data, including perhaps scanned pages > that display as text, but in fact are only a "picture" of a page. > > Once you can get conv_doc.pl to spit out text when run manually, > the following step will be to try htdig on those same PDF files, > one at a time, using htdig -ivvvv (note: 4 "v" options this time, > so htdig shows each word it parses). If you get that far, then the > next stage would be to use your original start_url to index your whole > site, and see if it will find all the PDF files. If it doesn't, see > http://www.htdig.org/FAQ.html#q5.27 > > -- > Gilles R. Detillieux E-mail: <gr...@sc...> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: David A. <D.J...@so...> - 2003-10-09 12:23:14
|
OK, so far we have established:
1) Htdig is reading a .PDF file
2) You are attempting to use /usr/local/bin/conv_doc.pl to convert it.
3) No text is being extracted from the .PDF file, so it is not being
indexed.
This suggests that the fault is with /usr/local/bin/conv_doc.pl. Please try
executing this from the command line:
/usr/local/bin/conv_doc.pl somepdffile.pdf
where somepdffile.pdf is a PDF file from which it should be able to extract
text. See what happens.
This is a necessary step in the diagnosis.
David Adams
Corporate Information Services
Information Systems Services
University of Southampton
----- Original Message -----
From: "Natalya Kolesnikova" <Ja...@gm...>
To: "Gilles Detillieux" <gr...@sc...>
Cc: <D.J...@so...>; <htd...@li...>
Sent: Thursday, October 09, 2003 9:51 AM
Subject: Re: [htdig] PDF-SEARCH
> Yes, I get error message "Deleted: no excerpt"!!!
>
> Natalya
>
> > According to Natalya Kolesnikova:
> > > Thank you, David, for your help!
> > >
> > > But when I run htmerge, I get follow message:
> > > htmerge: Document database has no URLs. Check your config file and try
> > > running htdig again.
> >
> > Are there any other htmerge error messages, such as a "Deleted: no
> > excerpt"
> > message? I suspect what's happening here is that htdig adds the single
> > URL for the PDF file, which you specify in start_url, to the database,
> > but when it tries to index it, it finds nothing to index. When htmerge
> > sees that nothing was indexed for this one document, it removes it from
> > the database, but then complains that there are no URLs left in the
> > database.
> > Seeing all the htmerge error messages (try htmerge -v after htdig) would
> > give us a more complete picture.
> >
> > Please follow through on Dave's and my suggestions below...
> >
> > > > Ok, your configuration file contains:
> > > >
> > > > external_parsers: application/msword->text/html
> > /usr/local/bin/conv_doc.pl
> > > > \
> > > > application/postscript->text/html
> > /usr/local/bin/conv_doc.pl
> > > > \
> > > > application/pdf->text/html /usr/local/bin/conv_doc.pl
> > > >
> > > > so you are using conv_doc.pl.
> > > >
> > > > Please check one thing in your configuration file: make sure there
are
> > no
> > > > white space characters after the \ characters at the end of lines,
> > this is
> > > > most important.
> >
> > My first hunch is that this isn't the problem, because if htdig didn't
> > see the full external_parsers definition (all 3 lines of it), it likely
> > would be trying to use acroread and the PDF:: class, so we'd see
messages
> > >from there. However, it's an easy thing to check for, and always a
good
> > idea to pay close attention to in any case, so please do have a look at
> > these lines.
> >
> > > > If your configuration file is OK, then the problem must be with
> > > > /usr/local/bin/conv_doc.pl or the utilities it calls.
> > > > Try running /usr/local/bin/conv_doc.pl from the command line with a
> > .PDF
> > > > file as argument and see what the result is.
> >
> > This is a very important test. Your first test, with the start_url set
to
> >
>
http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf
> > showed that it failed with this single PDF file, which suggests a
problem
> > either with that PDF file or with the setup of the external parser.
> > The next step is to find out which is at fault, and this test will do
> > that. If it fails on the introduction_to_IPR.pdf file (i.e. it produces
> > no output), try it on a few other files as well. If it doesn't work on
> > any of them, I'd suspect that conv_doc.pl is not properly configured.
> > In this case, you should try pdftotext directly on these PDF files to
> > see if that works.
> >
> > If it produces output for some PDF files, but not others, it may be that
> > the ones for which it produces nothing actually contain no indexable
text.
> > Some PDF files contain only image data, including perhaps scanned pages
> > that display as text, but in fact are only a "picture" of a page.
> >
> > Once you can get conv_doc.pl to spit out text when run manually,
> > the following step will be to try htdig on those same PDF files,
> > one at a time, using htdig -ivvvv (note: 4 "v" options this time,
> > so htdig shows each word it parses). If you get that far, then the
> > next stage would be to use your original start_url to index your whole
> > site, and see if it will find all the PDF files. If it doesn't, see
> > http://www.htdig.org/FAQ.html#q5.27
> >
> > --
> > Gilles R. Detillieux E-mail: <gr...@sc...>
> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
> > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
> >
|
|
From: Natalya K. <Ja...@gm...> - 2003-10-10 07:50:50
|
Hello David, thank you very much for your support! Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok, too. But when I run conv_doc.pl (or doc2html.pl ) from command line with a pdf-file as a argument I get error message: bad interpreter: no such file or directory/usr/bin/perl. What is here wrong??? best regards Natalya > OK, so far we have established: > > 1) Htdig is reading a .PDF file > 2) You are attempting to use /usr/local/bin/conv_doc.pl to convert it. > 3) No text is being extracted from the .PDF file, so it is not being > indexed. > > This suggests that the fault is with /usr/local/bin/conv_doc.pl. Please > try > executing this from the command line: > > /usr/local/bin/conv_doc.pl somepdffile.pdf > > where somepdffile.pdf is a PDF file from which it should be able to > extract > text. See what happens. > This is a necessary step in the diagnosis. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Gilles Detillieux" <gr...@sc...> > Cc: <D.J...@so...>; <htd...@li...> > Sent: Thursday, October 09, 2003 9:51 AM > Subject: Re: [htdig] PDF-SEARCH > > > > Yes, I get error message "Deleted: no excerpt"!!! > > > > Natalya > > > > > According to Natalya Kolesnikova: > > > > Thank you, David, for your help! > > > > > > > > But when I run htmerge, I get follow message: > > > > htmerge: Document database has no URLs. Check your config file and > try > > > > running htdig again. > > > > > > Are there any other htmerge error messages, such as a "Deleted: no > > > excerpt" > > > message? I suspect what's happening here is that htdig adds the > single > > > URL for the PDF file, which you specify in start_url, to the database, > > > but when it tries to index it, it finds nothing to index. When > htmerge > > > sees that nothing was indexed for this one document, it removes it > from > > > the database, but then complains that there are no URLs left in the > > > database. > > > Seeing all the htmerge error messages (try htmerge -v after htdig) > would > > > give us a more complete picture. > > > > > > Please follow through on Dave's and my suggestions below... > > > > > > > > Ok, your configuration file contains: > > > > > > > > > > external_parsers: application/msword->text/html > > > /usr/local/bin/conv_doc.pl > > > > > \ > > > > > application/postscript->text/html > > > /usr/local/bin/conv_doc.pl > > > > > \ > > > > > application/pdf->text/html > /usr/local/bin/conv_doc.pl > > > > > > > > > > so you are using conv_doc.pl. > > > > > > > > > > Please check one thing in your configuration file: make sure there > are > > > no > > > > > white space characters after the \ characters at the end of lines, > > > this is > > > > > most important. > > > > > > My first hunch is that this isn't the problem, because if htdig didn't > > > see the full external_parsers definition (all 3 lines of it), it > likely > > > would be trying to use acroread and the PDF:: class, so we'd see > messages > > > >from there. However, it's an easy thing to check for, and always a > good > > > idea to pay close attention to in any case, so please do have a look > at > > > these lines. > > > > > > > > If your configuration file is OK, then the problem must be with > > > > > /usr/local/bin/conv_doc.pl or the utilities it calls. > > > > > Try running /usr/local/bin/conv_doc.pl from the command line with > a > > > .PDF > > > > > file as argument and see what the result is. > > > > > > This is a very important test. Your first test, with the start_url > set > to > > > > > > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > > > showed that it failed with this single PDF file, which suggests a > problem > > > either with that PDF file or with the setup of the external parser. > > > The next step is to find out which is at fault, and this test will do > > > that. If it fails on the introduction_to_IPR.pdf file (i.e. it > produces > > > no output), try it on a few other files as well. If it doesn't work > on > > > any of them, I'd suspect that conv_doc.pl is not properly configured. > > > In this case, you should try pdftotext directly on these PDF files to > > > see if that works. > > > > > > If it produces output for some PDF files, but not others, it may be > that > > > the ones for which it produces nothing actually contain no indexable > text. > > > Some PDF files contain only image data, including perhaps scanned > pages > > > that display as text, but in fact are only a "picture" of a page. > > > > > > Once you can get conv_doc.pl to spit out text when run manually, > > > the following step will be to try htdig on those same PDF files, > > > one at a time, using htdig -ivvvv (note: 4 "v" options this time, > > > so htdig shows each word it parses). If you get that far, then the > > > next stage would be to use your original start_url to index your whole > > > site, and see if it will find all the PDF files. If it doesn't, see > > > http://www.htdig.org/FAQ.html#q5.27 > > > > > > -- > > > Gilles R. Detillieux E-mail: <gr...@sc...> > > > Spinal Cord Research Centre WWW: > http://www.scrc.umanitoba.ca/ > > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: Martin J. <web...@cl...> - 2003-10-10 08:02:13
|
Hi Natalya, then it seems that the path to perl is wrong and that's why the Perl Script(s) don't work. Check out the first lines of each Perl Script (.pl) and correct the path to perl. Maybe there isn't even perl installed ;-) Best wishes, Martin Natalya Kolesnikova schrieb: > Hello David, > > thank you very much for your support! > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok, too. > But when I run conv_doc.pl (or doc2html.pl ) from command line with a > pdf-file as a argument I get error message: > bad interpreter: no such file or directory/usr/bin/perl. > > What is here wrong??? > > best regards > Natalya > > >>OK, so far we have established: >> >>1) Htdig is reading a .PDF file >>2) You are attempting to use /usr/local/bin/conv_doc.pl to convert it. >>3) No text is being extracted from the .PDF file, so it is not being >>indexed. >> >>This suggests that the fault is with /usr/local/bin/conv_doc.pl. Please >>try >>executing this from the command line: >> >> /usr/local/bin/conv_doc.pl somepdffile.pdf >> >>where somepdffile.pdf is a PDF file from which it should be able to >>extract >>text. See what happens. >>This is a necessary step in the diagnosis. >> >>David Adams >>Corporate Information Services >>Information Systems Services >>University of Southampton >> >> >>----- Original Message ----- >>From: "Natalya Kolesnikova" <Ja...@gm...> >>To: "Gilles Detillieux" <gr...@sc...> >>Cc: <D.J...@so...>; <htd...@li...> >>Sent: Thursday, October 09, 2003 9:51 AM >>Subject: Re: [htdig] PDF-SEARCH >> >> >> >>>Yes, I get error message "Deleted: no excerpt"!!! >>> >>>Natalya >>> >>> >>>>According to Natalya Kolesnikova: >>>> >>>>>Thank you, David, for your help! >>>>> >>>>>But when I run htmerge, I get follow message: >>>>>htmerge: Document database has no URLs. Check your config file and >> >>try >> >>>>>running htdig again. >>>> >>>>Are there any other htmerge error messages, such as a "Deleted: no >>>>excerpt" >>>>message? I suspect what's happening here is that htdig adds the >> >>single >> >>>>URL for the PDF file, which you specify in start_url, to the database, >>>>but when it tries to index it, it finds nothing to index. When >> >>htmerge >> >>>>sees that nothing was indexed for this one document, it removes it >> >>from >> >>>>the database, but then complains that there are no URLs left in the >>>>database. >>>>Seeing all the htmerge error messages (try htmerge -v after htdig) >> >>would >> >>>>give us a more complete picture. >>>> >>>>Please follow through on Dave's and my suggestions below... >>>> >>>> >>>>>>Ok, your configuration file contains: >>>>>> >>>>>>external_parsers: application/msword->text/html >>>> >>>>/usr/local/bin/conv_doc.pl >>>> >>>>>>\ >>>>>> application/postscript->text/html >>>> >>>>/usr/local/bin/conv_doc.pl >>>> >>>>>>\ >>>>>> application/pdf->text/html >> >>/usr/local/bin/conv_doc.pl >> >>>>>>so you are using conv_doc.pl. >>>>>> >>>>>>Please check one thing in your configuration file: make sure there >> >>are >> >>>>no >>>> >>>>>>white space characters after the \ characters at the end of lines, >>>> >>>>this is >>>> >>>>>>most important. >>>> >>>>My first hunch is that this isn't the problem, because if htdig didn't >>>>see the full external_parsers definition (all 3 lines of it), it >> >>likely >> >>>>would be trying to use acroread and the PDF:: class, so we'd see >> >>messages >> >>>>>from there. However, it's an easy thing to check for, and always a >> >>good >> >>>>idea to pay close attention to in any case, so please do have a look >> >>at >> >>>>these lines. >>>> >>>> >>>>>>If your configuration file is OK, then the problem must be with >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls. >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line with >> >>a >> >>>>.PDF >>>> >>>>>>file as argument and see what the result is. >>>> >>>>This is a very important test. Your first test, with the start_url >> >>set >>to >> > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > >>>>showed that it failed with this single PDF file, which suggests a >> >>problem >> >>>>either with that PDF file or with the setup of the external parser. >>>>The next step is to find out which is at fault, and this test will do >>>>that. If it fails on the introduction_to_IPR.pdf file (i.e. it >> >>produces >> >>>>no output), try it on a few other files as well. If it doesn't work >> >>on >> >>>>any of them, I'd suspect that conv_doc.pl is not properly configured. >>>>In this case, you should try pdftotext directly on these PDF files to >>>>see if that works. >>>> >>>>If it produces output for some PDF files, but not others, it may be >> >>that >> >>>>the ones for which it produces nothing actually contain no indexable >> >>text. >> >>>>Some PDF files contain only image data, including perhaps scanned >> >>pages >> >>>>that display as text, but in fact are only a "picture" of a page. >>>> >>>>Once you can get conv_doc.pl to spit out text when run manually, >>>>the following step will be to try htdig on those same PDF files, >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time, >>>>so htdig shows each word it parses). If you get that far, then the >>>>next stage would be to use your original start_url to index your whole >>>>site, and see if it will find all the PDF files. If it doesn't, see >>>>http://www.htdig.org/FAQ.html#q5.27 >>>> >>>>-- >>>>Gilles R. Detillieux E-mail: <gr...@sc...> >>>>Spinal Cord Research Centre WWW: >> >>http://www.scrc.umanitoba.ca/ >> >>>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) >>>> >> > |
|
From: Natalya K. <Ja...@gm...> - 2003-10-10 11:28:03
|
Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!! If I run doc2html.pl with pdf-file as argument from command line, I get PRINT OUTPUNT!!!??? best regards Natalya > Hi Natalya, > then it seems that the path to perl is wrong and that's why the Perl > Script(s) don't work. > > Check out the first lines of each Perl Script (.pl) and correct the path > to perl. Maybe there isn't even perl installed ;-) > > Best wishes, > Martin > > > Natalya Kolesnikova schrieb: > > > Hello David, > > > > thank you very much for your support! > > > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working ok, > too. > > But when I run conv_doc.pl (or doc2html.pl ) from command line with a > > pdf-file as a argument I get error message: > > bad interpreter: no such file or directory/usr/bin/perl. > > > > What is here wrong??? > > > > best regards > > Natalya > > > > > >>OK, so far we have established: > >> > >>1) Htdig is reading a .PDF file > >>2) You are attempting to use /usr/local/bin/conv_doc.pl to convert > it. > >>3) No text is being extracted from the .PDF file, so it is not being > >>indexed. > >> > >>This suggests that the fault is with /usr/local/bin/conv_doc.pl. Please > >>try > >>executing this from the command line: > >> > >> /usr/local/bin/conv_doc.pl somepdffile.pdf > >> > >>where somepdffile.pdf is a PDF file from which it should be able to > >>extract > >>text. See what happens. > >>This is a necessary step in the diagnosis. > >> > >>David Adams > >>Corporate Information Services > >>Information Systems Services > >>University of Southampton > >> > >> > >>----- Original Message ----- > >>From: "Natalya Kolesnikova" <Ja...@gm...> > >>To: "Gilles Detillieux" <gr...@sc...> > >>Cc: <D.J...@so...>; <htd...@li...> > >>Sent: Thursday, October 09, 2003 9:51 AM > >>Subject: Re: [htdig] PDF-SEARCH > >> > >> > >> > >>>Yes, I get error message "Deleted: no excerpt"!!! > >>> > >>>Natalya > >>> > >>> > >>>>According to Natalya Kolesnikova: > >>>> > >>>>>Thank you, David, for your help! > >>>>> > >>>>>But when I run htmerge, I get follow message: > >>>>>htmerge: Document database has no URLs. Check your config file and > >> > >>try > >> > >>>>>running htdig again. > >>>> > >>>>Are there any other htmerge error messages, such as a "Deleted: no > >>>>excerpt" > >>>>message? I suspect what's happening here is that htdig adds the > >> > >>single > >> > >>>>URL for the PDF file, which you specify in start_url, to the database, > >>>>but when it tries to index it, it finds nothing to index. When > >> > >>htmerge > >> > >>>>sees that nothing was indexed for this one document, it removes it > >> > >>from > >> > >>>>the database, but then complains that there are no URLs left in the > >>>>database. > >>>>Seeing all the htmerge error messages (try htmerge -v after htdig) > >> > >>would > >> > >>>>give us a more complete picture. > >>>> > >>>>Please follow through on Dave's and my suggestions below... > >>>> > >>>> > >>>>>>Ok, your configuration file contains: > >>>>>> > >>>>>>external_parsers: application/msword->text/html > >>>> > >>>>/usr/local/bin/conv_doc.pl > >>>> > >>>>>>\ > >>>>>> application/postscript->text/html > >>>> > >>>>/usr/local/bin/conv_doc.pl > >>>> > >>>>>>\ > >>>>>> application/pdf->text/html > >> > >>/usr/local/bin/conv_doc.pl > >> > >>>>>>so you are using conv_doc.pl. > >>>>>> > >>>>>>Please check one thing in your configuration file: make sure there > >> > >>are > >> > >>>>no > >>>> > >>>>>>white space characters after the \ characters at the end of lines, > >>>> > >>>>this is > >>>> > >>>>>>most important. > >>>> > >>>>My first hunch is that this isn't the problem, because if htdig didn't > >>>>see the full external_parsers definition (all 3 lines of it), it > >> > >>likely > >> > >>>>would be trying to use acroread and the PDF:: class, so we'd see > >> > >>messages > >> > >>>>>>from there. However, it's an easy thing to check for, and always a > >> > >>good > >> > >>>>idea to pay close attention to in any case, so please do have a look > >> > >>at > >> > >>>>these lines. > >>>> > >>>> > >>>>>>If your configuration file is OK, then the problem must be with > >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls. > >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line with > >> > >>a > >> > >>>>.PDF > >>>> > >>>>>>file as argument and see what the result is. > >>>> > >>>>This is a very important test. Your first test, with the start_url > >> > >>set > >>to > >> > > > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > > > >>>>showed that it failed with this single PDF file, which suggests a > >> > >>problem > >> > >>>>either with that PDF file or with the setup of the external parser. > >>>>The next step is to find out which is at fault, and this test will do > >>>>that. If it fails on the introduction_to_IPR.pdf file (i.e. it > >> > >>produces > >> > >>>>no output), try it on a few other files as well. If it doesn't work > >> > >>on > >> > >>>>any of them, I'd suspect that conv_doc.pl is not properly configured. > >>>>In this case, you should try pdftotext directly on these PDF files to > >>>>see if that works. > >>>> > >>>>If it produces output for some PDF files, but not others, it may be > >> > >>that > >> > >>>>the ones for which it produces nothing actually contain no indexable > >> > >>text. > >> > >>>>Some PDF files contain only image data, including perhaps scanned > >> > >>pages > >> > >>>>that display as text, but in fact are only a "picture" of a page. > >>>> > >>>>Once you can get conv_doc.pl to spit out text when run manually, > >>>>the following step will be to try htdig on those same PDF files, > >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time, > >>>>so htdig shows each word it parses). If you get that far, then the > >>>>next stage would be to use your original start_url to index your whole > >>>>site, and see if it will find all the PDF files. If it doesn't, see > >>>>http://www.htdig.org/FAQ.html#q5.27 > >>>> > >>>>-- > >>>>Gilles R. Detillieux E-mail: <gr...@sc...> > >>>>Spinal Cord Research Centre WWW: > >> > >>http://www.scrc.umanitoba.ca/ > >> > >>>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > >>>> > >> > > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > SourceForge.net hosts over 70,000 Open Source Projects. > See the people who have HELPED US provide better services: > Click here: http://sourceforge.net/supporters.php > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |
|
From: David A. <D.J...@so...> - 2003-10-10 12:26:10
|
Glad that you have made progress.
I don't recognize the "PRINT OUTPUNT!!!???" message, but to run doc2html.=
pl
from the command line it is necessary to give two arguments:
doc2html.pl filename.pdf application/pdf
It this fails try:
pdf2html.pl filename.pdf
David Adams
Corporate Information Services
Information Systems Services
University of Southampton
----- Original Message -----
From: "Natalya Kolesnikova" <Ja...@gm...>
To: "Martin Joisten" <web...@cl...>
Cc: <htd...@li...>
Sent: Friday, October 10, 2003 12:27 PM
Subject: Re: [htdig] PDF-SEARCH
> Ok, it runs with conv_doc.pl!!!! Thank all people who help me!!!!
>
> If I run doc2html.pl with pdf-file as argument from command line, I ge=
t
> PRINT OUTPUNT!!!???
>
>
> best regards
> Natalya
>
> > Hi Natalya,
> > then it seems that the path to perl is wrong and that's why the Perl
> > Script(s) don't work.
> >
> > Check out the first lines of each Perl Script (.pl) and correct the p=
ath
> > to perl. Maybe there isn't even perl installed ;-)
> >
> > Best wishes,
> > Martin
> >
> >
> > Natalya Kolesnikova schrieb:
> >
> > > Hello David,
> > >
> > > thank you very much for your support!
> > >
> > > Yes, htdig is reading a.pdf-file. pdftotext and pdfinfo are working
ok,
> > too.
> > > But when I run conv_doc.pl (or doc2html.pl ) from command line wit=
h a
> > > pdf-file as a argument I get error message:
> > > bad interpreter: no such file or directory/usr/bin/perl.
> > >
> > > What is here wrong???
> > >
> > > best regards
> > > Natalya
> > >
> > >
> > >>OK, so far we have established:
> > >>
> > >>1) Htdig is reading a .PDF file
> > >>2) You are attempting to use /usr/local/bin/conv_doc.pl to conve=
rt
> > it.
> > >>3) No text is being extracted from the .PDF file, so it is not
being
> > >>indexed.
> > >>
> > >>This suggests that the fault is with /usr/local/bin/conv_doc.pl.
Please
> > >>try
> > >>executing this from the command line:
> > >>
> > >> /usr/local/bin/conv_doc.pl somepdffile.pdf
> > >>
> > >>where somepdffile.pdf is a PDF file from which it should be able to
> > >>extract
> > >>text. See what happens.
> > >>This is a necessary step in the diagnosis.
> > >>
> > >>David Adams
> > >>Corporate Information Services
> > >>Information Systems Services
> > >>University of Southampton
> > >>
> > >>
> > >>----- Original Message -----
> > >>From: "Natalya Kolesnikova" <Ja...@gm...>
> > >>To: "Gilles Detillieux" <gr...@sc...>
> > >>Cc: <D.J...@so...>; <htd...@li...>
> > >>Sent: Thursday, October 09, 2003 9:51 AM
> > >>Subject: Re: [htdig] PDF-SEARCH
> > >>
> > >>
> > >>
> > >>>Yes, I get error message "Deleted: no excerpt"!!!
> > >>>
> > >>>Natalya
> > >>>
> > >>>
> > >>>>According to Natalya Kolesnikova:
> > >>>>
> > >>>>>Thank you, David, for your help!
> > >>>>>
> > >>>>>But when I run htmerge, I get follow message:
> > >>>>>htmerge: Document database has no URLs. Check your config file a=
nd
> > >>
> > >>try
> > >>
> > >>>>>running htdig again.
> > >>>>
> > >>>>Are there any other htmerge error messages, such as a "Deleted: n=
o
> > >>>>excerpt"
> > >>>>message? I suspect what's happening here is that htdig adds the
> > >>
> > >>single
> > >>
> > >>>>URL for the PDF file, which you specify in start_url, to the
database,
> > >>>>but when it tries to index it, it finds nothing to index. When
> > >>
> > >>htmerge
> > >>
> > >>>>sees that nothing was indexed for this one document, it removes i=
t
> > >>
> > >>from
> > >>
> > >>>>the database, but then complains that there are no URLs left in t=
he
> > >>>>database.
> > >>>>Seeing all the htmerge error messages (try htmerge -v after htdig=
)
> > >>
> > >>would
> > >>
> > >>>>give us a more complete picture.
> > >>>>
> > >>>>Please follow through on Dave's and my suggestions below...
> > >>>>
> > >>>>
> > >>>>>>Ok, your configuration file contains:
> > >>>>>>
> > >>>>>>external_parsers: application/msword->text/html
> > >>>>
> > >>>>/usr/local/bin/conv_doc.pl
> > >>>>
> > >>>>>>\
> > >>>>>> application/postscript->text/html
> > >>>>
> > >>>>/usr/local/bin/conv_doc.pl
> > >>>>
> > >>>>>>\
> > >>>>>> application/pdf->text/html
> > >>
> > >>/usr/local/bin/conv_doc.pl
> > >>
> > >>>>>>so you are using conv_doc.pl.
> > >>>>>>
> > >>>>>>Please check one thing in your configuration file: make sure th=
ere
> > >>
> > >>are
> > >>
> > >>>>no
> > >>>>
> > >>>>>>white space characters after the \ characters at the end of lin=
es,
> > >>>>
> > >>>>this is
> > >>>>
> > >>>>>>most important.
> > >>>>
> > >>>>My first hunch is that this isn't the problem, because if htdig
didn't
> > >>>>see the full external_parsers definition (all 3 lines of it), it
> > >>
> > >>likely
> > >>
> > >>>>would be trying to use acroread and the PDF:: class, so we'd see
> > >>
> > >>messages
> > >>
> > >>>>>>from there. However, it's an easy thing to check for, and alwa=
ys
a
> > >>
> > >>good
> > >>
> > >>>>idea to pay close attention to in any case, so please do have a l=
ook
> > >>
> > >>at
> > >>
> > >>>>these lines.
> > >>>>
> > >>>>
> > >>>>>>If your configuration file is OK, then the problem must be with
> > >>>>>>/usr/local/bin/conv_doc.pl or the utilities it calls.
> > >>>>>>Try running /usr/local/bin/conv_doc.pl from the command line wi=
th
> > >>
> > >>a
> > >>
> > >>>>.PDF
> > >>>>
> > >>>>>>file as argument and see what the result is.
> > >>>>
> > >>>>This is a very important test. Your first test, with the start_u=
rl
> > >>
> > >>set
> > >>to
> > >>
> > >
> >
>
http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introd=
uct
ion_to_IPR.pdf
> > >
> > >>>>showed that it failed with this single PDF file, which suggests a
> > >>
> > >>problem
> > >>
> > >>>>either with that PDF file or with the setup of the external parse=
r.
> > >>>>The next step is to find out which is at fault, and this test wil=
l
do
> > >>>>that. If it fails on the introduction_to_IPR.pdf file (i.e. it
> > >>
> > >>produces
> > >>
> > >>>>no output), try it on a few other files as well. If it doesn't w=
ork
> > >>
> > >>on
> > >>
> > >>>>any of them, I'd suspect that conv_doc.pl is not properly
configured.
> > >>>>In this case, you should try pdftotext directly on these PDF file=
s
to
> > >>>>see if that works.
> > >>>>
> > >>>>If it produces output for some PDF files, but not others, it may =
be
> > >>
> > >>that
> > >>
> > >>>>the ones for which it produces nothing actually contain no indexa=
ble
> > >>
> > >>text.
> > >>
> > >>>>Some PDF files contain only image data, including perhaps scanned
> > >>
> > >>pages
> > >>
> > >>>>that display as text, but in fact are only a "picture" of a page.
> > >>>>
> > >>>>Once you can get conv_doc.pl to spit out text when run manually,
> > >>>>the following step will be to try htdig on those same PDF files,
> > >>>>one at a time, using htdig -ivvvv (note: 4 "v" options this time,
> > >>>>so htdig shows each word it parses). If you get that far, then t=
he
> > >>>>next stage would be to use your original start_url to index your
whole
> > >>>>site, and see if it will find all the PDF files. If it doesn't, =
see
> > >>>>http://www.htdig.org/FAQ.html#q5.27
> > >>>>
> > >>>>--
> > >>>>Gilles R. Detillieux E-mail:
<gr...@sc...>
> > >>>>Spinal Cord Research Centre WWW:
> > >>
> > >>http://www.scrc.umanitoba.ca/
> > >>
> > >>>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
> > >>>>
> > >>
> > >
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: SF.net Giveback Program.
> > SourceForge.net hosts over 70,000 Open Source Projects.
> > See the people who have HELPED US provide better services:
> > Click here: http://sourceforge.net/supporters.php
> > _______________________________________________
> > ht://Dig general mailing list: <htd...@li...>
> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> > List information (subscribe/unsubscribe, etc.)
> > https://lists.sourceforge.net/lists/listinfo/htdig-general
> >
>
> --
> NEU F=DCR ALLE - GMX MediaCenter - f=FCr Fotos, Musik, Dateien...
> Fotoalbum, File Sharing, MMS, Multimedia-Gru=DF, GMX FotoService
>
> Jetzt kostenlos anmelden unter http://www.gmx.net
>
> +++ GMX - die erste Adresse f=FCr Mail, Message, More! +++
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> SourceForge.net hosts over 70,000 Open Source Projects.
> See the people who have HELPED US provide better services:
> Click here: http://sourceforge.net/supporters.php
> _______________________________________________
> ht://Dig general mailing list: <htd...@li...>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>
|
|
From: Martin J. <web...@cl...> - 2003-10-08 14:41:55
|
Hi all, I have to admit not having followed this problem so far, but when Natalya writes "I don't get error message, but I have never .pdf-Files in my search-List!!!", I wonder if a simple misunderstanding is the cause for the trouble... For my understanding htdig doesn't index all the files in a subdirectory but only follows URLs which it finds on "webpages". So if no URL points to a PDF-File, no PDF will be indexed and therefore no PDF will show up in the search list. I wanted to index PDFs once and specially created a single PHP File that would browse through the subdirectories recursively and simple create a page with links to all the PDF Files found. I pointed htdig to this particular file and "voila" - all of the PDF Files were indexed. So maybe this is the problem - no links to the PDF Files. If this point had already been cleared in previous mails concerning this issue, I apologize for not having read these. All the best! Martin web...@cl... David Adams schrieb: > Thank you, that output establishes that htdig is reading a .pdf file. > > The next question is: what is it doing with it? > To answer that we need to see what you have in your configuration file. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Gilles Detillieux" <gr...@sc...> > Cc: <htd...@li...> > Sent: Wednesday, October 08, 2003 10:22 AM > Subject: Re: [htdig] PDF-SEARCH > > > >>Thank you very much for your help! >>I don't get error message, but I have never .pdf-Files in my > > search-List!!! > >>Hier is htdig -ivvv output when start_url is a single PDF file. >>What is wrong??? >> >>natalya.kolesnikova@intranet:~> htdig -ivvv >> >>1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i >>ntroduction_to_IPR.pdf >>New server: intranet.panasonic.de, 80 >>Retrieval command for http://intranet.panasonic.de/robots.txt: GET >>/robots.txt H >>TTP/1.0 >>User-Agent: htdig/3.1.6 (kol...@pa...) >>Host: intranet.panasonic.de >> >>Header line: HTTP/1.1 200 OK >>Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT >>Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 >>Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT >>Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 >>Header line: ETag: "44005-e7-3b82d9e0" >>Header line: Accept-Ranges: bytes >>Header line: Content-Length: 231 >>Header line: Connection: close >>Header line: Content-Type: text/plain >>Header line: >>returnStatus = 0 >>Read 231 from document >>Read a total of 231 bytes >>Parsing robots.txt file using myname = htdig >>Robots.txt line: # exclude help system from robots >>Robots.txt line: User-agent: * >>Found 'user-agent' line: * >>Robots.txt line: Disallow: /manual/ >>Found 'disallow' line: /manual/ >>Robots.txt line: Disallow: /doc/ >>Found 'disallow' line: /doc/ >>Robots.txt line: Disallow: /gif/ >>Found 'disallow' line: /gif/ >>Robots.txt line: # but allow htdig to index our doc-tree >>Robots.txt line: User-agent: susedig >>Found 'user-agent' line: susedig >>Robots.txt line: Disallow: >>Robots.txt line: # disallow stress test >>Robots.txt line: user-agent: stress-agent >>Found 'user-agent' line: stress-agent >>Robots.txt line: Disallow: / >>Pattern: /manual/|/doc/|/gif/ >> pushed >>pick: intranet.panasonic.de, # servers = >>1 >> > > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/int > rodu > >>ction_to_IPR.pdf: Retrieval command for >>http://intranet.panasonic.de/pel/ipr/tra >>ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET >>/pel/ipr/training_course >>/IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 >>User-Agent: htdig/3.1.6 (kol...@pa...) >>Host: intranet.panasonic.de >> >>Header line: HTTP/1.1 200 OK >>Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT >>Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 >>Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT >>Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 >>Header line: ETag: "314005-51e38-3f4f381f" >>Header line: Accept-Ranges: bytes >>Header line: Content-Length: 335416 >>Header line: Connection: close >>Header line: Content-Type: application/pdf >>Header line: >>returnStatus = 0 >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 7736 from document >>Read a total of 335416 bytes >> size = 335416 >>pick: intranet.panasonic.de, # servers = 1 >>natalya.kolesnikova@intranet:~> >> >>>According to Natalya Kolesnikova: >>> >>>>may be I am stupid, but it doesn't work by me! Can somebody help me? I >>> >>>have >>> >>>>tried with acroread and with external parser xpdf, but it doesn't >>> >>>work!!!! >>> >>>>I need the Installation Guide!!! :))) >>> >>>See http://www.htdig.org/FAQ.html#q4.9 >>> >>>That is the installation guide for PDF indexing. If you've carefully > > read > >>>and implemented everything recommended there, and checked out FAQs 5.2 >>>and 5.37 as David recommended (twice), then please provide more details, >>>such as what error messages you get, or give us an excerpt of > > htdig -ivvv > >>>output when start_url is set to point to just one single PDF file. >>> >>>There are dozens of potential points of failure in this process, so > > simply > >>>saying "it doesn't work" gives us no information that can help pinpoint >>>which point of failure is the one that needs to be addressed. >>> >>>Also, make sure you have links in your HTML files to all PDF files you >>>want to index. (See http://www.htdig.org/FAQ.html#q5.25) >>> >>>-- >>>Gilles R. Detillieux E-mail: <gr...@sc...> >>>Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ >>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) >>> >>> >>>------------------------------------------------------- >>>This sf.net email is sponsored by:ThinkGeek >>>Welcome to geek heaven. >>>http://thinkgeek.com/sf >>>_______________________________________________ >>>ht://Dig general mailing list: <htd...@li...> >>>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html >>>List information (subscribe/unsubscribe, etc.) >>>https://lists.sourceforge.net/lists/listinfo/htdig-general >>> >> >> >> >>-- >>NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... >>Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService >> >>Jetzt kostenlos anmelden unter http://www.gmx.net >> >>+++ GMX - die erste Adresse für Mail, Message, More! +++ >> >> >> >>------------------------------------------------------- >>This sf.net email is sponsored by:ThinkGeek >>Welcome to geek heaven. >>http://thinkgeek.com/sf >>_______________________________________________ >>ht://Dig general mailing list: <htd...@li...> >>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html >>List information (subscribe/unsubscribe, etc.) >>https://lists.sourceforge.net/lists/listinfo/htdig-general >> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > |
|
From: Gilles D. <gr...@sc...> - 2003-10-08 21:45:04
|
According to Martin Joisten: > I have to admit not having followed this problem so far, but when > Natalya writes "I don't get error message, but I have never .pdf-Files > in my search-List!!!", I wonder if a simple misunderstanding is the > cause for the trouble... > > For my understanding htdig doesn't index all the files in a subdirectory > but only follows URLs which it finds on "webpages". So if no URL points > to a PDF-File, no PDF will be indexed and therefore no PDF will show up > in the search list. > > I wanted to index PDFs once and specially created a single PHP File that > would browse through the subdirectories recursively and simple create a > page with links to all the PDF Files found. > > I pointed htdig to this particular file and "voila" - all of the PDF > Files were indexed. So maybe this is the problem - no links to the PDF > Files. > > If this point had already been cleared in previous mails concerning this > issue, I apologize for not having read these. I raised the issue very briefly in my reply to Natalya yesterday. I.e.: > >>>Also, make sure you have links in your HTML files to all PDF files you > >>>want to index. (See http://www.htdig.org/FAQ.html#q5.25) However, this is just one possibility among many possible trouble spots, and the test results from attempting to index a single PDF, using the URL of the PDF as start_url, suggest there's another problem at play here. I think it's important to get htdig working with a single PDF before tackling the bigger issue of whether it can find multiple ones. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |