|
From: Martin J. <web...@cl...> - 2003-10-08 14:41:55
|
Hi all, I have to admit not having followed this problem so far, but when Natalya writes "I don't get error message, but I have never .pdf-Files in my search-List!!!", I wonder if a simple misunderstanding is the cause for the trouble... For my understanding htdig doesn't index all the files in a subdirectory but only follows URLs which it finds on "webpages". So if no URL points to a PDF-File, no PDF will be indexed and therefore no PDF will show up in the search list. I wanted to index PDFs once and specially created a single PHP File that would browse through the subdirectories recursively and simple create a page with links to all the PDF Files found. I pointed htdig to this particular file and "voila" - all of the PDF Files were indexed. So maybe this is the problem - no links to the PDF Files. If this point had already been cleared in previous mails concerning this issue, I apologize for not having read these. All the best! Martin web...@cl... David Adams schrieb: > Thank you, that output establishes that htdig is reading a .pdf file. > > The next question is: what is it doing with it? > To answer that we need to see what you have in your configuration file. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Gilles Detillieux" <gr...@sc...> > Cc: <htd...@li...> > Sent: Wednesday, October 08, 2003 10:22 AM > Subject: Re: [htdig] PDF-SEARCH > > > >>Thank you very much for your help! >>I don't get error message, but I have never .pdf-Files in my > > search-List!!! > >>Hier is htdig -ivvv output when start_url is a single PDF file. >>What is wrong??? >> >>natalya.kolesnikova@intranet:~> htdig -ivvv >> >>1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i >>ntroduction_to_IPR.pdf >>New server: intranet.panasonic.de, 80 >>Retrieval command for http://intranet.panasonic.de/robots.txt: GET >>/robots.txt H >>TTP/1.0 >>User-Agent: htdig/3.1.6 (kol...@pa...) >>Host: intranet.panasonic.de >> >>Header line: HTTP/1.1 200 OK >>Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT >>Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 >>Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT >>Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 >>Header line: ETag: "44005-e7-3b82d9e0" >>Header line: Accept-Ranges: bytes >>Header line: Content-Length: 231 >>Header line: Connection: close >>Header line: Content-Type: text/plain >>Header line: >>returnStatus = 0 >>Read 231 from document >>Read a total of 231 bytes >>Parsing robots.txt file using myname = htdig >>Robots.txt line: # exclude help system from robots >>Robots.txt line: User-agent: * >>Found 'user-agent' line: * >>Robots.txt line: Disallow: /manual/ >>Found 'disallow' line: /manual/ >>Robots.txt line: Disallow: /doc/ >>Found 'disallow' line: /doc/ >>Robots.txt line: Disallow: /gif/ >>Found 'disallow' line: /gif/ >>Robots.txt line: # but allow htdig to index our doc-tree >>Robots.txt line: User-agent: susedig >>Found 'user-agent' line: susedig >>Robots.txt line: Disallow: >>Robots.txt line: # disallow stress test >>Robots.txt line: user-agent: stress-agent >>Found 'user-agent' line: stress-agent >>Robots.txt line: Disallow: / >>Pattern: /manual/|/doc/|/gif/ >> pushed >>pick: intranet.panasonic.de, # servers = >>1 >> > > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/int > rodu > >>ction_to_IPR.pdf: Retrieval command for >>http://intranet.panasonic.de/pel/ipr/tra >>ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET >>/pel/ipr/training_course >>/IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 >>User-Agent: htdig/3.1.6 (kol...@pa...) >>Host: intranet.panasonic.de >> >>Header line: HTTP/1.1 200 OK >>Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT >>Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 >>Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT >>Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 >>Header line: ETag: "314005-51e38-3f4f381f" >>Header line: Accept-Ranges: bytes >>Header line: Content-Length: 335416 >>Header line: Connection: close >>Header line: Content-Type: application/pdf >>Header line: >>returnStatus = 0 >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 8192 from document >>Read 7736 from document >>Read a total of 335416 bytes >> size = 335416 >>pick: intranet.panasonic.de, # servers = 1 >>natalya.kolesnikova@intranet:~> >> >>>According to Natalya Kolesnikova: >>> >>>>may be I am stupid, but it doesn't work by me! Can somebody help me? I >>> >>>have >>> >>>>tried with acroread and with external parser xpdf, but it doesn't >>> >>>work!!!! >>> >>>>I need the Installation Guide!!! :))) >>> >>>See http://www.htdig.org/FAQ.html#q4.9 >>> >>>That is the installation guide for PDF indexing. If you've carefully > > read > >>>and implemented everything recommended there, and checked out FAQs 5.2 >>>and 5.37 as David recommended (twice), then please provide more details, >>>such as what error messages you get, or give us an excerpt of > > htdig -ivvv > >>>output when start_url is set to point to just one single PDF file. >>> >>>There are dozens of potential points of failure in this process, so > > simply > >>>saying "it doesn't work" gives us no information that can help pinpoint >>>which point of failure is the one that needs to be addressed. >>> >>>Also, make sure you have links in your HTML files to all PDF files you >>>want to index. (See http://www.htdig.org/FAQ.html#q5.25) >>> >>>-- >>>Gilles R. Detillieux E-mail: <gr...@sc...> >>>Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ >>>Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) >>> >>> >>>------------------------------------------------------- >>>This sf.net email is sponsored by:ThinkGeek >>>Welcome to geek heaven. >>>http://thinkgeek.com/sf >>>_______________________________________________ >>>ht://Dig general mailing list: <htd...@li...> >>>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html >>>List information (subscribe/unsubscribe, etc.) >>>https://lists.sourceforge.net/lists/listinfo/htdig-general >>> >> >> >> >>-- >>NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... >>Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService >> >>Jetzt kostenlos anmelden unter http://www.gmx.net >> >>+++ GMX - die erste Adresse für Mail, Message, More! +++ >> >> >> >>------------------------------------------------------- >>This sf.net email is sponsored by:ThinkGeek >>Welcome to geek heaven. >>http://thinkgeek.com/sf >>_______________________________________________ >>ht://Dig general mailing list: <htd...@li...> >>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html >>List information (subscribe/unsubscribe, etc.) >>https://lists.sourceforge.net/lists/listinfo/htdig-general >> > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > |