|
From: Natalya K. <Ja...@gm...> - 2003-10-08 14:19:44
|
Hier is my htdig.conf: # # Example config file for ht://Dig. # # This configuration file is used by all the programs that make up ht://Dig. # Please refer to the attribute reference manual for more details on what # can be put into this file. (http://www.htdig.org/confindex.html) # Note that most attributes have very reasonable default values so you # really only have to add attributes here if you want to change the defaults. # # What follows are some of the common attributes you might want to change. # # # Specify where the database files need to go. Make sure that there is # plenty of free disk space available for the databases. They can get # pretty big. # database_dir: /srv/www/htdig/db # # This specifies the URL where the robot (htdig) will start. You can specify # multiple URLs here. Just separate them by some whitespace. # The example here will cause the ht://Dig homepage and related pages to be # indexed. # You could also index all the URLs in a file like so: #start_url: `${common_dir}/start.url` # start_url: http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf #start_url: http://intranet.panasonic.de star_image: /img_htdig/star.gif star_blank: /img_htdig/star_blank.gif # # This attribute limits the scope of the indexing process. The default is to # set it to the same as the start_url above. This way only pages that are on # the sites specified in the start_url attribute will be indexed and it will # reject any URLs that go outside of those sites. # # Keep in mind that the value for this attribute is just a list of string # patterns. As long as URLs contain at least one of the patterns it will be # seen as part of the scope of the index. # limit_urls_to: ${start_url} # # If there are particular pages that you definitely do NOT want to index, you # can use the exclude_urls attribute. The value is a list of string patterns. # If a URL matches any of the patterns, it will NOT be indexed. This is # useful to exclude things like virtual web trees or database accesses. By # default, all CGI URLs will be excluded. (Note that the /cgi-bin/ convention # may not work on your web server. Check the path prefix used on your web # server.) # exclude_urls: /cgi-bin/ .cgi # # Since ht://Dig does not (and cannot) parse every document type, this # attribute is a list of strings (extensions) that will be ignored during # indexing. These are *only* checked at the end of a URL, whereas # exclude_url patterns are matched anywhere. # bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css # # The string htdig will send in every request to identify the robot. Change # this to your email address. # maintainer: kol...@pa... # # The excerpts that are displayed in long results rely on stored information # in the index databases. The compiled default only stores 512 characters of # text from each document (this excludes any HTML markup...) If you plan on # using the excerpts you probably want to make this larger. The only concern # here is that more disk space is going to be needed to store the additional # information. Since disk space is cheap (! :-)) you might want to set this # to a value so that a large percentage of the documents that you are going # to be indexing are stored completely in the database. At SDSU we found # that by setting this value to about 50k the index would get 97% of all # documents completely and only 3% was cut off at 50k. You probably want to # experiment with this value. # Note that if you want to set this value low, you probably want to set the # excerpt_show_top attribute to false so that the top excerpt_length characters # of the document are always shown. # max_head_length: 10000 # # To limit network connections, ht://Dig will only pull up to a certain limit # of bytes. This prevents the indexing from dying because the server keeps # sending information. However, several FAQs happen because people have files # bigger than the default limit of 100KB. This sets the default a bit higher. # (see <http://www.htdig.org/FAQ.html> for more) # max_doc_size: 10000000000000000000000 # # Most people expect some sort of excerpt in results. By default, if the # search words aren't found in context in the stored excerpt, htsearch shows # the text defined in the no_excerpt_text attribute: # (None of the search words were found in the top of this document.) # This attribute instead will show the top of the excerpt. # no_excerpt_show_top: true # # Depending on your needs, you might want to enable some of the fuzzy search # algorithms. There are several to choose from and you can use them in any # combination you feel comfortable with. Each algorithm will get a weight # assigned to it so that in combinations of algorithms, certain algorithms get # preference over others. Note that the weights only affect the ranking of # the results, not the actual searching. # The available algorithms are: # accents # exact # endings # metaphone # prefix # soundex # substring # synonyms # By default only the "exact" algorithm is used with weight 1. # Note that if you are going to use the endings, metaphone, soundex, accents, # or synonyms algorithms, you will need to run htfuzzy to generate # the databases they use. # search_algorithm: exact:1 synonyms:0.5 endings:0.1 # # The following are the templates used in the builtin search results # The default is to use compiled versions of these files, which produces # slightly faster results. However, uncommenting these lines makes it # very easy to change the format of search results. # See <http://www.htdig.org/hts_templates.html> for more details. # # template_map: Long long ${common_dir}/long.html \ # Short short ${common_dir}/short.html # template_name: long # # The following are used to change the text for the page index. # The defaults are just boring text numbers. These images spice # up the result pages quite a bit. (Feel free to do whatever, though) # next_page_text: <img src="/img_htdig/buttonr.gif" border="0" align="middle" width="30" height="30" alt="next"> no_next_page_text: prev_page_text: <img src="/img_htdig/buttonl.gif" border="0" align="middle" width="30" height="30" alt="prev"> no_prev_page_text: page_number_text: '<img src="/img_htdig/button1.gif" border="0" align="middle" width="30" height="30" alt="1">' \ '<img src="/img_htdig/button2.gif" border="0" align="middle" width="30" height="30" alt="2">' \ '<img src="/img_htdig/button3.gif" border="0" align="middle" width="30" height="30" alt="3">' \ '<img src="/img_htdig/button4.gif" border="0" align="middle" width="30" height="30" alt="4">' \ '<img src="/img_htdig/button5.gif" border="0" align="middle" width="30" height="30" alt="5">' \ '<img src="/img_htdig/button6.gif" border="0" align="middle" width="30" height="30" alt="6">' \ '<img src="/img_htdig/button7.gif" border="0" align="middle" width="30" height="30" alt="7">' \ '<img src="/img_htdig/button8.gif" border="0" align="middle" width="30" height="30" alt="8">' \ '<img src="/img_htdig/button9.gif" border="0" align="middle" width="30" height="30" alt="9">' \ '<img src="/img_htdig/button10.gif" border="0" align="middle" width="30" height="30" alt="10">' # # To make the current page stand out, we will put a border around the # image for that page. # no_page_number_text: '<img src="/img_htdig/button1.gif" border="2" align="middle" width="30" height="30" alt="1">' \ '<img src="/img_htdig/button2.gif" border="2" align="middle" width="30" height="30" alt="2">' \ '<img src="/img_htdig/button3.gif" border="2" align="middle" width="30" height="30" alt="3">' \ '<img src="/img_htdig/button4.gif" border="2" align="middle" width="30" height="30" alt="4">' \ '<img src="/img_htdig/button5.gif" border="2" align="middle" width="30" height="30" alt="5">' \ '<img src="/img_htdig/button6.gif" border="2" align="middle" width="30" height="30" alt="6">' \ '<img src="/img_htdig/button7.gif" border="2" align="middle" width="30" height="30" alt="7">' \ '<img src="/img_htdig/button8.gif" border="2" align="middle" width="30" height="30" alt="8">' \ '<img src="/img_htdig/button9.gif" border="2" align="middle" width="30" height="30" alt="9">' \ '<img src="/img_htdig/button10.gif" border="2" align="middle" width="30" height="30" alt="10">' external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl \ application/postscript->text/html /usr/local/bin/conv_doc.pl \ application/pdf->text/html /usr/local/bin/conv_doc.pl # local variables: # mode: text # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) )) (font-lock-mode))) # end: > Thank you, that output establishes that htdig is reading a .pdf file. > > The next question is: what is it doing with it? > To answer that we need to see what you have in your configuration file. > > David Adams > Corporate Information Services > Information Systems Services > University of Southampton > > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "Gilles Detillieux" <gr...@sc...> > Cc: <htd...@li...> > Sent: Wednesday, October 08, 2003 10:22 AM > Subject: Re: [htdig] PDF-SEARCH > > > > Thank you very much for your help! > > I don't get error message, but I have never .pdf-Files in my > search-List!!! > > Hier is htdig -ivvv output when start_url is a single PDF file. > > What is wrong??? > > > > natalya.kolesnikova@intranet:~> htdig -ivvv > > > > 1:1:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/i > > ntroduction_to_IPR.pdf > > New server: intranet.panasonic.de, 80 > > Retrieval command for http://intranet.panasonic.de/robots.txt: GET > > /robots.txt H > > TTP/1.0 > > User-Agent: htdig/3.1.6 (kol...@pa...) > > Host: intranet.panasonic.de > > > > Header line: HTTP/1.1 200 OK > > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > > Header line: Last-Modified: Tue, 21 Aug 2001 22:00:00 GMT > > Converted Tue, 21 Aug 2001 22:00:00 GMT to Tue, 21 Aug 2001 22:00:00 > > Header line: ETag: "44005-e7-3b82d9e0" > > Header line: Accept-Ranges: bytes > > Header line: Content-Length: 231 > > Header line: Connection: close > > Header line: Content-Type: text/plain > > Header line: > > returnStatus = 0 > > Read 231 from document > > Read a total of 231 bytes > > Parsing robots.txt file using myname = htdig > > Robots.txt line: # exclude help system from robots > > Robots.txt line: User-agent: * > > Found 'user-agent' line: * > > Robots.txt line: Disallow: /manual/ > > Found 'disallow' line: /manual/ > > Robots.txt line: Disallow: /doc/ > > Found 'disallow' line: /doc/ > > Robots.txt line: Disallow: /gif/ > > Found 'disallow' line: /gif/ > > Robots.txt line: # but allow htdig to index our doc-tree > > Robots.txt line: User-agent: susedig > > Found 'user-agent' line: susedig > > Robots.txt line: Disallow: > > Robots.txt line: # disallow stress test > > Robots.txt line: user-agent: stress-agent > > Found 'user-agent' line: stress-agent > > Robots.txt line: Disallow: / > > Pattern: /manual/|/doc/|/gif/ > > pushed > > pick: intranet.panasonic.de, # servers = > > 1 > > > 0:0:0:http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/int > rodu > > ction_to_IPR.pdf: Retrieval command for > > http://intranet.panasonic.de/pel/ipr/tra > > ining_course/IPR_books_JPO/introduction_to_IPR.pdf: GET > > /pel/ipr/training_course > > /IPR_books_JPO/introduction_to_IPR.pdf HTTP/1.0 > > User-Agent: htdig/3.1.6 (kol...@pa...) > > Host: intranet.panasonic.de > > > > Header line: HTTP/1.1 200 OK > > Header line: Date: Wed, 08 Oct 2003 08:36:24 GMT > > Header line: Server: Apache/1.3.27 (Linux/SuSE) PHP/4.3.1 > > Header line: Last-Modified: Fri, 29 Aug 2003 11:25:19 GMT > > Converted Fri, 29 Aug 2003 11:25:19 GMT to Fri, 29 Aug 2003 11:25:19 > > Header line: ETag: "314005-51e38-3f4f381f" > > Header line: Accept-Ranges: bytes > > Header line: Content-Length: 335416 > > Header line: Connection: close > > Header line: Content-Type: application/pdf > > Header line: > > returnStatus = 0 > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 8192 from document > > Read 7736 from document > > Read a total of 335416 bytes > > size = 335416 > > pick: intranet.panasonic.de, # servers = 1 > > natalya.kolesnikova@intranet:~> > > > According to Natalya Kolesnikova: > > > > may be I am stupid, but it doesn't work by me! Can somebody help me? > I > > > have > > > > tried with acroread and with external parser xpdf, but it doesn't > > > work!!!! > > > > I need the Installation Guide!!! :))) > > > > > > See http://www.htdig.org/FAQ.html#q4.9 > > > > > > That is the installation guide for PDF indexing. If you've carefully > read > > > and implemented everything recommended there, and checked out FAQs 5.2 > > > and 5.37 as David recommended (twice), then please provide more > details, > > > such as what error messages you get, or give us an excerpt of > htdig -ivvv > > > output when start_url is set to point to just one single PDF file. > > > > > > There are dozens of potential points of failure in this process, so > simply > > > saying "it doesn't work" gives us no information that can help > pinpoint > > > which point of failure is the one that needs to be addressed. > > > > > > Also, make sure you have links in your HTML files to all PDF files you > > > want to index. (See http://www.htdig.org/FAQ.html#q5.25) > > > > > > -- > > > Gilles R. Detillieux E-mail: <gr...@sc...> > > > Spinal Cord Research Centre WWW: > http://www.scrc.umanitoba.ca/ > > > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > > > > > > > ------------------------------------------------------- > > > This sf.net email is sponsored by:ThinkGeek > > > Welcome to geek heaven. > > > http://thinkgeek.com/sf > > > _______________________________________________ > > > ht://Dig general mailing list: <htd...@li...> > > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > > List information (subscribe/unsubscribe, etc.) > > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > > > > > > > > > -- > > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > > Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService > > > > Jetzt kostenlos anmelden unter http://www.gmx.net > > > > +++ GMX - die erste Adresse für Mail, Message, More! +++ > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > ht://Dig general mailing list: <htd...@li...> > > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > > List information (subscribe/unsubscribe, etc.) > > https://lists.sourceforge.net/lists/listinfo/htdig-general > > > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |