From: Natalya K. <Ja...@gm...> - 2003-10-08 15:12:48
|
Thank you, David, for your help! But when I run htmerge, I get follow message: htmerge: Document database has no URLs. Check your config file and try running htdig again. Thank you for your tipps! Natalya > Ok, your configuration file contains: > > external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl > \ > application/postscript->text/html /usr/local/bin/conv_doc.pl > \ > application/pdf->text/html /usr/local/bin/conv_doc.pl > > so you are using conv_doc.pl. > > Please check one thing in your configuration file: make sure there are no > white space characters after the \ characters at the end of lines, this is > most important. > > If your configuration file is OK, then the problem must be with > /usr/local/bin/conv_doc.pl or the utilities it calls. > Try running /usr/local/bin/conv_doc.pl from the command line with a .PDF > file as argument and see what the result is. > > ----- Original Message ----- > From: "Natalya Kolesnikova" <Ja...@gm...> > To: "David Adams" <D.J...@so...> > Cc: <gr...@sc...>; <htd...@li...> > Sent: Wednesday, October 08, 2003 3:19 PM > Subject: Re: [htdig] PDF-SEARCH > > > > Hier is my htdig.conf: > > # > > # Example config file for ht://Dig. > > # > > # This configuration file is used by all the programs that make up > ht://Dig. > > # Please refer to the attribute reference manual for more details on > what > > # can be put into this file. (http://www.htdig.org/confindex.html) > > # Note that most attributes have very reasonable default values so you > > # really only have to add attributes here if you want to change the > > defaults. > > # > > # What follows are some of the common attributes you might want to > change. > > # > > > > # > > # Specify where the database files need to go. Make sure that there is > > # plenty of free disk space available for the databases. They can get > > # pretty big. > > # > > database_dir: /srv/www/htdig/db > > > > # > > # This specifies the URL where the robot (htdig) will start. You can > > specify > > # multiple URLs here. Just separate them by some whitespace. > > # The example here will cause the ht://Dig homepage and related pages to > be > > # indexed. > > # You could also index all the URLs in a file like so: > > #start_url: > > `${common_dir}/start.url` > > # > > start_url: > http://intranet.panasonic.de/pel/ipr/training_course/IPR_books_JPO/introduction_to_IPR.pdf > > #start_url: http://intranet.panasonic.de > > star_image: /img_htdig/star.gif > > star_blank: /img_htdig/star_blank.gif > > > > > > > > # > > # This attribute limits the scope of the indexing process. The default > is > > to > > # set it to the same as the start_url above. This way only pages that > are > > on > > # the sites specified in the start_url attribute will be indexed and it > will > > # reject any URLs that go outside of those sites. > > # > > # Keep in mind that the value for this attribute is just a list of > string > > # patterns. As long as URLs contain at least one of the patterns it will > be > > # seen as part of the scope of the index. > > # > > limit_urls_to: ${start_url} > > > > # > > # If there are particular pages that you definitely do NOT want to > index, > > you > > # can use the exclude_urls attribute. The value is a list of string > > patterns. > > # If a URL matches any of the patterns, it will NOT be indexed. This is > > # useful to exclude things like virtual web trees or database accesses. > By > > # default, all CGI URLs will be excluded. (Note that the /cgi-bin/ > > convention > > # may not work on your web server. Check the path prefix used on your > web > > # server.) > > # > > exclude_urls: /cgi-bin/ .cgi > > > > > > > > # > > # Since ht://Dig does not (and cannot) parse every document type, this > > # attribute is a list of strings (extensions) that will be ignored > during > > # indexing. These are *only* checked at the end of a URL, whereas > > # exclude_url patterns are matched anywhere. > > # > > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ > > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css > > > > # > > # The string htdig will send in every request to identify the robot. > Change > > # this to your email address. > > # > > maintainer: kol...@pa... > > > > # > > # The excerpts that are displayed in long results rely on stored > information > > # in the index databases. The compiled default only stores 512 > characters > > of > > # text from each document (this excludes any HTML markup...) If you > plan > on > > # using the excerpts you probably want to make this larger. The only > > concern > > # here is that more disk space is going to be needed to store the > additional > > # information. Since disk space is cheap (! :-)) you might want to set > this > > # to a value so that a large percentage of the documents that you are > going > > # to be indexing are stored completely in the database. At SDSU we > found > > # that by setting this value to about 50k the index would get 97% of all > > # documents completely and only 3% was cut off at 50k. You probably > want > to > > # experiment with this value. > > # Note that if you want to set this value low, you probably want to set > the > > # excerpt_show_top attribute to false so that the top excerpt_length > > characters > > # of the document are always shown. > > # > > max_head_length: 10000 > > > > # > > # To limit network connections, ht://Dig will only pull up to a certain > > limit > > # of bytes. This prevents the indexing from dying because the server > keeps > > # sending information. However, several FAQs happen because people have > > files > > # bigger than the default limit of 100KB. This sets the default a bit > > higher. > > # (see <http://www.htdig.org/FAQ.html> for more) > > # > > max_doc_size: 10000000000000000000000 > > > > # > > # Most people expect some sort of excerpt in results. By default, if the > > # search words aren't found in context in the stored excerpt, htsearch > shows > > > > # the text defined in the no_excerpt_text attribute: > > # (None of the search words were found in the top of this document.) > > # This attribute instead will show the top of the excerpt. > > # > > no_excerpt_show_top: true > > > > # > > # Depending on your needs, you might want to enable some of the fuzzy > search > > # algorithms. There are several to choose from and you can use them in > any > > # combination you feel comfortable with. Each algorithm will get a > weight > > # assigned to it so that in combinations of algorithms, certain > algorithms > > get > > # preference over others. Note that the weights only affect the ranking > of > > # the results, not the actual searching. > > # The available algorithms are: > > # accents > > # exact > > # endings > > # metaphone > > # prefix > > # soundex > > # substring > > # synonyms > > # By default only the "exact" algorithm is used with weight 1. > > # Note that if you are going to use the endings, metaphone, soundex, > > accents, > > # or synonyms algorithms, you will need to run htfuzzy to generate > > # the databases they use. > > # > > search_algorithm: exact:1 synonyms:0.5 endings:0.1 > > > > # > > # The following are the templates used in the builtin search results > > # The default is to use compiled versions of these files, which produces > > # slightly faster results. However, uncommenting these lines makes it > > # very easy to change the format of search results. > > # See <http://www.htdig.org/hts_templates.html> for more details. > > # > > # template_map: Long long ${common_dir}/long.html \ > > # Short short ${common_dir}/short.html > > # template_name: long > > > > # > > # The following are used to change the text for the page index. > > # The defaults are just boring text numbers. These images spice > > # up the result pages quite a bit. (Feel free to do whatever, though) > > # > > next_page_text: <img src="/img_htdig/buttonr.gif" border="0" > align="middle" > > width="30" height="30" alt="next"> > > no_next_page_text: > > prev_page_text: <img src="/img_htdig/buttonl.gif" border="0" > align="middle" > > width="30" height="30" alt="prev"> > > no_prev_page_text: > > page_number_text: '<img src="/img_htdig/button1.gif" border="0" > > align="middle" width="30" height="30" alt="1">' \ > > '<img src="/img_htdig/button2.gif" border="0" align="middle" width="30" > > height="30" alt="2">' \ > > '<img src="/img_htdig/button3.gif" border="0" align="middle" width="30" > > height="30" alt="3">' \ > > '<img src="/img_htdig/button4.gif" border="0" align="middle" width="30" > > height="30" alt="4">' \ > > '<img src="/img_htdig/button5.gif" border="0" align="middle" width="30" > > height="30" alt="5">' \ > > '<img src="/img_htdig/button6.gif" border="0" align="middle" width="30" > > height="30" alt="6">' \ > > '<img src="/img_htdig/button7.gif" border="0" align="middle" width="30" > > height="30" alt="7">' \ > > '<img src="/img_htdig/button8.gif" border="0" align="middle" width="30" > > height="30" alt="8">' \ > > '<img src="/img_htdig/button9.gif" border="0" align="middle" width="30" > > height="30" alt="9">' \ > > '<img src="/img_htdig/button10.gif" border="0" align="middle" width="30" > > height="30" alt="10">' > > # > > # To make the current page stand out, we will put a border around the > > # image for that page. > > # > > no_page_number_text: '<img src="/img_htdig/button1.gif" border="2" > > align="middle" width="30" height="30" alt="1">' \ > > '<img src="/img_htdig/button2.gif" border="2" align="middle" width="30" > > height="30" alt="2">' \ > > '<img src="/img_htdig/button3.gif" border="2" align="middle" width="30" > > height="30" alt="3">' \ > > '<img src="/img_htdig/button4.gif" border="2" align="middle" width="30" > > height="30" alt="4">' \ > > '<img src="/img_htdig/button5.gif" border="2" align="middle" width="30" > > height="30" alt="5">' \ > > '<img src="/img_htdig/button6.gif" border="2" align="middle" width="30" > > height="30" alt="6">' \ > > '<img src="/img_htdig/button7.gif" border="2" align="middle" width="30" > > height="30" alt="7">' \ > > '<img src="/img_htdig/button8.gif" border="2" align="middle" width="30" > > height="30" alt="8">' \ > > '<img src="/img_htdig/button9.gif" border="2" align="middle" width="30" > > height="30" alt="9">' \ > > '<img src="/img_htdig/button10.gif" border="2" align="middle" width="30" > > height="30" alt="10">' > > > > external_parsers: application/msword->text/html > /usr/local/bin/conv_doc.pl > \ > > application/postscript->text/html > /usr/local/bin/conv_doc.pl > \ > > application/pdf->text/html /usr/local/bin/conv_doc.pl > > > > > > > > # local variables: > > # mode: text > > # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list > > '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . > > font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) )) > (font-lock-mode))) > > # end: > > > .... > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ |