From: Marco H. <ma...@ec...> - 2007-01-15 09:26:34
|
Hi all, A couple of years ago we've implemented a website for one of our clients. The website serves as an information portal for regular visitors seeking information, but it has also workgroups for specialists which share information about different subjects. When we created the website we used htdig as a search engine to spider the public information as well as the information in the workgroups. Workgroups contain news and agenda, forums, webmail and documents. The problem is that since a few months the load on the server has increased. During indexing the serverload gets up to and above 3.00 - 4.00 average. The duration of the indexing: up to 12 hours! Now I know there is a lot of information on the website plus 600+ workgroups to index, so I'd expect htdig to consume some time. In the end we end up with the usual htdig files: docdb = 206 MB, docs.index = 11 MB, wordlist = 222 MB and words db = 169 MB. Comparing these filesizes with the filesizes of an other (and a lot smaller) website index: not much different. Indexing the smaller website gives us a docdb file of 195 MB, a wordlist file of 195 MB and a words db file of 140 MB. The indexing takes about 2 hours from start to finish. I've tried to set up htdig to index incremental but that doesn't work well. After a while htdig stops indexing with an error: the work files get past the 2GB boundary. Cleaning up the workfiles before the indexing works, but then it's no longer an incremental search, is it? :-) Besides the indexing still take about 5-6 hours when doing an incremental search. Can anyone help me with this problem? Or does anyone have an idea what the problem might be? I've added serverinformation at the bottom of the e-mail followed by the configuration file. Thanks in advance! Greetings, Marco Houtman Ecommany B.V. PS: Serverinformation SuSE Linux (I believe it's 8.2, but our service provider installed it for us and I do not have enough privileges to login to the server and look up the exact version) Htdig 3.1.6 has been installed from source (I can see the sourcefile an there's no RPM with the name htdig installed afaik). Server hardware is about 3 years old now. I can't tell you exactly what the components are but I know it was state of the art equipment back then :-) PS2: configuration # common root_dir: /home/www/domain.nl/htdig common_dir: ${root_dir}/common database_dir: ${root_dir}/db template_dir: ${root_dir}/templates # htdig bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm \ .mpg .mov .avi .css .js .inc bad_word_list: ${common_dir}/bad_words create_url_list: yes exclude_urls: /bestel/ /zoek/ /download/ /pdf/ /uploaded/ \ selectie= regio= type= letter= auteur= rubriek= start= \ tumor= uitgever= sort= regtoev= meth= status= fase= \ aktie= thesaurus external_parsers: application/msword /usr/local/bin/parse_doc.pl \ application/pdf /usr/local/bin/parse_pdf.pl limit_urls_to: http://www.domain.nl/ maintainer: mye...@do... max_doc_size: 5000000 max_head_length: 10000 max_hop_count: 10 start_url: http://www.domain.nl/index.php user_agent: domain-digger # htmerge # htdump # htload # htfuzzy endings_affix_file: ${common_dir}/nederlands.aff endings_dictionary: ${common_dir}/nederlands.0 # htnotify # htsearch max_prefix_matches: 100 minimum_prefix_length: 2 no_excerpt_show_top: true nothing_found_file: ${template_dir}/nomatch.html prefix_match_character: * search_algorithm: exact:1 prefix:0.5 endings:0.1 search_results_footer: ${template_dir}/footer.html search_results_header: ${template_dir}/header.html syntax_error_file: ${template_dir}/syntax.html sort: score template_map: Long builtin-long builtin-long \ Short builtin-short builtin-short \ Website website ${template_dir}/website.html template_name: website |
From: Jim C. <li...@yg...> - 2007-01-15 23:41:03
|
Hi - My recommendation would be to remove any existing work files and =20= then try reindexing with one or two -v options (the more v's the more =20= verbose the output). This will allow you to see what URLs are being =20 indexed and give you some feel for the speed with which they are =20 being indexed. If things slow way down part way through, it might be =20 a resource issue. Depending on various server side details, it is possible to end up in =20= a situation where the same page is associated with different URLs. =20 This can create loops that will slow things down. Another possibility =20= is that enough data has been added that you are hitting swap during =20 the index run, which will slow things down a lot; monitoring memory =20 use during the run should let you know if this is a problem. I am sure there are other possibilities, but the best starting point =20 would be to make sure there are no surprises in what is being indexed =20= and keep an eye on system resources while indexing. Jim On Jan 15, 2007, at 2:24 AM, Marco Houtman wrote: > Hi all, > > > > A couple of years ago we=92ve implemented a website for one of our =20 > clients. The website serves as an information portal for regular =20 > visitors seeking information, but it has also workgroups for =20 > specialists which share information about different subjects. When =20 > we created the website we used htdig as a search engine to spider =20 > the public information as well as the information in the =20 > workgroups. Workgroups contain news and agenda, forums, webmail and =20= > documents. > > > > The problem is that since a few months the load on the server has =20 > increased. During indexing the serverload gets up to and above 3.00 =20= > =96 4.00 average. The duration of the indexing: up to 12 hours! Now I =20= > know there is a lot of information on the website plus 600+ =20 > workgroups to index, so I=92d expect htdig to consume some time. In =20= > the end we end up with the usual htdig files: docdb =3D 206 MB, =20 > docs.index =3D 11 MB, wordlist =3D 222 MB and words db =3D 169 MB. =20 > Comparing these filesizes with the filesizes of an other (and a lot =20= > smaller) website index: not much different. Indexing the smaller =20 > website gives us a docdb file of 195 MB, a wordlist file of 195 MB =20 > and a words db file of 140 MB. The indexing takes about 2 hours =20 > from start to finish. > > > > I=92ve tried to set up htdig to index incremental but that doesn=92t =20= > work well. After a while htdig stops indexing with an error: the =20 > work files get past the 2GB boundary. Cleaning up the workfiles =20 > before the indexing works, but then it=92s no longer an incremental =20= > search, is it? :-) Besides the indexing still take about 5-6 hours =20 > when doing an incremental search. > > > > Can anyone help me with this problem? Or does anyone have an idea =20 > what the problem might be? I=92ve added serverinformation at the =20 > bottom of the e-mail followed by the configuration file. > > > > Thanks in advance! > > > > Greetings, > > > > Marco Houtman > > Ecommany B.V. > > > > > > > > PS: Serverinformation > > > > SuSE Linux (I believe it=92s 8.2, but our service provider installed =20= > it for us and I do not have enough privileges to login to the =20 > server and look up the exact version) > > Htdig 3.1.6 has been installed from source (I can see the =20 > sourcefile an there=92s no RPM with the name htdig installed afaik). > > Server hardware is about 3 years old now. I can=92t tell you exactly =20= > what the components are but I know it was state of the art =20 > equipment back then :-) > > > > PS2: configuration > > > > # common > > root_dir: /home/www/domain.nl/htdig > > > > common_dir: ${root_dir}/common > > database_dir: ${root_dir}/db > > template_dir: ${root_dir}/templates > > > > # htdig > > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .c=20= > om .gif \ > > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm \ > > .mpg .mov .avi .css .js .inc > > bad_word_list: ${common_dir}/bad_words > > create_url_list: yes > > exclude_urls: /bestel/ /zoek/ /download/ /pdf/ /=20 > uploaded/ \ > > selectie=3D regio=3D type=3D letter=3D auteur=3D rubriek=3D start=3D \ > > tumor=3D uitgever=3D sort=3D regtoev=3D meth=3D status=3D fase=3D \ > > aktie=3D thesaurus > > external_parsers: application/msword /usr/local/bin/=20 > parse_doc.pl \ > > application/pdf /usr/local/bin/parse_pdf.pl > > limit_urls_to: http://www.domain.nl/ > > maintainer: mye...@do... > > max_doc_size: 5000000 > > max_head_length: 10000 > > max_hop_count: 10 > > start_url: http://www.domain.nl/index.php > > user_agent: domain-digger > > > > # htmerge > > > > # htdump > > > > # htload > > > > # htfuzzy > > endings_affix_file: ${common_dir}/nederlands.aff > > endings_dictionary: ${common_dir}/nederlands.0 > > > > # htnotify > > > > # htsearch > > max_prefix_matches: 100 > > minimum_prefix_length: 2 > > no_excerpt_show_top: true > > nothing_found_file: ${template_dir}/nomatch.html > > prefix_match_character: * > > search_algorithm: exact:1 prefix:0.5 endings:0.1 > > search_results_footer: ${template_dir}/footer.html > > search_results_header: ${template_dir}/header.html > > syntax_error_file: ${template_dir}/syntax.html > > sort: score > > template_map: Long builtin-long builtin-long \ > > Short builtin-short builtin-short \ > > Website website ${template_dir}/website.html > > template_name: website > > > > ----------------------------------------------------------------------=20= > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to =20 > share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?=20 > page=3Djoin.php&p=3Dsourceforge&CID=3DDEVDEV____________________________= ____=20 > _______________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general |
From: Robert I. <rob...@vo...> - 2007-01-16 16:15:32
|
Have you tried to set up htDig 3.2.0b5 as that uses compression. You can set it up in a different location and a different name for htsearch (htsearch32). This what I did when I moved from 3.1.6 to 3.2 and wanted to keep 3.1.6 running whilst I set up 3.2. It still hogs the server a bit and so I run it during the quiet time. I have it on RHES4 with 2Gb RAM and a single Xeon 3.06 CPU, but am adding another CPU soon so it will be interesting to see what difference that will make. Bob ___________________________________________________ Robert Isaac Director/Web Admin Volvo Owners Club -----Original Message----- From: htd...@li... [mailto:htd...@li...] On Behalf Of Jim Cole Sent: 15 January 2007 23:41 To: Marco Houtman Cc: htd...@li... Subject: Re: [htdig] htdig indexing for a very long time Hi - My recommendation would be to remove any existing work files and then try reindexing with one or two -v options (the more v's the more verbose the output). This will allow you to see what URLs are being indexed and give you some feel for the speed with which they are being indexed. If things slow way down part way through, it might be a resource issue. Depending on various server side details, it is possible to end up in a situation where the same page is associated with different URLs. This can create loops that will slow things down. Another possibility is that enough data has been added that you are hitting swap during the index run, which will slow things down a lot; monitoring memory use during the run should let you know if this is a problem. I am sure there are other possibilities, but the best starting point would be to make sure there are no surprises in what is being indexed and keep an eye on system resources while indexing. Jim On Jan 15, 2007, at 2:24 AM, Marco Houtman wrote: > Hi all, > > > > A couple of years ago we've implemented a website for one of our > clients. The website serves as an information portal for regular > visitors seeking information, but it has also workgroups for > specialists which share information about different subjects. When > we created the website we used htdig as a search engine to spider > the public information as well as the information in the > workgroups. Workgroups contain news and agenda, forums, webmail and > documents. > > > > The problem is that since a few months the load on the server has > increased. During indexing the serverload gets up to and above 3.00 > - 4.00 average. The duration of the indexing: up to 12 hours! Now I > know there is a lot of information on the website plus 600+ > workgroups to index, so I'd expect htdig to consume some time. In > the end we end up with the usual htdig files: docdb = 206 MB, > docs.index = 11 MB, wordlist = 222 MB and words db = 169 MB. > Comparing these filesizes with the filesizes of an other (and a lot > smaller) website index: not much different. Indexing the smaller > website gives us a docdb file of 195 MB, a wordlist file of 195 MB > and a words db file of 140 MB. The indexing takes about 2 hours > from start to finish. > > > > I've tried to set up htdig to index incremental but that doesn't > work well. After a while htdig stops indexing with an error: the > work files get past the 2GB boundary. Cleaning up the workfiles > before the indexing works, but then it's no longer an incremental > search, is it? :-) Besides the indexing still take about 5-6 hours > when doing an incremental search. > > > > Can anyone help me with this problem? Or does anyone have an idea > what the problem might be? I've added serverinformation at the > bottom of the e-mail followed by the configuration file. > > > > Thanks in advance! > > > > Greetings, > > > > Marco Houtman > > Ecommany B.V. > > > > > > > > PS: Serverinformation > > > > SuSE Linux (I believe it's 8.2, but our service provider installed > it for us and I do not have enough privileges to login to the > server and look up the exact version) > > Htdig 3.1.6 has been installed from source (I can see the > sourcefile an there's no RPM with the name htdig installed afaik). > > Server hardware is about 3 years old now. I can't tell you exactly > what the components are but I know it was state of the art > equipment back then :-) > > > > PS2: configuration > > > > # common > > root_dir: /home/www/domain.nl/htdig > > > > common_dir: ${root_dir}/common > > database_dir: ${root_dir}/db > > template_dir: ${root_dir}/templates > > > > # htdig > > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .c > om .gif \ > > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm \ > > .mpg .mov .avi .css .js .inc > > bad_word_list: ${common_dir}/bad_words > > create_url_list: yes > > exclude_urls: /bestel/ /zoek/ /download/ /pdf/ / > uploaded/ \ > > selectie= regio= type= letter= auteur= rubriek= start= \ > > tumor= uitgever= sort= regtoev= meth= status= fase= \ > > aktie= thesaurus > > external_parsers: application/msword /usr/local/bin/ > parse_doc.pl \ > > application/pdf /usr/local/bin/parse_pdf.pl > > limit_urls_to: http://www.domain.nl/ > > maintainer: mye...@do... > > max_doc_size: 5000000 > > max_head_length: 10000 > > max_hop_count: 10 > > start_url: http://www.domain.nl/index.php > > user_agent: domain-digger > > > > # htmerge > > > > # htdump > > > > # htload > > > > # htfuzzy > > endings_affix_file: ${common_dir}/nederlands.aff > > endings_dictionary: ${common_dir}/nederlands.0 > > > > # htnotify > > > > # htsearch > > max_prefix_matches: 100 > > minimum_prefix_length: 2 > > no_excerpt_show_top: true > > nothing_found_file: ${template_dir}/nomatch.html > > prefix_match_character: * > > search_algorithm: exact:1 prefix:0.5 endings:0.1 > > search_results_footer: ${template_dir}/footer.html > > search_results_header: ${template_dir}/header.html > > syntax_error_file: ${template_dir}/syntax.html > > sort: score > > template_map: Long builtin-long builtin-long \ > > Short builtin-short builtin-short \ > > Website website ${template_dir}/website.html > > template_name: website > > > > ---------------------------------------------------------------------- > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php? > page=join.php&p=sourceforge&CID=DEVDEV________________________________ > _______________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general ---------------------------------------------------------------- --------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&C ID=DEVDEV _______________________________________________ ht://Dig general mailing list: <htd...@li...> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general |