From: Jim C. <li...@yg...> - 2007-01-15 23:41:03
|
Hi - My recommendation would be to remove any existing work files and =20= then try reindexing with one or two -v options (the more v's the more =20= verbose the output). This will allow you to see what URLs are being =20 indexed and give you some feel for the speed with which they are =20 being indexed. If things slow way down part way through, it might be =20 a resource issue. Depending on various server side details, it is possible to end up in =20= a situation where the same page is associated with different URLs. =20 This can create loops that will slow things down. Another possibility =20= is that enough data has been added that you are hitting swap during =20 the index run, which will slow things down a lot; monitoring memory =20 use during the run should let you know if this is a problem. I am sure there are other possibilities, but the best starting point =20 would be to make sure there are no surprises in what is being indexed =20= and keep an eye on system resources while indexing. Jim On Jan 15, 2007, at 2:24 AM, Marco Houtman wrote: > Hi all, > > > > A couple of years ago we=92ve implemented a website for one of our =20 > clients. The website serves as an information portal for regular =20 > visitors seeking information, but it has also workgroups for =20 > specialists which share information about different subjects. When =20 > we created the website we used htdig as a search engine to spider =20 > the public information as well as the information in the =20 > workgroups. Workgroups contain news and agenda, forums, webmail and =20= > documents. > > > > The problem is that since a few months the load on the server has =20 > increased. During indexing the serverload gets up to and above 3.00 =20= > =96 4.00 average. The duration of the indexing: up to 12 hours! Now I =20= > know there is a lot of information on the website plus 600+ =20 > workgroups to index, so I=92d expect htdig to consume some time. In =20= > the end we end up with the usual htdig files: docdb =3D 206 MB, =20 > docs.index =3D 11 MB, wordlist =3D 222 MB and words db =3D 169 MB. =20 > Comparing these filesizes with the filesizes of an other (and a lot =20= > smaller) website index: not much different. Indexing the smaller =20 > website gives us a docdb file of 195 MB, a wordlist file of 195 MB =20 > and a words db file of 140 MB. The indexing takes about 2 hours =20 > from start to finish. > > > > I=92ve tried to set up htdig to index incremental but that doesn=92t =20= > work well. After a while htdig stops indexing with an error: the =20 > work files get past the 2GB boundary. Cleaning up the workfiles =20 > before the indexing works, but then it=92s no longer an incremental =20= > search, is it? :-) Besides the indexing still take about 5-6 hours =20 > when doing an incremental search. > > > > Can anyone help me with this problem? Or does anyone have an idea =20 > what the problem might be? I=92ve added serverinformation at the =20 > bottom of the e-mail followed by the configuration file. > > > > Thanks in advance! > > > > Greetings, > > > > Marco Houtman > > Ecommany B.V. > > > > > > > > PS: Serverinformation > > > > SuSE Linux (I believe it=92s 8.2, but our service provider installed =20= > it for us and I do not have enough privileges to login to the =20 > server and look up the exact version) > > Htdig 3.1.6 has been installed from source (I can see the =20 > sourcefile an there=92s no RPM with the name htdig installed afaik). > > Server hardware is about 3 years old now. I can=92t tell you exactly =20= > what the components are but I know it was state of the art =20 > equipment back then :-) > > > > PS2: configuration > > > > # common > > root_dir: /home/www/domain.nl/htdig > > > > common_dir: ${root_dir}/common > > database_dir: ${root_dir}/db > > template_dir: ${root_dir}/templates > > > > # htdig > > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .c=20= > om .gif \ > > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm \ > > .mpg .mov .avi .css .js .inc > > bad_word_list: ${common_dir}/bad_words > > create_url_list: yes > > exclude_urls: /bestel/ /zoek/ /download/ /pdf/ /=20 > uploaded/ \ > > selectie=3D regio=3D type=3D letter=3D auteur=3D rubriek=3D start=3D \ > > tumor=3D uitgever=3D sort=3D regtoev=3D meth=3D status=3D fase=3D \ > > aktie=3D thesaurus > > external_parsers: application/msword /usr/local/bin/=20 > parse_doc.pl \ > > application/pdf /usr/local/bin/parse_pdf.pl > > limit_urls_to: http://www.domain.nl/ > > maintainer: mye...@do... > > max_doc_size: 5000000 > > max_head_length: 10000 > > max_hop_count: 10 > > start_url: http://www.domain.nl/index.php > > user_agent: domain-digger > > > > # htmerge > > > > # htdump > > > > # htload > > > > # htfuzzy > > endings_affix_file: ${common_dir}/nederlands.aff > > endings_dictionary: ${common_dir}/nederlands.0 > > > > # htnotify > > > > # htsearch > > max_prefix_matches: 100 > > minimum_prefix_length: 2 > > no_excerpt_show_top: true > > nothing_found_file: ${template_dir}/nomatch.html > > prefix_match_character: * > > search_algorithm: exact:1 prefix:0.5 endings:0.1 > > search_results_footer: ${template_dir}/footer.html > > search_results_header: ${template_dir}/header.html > > syntax_error_file: ${template_dir}/syntax.html > > sort: score > > template_map: Long builtin-long builtin-long \ > > Short builtin-short builtin-short \ > > Website website ${template_dir}/website.html > > template_name: website > > > > ----------------------------------------------------------------------=20= > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to =20 > share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?=20 > page=3Djoin.php&p=3Dsourceforge&CID=3DDEVDEV____________________________= ____=20 > _______________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general |