From: Bill A. <bil...@em...> - 2002-06-23 16:28:17
|
Hello all! While running HtDig (version 3.1.6) with -i and -vvv, it runs for quite a while (like 19 hours) and then I get this: title: doc8995.pdf size: 6536 pick: server.com.com, # servers = 1 452364:452364:2:http://server.com.com/docs/dir76/doc8995.pdf Read 3957 from document Read a total of 3957 bytes title: doc8996.pdf File size limit exceeded [root@system Dir]# The listing of the dir where dig data files reside: -rw-r--r-- 1 root root 1369112576 Jun 23 11:05 db.docdb -rw-r--r-- 1 root root 2147483647 Jun 23 11:05 db.wordlist The significant portions of my .conf file: wordlist_cache_size: 50000000 wordlist_compress: false allow_numbers: true valid_puctuation: -/ max_head_length: 1500000 max_doc_size: 150000000 search_algorithm: exact:1 synonyms:0.1 endings:0.1 I have all of the external parsers defined as well. All was working fine until last Tuesday when we added 1,252 new PDF docs and I performed a -a dig and was missing a lot of docs. The report from HtDig -a script: (found in contrib./scripts on HtDig web): rundig: Start time: Tue Jun 18 14:00:00 EDT 2002 rundig: Done Digging: Tue Jun 18 14:48:11 EDT 2002 htmerge: Total word count: 1511472 htmerge: Total documents: 17493 htmerge: Total size of documents (in K): 127347 rundig: Done Merging: Tue Jun 18 15:29:22 EDT 2002 rundig: End time: Tue Jun 18 15:36:25 EDT 2002 Output from the previous week: (<< Note the differences! >>) rundig: Start time: Tue Jun 11 16:48:21 EDT 2002 rundig: Done Digging: Tue Jun 11 17:19:12 EDT 2002 htmerge: Total word count: 1504940 htmerge: Total documents: 459082 htmerge: Total size of documents (in K): 2115373 rundig: Done Merging: Tue Jun 11 18:24:50 EDT 2002 rundig: End time: Tue Jun 11 18:36:47 EDT 2002 That is why I tried the -i in an alternate dir. Does anything jump out at you as being very, very wrong in either the conf file or the output? The files are stored on a 120 GB RAID array and I am only using like 15 GB of disk space in PDF files. Total file count is 460996 PDF files in 89 dirs that are being indexed. I have tested each of the new PDF files and know that they are good. They were created in the exact same manner as the other 400,000+ PDF files. BTW, the size of the file it died on, doc8996.PDF, is only 4096 bytes and has been indexed successfully in the past. System is Linux RedHat 7.3 with all RH patches (except for upgrading to HtDig 3.2), 1 GB RAM, more than 120 GB free disk space on RAID 5 array, Apache 1.3.23-11 webserver using fancy indexing so dig can find the files, xpdf is ver. 1.00-3. Sorry for the length of this email, just wanted to supply as much info as I could. Thanks for any input! Bill Akins, CNE Sr. OSA Emory Healthcare (404) 712-2879 - Office 12674 - PIC bil...@em... ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CONFIDENTIALITY NOTICE: This message may contain legally confidential and privileged information and is intended only for the named recipient(s). No one else is authorized to read, disseminate, distribute, copy, or otherwise disclose the contents of this message. If you have received this message in error, please notify the sender immediately by e-mail or telephone and delete the message in its entirety. Thank you. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ <<<<GWIASIG 0.06c>>>> |