From: Gilles D. <gr...@sc...> - 2002-01-08 22:21:25
|
According to Neal Richter: > 1. Is there any need to rebuild the index from scratch > periodically? Some commercial search engines use incremental indexing and > recommend that when the incremental portion of the index gets to be a > given size (say 20%) the entire index is rebuilt. htdig doesn't keep the incremental portion of the index in a separate database, so there aren't any hard and fast rules about needing to rebuild from scratch. However, it has been our experience that some things to seem to degenerate over time, but not for anything we've been able to pin down and fix, so we do still recommend that you occasionally rebuild from scratch. Some users do this only once a month or thereabouts. Others only rebuild when odd problems develop. We don't have any stats on how long anyone has run with daily updates to a database with no rebuilds. > 2. Is it possible to turn stemming off for particular languages > during run time? We have our own stemming tools.. (Porter Algorithm) I'm not sure what you mean by stemming. If you mean the "endings" fuzzy algorithm then it can be turned off by changing the search_algorithm attribute. > 3. (Unicode) Is the index (the core of the index code) capable of > doing multibyte searching? For example if a fully escaped version of a > Japanese or other multibyte document was indexed.. and then searched with > a properly escaped query.. would valid matches occur? (exculde any UI or > upper level code in your thing here.) No, currently only 8-bit character sets supported by a "locale" on your system are supported by htdig. > 4. (2 Gig Limit) Some of the archives will be at a million+ > documents in size with an average length exceeding 2K. Other than using > XFS or JFS, the solution in this case is to use multiple index files? Correct. The 3.1 series can only search one index file at a time, but you can set it up to manually select with config file you want to use, and the config files can select their own database. The 3.2 beta series has preliminary support for "collections", where results for a search of multiple index files are combined. > 5. Is there a way to add a 'field' to the index? Ie.. multiple > documents share a source-id & a query is given to return the documents > with that source-id. This could accomplished implicitly by modifying the > source-id to be some special alpha-numeric character (DJ23KJD823).. but > this has a small probability of giving false-positive search results. At the moment, there isn't an easy way to add fields, although it's on the wish list for 3.2. Depending on how you encode your source IDs, though, you can reduce the probability of a false-positive to almost 0. The encoded source IDs could be placed in meta keywords tags in your documents, and htdig would pick them up and index them as any other keywords, with a scoring factor controlled by keywords_factor. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |