Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

According to Neal Richter:
> 1.	Is there any need to rebuild the index from scratch
> periodically?  Some commercial search engines use incremental indexing and
> recommend that when the incremental portion of the index gets to be a
> given size (say 20%) the entire index is rebuilt.

htdig doesn't keep the incremental portion of the index in a separate
database, so there aren't any hard and fast rules about needing to rebuild
from scratch.  However, it has been our experience that some things to seem
to degenerate over time, but not for anything we've been able to pin down
and fix, so we do still recommend that you occasionally rebuild from scratch.
Some users do this only once a month or thereabouts.  Others only rebuild
when odd problems develop.  We don't have any stats on how long anyone
has run with daily updates to a database with no rebuilds.

> 2.      Is it possible to turn stemming off for particular languages
> during run time?  We have our own stemming tools.. (Porter Algorithm)

I'm not sure what you mean by stemming.  If you mean the "endings" fuzzy
algorithm then it can be turned off by changing the search_algorithm
attribute.

> 3.      (Unicode) Is the index (the core of the index code) capable of
> doing multibyte searching?  For example if a fully escaped version of a
> Japanese or other multibyte document was indexed.. and then searched with
> a properly escaped query.. would valid matches occur? (exculde any UI or
> upper level code in your thing here.)

No, currently only 8-bit character sets supported by a "locale" on your
system are supported by htdig.

> 4.      (2 Gig Limit)  Some of the archives will be at a million+ 
> documents in size with an average length exceeding 2K.  Other than using
> XFS or JFS, the solution in this case is to use multiple index files?

Correct.  The 3.1 series can only search one index file at a time, but you
can set it up to manually select with config file you want to use, and the
config files can select their own database.  The 3.2 beta series has
preliminary support for "collections", where results for a search of multiple
index files are combined.

> 5.      Is there a way to add a 'field' to the index?  Ie.. multiple
> documents share a source-id & a query is given to return the documents
> with that source-id.  This could accomplished implicitly by modifying the
> source-id to be some special alpha-numeric character (DJ23KJD823).. but
> this has a small probability of giving false-positive search results.

At the moment, there isn't an easy way to add fields, although it's on the
wish list for 3.2.  Depending on how you encode your source IDs, though,
you can reduce the probability of a false-positive to almost 0.  The
encoded source IDs could be placed in meta keywords tags in your documents,
and htdig would pick them up and index them as any other keywords, with a
scoring factor controlled by keywords_factor.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930