Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 8 Jan 2002, Geoff Hutchison wrote:

> I'm keeping this on-list so it's archived. I'm by far not the only active
> member in the group. (esp. judging from # of CVS commits...)

	OK.  Just trying to keep the dumb questions to the developer list
to a minimum. ;-)  Some of my questions are an attempt to get a feel for
how this developer group works.

> So if this is interesting to you, think about it, work out a proposal of
> how you think it might best be implemented and write the list--we'll give
> feedback and go from there.

	OK.  I need to do some code reading but I'll take a whack at a
proposal in the next week or so...

	Is the thinking that these changes (queriable index fields) would
take place in the mifluz indexing code?  Is mifluz the core index code in
3.2?

> 
> The release will happen when it's ready. The two main "chunks" that
> absolutely, positively need to be finished are a sync with the current
> mifluz code and the new htsearch framework. Beyond that, many things are
> up in the air--if people come along to do things like Unicode and they can
> be tested thoroughly, great. If not, it moves back.

	Just to be clear.. do you mean the release is moved back or that
the feature is postponed until the next release?  I'm guessing the latter.

	I'm trying to get an good idea of what feature goals the htdig
developers have 3.2 [other than what was just given].

	Not to be pushy, but how complete does everyone consider 3.2 to be
currently?

	My main immediate task is to choose to use either 3.1.5 or
3.2Beta as a starting point.  I would obviously prefer to use 3.2Beta and
submit bug-fixes, feature improvements, & memory leak/corruption fixes
than do that effort on a previous version.. as well as test 3.2Beta on
very large data-sets..

	A ball-park ETA is fine..  I'm looking at an early spring
release for the next version of the software that will use this as an
archiving tool.  [See www.rightnow.com for more info on our software]
Note that this (our software) release does not require the archive to be
UTF-8 ready... just moving in that direction.

	With QA/Testing taking a big chunk of that time, I want to be
confident that the state of the 3.2Beta code as of mid-february is good
enough to send to QA... and of course get all the fixes from our QA back
to htdig ASAP ;-) !

	I'm gathering implicitly that the changes envisioned before the 
3.2 release are improvements.. ie the Beta code is fully functional now.  
Correct?

	I'd be happy to assist in syncing with mifluz.

> At the moment, there is support only for 8-bit character
> chunks. Gilles told you the current state: if you have an 8-bit locale
> that supports the character set you want, you're probably going to be
> OK. Significant parts of the code rely on the 8-bit char assumption, which
> means that there will be some rewriting work beyond just the backend.

	Ah!  That's the magic phrase ("8-bit char assumption") I was
looking for.  My fault, I should have just asked it more clearly the first
time!
	Thanks for the clarification.

> My guess is that the biggest challenges for moving to Unicode are going to
> be removing the 8-bit char assumption and parsing "words." Updating the
> String class to be Unicode-aware (perhaps via UTF-8) is probably going to
> help a lot towards the first problem.

	The word-breaking problem for Asian languages like Japanese is a
hard one.  There are no 'spaces' in most Japanese text, so a language
specific word-breaking algorithm is needed before any soft of searchable
index can be built.

	We've contracted with a large internationalization-software 
company to assist us in our conversion to be UTF-8.  We accomplish word
breaking for Japanese documents with software that they supply under
contract via NDA.  I'll ask our in-house Japanese developer about free
tools for word breaking.

-- 
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site