From: Neal R. <ne...@ri...> - 2002-01-09 07:32:10
|
On Tue, 8 Jan 2002, Geoff Hutchison wrote: > I'm keeping this on-list so it's archived. I'm by far not the only active > member in the group. (esp. judging from # of CVS commits...) OK. Just trying to keep the dumb questions to the developer list to a minimum. ;-) Some of my questions are an attempt to get a feel for how this developer group works. > So if this is interesting to you, think about it, work out a proposal of > how you think it might best be implemented and write the list--we'll give > feedback and go from there. OK. I need to do some code reading but I'll take a whack at a proposal in the next week or so... Is the thinking that these changes (queriable index fields) would take place in the mifluz indexing code? Is mifluz the core index code in 3.2? > > The release will happen when it's ready. The two main "chunks" that > absolutely, positively need to be finished are a sync with the current > mifluz code and the new htsearch framework. Beyond that, many things are > up in the air--if people come along to do things like Unicode and they can > be tested thoroughly, great. If not, it moves back. Just to be clear.. do you mean the release is moved back or that the feature is postponed until the next release? I'm guessing the latter. I'm trying to get an good idea of what feature goals the htdig developers have 3.2 [other than what was just given]. Not to be pushy, but how complete does everyone consider 3.2 to be currently? My main immediate task is to choose to use either 3.1.5 or 3.2Beta as a starting point. I would obviously prefer to use 3.2Beta and submit bug-fixes, feature improvements, & memory leak/corruption fixes than do that effort on a previous version.. as well as test 3.2Beta on very large data-sets.. A ball-park ETA is fine.. I'm looking at an early spring release for the next version of the software that will use this as an archiving tool. [See www.rightnow.com for more info on our software] Note that this (our software) release does not require the archive to be UTF-8 ready... just moving in that direction. With QA/Testing taking a big chunk of that time, I want to be confident that the state of the 3.2Beta code as of mid-february is good enough to send to QA... and of course get all the fixes from our QA back to htdig ASAP ;-) ! I'm gathering implicitly that the changes envisioned before the 3.2 release are improvements.. ie the Beta code is fully functional now. Correct? I'd be happy to assist in syncing with mifluz. > At the moment, there is support only for 8-bit character > chunks. Gilles told you the current state: if you have an 8-bit locale > that supports the character set you want, you're probably going to be > OK. Significant parts of the code rely on the 8-bit char assumption, which > means that there will be some rewriting work beyond just the backend. Ah! That's the magic phrase ("8-bit char assumption") I was looking for. My fault, I should have just asked it more clearly the first time! Thanks for the clarification. > My guess is that the biggest challenges for moving to Unicode are going to > be removing the 8-bit char assumption and parsing "words." Updating the > String class to be Unicode-aware (perhaps via UTF-8) is probably going to > help a lot towards the first problem. The word-breaking problem for Asian languages like Japanese is a hard one. There are no 'spaces' in most Japanese text, so a language specific word-breaking algorithm is needed before any soft of searchable index can be built. We've contracted with a large internationalization-software company to assist us in our conversion to be UTF-8. We accomplish word breaking for Japanese documents with software that they supply under contract via NDA. I'll ask our in-house Japanese developer about free tools for word breaking. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |