From: Neal R. <ne...@ri...> - 2005-10-20 21:21:51
|
> > After having looked at many commercial implementation of search engines > > over the past few years and following Nutch a bit.. I am still convinced > > that HtDig has plenty of legs. > > I know what you mean. Every time I look at Nutch I decide > to stick with htdig 3.1.6 a little longer. However, UTF-8 support > is getting super critical and some time in 2006 I'm going to have > to bite the bullet and do something. Exactly the impetus for the 4.0 development. I need Unicode in 2006 as well. > Neal, are you tracking the Java Lucene dev lists? There's > some recent discussion with respect to index interoperability > that may be relevant. Not yet... just the Clucene list. I'll have a look. We have been able to verify that the Java Lucene tool 'luke' is able to read and query the indexes produced by CLucene. Very cool. The names of the searchable-fields we are using at this point is likely different than nutch. Might be worth a look to see how different. If you look at the 4.0 cvs branch, we've devised a pretty cool method of using an STL map container to hold the fieldname & fieldtext pairs with index/noindex and store/nostore flags. These are filled per document during htdig's parsing. It makes the htdig<->clucene interface very elegant. Thanks -- Neal Richter Sr. Researcher and Machine Learning Lead Software Development RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |