Re: [htdig-dev] Checking in

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> > Neal, are you tracking the Java Lucene dev lists? There's
> > some recent discussion with respect to index interoperability
> > that may be relevant.
>
>   Not yet... just the Clucene list.  I'll have a look.

Here's some starting points maybe worth half an eyeball:

The UTF-8 interoperability thread
http://www.mail-archive.com/jav...@lu.../msg01970.html

Interoperability with Perl Lucene
http://www.mail-archive.com/jav...@lu.../msg02187.html

Features in the approaching Java Lucene 1.9
http://www.mail-archive.com/jav...@lu.../msg02284.html

Debian & Kaffe, Redhat & GCJ
http://www.mail-archive.com/jav...@lu.../msg02092.html

>   We have been able to verify that the Java Lucene tool 'luke' is able to
> read and query the indexes produced by CLucene.  Very cool.
>
>   The names of the searchable-fields we are using at this point is likely
> different than nutch.  Might be worth a look to see how different.

As of Nutch 0.7.1, the crawler + indexer is getting close. If it had
an easy to configure equivalent to HtDig's "local_urls" and
""  features I think it would probably be good
enough. Running Java for these operations does not feel like such
a big deal, and maybe there would be GCJ magic to ease the pain.

The search portion is a different story and requiring Tomcat is kind of
a pain in the butt. If some miracle occurred and htdig 4.0 and nutch
were super-compatible, I could imagine wanting to use htsearch against
a nutch built index. Dropping a search program into cgi-bin is really
convenient.

>   If you look at the 4.0 cvs branch, we've devised a pretty cool method o=
f
> using an STL map container to hold the fieldname & fieldtext pairs with
> index/noindex and store/nostore flags.  These are filled per document
> during htdig's parsing.
>
>   It makes the htdig<->clucene interface very elegant.

I'm a straight C guy, so STL is a little beyond me. But I like the sound
of elegant and am tracking the blog.