Re: [htdig-dev] htdig 4.0 updates

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Fri, 9 Dec 2005, Gustave Stresen-Reuter wrote:

> Neal,
> 
> I've been reading, with interest, the posts on the blog. I have a few 
> of questions so far.
> 
> - Is htdig a competitor to Nutch? If not, could you take a few minutes 
> to clarify the differences between the two?

  No! They will be complementary.

  I believe that HtDig is much easier to manage and has more 
clear flexibility of configuration than Nutch.  Nutch is a very powerfull 
in terms of it's scalability in both documents and simultanious searches.

  Nutch uses an apache/tomcat server to service requests.  It can (and 
has) scaled to 200 millions documents.  It's written in 100% Java as a 
full application built on Java Lucene.  It's great.

  However I do believe that getting a tomcat server up and running as well 
as having a JVM and other associated infastructure is a bit beyond the 
capabilites of a lot of our users.  It's not quite as simple as compiling 
and installing the binaries or installing a package.  I may be 
underestimating our users, but I base this assesment on reading the 
htdig-general list.

  HtDig 4.0 will be easy to configure install and/or install via RPM or 
other package manager.  It won't require a user to keep a server-daemon 
running.  And it will continue to provide a massive variety of flexible 
configuration options.  The addition of the CLucene library underneath 
will enable HtDig to achieve good scalability in documents.

 The way I see it HtDig 4.0 is for the classic use of a site-specific 
search engine for modestly sized websites that don't have tons of search 
hits per second.

 Nutch is for people who have large document sets and/or lots of search 
hits per unit time and need a multi-threaded server daemon to handle the 
load.

 FYI: Doug Cutting, the leader of Nutch & Lucene, was one of the original 
authors of Excite and has been doing IR for 15+ years.  Doug's aims are 
much higher in terms of what Nutch is.

> - What, if any, modifications to the ranking engine will be made in 4.0 
> (saw the note about back-links and anchor texts - what about incoming 
> links from other domains)?
> 
> - It seems the goal is to create a library that can be included in 
> other programs. Will the library include all the code for spidering, 
> creating the indexes, and searching or just the database creation 
> stuff, or something else...?

  HtDig is an application for users.  

We are architecting 4.0 in such a way so that it can be used as a library 
in other applications.  For a while KDE used a wrapper for the htdig 
binaries to enable document searching.  That was a big ugly hack.  I'd 
like to be able to have something that anyone including other open source 
projects can use to spider/index and search documents.

> - Are there any security considerations that should be addressed at 
> this early stage (sanitizing of URL parameters, for example)

  HtDig currently has a flexible AWK rule method for doing any URL 
manipulation you can think up.  I hope to provide a quick wrapper 
config for that that will ouput an AWK rule to specificaly strip a URL 
parameter (it's already done in some PHP code I wrote).

-- 
Neal Richter
Sr. Researcher and Machine Learning Lead
Software Development
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485