From: Neal R. <ne...@ri...> - 2005-12-09 21:11:07
|
On Fri, 9 Dec 2005, Gustave Stresen-Reuter wrote: > Neal, > > I've been reading, with interest, the posts on the blog. I have a few > of questions so far. > > - Is htdig a competitor to Nutch? If not, could you take a few minutes > to clarify the differences between the two? No! They will be complementary. I believe that HtDig is much easier to manage and has more clear flexibility of configuration than Nutch. Nutch is a very powerfull in terms of it's scalability in both documents and simultanious searches. Nutch uses an apache/tomcat server to service requests. It can (and has) scaled to 200 millions documents. It's written in 100% Java as a full application built on Java Lucene. It's great. However I do believe that getting a tomcat server up and running as well as having a JVM and other associated infastructure is a bit beyond the capabilites of a lot of our users. It's not quite as simple as compiling and installing the binaries or installing a package. I may be underestimating our users, but I base this assesment on reading the htdig-general list. HtDig 4.0 will be easy to configure install and/or install via RPM or other package manager. It won't require a user to keep a server-daemon running. And it will continue to provide a massive variety of flexible configuration options. The addition of the CLucene library underneath will enable HtDig to achieve good scalability in documents. The way I see it HtDig 4.0 is for the classic use of a site-specific search engine for modestly sized websites that don't have tons of search hits per second. Nutch is for people who have large document sets and/or lots of search hits per unit time and need a multi-threaded server daemon to handle the load. FYI: Doug Cutting, the leader of Nutch & Lucene, was one of the original authors of Excite and has been doing IR for 15+ years. Doug's aims are much higher in terms of what Nutch is. > - What, if any, modifications to the ranking engine will be made in 4.0 > (saw the note about back-links and anchor texts - what about incoming > links from other domains)? > > - It seems the goal is to create a library that can be included in > other programs. Will the library include all the code for spidering, > creating the indexes, and searching or just the database creation > stuff, or something else...? HtDig is an application for users. We are architecting 4.0 in such a way so that it can be used as a library in other applications. For a while KDE used a wrapper for the htdig binaries to enable document searching. That was a big ugly hack. I'd like to be able to have something that anyone including other open source projects can use to spider/index and search documents. > - Are there any security considerations that should be addressed at > this early stage (sanitizing of URL parameters, for example) HtDig currently has a flexible AWK rule method for doing any URL manipulation you can think up. I hope to provide a quick wrapper config for that that will ouput an AWK rule to specificaly strip a URL parameter (it's already done in some PHP code I wrote). -- Neal Richter Sr. Researcher and Machine Learning Lead Software Development RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |