From: Arnone, A. <aa...@ri...> - 2005-12-09 18:23:05
|
I'll try to answer (or dodge) some these questions. - Is htdig a competitor to Nutch? If not, could you take a few minutes=20 to clarify the differences between the two? This is a good one for Neal to answer. I can tell you that I'm expecting the new ht://Dig to epitomize a fast, lightweight and scalable domain-specific search engine. Nutch, Omega and similar projects all have their strengths (again, maybe Neal can talk about that), but one of the big strengths of ht://Dig is the vast array of options and settings that are available to the user. While some of these are going away because they are no longer applicable, we are committed to keeping as many of the nice bells and whistles as we can. - What, if any, modifications to the ranking engine will be made in 4.0=20 (saw the note about back-links and anchor texts - what about incoming=20 links from other domains)? The ranking engine will be moved over to CLucene. Right now, the CLucene database contains anything we want (the API is highly extensible), and we're working on making things like backlink counts and link descriptions work efficiently. As for external domain links, that is really outside the scope of ht://Dig, since it is primarily a single-site (or small group of sites) crawler. - It seems the goal is to create a library that can be included in=20 other programs. Will the library include all the code for spidering,=20 creating the indexes, and searching or just the database creation=20 stuff, or something else...? Creating a library is exactly what we're shooting for. It will contain the ability to spider and push documents into a CLucene database. For searching, we essentially want to be able to stick any appropriate wrapper on top of ht://Dig and be able to do searches. I've written about this on the blog, but what I'd like to do is separate the htsearch options from the htdisplay options. Search options can be sent down to the library, and search results can be returned in some kind of XML format to the wrapper. The wrapper can do whatever it wants with the results as far as cgi and pretty print. Since we're still in beta (or alpha since I keep writing stupid bugs), we're using Luke to verify index creation and validity. Luke (http://www.getopt.org/luke/) is a toolbox designed to interact with Java Lucene indexes, but since CLucene follows the standard, we use it for our own purposes. - Are there any security considerations that should be addressed at=20 this early stage (sanitizing of URL parameters, for example) Uhh... Neal? Anyway, I'm planning on making a tag in CVS that everyone can download and try soon. There is a htdig_4_0 branch right now, but it is lacking certain parts - namely the CLucene back end. We're working on adding CLucene to the make scripts; right now we're doing builds the hard way. I hope this answered some of your questions, and I hope that Neal can step in and answer a few more. I've been bad about updating the blog on a regular basis, but hopefully I can get myself in gear and let everyone know the day-to-day progress. Feel free to leave comments on my posts, too. Anthony -----Original Message----- From: htd...@li... [mailto:htd...@li...] On Behalf Of Gustave Stresen-Reuter Sent: Friday, December 09, 2005 5:08 AM To: Richter, Neal Cc: htd...@li... Subject: Re: [htdig-dev] htdig 4.0 updates Neal, I've been reading, with interest, the posts on the blog. I have a few=20 of questions so far. - Is htdig a competitor to Nutch? If not, could you take a few minutes=20 to clarify the differences between the two? - What, if any, modifications to the ranking engine will be made in 4.0=20 (saw the note about back-links and anchor texts - what about incoming=20 links from other domains)? - It seems the goal is to create a library that can be included in=20 other programs. Will the library include all the code for spidering,=20 creating the indexes, and searching or just the database creation=20 stuff, or something else...? - Are there any security considerations that should be addressed at=20 this early stage (sanitizing of URL parameters, for example) I'm not a C developer, but I'm more than happy to try building the=20 project on Linux and Mac OS X (10.3). Is there a 4.0 branch in CVS or=20 will we have to wait for you to tag it? Thanks for the work. Gustave (Ted) Stresen-Reuter On Dec 8, 2005, at 6:05 PM, Neal Richter wrote: > Hey all, > > We've been making good progress on HtDig 4.0 > > You can see the progress updates on this blog. > > http://htdig.blogspot.com/ > > Thanks. > > --=20 > Neal Richter > Sr. Researcher and Machine Learning Lead > Software Development > RightNow Technologies, Inc. > Customer Service for Every Web Site > Office: 406-522-1485 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick _______________________________________________ ht://Dig Developer mailing list: htd...@li... List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev |