From: Aaron B. <aa...@ar...> - 2011-08-28 21:50:44
|
Jon Walton <jon...@gm...> writes: > I use NutchWAX to index WARC file content for analysis. I need to fix > it to get around the JDK u23 gzip problem, but I noticed that > development seems to have died. Is everyone using other solutions now > such as Solr? If so, care to share any details? It's not quite entirely dead, but pretty close to it. I can't speak for everyone, but many (former) users of NutchWAX are in some state of migration to a Solr-based implementation. IMO, the main challenge in moving to Solr is replacing the NutchWAX 'import' step -- reading the documents from (W)ARC files. There is a branch on the public NutchWAX subversion tree that has a fix to hangle the JDK u23 gzip change. This branch also contains a few customizations specific to the way NutchWAX is still used in a few deployments with particular needs. YMMV. The branch is: http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive As for the future of NutchWAX, well if you consider that NutchWAX has three essential pieces: 1. Import: read (w)arc files, get the documents, parse them and extract the text, metadata, links, etc. 2. Index: Read the output of step 1, perform text analysis and manipulation and index with Solr/Lucene/etc. 3. Search/query: the live search service. For us at the Archive, we still use NutchWAX for step 1, albeit with modifications that are particular to our deployments. As for step 2, we are now using some custom MapReduce code I wrote which can index documents either directly with Lucene, or push them over the wire into a Solr server, that project can be found at: https://github.com/aaronbinns/jbs And as for step 3, some folks have moved on to Solr, at the Archive we use a custom Lucene-based Java web application: https://github.com/aaronbinns/tnh which is purpose-built for search of archival web pages. My plan is to replace NutchWAX for step 1. At the Archive, we have an in-development "access" library which can read (w)arcs and has lots of goodies for doing things with (w)arcs at scale in Hadoop. The idea is that both Wayback and full-text search and other web access projects will all use that core library. It's just a ways off still. Hope that helps, Aaron -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |