From: Aaron B. <aa...@ar...> - 2011-03-17 19:37:38
|
Gerard Suades i Méndez <gs...@ce...> writes: > NutchWAX 0.13 official release. Good :) In that case, I can say that you do *not* need the crawl_parse crawl_data content sub-directories of the segments. You can safely delete them to save on disk space. For example: $ rm -rfv segments/*/c* Also, the NutchWAX 0.13 official release *does* have the content of the documents stored in the index (in compressed form). This means that the indexes are 100% self-contained and you do not need the segments for the live search service. However, NutchWAX 0.13 official release does *not* perform de-duplication during indexing. That feature was added to a branch I created from NW 0.13 but has not been officially released yet. The branch is http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive Aaron |