Update of /cvsroot/archive-access/archive-access/projects/nutch/bin
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21415/bin
Modified Files:
nutch
Log Message:
Implement '[ 1309781 ] Add in skipping certain types if > size' for Dan.
* bin/nutch
Add new col-dedup command.
* conf/nutch-site.xml.nutchwax
Remove dedup collection parameter. Not used.
* src/java/org/archive/access/nutch/CollectionDeleteDuplicates.java
A copy of nutch DeleteDuplicates that adds in hash of collection to url
and content md5. Have to make copy rather than subclass because the
original is not subclassable -- its all private in awkward places.
Index: nutch
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/bin/nutch,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** nutch 15 Sep 2005 18:22:53 -0000 1.2
--- nutch 21 Oct 2005 04:16:12 -0000 1.3
***************
*** 39,42 ****
--- 39,43 ----
echo " merge merge several segment indexes"
echo " dedup remove duplicates from a set of segment indexes"
+ echo " col-dedup remove collection duplicates from segment indexes"
echo " updatedb update db from segments after fetching"
echo " updatesegs update segments with link data from the db"
***************
*** 151,154 ****
--- 152,159 ----
elif [ "$COMMAND" = "dedup" ] ; then
CLASS=org.apache.nutch.indexer.DeleteDuplicates
+ elif [ "$COMMAND" = "col-dedup" ] ; then
+ # Do a dedup that counts collection into url and content md5. Will
+ # ensure dedup done only within a collection.
+ CLASS=org.archive.access.nutch.CollectionDeleteDuplicates
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.tools.UpdateDatabaseTool
|