From: Michael S. <sta...@us...> - 2005-10-21 04:16:23
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/bin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21415/bin Modified Files: nutch Log Message: Implement '[ 1309781 ] Add in skipping certain types if > size' for Dan. * bin/nutch Add new col-dedup command. * conf/nutch-site.xml.nutchwax Remove dedup collection parameter. Not used. * src/java/org/archive/access/nutch/CollectionDeleteDuplicates.java A copy of nutch DeleteDuplicates that adds in hash of collection to url and content md5. Have to make copy rather than subclass because the original is not subclassable -- its all private in awkward places. Index: nutch =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/bin/nutch,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** nutch 15 Sep 2005 18:22:53 -0000 1.2 --- nutch 21 Oct 2005 04:16:12 -0000 1.3 *************** *** 39,42 **** --- 39,43 ---- echo " merge merge several segment indexes" echo " dedup remove duplicates from a set of segment indexes" + echo " col-dedup remove collection duplicates from segment indexes" echo " updatedb update db from segments after fetching" echo " updatesegs update segments with link data from the db" *************** *** 151,154 **** --- 152,159 ---- elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates + elif [ "$COMMAND" = "col-dedup" ] ; then + # Do a dedup that counts collection into url and content md5. Will + # ensure dedup done only within a collection. + CLASS=org.archive.access.nutch.CollectionDeleteDuplicates elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.tools.UpdateDatabaseTool |