From: Michael S. <sta...@us...> - 2005-10-21 04:16:20
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21415/conf Modified Files: nutch-site.xml.nutchwax Log Message: Implement '[ 1309781 ] Add in skipping certain types if > size' for Dan. * bin/nutch Add new col-dedup command. * conf/nutch-site.xml.nutchwax Remove dedup collection parameter. Not used. * src/java/org/archive/access/nutch/CollectionDeleteDuplicates.java A copy of nutch DeleteDuplicates that adds in hash of collection to url and content md5. Have to make copy rather than subclass because the original is not subclassable -- its all private in awkward places. Index: nutch-site.xml.nutchwax =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml.nutchwax,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** nutch-site.xml.nutchwax 21 Oct 2005 00:42:14 -0000 1.1 --- nutch-site.xml.nutchwax 21 Oct 2005 04:16:12 -0000 1.2 *************** *** 142,152 **** value is -1 which says don't skip text/html docs.</description> </property> - <property> - <name>archive.dedup.count.collection</name> - <value>false</value> - <description>If true, when deduping, compare collection names - as well as URL and content-md5 deduping. - </description> - </property> - </nutch-conf> --- 142,144 ---- |