Update of /cvsroot/archive-access/archive-access/projects/nutch/conf
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21415/conf
Modified Files:
nutch-site.xml.nutchwax
Log Message:
Implement '[ 1309781 ] Add in skipping certain types if > size' for Dan.
* bin/nutch
Add new col-dedup command.
* conf/nutch-site.xml.nutchwax
Remove dedup collection parameter. Not used.
* src/java/org/archive/access/nutch/CollectionDeleteDuplicates.java
A copy of nutch DeleteDuplicates that adds in hash of collection to url
and content md5. Have to make copy rather than subclass because the
original is not subclassable -- its all private in awkward places.
Index: nutch-site.xml.nutchwax
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml.nutchwax,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** nutch-site.xml.nutchwax 21 Oct 2005 00:42:14 -0000 1.1
--- nutch-site.xml.nutchwax 21 Oct 2005 04:16:12 -0000 1.2
***************
*** 142,152 ****
value is -1 which says don't skip text/html docs.</description>
</property>
- <property>
- <name>archive.dedup.count.collection</name>
- <value>false</value>
- <description>If true, when deduping, compare collection names
- as well as URL and content-md5 deduping.
- </description>
- </property>
-
</nutch-conf>
--- 142,144 ----
|