From: <sta...@us...> - 2007-02-14 02:33:12
|
Revision: 1489 http://archive-access.svn.sourceforge.net/archive-access/?rev=1489&view=rev Author: stack-sf Date: 2007-02-13 18:33:05 -0800 (Tue, 13 Feb 2007) Log Message: ----------- Part of '[ 1637951 ] [nutchwax] Redo reporting scripts as mapreduce jobs' * conf/wax-default.xml Override default hadoop log formatter. Turn off the purging of logs and keep them around longer than 12 hours. * src/java/org/archive/access/nutch/ImportArcs.java Pass on empty split inputs. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/conf/wax-default.xml trunk/archive-access/projects/nutchwax/src/java/org/archive/access/nutch/ImportArcs.java Modified: trunk/archive-access/projects/nutchwax/conf/wax-default.xml =================================================================== --- trunk/archive-access/projects/nutchwax/conf/wax-default.xml 2007-02-12 22:00:59 UTC (rev 1488) +++ trunk/archive-access/projects/nutchwax/conf/wax-default.xml 2007-02-14 02:33:05 UTC (rev 1489) @@ -154,7 +154,41 @@ </description> </property> +<!-- The below mapred.userlog configs. override defaults which purge +anything beyond a 100k and anything over 12 hours old. Of note, if mapred +is restarted, logs for tasks of same name are overwritten. +--> <property> + <name>mapred.userlog.limit.kb</name> + <value>400</value> + <description>The maximum size of user-logs of each task. + + We're using default split of 4 so 400 instead of 100 makes + for files of 100k each. + </description> +</property> + +<property> + <name>mapred.userlog.purgesplits</name> + <value>false</value> + <description>Should the splits be purged disregarding the user-log size limit. + + For now, don't purge logs. Default purges. + </description> +</property> + +<property> + <name>mapred.userlog.retain.hours</name> + <value>168</value> + <description>The maximum time, in hours, for which the user-logs are to be + retained. + + + Keep them for a week rather than for 12 hours only, the default. + </description> +</property> + +<property> <name>fetcher.store.content</name> <value>false</value> <description>If true, fetcher will store content. Modified: trunk/archive-access/projects/nutchwax/src/java/org/archive/access/nutch/ImportArcs.java =================================================================== --- trunk/archive-access/projects/nutchwax/src/java/org/archive/access/nutch/ImportArcs.java 2007-02-12 22:00:59 UTC (rev 1488) +++ trunk/archive-access/projects/nutchwax/src/java/org/archive/access/nutch/ImportArcs.java 2007-02-14 02:33:05 UTC (rev 1489) @@ -323,6 +323,10 @@ } public void run() { + if (this.arcLocation == null || this.arcLocation.length() <= 0) { + return; + } + ArchiveReader arc = null; // Need a thread that will keep updating TaskTracker during long // downloads else tasktracker will kill us. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |