[Archive-access-cvs] SF.net SVN: archive-access: [1714] trunk/archive-access/projects/nutchwax

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1714
          http://archive-access.svn.sourceforge.net/archive-access/?rev=1714&view=rev
Author:   stack-sf
Date:     2007-04-10 11:12:24 -0700 (Tue, 10 Apr 2007)

Log Message:
-----------

M    nutchwax/xdocs/faq.fml
M    nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
    Add examples of how to run multiple concurrent index sorts.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
    trunk/archive-access/projects/nutchwax/xdocs/faq.fml

Modified: trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
===================================================================

--- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java	2007-04-10 17:30:07 UTC (rev 1713)
+++ trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java	2007-04-10 18:12:24 UTC (rev 1714)
@@ -37,7 +37,10 @@
  * Takes input that has per line the name of the class to run and the arguments
  * to pass.  Here is an example line for IndexMerger:
  * <code>org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-new indexes
- * </code>. We run as many tasks as there are input lines.
+ * </code>. Here is one for IndexSorter:
+ * <code>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</code>
+ * (Note that IndexSorter wants to refer to the local system; the indexes to
+ * sort must be on local disk). We run as many tasks as there are input lines.
  * 
  * @author stack
  */
@@ -234,8 +237,16 @@
         System.out.println("Examples:");
         System.out.println(" org.apache.nutch.indexer.IndexMerger " +
             "-workingdir /3/hadoop-tmp index-monday indexes-monday");
-        System.out.println(" (Note that named class must implement " +
-            "org.apache.hadoop.util.ToolBase)");
+        System.out.println(" Note that named class must implement " +
+            "org.apache.hadoop.util.ToolBase");
+        System.out.println();
+        System.out.println(" org.apache.nutch.indexer.IndexSorter " +
+            "/home/stack/tmp/crawl"); 
+        System.out.println(" Note that IndexSorter refers to local " +
+            "filesystem and not to hdfs and is RAM-bound. Set");
+        System.out.println(" task child RAM with the mapred.child.java.opts " +
+                "property in your hadoop-site.xml.");
+        
 	}
 	
 	public int run(String[] args) throws Exception {

Modified: trunk/archive-access/projects/nutchwax/xdocs/faq.fml
===================================================================
--- trunk/archive-access/projects/nutchwax/xdocs/faq.fml	2007-04-10 17:30:07 UTC (rev 1713)
+++ trunk/archive-access/projects/nutchwax/xdocs/faq.fml	2007-04-10 18:12:24 UTC (rev 1714)
@@ -80,20 +80,6 @@
 </answer>
 </faq>
 
-<faq id="sort">
-<title>How do I sort an index in NutchWAX</title>
-<question>How do I sort an index with NutchWAX</question>
-<answer><p>Sorting an index will usually return better
-quality results in less time.  Most of Nutch is built into the NutchWAX jar.
-To run the nutch indexer sorter, do the following:
-<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre>
-</p>
-<p>When the index is sorted, you might as well set the
-searcher.max.hits to, e.g., 1000, since you are getting back the top ranked
-documents and limit the number of hits someone is allowed to see to 1000.</p>
-</answer>
-</faq>
-
 <faq id="segmentmerge">
 <title>How do I merge segments in NutchWAX</title>
 <question>How do I merge segments in NutchWAX</question>
@@ -118,9 +104,34 @@
 org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-monday indexes-monday
 </pre>.
 </p>
+<p>In a similar fashion its possible to run multiple concurrent index sorts.
+Here is an example line from the inputs:
+<pre>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</pre>
+Note that the IndexSorter references the local filesystem explicitly (Your
+index cannot be in hdfs when you run the sort).  Also index sorting is RAM-bound
+so you will probably need to up the RAM allocated to task children (Set the
+mapred.child.java.opts property in your hadoop-site.xml).
+</p>
 </answer>
 </faq>
 
+<faq id="sort">
+<title>How do I sort an index in NutchWAX</title>
+<question>How do I sort an index with NutchWAX</question>
+<answer><p>Sorting an index will usually return better
+quality results in less time.  Most of Nutch is built into the NutchWAX jar.
+To run the nutch indexer sorter, do the following:
+<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre>
+</p>
+<p>When the index is sorted, you might as well set the
+searcher.max.hits to, e.g., 1000, since you are getting back the top ranked
+documents and limit the number of hits someone is allowed to see to 1000.</p>
+<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a>
+for how to run multiple concurrent sorts.</p>
+</answer>
+</faq>
+
+
 <faq id="incremental">
 <question>Is it possible to do incremental updates?</question>
 <answer><p>Here is a sketch of how to do it for now.  Later we'll add better


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.