[Archive-access-cvs] SF.net SVN: archive-access: [1717] trunk/archive-access/projects/nutchwax

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1717
          http://archive-access.svn.sourceforge.net/archive-access/?rev=1717&view=rev
Author:   stack-sf
Date:     2007-04-10 13:36:21 -0700 (Tue, 10 Apr 2007)

Log Message:
-----------

M    nutchwax/xdocs/faq.fml
M    nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
    Add an example of a distributed copy from hdfs to local filesystem.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
    trunk/archive-access/projects/nutchwax/xdocs/faq.fml

Modified: trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java
===================================================================

--- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java	2007-04-10 18:30:28 UTC (rev 1716)
+++ trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java	2007-04-10 20:36:21 UTC (rev 1717)
@@ -234,14 +234,27 @@
 		System.out.println(" <input>   Directory of input files with " +
 			"each line describing task to run");
 		System.out.println(" <output>  Output directory.");
-        System.out.println("Examples:");
+        System.out.println("Example input lines:");
+        System.out.println();
+        System.out.println(" An input line to specify a merge would look " +
+             "like:");
+        System.out.println();
         System.out.println(" org.apache.nutch.indexer.IndexMerger " +
             "-workingdir /3/hadoop-tmp index-monday indexes-monday");
+        System.out.println();
         System.out.println(" Note that named class must implement " +
-            "org.apache.hadoop.util.ToolBase");
+        "org.apache.hadoop.util.ToolBase");
         System.out.println();
+        System.out.println(" To copy from " +
+            "hdfs://HOST:PORT/user/stack/index-monday to");
+        System.out.println( " file:///0/searcher.dir/index:");
+        System.out.println();
+        System.out.println(" org.apache.hadoop.fs.FsShell " +
+            "/user/stack/index-monday /0/searcher.dir/index"); 
+        System.out.println();
         System.out.println(" org.apache.nutch.indexer.IndexSorter " +
             "/home/stack/tmp/crawl"); 
+        System.out.println();
         System.out.println(" Note that IndexSorter refers to local " +
             "filesystem and not to hdfs and is RAM-bound. Set");
         System.out.println(" task child RAM with the mapred.child.java.opts " +

Modified: trunk/archive-access/projects/nutchwax/xdocs/faq.fml
===================================================================
--- trunk/archive-access/projects/nutchwax/xdocs/faq.fml	2007-04-10 18:30:28 UTC (rev 1716)
+++ trunk/archive-access/projects/nutchwax/xdocs/faq.fml	2007-04-10 20:36:21 UTC (rev 1717)
@@ -46,8 +46,6 @@
 
 </part>
 
-
-
   <part id="indexing">
     <title>Indexing</title>
 
@@ -91,11 +89,56 @@
 Run the following to see the usage:
 <pre>% ${HADOOP_HOME}/bin/hadoop jar nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.nutch.segment.SegmentMerger ~/tmp/crawl/segments_merged/ ~/tmp/crawl/segments/20070406155807-test/ ~/tmp/crawl/segments/20070406155856-test/</pre>
 </p>
+</answer>
+</faq>
+
+<faq id="sort">
+<title>How do I sort an index in NutchWAX</title>
+<question>How do I sort an index with NutchWAX</question>
+<answer><p>Sorting an index will usually return better
+quality results in less time.  Most of Nutch is built into the NutchWAX jar.
+To run the nutch indexer sorter, do the following:
+<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre>
+</p>
+<p>When the index is sorted, you might as well set the
+searcher.max.hits to, e.g., 1000, since you are getting back the top ranked
+documents and limit the number of hits someone is allowed to see to 1000.</p>
+<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a>
+for how to run multiple concurrent sorts.</p>
+</answer>
+</faq>
+
+<faq id="multiples">
+<title>How to run multiple merges/sorts/copies concurrently?</title>
+<question>How to run multiple merges/sorts/copies concurrently</question>
+<answer>
 <p>If creating multiple indices, you may want to make use of the NutchWAX facility
-that runs a mapreduce job to farm out the multiple index merges across the cluster
-so they run concurrently rather than in series.  For the usage on how to run 
-multiple concurrent jobs, run the following:
+that runs a mapreduce job to farm out the multiple index merges, copy from hdfs to local,
+and index sorting across the cluster so they run concurrently rather than in series.  For
+the usage on how to run multiple concurrent jobs, run the following:
 <pre>stack@debord:~/workspace$ ${HADOOP_HOME}/bin/hadoop jar nutchwax.jar help multiple
+Usage: multiple &lt;input&gt; &lt;output&gt;
+Runs concurrently all commands listed in &lt;inputs&gt;.
+Arguments:
+ &lt;input&gt;   Directory of input files with each line describing task to run
+ &lt;output&gt;  Output directory.
+Example input lines:
+
+ An input line to specify a merge would look like:
+
+ org.apache.nutch.indexer.IndexMerger -workingdir /3/hadoop-tmp index-monday indexes-monday
+
+ Note that named class must implement org.apache.hadoop.util.ToolBase
+
+ To copy from hdfs://HOST:PORT/user/stack/index-monday to
+ file:///0/searcher.dir/index:
+
+ org.apache.hadoop.fs.FsShell /user/stack/index-monday /0/searcher.dir/index
+
+ org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl
+
+ Note that IndexSorter refers to local filesystem and not to hdfs and is RAM-bound. Set
+ task child RAM with the mapred.child.java.opts property in your hadoop-site.xml.
 </pre>
 It takes inputs and outputs directories. The latter is usually not used but required
 by the framework.  The inputs directory contains files that list per line a job to
@@ -105,7 +148,14 @@
 <pre>
 org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-monday indexes-monday
 </pre>
+If the inputs had a line per day of the week then we'd run seven tasks with
+each task merging a day's indices.  If the cluster had 7 machines, then we'd the
+7 tasks would run concurrently.
 </p>
+<p>Here is how you would specify a copy task that copyied <code>hdfs:///user/stack/index-monday</code>
+to <code>file:///0/searcher.dir/index</code>:
+<pre>org.apache.hadoop.fs.FsShell -get /user/stack/index-monday /0/searcher.dir/index</pre>
+</p>
 <p>In a similar fashion its possible to run multiple concurrent index sorts.
 Here is an example line from the inputs:
 <pre>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</pre>
@@ -117,23 +167,6 @@
 </answer>
 </faq>
 
-<faq id="sort">
-<title>How do I sort an index in NutchWAX</title>
-<question>How do I sort an index with NutchWAX</question>
-<answer><p>Sorting an index will usually return better
-quality results in less time.  Most of Nutch is built into the NutchWAX jar.
-To run the nutch indexer sorter, do the following:
-<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre>
-</p>
-<p>When the index is sorted, you might as well set the
-searcher.max.hits to, e.g., 1000, since you are getting back the top ranked
-documents and limit the number of hits someone is allowed to see to 1000.</p>
-<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a>
-for how to run multiple concurrent sorts.</p>
-</answer>
-</faq>
-
-
 <faq id="incremental">
 <question>Is it possible to do incremental updates?</question>
 <answer><p>Here is a sketch of how to do it for now.  Later we'll add better


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.