From: <sta...@us...> - 2007-04-10 20:36:21
|
Revision: 1717 http://archive-access.svn.sourceforge.net/archive-access/?rev=1717&view=rev Author: stack-sf Date: 2007-04-10 13:36:21 -0700 (Tue, 10 Apr 2007) Log Message: ----------- M nutchwax/xdocs/faq.fml M nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java Add an example of a distributed copy from hdfs to local filesystem. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java trunk/archive-access/projects/nutchwax/xdocs/faq.fml Modified: trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java =================================================================== --- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java 2007-04-10 18:30:28 UTC (rev 1716) +++ trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java 2007-04-10 20:36:21 UTC (rev 1717) @@ -234,14 +234,27 @@ System.out.println(" <input> Directory of input files with " + "each line describing task to run"); System.out.println(" <output> Output directory."); - System.out.println("Examples:"); + System.out.println("Example input lines:"); + System.out.println(); + System.out.println(" An input line to specify a merge would look " + + "like:"); + System.out.println(); System.out.println(" org.apache.nutch.indexer.IndexMerger " + "-workingdir /3/hadoop-tmp index-monday indexes-monday"); + System.out.println(); System.out.println(" Note that named class must implement " + - "org.apache.hadoop.util.ToolBase"); + "org.apache.hadoop.util.ToolBase"); System.out.println(); + System.out.println(" To copy from " + + "hdfs://HOST:PORT/user/stack/index-monday to"); + System.out.println( " file:///0/searcher.dir/index:"); + System.out.println(); + System.out.println(" org.apache.hadoop.fs.FsShell " + + "/user/stack/index-monday /0/searcher.dir/index"); + System.out.println(); System.out.println(" org.apache.nutch.indexer.IndexSorter " + "/home/stack/tmp/crawl"); + System.out.println(); System.out.println(" Note that IndexSorter refers to local " + "filesystem and not to hdfs and is RAM-bound. Set"); System.out.println(" task child RAM with the mapred.child.java.opts " + Modified: trunk/archive-access/projects/nutchwax/xdocs/faq.fml =================================================================== --- trunk/archive-access/projects/nutchwax/xdocs/faq.fml 2007-04-10 18:30:28 UTC (rev 1716) +++ trunk/archive-access/projects/nutchwax/xdocs/faq.fml 2007-04-10 20:36:21 UTC (rev 1717) @@ -46,8 +46,6 @@ </part> - - <part id="indexing"> <title>Indexing</title> @@ -91,11 +89,56 @@ Run the following to see the usage: <pre>% ${HADOOP_HOME}/bin/hadoop jar nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.nutch.segment.SegmentMerger ~/tmp/crawl/segments_merged/ ~/tmp/crawl/segments/20070406155807-test/ ~/tmp/crawl/segments/20070406155856-test/</pre> </p> +</answer> +</faq> + +<faq id="sort"> +<title>How do I sort an index in NutchWAX</title> +<question>How do I sort an index with NutchWAX</question> +<answer><p>Sorting an index will usually return better +quality results in less time. Most of Nutch is built into the NutchWAX jar. +To run the nutch indexer sorter, do the following: +<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre> +</p> +<p>When the index is sorted, you might as well set the +searcher.max.hits to, e.g., 1000, since you are getting back the top ranked +documents and limit the number of hits someone is allowed to see to 1000.</p> +<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a> +for how to run multiple concurrent sorts.</p> +</answer> +</faq> + +<faq id="multiples"> +<title>How to run multiple merges/sorts/copies concurrently?</title> +<question>How to run multiple merges/sorts/copies concurrently</question> +<answer> <p>If creating multiple indices, you may want to make use of the NutchWAX facility -that runs a mapreduce job to farm out the multiple index merges across the cluster -so they run concurrently rather than in series. For the usage on how to run -multiple concurrent jobs, run the following: +that runs a mapreduce job to farm out the multiple index merges, copy from hdfs to local, +and index sorting across the cluster so they run concurrently rather than in series. For +the usage on how to run multiple concurrent jobs, run the following: <pre>stack@debord:~/workspace$ ${HADOOP_HOME}/bin/hadoop jar nutchwax.jar help multiple +Usage: multiple <input> <output> +Runs concurrently all commands listed in <inputs>. +Arguments: + <input> Directory of input files with each line describing task to run + <output> Output directory. +Example input lines: + + An input line to specify a merge would look like: + + org.apache.nutch.indexer.IndexMerger -workingdir /3/hadoop-tmp index-monday indexes-monday + + Note that named class must implement org.apache.hadoop.util.ToolBase + + To copy from hdfs://HOST:PORT/user/stack/index-monday to + file:///0/searcher.dir/index: + + org.apache.hadoop.fs.FsShell /user/stack/index-monday /0/searcher.dir/index + + org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl + + Note that IndexSorter refers to local filesystem and not to hdfs and is RAM-bound. Set + task child RAM with the mapred.child.java.opts property in your hadoop-site.xml. </pre> It takes inputs and outputs directories. The latter is usually not used but required by the framework. The inputs directory contains files that list per line a job to @@ -105,7 +148,14 @@ <pre> org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-monday indexes-monday </pre> +If the inputs had a line per day of the week then we'd run seven tasks with +each task merging a day's indices. If the cluster had 7 machines, then we'd the +7 tasks would run concurrently. </p> +<p>Here is how you would specify a copy task that copyied <code>hdfs:///user/stack/index-monday</code> +to <code>file:///0/searcher.dir/index</code>: +<pre>org.apache.hadoop.fs.FsShell -get /user/stack/index-monday /0/searcher.dir/index</pre> +</p> <p>In a similar fashion its possible to run multiple concurrent index sorts. Here is an example line from the inputs: <pre>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</pre> @@ -117,23 +167,6 @@ </answer> </faq> -<faq id="sort"> -<title>How do I sort an index in NutchWAX</title> -<question>How do I sort an index with NutchWAX</question> -<answer><p>Sorting an index will usually return better -quality results in less time. Most of Nutch is built into the NutchWAX jar. -To run the nutch indexer sorter, do the following: -<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre> -</p> -<p>When the index is sorted, you might as well set the -searcher.max.hits to, e.g., 1000, since you are getting back the top ranked -documents and limit the number of hits someone is allowed to see to 1000.</p> -<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a> -for how to run multiple concurrent sorts.</p> -</answer> -</faq> - - <faq id="incremental"> <question>Is it possible to do incremental updates?</question> <answer><p>Here is a sketch of how to do it for now. Later we'll add better This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |