From: <sta...@us...> - 2007-04-10 18:13:42
|
Revision: 1714 http://archive-access.svn.sourceforge.net/archive-access/?rev=1714&view=rev Author: stack-sf Date: 2007-04-10 11:12:24 -0700 (Tue, 10 Apr 2007) Log Message: ----------- M nutchwax/xdocs/faq.fml M nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java Add examples of how to run multiple concurrent index sorts. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java trunk/archive-access/projects/nutchwax/xdocs/faq.fml Modified: trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java =================================================================== --- trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java 2007-04-10 17:30:07 UTC (rev 1713) +++ trunk/archive-access/projects/nutchwax/nutchwax-core/src/main/java/org/archive/access/nutch/Multiple.java 2007-04-10 18:12:24 UTC (rev 1714) @@ -37,7 +37,10 @@ * Takes input that has per line the name of the class to run and the arguments * to pass. Here is an example line for IndexMerger: * <code>org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-new indexes - * </code>. We run as many tasks as there are input lines. + * </code>. Here is one for IndexSorter: + * <code>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</code> + * (Note that IndexSorter wants to refer to the local system; the indexes to + * sort must be on local disk). We run as many tasks as there are input lines. * * @author stack */ @@ -234,8 +237,16 @@ System.out.println("Examples:"); System.out.println(" org.apache.nutch.indexer.IndexMerger " + "-workingdir /3/hadoop-tmp index-monday indexes-monday"); - System.out.println(" (Note that named class must implement " + - "org.apache.hadoop.util.ToolBase)"); + System.out.println(" Note that named class must implement " + + "org.apache.hadoop.util.ToolBase"); + System.out.println(); + System.out.println(" org.apache.nutch.indexer.IndexSorter " + + "/home/stack/tmp/crawl"); + System.out.println(" Note that IndexSorter refers to local " + + "filesystem and not to hdfs and is RAM-bound. Set"); + System.out.println(" task child RAM with the mapred.child.java.opts " + + "property in your hadoop-site.xml."); + } public int run(String[] args) throws Exception { Modified: trunk/archive-access/projects/nutchwax/xdocs/faq.fml =================================================================== --- trunk/archive-access/projects/nutchwax/xdocs/faq.fml 2007-04-10 17:30:07 UTC (rev 1713) +++ trunk/archive-access/projects/nutchwax/xdocs/faq.fml 2007-04-10 18:12:24 UTC (rev 1714) @@ -80,20 +80,6 @@ </answer> </faq> -<faq id="sort"> -<title>How do I sort an index in NutchWAX</title> -<question>How do I sort an index with NutchWAX</question> -<answer><p>Sorting an index will usually return better -quality results in less time. Most of Nutch is built into the NutchWAX jar. -To run the nutch indexer sorter, do the following: -<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre> -</p> -<p>When the index is sorted, you might as well set the -searcher.max.hits to, e.g., 1000, since you are getting back the top ranked -documents and limit the number of hits someone is allowed to see to 1000.</p> -</answer> -</faq> - <faq id="segmentmerge"> <title>How do I merge segments in NutchWAX</title> <question>How do I merge segments in NutchWAX</question> @@ -118,9 +104,34 @@ org.apache.nutch.indexer.IndexMerger -workingdir /tmp index-monday indexes-monday </pre>. </p> +<p>In a similar fashion its possible to run multiple concurrent index sorts. +Here is an example line from the inputs: +<pre>org.apache.nutch.indexer.IndexSorter /home/stack/tmp/crawl</pre> +Note that the IndexSorter references the local filesystem explicitly (Your +index cannot be in hdfs when you run the sort). Also index sorting is RAM-bound +so you will probably need to up the RAM allocated to task children (Set the +mapred.child.java.opts property in your hadoop-site.xml). +</p> </answer> </faq> +<faq id="sort"> +<title>How do I sort an index in NutchWAX</title> +<question>How do I sort an index with NutchWAX</question> +<answer><p>Sorting an index will usually return better +quality results in less time. Most of Nutch is built into the NutchWAX jar. +To run the nutch indexer sorter, do the following: +<pre>$ hadoop jar nutchwax.jar class org.apache.nutch.indexer.IndexerSorter</pre> +</p> +<p>When the index is sorted, you might as well set the +searcher.max.hits to, e.g., 1000, since you are getting back the top ranked +documents and limit the number of hits someone is allowed to see to 1000.</p> +<p>See the end of <a href="#segmentmerge">How do I merge segments in NutchWAX</a> +for how to run multiple concurrent sorts.</p> +</answer> +</faq> + + <faq id="incremental"> <question>Is it possible to do incremental updates?</question> <answer><p>Here is a sketch of how to do it for now. Later we'll add better This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |