From: <bi...@us...> - 2010-03-18 21:55:58
|
Revision: 2976 http://archive-access.svn.sourceforge.net/archive-access/?rev=2976&view=rev Author: binzino Date: 2010-03-18 21:55:45 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2010-03-18 21:51:55 UTC (rev 2975) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2010-03-18 21:55:45 UTC (rev 2976) @@ -1,6 +1,6 @@ HOWTO-pagerank.txt -2008-12-18 +2010-02-13 Aaron Binns Table of Contents @@ -30,22 +30,20 @@ simplistic "page rank" information for scoring and sorting documents in the full-text search index. -Nutch's 'invertlinks' step inverts links and stores them in the -'linkdb' directory. We use these inlinks to boost the Lucene score of -documents in proportion to the number of inlinks. +NutchWAX's 'pagerankdb' command inverts and counts links to a page, +storing the counts in a directory named 'pagerankdb'. This +information is then used to update the boost values in the Lucene +index in proportion to number of inlinks to each document. ====================================================================== Generate PageRank ====================================================================== -After the Nutch 'invertlinks' step is performed, run the NutchWAX -'pagerank' command to extract inlink information from the 'linkdb' - For example - $ nutch invertlinks linkdb -dir segments - $ nutchwax pagerank pagerank.txt linkdb + $ nutchwax pagerankdb prdb -dir segments + $ nutchwax pagerank pagerank.txt prdb The resulting "pagerank.txt" file is a simple text file containing a count of the number of inlinks followed by the URL. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |