[Archive-access-cvs] SF.net SVN: archive-access:[2654] trunk/archive-access/projects/nutchwax/ arc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2654
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2654&view=rev
Author:   binzino
Date:     2008-12-09 01:58:04 +0000 (Tue, 09 Dec 2008)

Log Message:
-----------
Added class-level javadoc description.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java	2008-12-09 01:42:08 UTC (rev 2653)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java	2008-12-09 01:58:04 UTC (rev 2654)
@@ -48,6 +48,58 @@
 import org.apache.nutch.scoring.ScoringFilterException;
 
 
+/**
+ * Simple scoring plugin that applies a PageRank multiple to the
+ * document score/boost during index time.  Only implements the
+ * <code>ScoringFilter</code> method associated with indexing, none of
+ * the other scoring methods are implemented.
+ * </p><p>
+ * Applies a simple log10 multipler to the document score based on the
+ * base-10 log value of the number of inlinks.  For example, a page with
+ * 13,032 inlinks will have a score/boost of 5.  The actual formula is
+ * </p>
+ * <code>
+ *  initialScore *= ( floor( log10( # inlinks ) ) + 1 )
+ * </code>
+ * <p>
+ * We use floor() to get an integer value from the log10() function
+ * since we're only interested in order of magnitude.  We then add 1
+ * so that a page with &lt; 10 inlins will have a multipler of 1, and
+ * thus stay the same, 10-100 gets a multipler of 2, 100-1000 is 3, and
+ * so forth.
+ * </p>
+ * <p>
+ * The number of inlinks for a page is not taken from the <code>inlinks</code>
+ * method parameter.  Rather a map of &lt;URL,rank&gt; values is read from
+ * an external file.  Confusing?  Yes.
+ * </p>
+ * <p>
+ * We use an external file because the <code>inlinks</code> will
+ * <strong>always</strong> be empty.  This is because the
+ * <code>linkdb</code> uses URLs where the <strong>key</strong> is not
+ * the URL rather the URL+digest.  Thus the URLs in the
+ * <code>linkdb</code> never match the keys and Hadoop doesn't pass
+ * in the expected <code>linkdb</code> information.
+ * </p>
+ * <p>
+ * We work around this by using a NutchWAX command-line tool to
+ * extract the relevant PageRank information from the
+ * <code>linkdb</code> and write to an external file.  We then read
+ * that external file here and use the information contained therein.
+ * </p>
+ * <p>
+ * Yes, this is a hassle.  But it's the best we got right now.
+ * </p>
+ * <h2>Implementation note</h2>
+ * <p>
+ * Since the scoring plugins are used <em>only</em> during the
+ * <code>reduce</code> step during indexing, we delay the
+ * initialization of the &lt;URL,rank&gt; map until the first call to
+ * the <code>indexerScore</code> method.  This way, we don't spend the
+ * effort to read the external file when we are instantiated during
+ * <code>map</code> phase.
+ * </p>
+ */
 public class PageRankScoringFilter implements ScoringFilter
 {
   public static final Log LOG = LogFactory.getLog( PageRankScoringFilter.class );
@@ -247,5 +299,4 @@
     return pageranks;
   }
 
-
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2654] trunk/archive-access/projects/nutchwax/ arc

[Archive-access-cvs] SF.net SVN: archive-access:[2654] trunk/archive-access/projects/nutchwax/ archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/ PageRankScoringFilter.java