[Archive-access-cvs] SF.net SVN: archive-access:[2671] trunk/archive-access/projects/nutchwax/ arch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2671
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2671&view=rev
Author:   binzino
Date:     2008-12-16 03:00:10 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Updated documentation for 0.12.3 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
    trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -157,62 +157,36 @@
 
 
 ======================================================================
-Index
+Index and Index merging
 ======================================================================
 
-The only chage we make to the indexing step is the destination of the
-index directory.
+Perform the index step as normal, yielding an 'indexes' directory.
 
-By default, Nutch expects the per-segment index directory to live in a
-sub-directory called 'indexes' and the index command is accordingly
+E.g.
 
   $ nutch index indexes crawldb linkdb segments/*
 
-Resulting in an index directory structure of the form
+Then, merge the 'indexes' directory into a single Lucene index by
+invoking the Nutch 'merge' command
 
-    indexes/part-00000
+  $ nutch merge index indexes
 
-For de-duplication, we use a slightly different directory structure,
-which will be used by a de-duplication-aware NutchWaxBean at
-search-time.  The directory structure we use is:
 
-    pindexes/<segment>/part-00000
-
-Using the segment name is not strictly required, but it is a good
-practice and is strongly recommended.  This way the segment and its
-corresponding index directory are easily matched.
-
-Let's assume that the segment directory created during the import is
-named
-
-  segments/20080703050349
-
-In that case, our index command becomes:
-
-  $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349
-
-Upon completion, the Lucene index is created in
-
-  pindexes/20080703050349/part-0000
-
-This index is exactly the same as one normally created by Nutch, the
-only difference is the location.
-
-
 ======================================================================
 Add Revisit Dates
 ======================================================================
 
-Now that we have the Nutch index, we add the revisit dates to it.
+Now that we have a single, merged index, we create a "parallel" index
+directory which contains the additional revisit dates.
 
 Examine the "all.dup" file again, it has lines of the form
 
-   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+  example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
 
 These are the revisit dates that need to be added to the records in
 the Lucene index.  When we generated the index, only the date of the
@@ -220,35 +194,47 @@
 
 As explained in README-dedup.txt, modifying the Lucene index to
 actually add these dates is infeasible.  What we do is create a
-parallel index next to the main index (the part-00000 created above)
-that contains all the dates for each record.
+parallel index next to the merged index that contains all the dates
+for each record.
 
 The NutchWAX 'add-dates' command creates this parallel index for us.
 
-  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/dates \
+  $ nutchwax add-dates index \
+                       index \
+                       dates \
                        all.dup
 
-Yes, the part-0000 argument does appear twice.  This is beacuse it is
+Yes, the 'index' argument does appear twice.  This is beacuse it is
 both the "key" index and the "source" index.
 
-
 Suppose we did another crawl and had even more dates to add to the
 existing index.  In that case we would run
 
-  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/dates \
-                       pindexes/20080703050349/new-dates \
+  $ nutchwax add-dates index \
+                       dates \
+                       new-dates \
                        new-crawl.dup
-  $ rm -r pindexes/20080703050349/dates
-  $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates
+  $ rm -r dates
+  $ mv new-dates dates
 
 This copies the existing dates from "dates" to "new-dates" and adds
 additional ones from "new-crawl.dup" along the way.  Then we replace
 the previous "dates" index with the new one.
 
+Now, Nutch doesn't know what to do with the extra 'dates' parallel
+index, but NutchWAX does and it requires them to be arranged
+in a directory structure of the following form:
 
+  pindexes/<name>/dates
+                 /index
+
+Where "name" is any name of your choosing.  For example,
+
+  $ mkdir -p pindexes/200812180000
+  $ mv dates pindexes/200812180000/
+  $ mv index pindexes/200812180000/
+
+
 WARC
 ----
 This step is the same for ARCs and WARCs.
@@ -318,6 +304,8 @@
 
   <listener>
     <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+  </listener>
+  <listener>
     <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
   </listener>
 

Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -0,0 +1,129 @@
+
+HOWTO-pagerank.txt
+2008-12-18
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+ o Overview
+ o Generate PageRank
+ o PageRank Scoring and Boosting
+ o Configuration and Indexing
+
+
+======================================================================
+Prerequisites
+======================================================================
+
+This HOWTO assumes you've already read the main NutchWAX HOWTO and are
+familiar with importing and indexing archive files with NutchWAX.
+
+Also, we assume that you are familiar with deploying the Nutch(WAX)
+web application into a servlet container such as Tomcat.
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX provides a pair of tools for extracting and utilizing
+simplistic "page rank" information for scoring and sorting documents
+in the full-text search index.
+
+Nutch's 'invertlinks' step inverts links and stores them in the
+'linkdb' directory.  We use the inlinks to boost the Lucene score of
+documents in proportion to the number of inlinks.
+
+
+======================================================================
+Generate PageRank
+======================================================================
+
+After the Nutch 'invertlinks' step is performed, run the NutchWAX
+'pagerank' command to extract inlink information from the 'linkdb'
+
+For example
+
+  $ nutch invertlinks linkdb -dir segments
+  $ nutchwax pagerank pagerank.txt linkdb
+
+The resulting "pagerank.txt" file is a simple text file containing
+a count of the number of inlinks followed by the URL. 
+
+  $ sort -n pagerank.txt | tail
+  367762 http://informe.presidencia.gob.mx/
+  367809 http://comovamos.presidencia.gob.mx/
+  367852 http://ocho.presidencia.gob.mx/
+  372681 http://www.gob.mx/
+  398073 http://pnd.presidencia.gob.mx/
+  399321 http://zedillo.presidencia.gob.mx/
+  496993 http://www.google-analytics.com/urchin.js
+  702448 http://www.elbalero.gob.mx/
+  703517 http://www.mexicoenlinea.gob.mx/
+  764195 http://www.brasil.gov.br
+
+In the above example, the most linked-to URL has 764195 inlinks.
+
+
+======================================================================
+PageRank Scoring and Boosting
+======================================================================
+
+During indexing, the NutchWAX PageRankScoringFilter uses the page rank
+information to boost the Lucene documents score in proportion to the
+number of inlinks.
+
+The formula used for boosting the Lucene document score is a simple
+log10()-based calculation 
+
+  boost = log10( # inlinks ) + 1
+
+In Lucene, the boost is a multiplier where a boost of 1.0 means "no
+change" or "no boost" for the document score.  By default, all
+documents have a boost of 1.0 unless a scoring filter changes it.
+
+Thus, we add 1 to the log10() value so that our boost scores start and
+1.0 and go up from there.  
+
+The use of log10() gives us a linear boost based on the order of
+magnitude of the number of inlinks.  Consider the following boost
+scores as determined by our formula:
+
+   # inlinks     boost
+   1             1.00
+   10            2.00
+   82            2.91
+   100           3.00
+   532           3.72
+   1000          4.00
+   14892         5.17
+
+A document with 1000 inlinks will have it's score boosted 4x compared
+to a document with 1 inlink.
+
+
+======================================================================
+Configuration and Indexing
+======================================================================
+
+To use the PageRankScoringFilter during indexing, replace the Nutch
+OPIC scoring filter in the Nutch(WAX) configuration:
+
+nutch-site.xml
+  <property>
+    <name>plugin.includes</name>
+    <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
+  </property>
+
+Where we change 'scoring-opic' to 'scoring-nutchwax'.
+
+Then, when we invoke the Nutch(WAX) 'index' command, we specify the
+location of the page rank file.  For example,
+
+  $ nutch index \
+          -Dnutchwax.scoringfilter.pagerank.ranks=pagerank.txt \
+          indexes \
+          linkdb \
+          crawldb \
+          segments/*
+

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,13 +1,15 @@
 
 HOWTO-xslt.txt
-2008-07-25
+2008-12-18
 Aaron Binns
 
 Table of Contents
  o Prerequisites
    - NutchWAX HOWTO.txt
  o Overview
+ o NutchWAX OpenSearchServlet
  o XSLTFilter and web.xml
+ o Sample
 
 
 ======================================================================
@@ -31,9 +33,10 @@
   Servlet  : OpenSearchServlet
 
 If you read the OpenSearchServlet.java source code and the search.jsp
-page, you'll notice a lot of similarity, if not duplication of code.
+page, you'll notice a lot of similarity, if not outright duplication
+of code.
 
-The Internet Archive Web Team plans to improve and expand upon the
+The Internet Archive Web Team has improved and expanded upon the
 existing OpenSearchServlet interface as well as adding more XML-based
 capabilities, including replacements for the existing JSP pages.  In
 short, moving away from JSP and toward XML.
@@ -48,6 +51,21 @@
 
 
 ======================================================================
+NutchWAX OpenSearchServlet
+======================================================================
+
+NutchWAX contains an enhanced OpenSearch servlet which is a drop-in
+replacement for the default Nutch OpenSearch servlet.  To use the
+NutchWAX implementation, modify the 'web.xml'
+
+from:
+    <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class>
+
+to:
+    <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
+
+
+======================================================================
 XSLTFilter and web.xml
 ======================================================================
 
@@ -55,11 +73,11 @@
 OpenSearchServlet is straightforward.  Simply add the XSLTFilter to
 the servlet's path and specify the XSL transform to apply.
 
-For example, consider the default Nutch web.xml
+For example, consider the default NutchWAX web.xml
 
   <servlet>
     <servlet-name>OpenSearch</servlet-name>
-    <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class>
+    <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
   </servlet>
 
   <servlet-mapping>
@@ -68,13 +86,13 @@
   </servlet-mapping>
 
 Let's say we want to retain the '/opensearch' path for the XML output,
-and add the human-friendly HTML page at '/coolsearch'
+and add the human-friendly HTML page at '/search'
 
 First, we add an additional 'servlet-mapping' for our new path:
 
   <servlet-mapping>
     <servlet-name>OpenSearch</servlet-name>
-    <url-pattern>/coolsearch</url-pattern>
+    <url-pattern>/search</url-pattern>
   </servlet-mapping>
 
 Then, we add the XSLTFilter, passing it a URL to the XSLT file
@@ -93,7 +111,7 @@
 
   <filter-mapping>
     <filter-name>XSLT Filter</filter-name>
-    <url-pattern>/coolsearch</url-pattern>
+    <url-pattern>/search</url-pattern>
   </filter-mapping>
 
 This way, we have two URLs, which run the exact same
@@ -101,11 +119,11 @@
 output whereas the other produces human-friendly HTML output.
 
  OpenSearch     XML  :  http://someserver/opensearch?query=foo
- Human-friendly HTML :  http://someserver/coolsearch?query=foo
+ Human-friendly HTML :  http://someserver/search?query=foo
 
 
 ======================================================================
-Samples
+Sample
 ======================================================================
 
 You can find sample 'web.xml' and 'search.xsl' files in 

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-10-01
+2008-12-18
 Aaron Binns
 
 This installation guide assumes the reader is already familiar with
@@ -43,7 +43,7 @@
 -------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.2 is
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.3 is
 built against is:
 
   701524

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,9 +1,9 @@
 
 README.txt
-2008-10-01
+2008-12-18
 Aaron Binns
 
-Welcome to NutchWAX 0.12.2!
+Welcome to NutchWAX 0.12.3!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.
@@ -60,6 +60,15 @@
    Filtering plugin which can be used to exclude URLs from import.  It
    can be used as part of a NutchWAX de-duplication scheme.
 
+ plugins/scoring-nutchwax
+
+   Scoring plugin for use at index-time which reads from an external
+   "pagerank.txt" file for scoring documents based on the log10 of the
+   number of inlinks to a document.
+
+   The use of this plugin is optional but can improve the quality of
+   search results, especially for very large collections.
+
  conf/nutch-site.xml
 
    Sample configuration properties file showing suggested settings for
@@ -131,6 +140,4 @@
       contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
       ...
 
-
 ======================================================================
-

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2008-10-13
+2008-12-18
 Aaron Binns
 
-Release notes for NutchWAX 0.12.2
+Release notes for NutchWAX 0.12.3
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,9 +15,14 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.2 contains some minor enhancements and fixes to NutchWAX
-0.12.1.
+NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2
 
+  o PageRank calculation and scoring
+  o Enhanced OpenSearchServlet
+  o Improved XSLT sample for OpenSearch
+  o System init.d script for searcher slaves
+  o Enhanced searcher slave aware of NutchWAX extensions
+
 ======================================================================
 Issues
 ======================================================================
@@ -28,23 +33,6 @@
 
 Issues resolved in this release:
 
-WAX-19 
-  Add strict/loose option to DateAdder for revisit lines with extra
-  data on end.
-
-WAX-21 
-  Allow for blank lines and comment lines in manifest file.
-
-WAX-22 
-  Various code clean-ups based on code review using PMD tool.
-
-WAX-23 
-  Add a "field setter" filter to set a field to a static value in the
-  Lucene document during indexing.
-
-WAX-24
-  DateAdder fails due to uncaught exception in URL canonicalization
-
-WAX-25
-  Add utility/tool to dump unique values of a field in an index.
-
+WAX-26
+  Add XML elements containing all search URL params for self-link
+  generation


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2671] trunk/archive-access/projects/nutchwax/ arch

[Archive-access-cvs] SF.net SVN: archive-access:[2671] trunk/archive-access/projects/nutchwax/ archive