archive-access-cvs Mailing List for Web Archive Access Utilities (Page 46)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 44 45 46 47 48 .. 171 > >> (Page 46 of 171)

[Archive-access-cvs] SF.net SVN: archive-access:[2680] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/ LocalResourceIndex.java

From: <bra...@us...> - 2009-01-29 23:52:17

Revision: 2680
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2680&view=rev
Author:   bradtofel
Date:     2009-01-29 23:52:10 +0000 (Thu, 29 Jan 2009)

Log Message:
-----------
BUGFIX(ACC-58): was not adding DateRangeFilter for UrlPrefix queries.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2008-12-18 19:12:47 UTC (rev 2679)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2009-01-29 23:52:10 UTC (rev 2680)
@@ -372,6 +372,7 @@
 				filter.addFilter(drFilter);
 			} else if(type == TYPE_URL) {
 				filter.addFilter(new UrlPrefixMatchFilter(keyUrl));				
+				filter.addFilter(drFilter);
 			} else {
 				throw new BadQueryException("Unknown type");
 			}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2679] tags

From: <bi...@us...> - 2008-12-18 19:12:56

Revision: 2679
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2679&view=rev
Author:   binzino
Date:     2008-12-18 19:12:47 +0000 (Thu, 18 Dec 2008)

Log Message:
-----------
Make NutchWAX 0.12.3 release tag.

Added Paths:
-----------
    tags/nutchwax-0_12_3/
    tags/nutchwax-0_12_3/archive/


Property changes on: tags/nutchwax-0_12_3/archive
___________________________________________________________________
Added: svn:mergeinfo
   + 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2678] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2008-12-18 18:37:45

Revision: 2678
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2678&view=rev
Author:   binzino
Date:     2008-12-18 18:37:40 +0000 (Thu, 18 Dec 2008)

Log Message:
-----------
Updated documenation for 0.12.3 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt

Added: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -0,0 +1,392 @@
+
+BUILD-NOTES.txt
+2008-12-18
+Aaron Binns
+
+======================================================================
+Build notes
+======================================================================
+
+This document contains supplemental notes regarding the NutchWAX
+build, expanding upon the information found in the various READMEs and
+HOWTOs.
+
+======================================================================
+
+This 0.12.x release of NutchWAX is radically different in source-code
+form compared to the previous release, 0.10.
+
+One of the design goals of 0.12.x was to reduce or even eliminate the
+"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
+releases had to copy/paste/edit large chunks of Nutch source code in
+order to add the NutchWAX features.
+
+Also, the NutchWAX 0.12.x sources and build are designed to one day be
+added into mainline Nutch as a proper "contrib" package; then
+eventually be fully integrated into the core Nutch source code.
+
+======================================================================
+
+Most of the NutchWAX source code is relatively straightfoward to those
+already familiar with the inner workings of Nutch.  Still, special
+attention on one class is worth while:
+
+  src/java/org/archive/nutchwax/Importer.java
+
+This is where ARC/WARC files are read and their documents are imported
+into a Nutch segment.
+
+It is inspired by:
+
+  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+
+on the Nutch SVN head.
+
+Our implementation differs in a few important ways:
+
+  o Rather than taking a directory with ARC files as input, we take a
+    manifest file with URLs to ARC files.  This way, the manifest is
+    split up among the distributed Hadoop jobs and the ARC files are
+    processed in whole by each worker.
+
+    In the Nutch SVN, the ArcSegmentCreator.java expects the input
+    directory to contain the ARC files and (AFAICT) splits them up and
+    distributes them across the Hadoop workers.
+
+  o We use the standard Internet Archive ARCReader and WARCReader
+    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
+    the ArcSegmentCreator class can only read ARC files.
+
+  o We add metadata fields to the document, which are then available
+    to the "index-nutchwax" plugin at indexing-time.
+
+    Importer.importRecord()
+      ...
+      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
+      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
+      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
+      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
+      ...
+
+
+======================================================================
+Patching
+======================================================================
+
+When NutchWAX is built, a number of patches are automatically applied
+to the Nutch source and configuration files.
+
+----------------------------------------------------------------------
+The file
+
+  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+
+contains two errors: one where a mimetype is referenced before it is
+defined; and a second where a definition has an illegal character.
+
+These errors cause Nutch to not recognize certain mimetypes and
+therefore will ignore documents matching those mimetypes.
+
+There are two fixes:
+
+ 1. Move
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+    definition higher up in the file, before the reference to it.
+
+ 2. Remove
+
+	<mime-type type="application/x-ms-dos-executable">
+		<alias type="application/x-dosexec;exe" />
+	</mime-type>
+
+    as the ';' character is illegal according to the comments in the
+    Nutch code.
+
+You can either apply these patches yourself, or copy an already-patched
+copy from:
+
+  /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml
+
+to 
+
+  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+
+----------------------------------------------------------------------
+
+In the file 'conf/nutch-site.xml' we define some properties to
+over-ride the values in 'conf/nutch-default.xml'.
+
+--------------------------------------------------
+plugin.includes
+--------------------------------------------------
+Change the list of plugins from:
+
+  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
+
+to
+
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+
+In short, we add:
+
+  index-nutchwax
+  query-nutchwax
+  urlfilter-nutchwax
+  parse-pdf
+
+and remove:
+
+  urlfilter-regex
+  urlnormalizer-(pass|regex|basic)
+
+The only *required* changes are the additions of the NutchWAX index
+and query plugins.  The rest are optional, but recommended.
+
+The "parse-pdf" plugin is added simply because we have lots of PDFs in
+our archives and we want to index them.  We sometimes remove the
+"parse-js" plugin if we don't care to index JavaScript files.
+
+We also remove the default Nutch URL filtering and normalizing plugins
+because we do not need the URLs normalized nor filtered.  We trust
+that the tool that produced the ARC/WARC file will have normalized the
+URLs contained therein according to its own rules so there's no need
+to normalize here.  Also, we don't filter by URL since we want to
+index as much of the ARC/WARC file as we have parsers for.
+
+We do, however, add the NutchWAX URL filter.  If de-duplication is
+being performed upon import, this plugin is required.  It performs URL
+filtering of the list of ARC records to exclude based on
+URL+digest+date.
+
+--------------------------------------------------
+indexingfilter.order
+--------------------------------------------------
+
+Add this property with a value of
+
+    org.apache.nutch.indexer.basic.BasicIndexingFilter
+    org.archive.nutchwax.index.ConfigurableIndexingFilter
+
+So that the NutchWAX indexing filter is run after the Nutch basic
+indexing filter.
+
+A full explanation is given in "README-dedup.txt".
+
+--------------------------------------------------
+mime.type.magic
+--------------------------------------------------
+We disable mimetype detection in Nutch for two reasons:
+
+1. The ARC/WARC file specifies the Content-Type of the document.  We
+   trust that the tool that created the ARC/WARC file got it right.
+
+2. The implementation in Nutch can use a lot of memory as the *entire*
+   document is read into memory as a byte[], then converted to a
+   String, then checked against the MIME database.  This can lead to
+   out of memory errors for large files, such as music and video.
+
+To disable, simply set the property value to false.
+
+  <property>
+    <name>mime.type.magic</name>
+    <value>false</value>
+  </property>
+
+--------------------------------------------------
+nutchwax.filter.index
+--------------------------------------------------
+Configure the 'index-nutchwax' plugin.  Specify how the metadata
+fields added by the Importer are mapped to the Lucene documents during
+indexing.
+
+The specifications here are of the form:
+
+  src-key:lowercase:store:tokenize:exclusive:dest-key
+
+where the only required part is the "src-key", the rest will assume
+the following defaults:
+
+  lowercase = true
+  store     = true
+  tokenize  = false
+  exclusive = true
+  dest-key  = src-key
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.index</name>
+  <value>
+    url:false:true:true
+    url:flase:true:false:true:exacturl
+    orig:false
+    digest:false
+    filename:false
+    fileoffset:false
+    collection
+    date
+    type
+    length
+  </value>
+</property>
+
+The "url", "orig" and "digest" values are required, the rest are
+optional, but strongly recommended.
+
+--------------------------------------------------
+nutchwax.filter.query
+--------------------------------------------------
+Configure the 'query-nutchwax' plugin.  Specify which fields to make
+searchable via "field:[term|phrase]" query syntax, and whether they
+are "raw" fields or not.
+
+The specification format is one of:
+
+  field:<name>:<boost>
+  raw:<name>:<lowercase>:<boost>
+  group:<name>:<lowercase>:<delimiter>:<boost>
+
+Default values are
+
+  lowercase = true
+  delimiter = ","
+  boost     = 1.0f
+
+There is no "lowercase" property for "field" specification because the
+Nutch FieldQueryFilter doesn't expose the option, unlike the
+RawFieldQueryFilter.
+
+The "group" fields are raw fields that can accept multiple values,
+separated by a delimiter.  Multiple values appearing in a query are
+automagically translated into required OR-groups, such as
+
+  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
+
+NOTE: We do *not* use this filter for handling "date" queries, there
+is a specific filter for that: DateQueryFilter
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.query</name>
+  <value>
+    raw:digest:false
+    raw:filename:false
+    raw:fileoffset:false
+    raw:exacturl:false
+    group:collection
+    group:type
+    field:anchor
+    field:content
+    field:host
+    field:title
+  </value>
+</property>
+
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.exclusions
+--------------------------------------------------
+File containing the exclusion list for importing.
+
+Normally, this is specified on the command line with the NutchWAX
+Importer is invoked.  It can be specified here if preferred.
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.canonicalizer
+--------------------------------------------------
+
+For CDX-based de-duplication, the same URL canonicalization algorithm
+must be used here as was used to generate the CDX files.
+
+The default canonicalizer in Wayback's '(w)arc-indexer' utility
+is 
+
+  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
+
+which is the value provided in "nutch-site.xml".
+
+If the '(w)arc-indexer' is executed with the "-i" (identity)
+command-line option, then the matching canonicalizer
+
+  org.archive.wayback.util.url.IdentityUrlCanonicalizer
+
+must be specified here.
+
+--------------------------------------------------
+nutchwax.filter.http.status
+--------------------------------------------------
+This property configures a filter with a list of ranges
+of HTTP status codes to allow.
+
+Typically, most NutchWAX implementors do not wish to import and index
+404, 500, 302 and other non-success pages.  This is an inclusion
+filter, meaning that only ARC records with an HTTP status code
+matching any of the values will be imported.
+
+There is a special "unknown" value which can be used to include ARC
+records that don't have an HTTP status code (for whatever reason).
+
+The default setting provided in nutch-site.xml is to allow any 2XX
+success code:
+
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200-299
+    </value>
+  </property>
+
+But some other examples are:
+
+  Allow any 2XX success code *and* redirects, use:
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200-299
+      300-399
+   </value>
+  </property>
+
+  Be really strict about only certain codes, use:
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200
+      301
+      302
+      304
+   </value>
+  </property>
+
+  Mix of ranges and specific codes, including the "unknown"
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      Unknown
+      200
+      300-399
+   </value>
+  </property>
+
+--------------------------------------------------
+nutchwax.import.content.limit
+--------------------------------------------------
+Similar to Nutch's
+
+  file.content.limit
+  http.content.limit
+  ftp.content.limit
+
+properties, this specifies a limit on the size of a document imported
+via NutchWAX.
+
+We recommend setting this to a size compatible with the memory
+capacity of the computers performing the import.  Something in the
+1-4MB range is typical.
+

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -31,7 +31,7 @@
 in the full-text search index.
 
 Nutch's 'invertlinks' step inverts links and stores them in the
-'linkdb' directory.  We use the inlinks to boost the Lucene score of
+'linkdb' directory.  We use these inlinks to boost the Lucene score of
 documents in proportion to the number of inlinks.
 
 

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -5,9 +5,8 @@
 
 Table of Contents
  o Prerequisites
-   - Nutch(WAX) installation
+   - NutchWAX installation
    - ARC/WARC files
- o Configuration & Patching
  o Create a manifest
  o Import, Invert and Index
  o Search
@@ -27,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutch-1.0-dev
+      /opt/nutchwax-0.12.3
 
  2. ARC/WARC files.
 
@@ -40,348 +39,6 @@
 
 
 ======================================================================
-Patching
-======================================================================
-
-The vanilla NutchWAX as built according to the INSTALL.txt guide is
-not quite ready to be used out-of-the-box.
-
-Before you can use NutchWAX, you must first patch a bug that exists in
-the current Nutch SVN head.
-
-The file
-
-  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
-
-contains two errors: one where a mimetype is referenced before it is
-defined; and a second where a definition has an illegal character.
-
-These errors cause Nutch to not recognize certain mimetypes and
-therefore will ignore documents matching those mimetypes.
-
-There are two fixes:
-
- 1. Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-    definition higher up in the file, before the reference to it.
-
- 2. Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-    as the ';' character is illegal according to the comments in the
-    Nutch code.
-
-You can either apply these patches yourself, or copy an already-patched
-copy from:
-
-  /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml
-
-to 
-
-  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
-
-
-======================================================================
-Configuring
-======================================================================
-
-Since we assume that you are already familiar with Nutch, then you
-should already be familiar with configuring it.  The configuration
-is mainly defined in
-
-  /opt/nutch-1.0-dev/conf/nutch-default.xml
-
-NutchWAX requires the modification of two existing properties and the
-addition of two new ones.
-
-All of the modifications described below can be found in:
-
-  /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml
-
-You can either apply the configuration changes yourself, or copy that
-file to
-
-  /opt/nutch-1.0-dev/conf/nutch-site.xml
-
---------------------------------------------------
-plugin.includes
---------------------------------------------------
-Change the list of plugins from:
-
-  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
-
-to
-
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
-
-In short, we add:
-
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
-
-and remove:
-
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
-
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
-
-The "parse-pdf" plugin is added simply because we have lots of PDFs in
-our archives and we want to index them.  We sometimes remove the
-"parse-js" plugin if we don't care to index JavaScript files.
-
-We also remove the default Nutch URL filtering and normalizing plugins
-because we do not need the URLs normalized nor filtered.  We trust
-that the tool that produced the ARC/WARC file will have normalized the
-URLs contained therein according to its own rules so there's no need
-to normalize here.  Also, we don't filter by URL since we want to
-index as much of the ARC/WARC file as we have parsers for.
-
-We do, however, add the NutchWAX URL filter.  If de-duplication is
-being performed upon import, this plugin is required.  It performs URL
-filtering of the list of ARC records to exclude based on
-URL+digest+date.
-
---------------------------------------------------
-indexingfilter.order
---------------------------------------------------
-
-Add this property with a value of
-
-    org.apache.nutch.indexer.basic.BasicIndexingFilter
-    org.archive.nutchwax.index.ConfigurableIndexingFilter
-
-So that the NutchWAX indexing filter is run after the Nutch basic
-indexing filter.
-
-A full explanation is given in "README-dedup.txt".
-
---------------------------------------------------
-mime.type.magic
---------------------------------------------------
-We disable mimetype detection in Nutch for two reasons:
-
-1. The ARC/WARC file specifies the Content-Type of the document.  We
-   trust that the tool that created the ARC/WARC file got it right.
-
-2. The implementation in Nutch can use a lot of memory as the *entire*
-   document is read into memory as a byte[], then converted to a
-   String, then checked against the MIME database.  This can lead to
-   out of memory errors for large files, such as music and video.
-
-To disable, simply set the property value to false.
-
-  <property>
-    <name>mime.type.magic</name>
-    <value>false</value>
-  </property>
-
---------------------------------------------------
-nutchwax.filter.index
---------------------------------------------------
-Configure the 'index-nutchwax' plugin.  Specify how the metadata
-fields added by the Importer are mapped to the Lucene documents during
-indexing.
-
-The specifications here are of the form:
-
-  src-key:lowercase:store:tokenize:exclusive:dest-key
-
-where the only required part is the "src-key", the rest will assume
-the following defaults:
-
-  lowercase = true
-  store     = true
-  tokenize  = false
-  exclusive = true
-  dest-key  = src-key
-
-We recommend:
-
-<property>
-  <name>nutchwax.filter.index</name>
-  <value>
-    url:false:true:true
-    url:flase:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
-  </value>
-</property>
-
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
-
---------------------------------------------------
-nutchwax.filter.query
---------------------------------------------------
-Configure the 'query-nutchwax' plugin.  Specify which fields to make
-searchable via "field:[term|phrase]" query syntax, and whether they
-are "raw" fields or not.
-
-The specification format is one of:
-
-  field:<name>:<boost>
-  raw:<name>:<lowercase>:<boost>
-  group:<name>:<lowercase>:<delimiter>:<boost>
-
-Default values are
-
-  lowercase = true
-  delimiter = ","
-  boost     = 1.0f
-
-There is no "lowercase" property for "field" specification because the
-Nutch FieldQueryFilter doesn't expose the option, unlike the
-RawFieldQueryFilter.
-
-The "group" fields are raw fields that can accept multiple values,
-separated by a delimiter.  Multiple values appearing in a query are
-automagically translated into required OR-groups, such as
-
-  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
-
-NOTE: We do *not* use this filter for handling "date" queries, there
-is a specific filter for that: DateQueryFilter
-
-We recommend:
-
-<property>
-  <name>nutchwax.filter.query</name>
-  <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
-    group:collection
-    group:type
-    field:anchor
-    field:content
-    field:host
-    field:title
-  </value>
-</property>
-
-
---------------------------------------------------
-nutchwax.urlfilter.wayback.exclusions
---------------------------------------------------
-File containing the exclusion list for importing.
-
-Normally, this is specified on the command line with the NutchWAX
-Importer is invoked.  It can be specified here if preferred.
-
---------------------------------------------------
-nutchwax.urlfilter.wayback.canonicalizer
---------------------------------------------------
-
-For CDX-based de-duplication, the same URL canonicalization algorithm
-must be used here as was used to generate the CDX files.
-
-The default canonicalizer in Wayback's '(w)arc-indexer' utility
-is 
-
-  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
-
-which is the value provided in "nutch-site.xml".
-
-If the '(w)arc-indexer' is executed with the "-i" (identity)
-command-line option, then the matching canonicalizer
-
-  org.archive.wayback.util.url.IdentityUrlCanonicalizer
-
-must be specified here.
-
---------------------------------------------------
-nutchwax.filter.http.status
---------------------------------------------------
-This property configures a filter with a list of ranges
-of HTTP status codes to allow.
-
-Typically, most NutchWAX implementors do not wish to import and index
-404, 500, 302 and other non-success pages.  This is an inclusion
-filter, meaning that only ARC records with an HTTP status code
-matching any of the values will be imported.
-
-There is a special "unknown" value which can be used to include ARC
-records that don't have an HTTP status code (for whatever reason).
-
-The default setting provided in nutch-site.xml is to allow any 2XX
-success code:
-
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200-299
-    </value>
-  </property>
-
-But some other examples are:
-
-  Allow any 2XX success code *and* redirects, use:
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200-299
-      300-399
-   </value>
-  </property>
-
-  Be really strict about only certain codes, use:
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200
-      301
-      302
-      304
-   </value>
-  </property>
-
-  Mix of ranges and specific codes, including the "unknown"
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      Unknown
-      200
-      300-399
-   </value>
-  </property>
-
---------------------------------------------------
-nutchwax.import.content.limit
---------------------------------------------------
-Similar to Nutch's
-
-  file.content.limit
-  http.content.limit
-  ftp.content.limit
-
-properties, this specifies a limit on the size of a document imported
-via NutchWAX.
-
-We recommend setting this to a size compatible with the memory
-capacity of the computers performing the import.  Something in the
-1-4MB range is typical.
-
-
-======================================================================
 Create a manifest
 ======================================================================
 
@@ -411,10 +68,10 @@
 
   $ mkdir crawl
   $ cd crawl
-  $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest
-  $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
+  $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest
+  $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/*
   $ ls -F1
   crawldb/
   indexes/
@@ -439,7 +96,7 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer
 
 This calls the NutchBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
@@ -450,17 +107,9 @@
 Web Deployment
 ======================================================================
 
-As users of Nutch are aware, the web application (nutch-1.0-dev.war)
-bundled with Nutch contains duplicate copies of the configuration
-files.
+The Nutch(WAX) web application is bundled with NutchWAX as
 
-So, all patches and configuration changes that we made to the
-files in
+  /opt/nutchwax-0.12.3/nutch-1.0-dev.war
 
-  /opt/nutch-1.0-dev/conf
-
-will have to be duplicated in the Nutch webapp when it is deployed.
-
-This is not due to NutchWAX, this is a "feature" of regular Nutch.  I
-just thought it would be good to remind everyone since we did make
-configuration changes for NutchWAX.
+Simply deploy that web application in the same fashion as with
+Nutch.

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -3,10 +3,22 @@
 2008-12-18
 Aaron Binns
 
+Table of Contents
+ o Introduction
+ o Build from source
+    - SVN: Nutch 1.0-dev
+    - SVN: NutchWAX
+    - Build and Install
+ o Install binary package
+
+
+======================================================================
+Introduction
+======================================================================
+
 This installation guide assumes the reader is already familiar with
 building, packaging and deploying Nutch 1.0-dev.
 
-
 The NutchWAX 0.12 source and build system are designed to integrate
 into the existing Nutch 1.0-dev source and build.
 
@@ -20,12 +32,12 @@
 proper, then builds NutchWAX components and integrates them into the
 Nutch build directory.
 
-We recommend that you execute all build commands from the NutchWAX
-directory.  This way, NutchWAX will ensure that any and all
+In order to build NutchWAX, execute all build commands from the
+NutchWAX directory.  This way, NutchWAX will ensure that any and all
 dependencies in Nutch will be properly built and kept up-to-date.
 Towards this goal, we have duplicated the most common build targets
-from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file,
-such as:
+from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, such
+as:
 
   o compile
   o jar
@@ -39,8 +51,15 @@
 sub-directory as normal.
 
 
-Nutch-1.0-dev
--------------
+======================================================================
+Build from Source
+======================================================================
+
+To build from source, you must check-out the Nutch and NutchWAX sources
+from their respective 'subversion' source control servers.
+
+SVN: nutch-1.0-dev
+------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
 Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.3 is
@@ -53,9 +72,12 @@
  $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
 
+Please be sure to check-out this specific version of the Nutch source.
+If you just grab the head of the trunk, there may be newer and
+incompatible changed to Nutch.
 
-NutchWAX
---------
+SVN: NutchWAX
+-------------
 Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
 Nutch's "contrib" directory.
 
@@ -65,7 +87,6 @@
 This will create a sub-directory named "archive" containing the
 NutchWAX sources.
 
-
 Build and install
 -----------------
 Assuming you already have the required tool-set for building Nutch,
@@ -91,3 +112,18 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
+  $ mv nutch-1.0-dev nutchwax-0.12.3
+
+
+======================================================================
+Install binary package
+======================================================================
+
+Alternatively, grab a "binary" release package from the Internet
+Archive's NutchWAX home page.
+
+Install it simply by untarring it, for example:
+
+  $ cd /opt
+  $ tar xvfz nutchwax-0.12.3.tar.gz
+

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -3,6 +3,16 @@
 2008-12-18
 Aaron Binns
 
+Table of Contents
+ o Introduction
+ o Build and Install
+ o Tutorial
+
+
+======================================================================
+Introduction
+======================================================================
+
 Welcome to NutchWAX 0.12.3!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
@@ -17,7 +27,6 @@
 Since NutchWAX is a set of add-ons to Nutch, you should already be
 familiar with Nutch before using NutchWAX.
 
-======================================================================
 
 The goal of NutchWAX is to enable full-text indexing and searching of
 documents stored in web archive file formats (ARC and WARC).
@@ -26,13 +35,13 @@
 to Nutch to read documents directly from ARC/WARC files.  We call this
 process "importing" archive files.
 
-Importing produces a Nutch segment, similar to Nutch crawling the
-documents itself.  In this scenario, document importing replaces the
+Importing produces a Nutch segment, the same as when Nutch is used to
+crawl documents itself.  In essence, document importing replaces the
 conventional "generate/fetch/update" cycle of Nutch.
 
 Once the archival documents have been imported into a segment, the
-regular Nutch commands to update the 'crawldb', invert the links and
-index the document contents can proceed as normal.
+regular Nutch commands to index the document contents can proceed as
+normal.
 
 ======================================================================
 
@@ -71,73 +80,25 @@
 
  conf/nutch-site.xml
 
-   Sample configuration properties file showing suggested settings for
-   Nutch and NutchWAX.
+   Additional configuration properties for NutchWAX, including
+   over-rides for properties defined in 'nutch-default.xml'
 
 There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
 is distributed in source code form and is intended to be built in
 conjunction with Nutch.
 
-See "INSTALL.txt" for details on building NutchWAX and Nutch.
 
-See "HOWTO.txt" for a quick tutorial on importing, indexing and
-searching a set of documents in a web archive file.
-
 ======================================================================
-
-This 0.12.x release of NutchWAX is radically different in source-code
-form compared to the previous release, 0.10.
-
-One of the design goals of 0.12.x was to reduce or even eliminate the
-"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
-releases had to copy/paste/edit large chunks of Nutch source code in
-order to add the NutchWAX features.
-
-Also, the NutchWAX 0.12.x sources and build are designed to one day be
-added into mainline Nutch as a proper "contrib" package; then
-eventually be fully integrated into the core Nutch source code.
-
+Build and Install
 ======================================================================
 
-Most of the NutchWAX source code is relatively straightfoward to those
-already familiar with the inner workings of Nutch.  Still, special
-attention on one class is worth while:
+See "INSTALL.txt" for detailed instructions to build NutchWAX from
+source or install a binary package.
 
-  src/java/org/archive/nutchwax/Importer.java
 
-This is where ARC/WARC files are read and their documents are imported
-into a Nutch segment.
-
-It is inspired by:
-
-  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
-
-on the Nutch SVN head.
-
-Our implementation differs in a few important ways:
-
-  o Rather than taking a directory with ARC files as input, we take a
-    manifest file with URLs to ARC files.  This way, the manifest is
-    split up among the distributed Hadoop jobs and the ARC files are
-    processed in whole by each worker.
-
-    In the Nutch SVN, the ArcSegmentCreator.java expects the input
-    directory to contain the ARC files and (AFAICT) splits them up and
-    distributes them across the Hadoop workers.
-
-  o We use the standard Internet Archive ARCReader and WARCReader
-    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
-    the ArcSegmentCreator class can only read ARC files.
-
-  o We add metadata fields to the document, which are then available
-    to the "index-nutchwax" plugin at indexing-time.
-
-    Importer.importRecord()
-      ...
-      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
-      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
-      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
-      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
-      ...
-
 ======================================================================
+Tutorial
+======================================================================
+
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -21,8 +21,45 @@
   o Enhanced OpenSearchServlet
   o Improved XSLT sample for OpenSearch
   o System init.d script for searcher slaves
-  o Enhanced searcher slave aware of NutchWAX extensions
+  o Enhanced searcher slave which supports NutchWAX extensions
 
+
+One of the major changes to 0.12.3 is not a feature, enhancement or
+bug-fix, but the way the NutchWAX source is "integrated" into the
+Nutch source.
+
+Yes, the NutchWAX source is still kept in the contrib/archive
+sub-directory, but when you invoke a build command from the
+NutchWAX directory, such as
+
+  $ cd nutch/contrib/archive
+  $ ant tar
+
+Many files from the NutchWAX source tree are copied directly into the
+Nutch source tree before the build process begins.
+
+The reason for this is to make NutchWAX easier to use.
+
+In previous versions of NutchWAX, once 'ant' build command was
+finished, the operator had to manually patch configuration files in
+the Nutch directory.  Upon a subsequent build, the files would be
+over-written by Nutch's and would have to be patched again.
+
+It was a major hassle and complication.
+
+Another impetus for copying files into the Nutch source was to patch
+bugs and make enhancements in the Nutch Java code which couldn't be
+effectively done keeping the sources separate.  When an 'ant' build
+command is run a few Java files are copied from the NutchWAX source
+tree into the Nutch source tree.
+
+In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of
+this.  Simply execute your build commands from 'contrib/archive' as
+instructed in the HOWTO and no longer worry about patching
+configuration files.  If you wish to alter the NutchWAX configuration
+file, make those changes in the NutchWAX source tree.
+
+
 ======================================================================
 Issues
 ======================================================================


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2677] trunk/archive-access/projects/nutchwax/ archive/src/nutch/conf/nutch-site.xml

From: <bi...@us...> - 2008-12-16 19:53:29

Revision: 2677
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2677&view=rev
Author:   binzino
Date:     2008-12-16 19:53:25 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Changed nutchwax.FetchedSegments.perCollection default value to false.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2008-12-16 19:52:42 UTC (rev 2676)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2008-12-16 19:53:25 UTC (rev 2677)
@@ -144,7 +144,7 @@
   -->
 <property>
   <name>nutchwax.FetchedSegments.perCollection</name>
-  <value>true</value>
+  <value>false</value>
 </property>
 
 <!-- The following are over-rides of property values in


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2676] trunk/archive-access/projects/nutchwax/ archive/build.xml

From: <bi...@us...> - 2008-12-16 19:52:45

Revision: 2676
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2676&view=rev
Author:   binzino
Date:     2008-12-16 19:52:42 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Removed references to web and conf sub-dirs in "onlypack" target since
they are now rolled into set of files copied into Nutch.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-16 07:38:28 UTC (rev 2675)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-16 19:52:42 UTC (rev 2676)
@@ -104,14 +104,6 @@
   <!-- This one does a little more after calling down to the relevant
        Nutch target.  After Nutch has copied everything into the
        distribution directory, we add our script, libraries, etc.
-       
-       Rather than over-write the standard Nutch configuration files,
-       we place ours in a newly created directory
-       
-         contrib/archive/conf
-
-       and let the individual user decide whether or not to
-       incorporate our modifications.
     -->
   <target name="package" depends="jar, job, war, javadoc" >
     <ant dir="${nutch.dir}" target="package" inheritAll="false" />
@@ -131,22 +123,12 @@
         <fileset dir="${dist.dir}/bin"/>
     </chmod>
 
-    <mkdir dir="${dist.dir}/contrib/archive/conf"/>
-    <copy todir="${dist.dir}/contrib/archive/conf">
-      <fileset dir="conf" />
-    </copy>
-
     <copy todir="${dist.dir}/contrib/archive">
       <fileset dir=".">
         <include name="*.txt" />
       </fileset>
     </copy>
 
-    <mkdir dir="${dist.dir}/contrib/archive/web"/>
-    <copy todir="${dist.dir}/contrib/archive/web">
-      <fileset dir="src/web" />
-    </copy>
-
     <mkdir dir="${dist.dir}/contrib/archive/etc"/>
     <copy todir="${dist.dir}/contrib/archive/etc">
       <fileset dir="src/etc" />


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2675] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/web/web.xml

From: <bi...@us...> - 2008-12-16 07:38:30

Revision: 2675
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2675&view=rev
Author:   binzino
Date:     2008-12-16 07:38:28 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Fixed bug in web.xml related to <listener> tags.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2008-12-16 06:41:44 UTC (rev 2674)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2008-12-16 07:38:28 UTC (rev 2675)
@@ -24,6 +24,8 @@
 
 <listener>
   <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+</listener>
+<listener>
   <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
 </listener>
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2674] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2008-12-16 06:41:48

Revision: 2674
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2674&view=rev
Author:   binzino
Date:     2008-12-16 06:41:44 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Moved web files into src/nutch sub-tree so they will be copied into
Nutch corresponding sources directories for inclusion in Nutch ant
build targets.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/src/web/

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl	2008-12-16 06:41:44 UTC (rev 2674)
@@ -0,0 +1,281 @@
+<?xml version="1.0" encoding="utf-8" ?> 
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<xsl:stylesheet
+     version="1.0"
+     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+     xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/"
+     xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/" 
+>
+<xsl:output method="xml" />
+
+<xsl:template match="rss/channel">
+  <html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+  <title><xsl:value-of select="title" /></title>
+  <style media="all" lang="en" type="text/css">
+  body
+  {
+    padding     : 20px;
+    margin      : 0;
+    font-family : Verdana; sans-serif;
+    font-size   : 9pt;
+    color : #000000;
+    background-color: #ffffff;
+  }
+  .pageTitle
+  {
+    font-size   : 125% ;
+    font-weight : bold ;
+    text-align  : center ;
+    padding-bottom : 2em ;
+  }
+  .searchForm
+  {
+    margin : 20px 0 5px 0;
+    padding-bottom : 0px;
+    border-bottom : 1px solid black;
+  }
+  .searchResult
+  {
+    margin  : 0;
+    padding : 0;
+  }
+  .searchResult h1 
+  {
+    margin  : 0 0 5px 0 ;
+    padding : 0 ;
+    font-size : 120%;
+  }
+  .searchResult .details
+  {
+    font-size: 80%;
+    color: green;
+  }
+  .searchResult .dates
+  {
+    font-size: 80%;
+  }
+  .searchResult .dates a
+  {
+    color: #3366cc;
+  }
+  form#searchForm
+  {
+    margin : 0; padding: 0 0 10px 0;
+  }
+  .searchFields
+  {
+    padding : 3px 0;
+  }
+  .searchFields input
+  {
+    margin  : 0 0 0 15px;
+    padding : 0;
+  }
+  input#query
+  {
+    margin : 0;
+  }
+  ol
+  {
+    margin  : 5px 0 0 0;
+    padding : 0 0 0 2em;
+  }
+  ol li
+  {
+    margin : 0 0 15px 0;
+  }
+  </style>
+  </head>
+  <body>
+    <!-- Page header: title and search form -->
+    <div class="pageTitle" >
+      NutchWAX Sample XSLT
+    </div>
+    <div>
+      This simple XSLT demonstrates the transformation of OpenSearch XML results into a fully-functional, human-friendly HTML search page.  No JSP needed.
+    </div>
+    <div class="searchForm">
+      <form id="searchForm" name="searchForm" method="get" action="search" >
+        <span class="searchFields">
+        Search for 
+        <input id="query" name="query" type="text" size="40" value="{nutch:query}" />
+
+        <!-- Create hidden form fields for the rest of the URL parameters -->
+        <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start' and @name!='query']">
+          <xsl:element name="input" namespace="http://www.w3.org/1999/xhtml">
+            <xsl:attribute name="type">hidden</xsl:attribute>
+            <xsl:attribute name="name" ><xsl:value-of select="@name"  /></xsl:attribute>
+            <xsl:attribute name="value"><xsl:value-of select="@value" /></xsl:attribute>
+          </xsl:element>
+        </xsl:for-each>
+
+        <input type="submit" value="Search"/>
+        </span>
+      </form>
+    </div>
+    <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"></span></div>
+    <!-- Search results -->
+    <ol start="{opensearch:startIndex + 1}">
+      <xsl:apply-templates select="item" />
+    </ol>
+    <!-- Generate list of page links -->
+    <center>
+      <xsl:call-template name="pageLinks">
+        <xsl:with-param name="labelPrevious" select="'&#171;'" />
+        <xsl:with-param name="labelNext"     select="'&#187;'" />
+      </xsl:call-template>
+    </center>
+  </body>
+</html>
+</xsl:template>
+
+
+<!-- ======================================================================
+     NutchWAX XSLT template/fuction library.
+     
+     The idea is that the above xhtml code is what most NutchWAX users
+     will modify to tailor to their own look and feel.  The stuff
+     below implements the core logic for generating results lists,
+     page links, etc.
+
+     Hopefully NutchWAX web developers will be able to easily edit the
+     above xhtml and css and won't have to change the below.
+     ====================================================================== -->
+
+<!-- Template to emit a search result as an HTML list item (<li/>).
+  -->
+<xsl:template match="item">
+  <li>
+  <div class="searchResult">
+    <h1><a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/',nutch:date,'/',link)}"><xsl:value-of select="title" /></a></h1>
+    <div>
+      <xsl:value-of select="description" />
+    </div>
+    <div class="details">
+      <xsl:value-of select="link" /> - <xsl:value-of select="round( nutch:length div 1024 )"/>k - <xsl:value-of select="nutch:type" />
+    </div>
+    <div class="dates">
+      <a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/*/',link)}">All versions</a> - <a href="?query={../nutch:query} site:{nutch:site}&amp;hitsPerSite=0">More from <xsl:value-of select="nutch:site" /></a>
+    </div>
+  </div>
+  </li>
+</xsl:template>
+
+<!-- Template to emit a date in YYYY/MM/DD format 
+  -->
+<xsl:template match="nutch:date" >
+  <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text>
+</xsl:template>
+
+<!-- Template to emit a list of numbered page links, *including*
+     "previous" and "next" links on either end, using the given labels.
+     Parameters:
+       labelPrevious   Link text for "previous page" link
+       labelNext       Link text for "next page" link
+  -->
+<xsl:template name="pageLinks">
+  <xsl:param name="labelPrevious" />
+  <xsl:param name="labelNext"     />
+  <!-- If we are on any page past the first, emit a "previous" link -->
+  <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1">
+    <xsl:call-template name="pageLink">
+      <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage)" />
+      <xsl:with-param name="linkText" select="$labelPrevious" />
+    </xsl:call-template>
+    <xsl:text> </xsl:text>
+  </xsl:if>
+  <!-- Now, emit numbered page links -->
+  <xsl:choose>
+    <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) &lt; 11">
+      <xsl:call-template name="numberedPageLinks" >
+        <xsl:with-param name="begin"   select="1"  />
+        <xsl:with-param name="end"     select="21" />
+        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+      </xsl:call-template>
+    </xsl:when>
+    <xsl:otherwise>
+      <xsl:call-template name="numberedPageLinks" >
+        <xsl:with-param name="begin"   select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" />
+        <xsl:with-param name="end"     select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" />
+        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+      </xsl:call-template>
+    </xsl:otherwise>
+  </xsl:choose>
+  <!-- Lastly, emit a "next" link. -->
+  <xsl:text> </xsl:text>
+  <xsl:call-template name="pageLink">
+    <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" />
+    <xsl:with-param name="linkText" select="$labelNext" />
+  </xsl:call-template>
+</xsl:template>
+
+<!-- Template to emit a list of numbered links to results pages. 
+     Parameters:
+       begin    starting # inclusive
+       end      ending # exclusive
+       current  the current page, don't emit a link
+  -->
+<xsl:template name="numberedPageLinks">
+  <xsl:param name="begin"   />
+  <xsl:param name="end"     />
+  <xsl:param name="current" />
+  <xsl:if test="$begin &lt; $end">
+    <xsl:choose>
+      <xsl:when test="$begin = $current" >
+        <xsl:value-of select="$current" />
+      </xsl:when>
+      <xsl:otherwise>
+        <xsl:call-template name="pageLink" >
+          <xsl:with-param name="pageNum"  select="$begin"  />
+          <xsl:with-param name="linkText" select="$begin"  />
+        </xsl:call-template>
+      </xsl:otherwise>
+    </xsl:choose>
+    <xsl:text> </xsl:text>
+    <xsl:call-template name="numberedPageLinks">
+      <xsl:with-param name="begin"   select="$begin + 1" />
+      <xsl:with-param name="end"     select="$end"       />
+      <xsl:with-param name="current" select="$current"   />
+    </xsl:call-template>
+  </xsl:if>
+</xsl:template>
+
+<!-- Template to emit a single page link.  All of the URL parameters
+     listed in the OpenSearch results are included in the link.
+     Parmeters:
+       pageNum    page number of the link
+       linkText   text of the link
+  -->
+<xsl:template name="pageLink">
+  <xsl:param name="pageNum"  />
+  <xsl:param name="linkText" />
+  <xsl:element name="a" namespace="http://www.w3.org/1999/xhtml">
+    <xsl:attribute name="href">
+      <xsl:text>?</xsl:text>
+      <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start']">
+        <xsl:value-of select="@name" /><xsl:text>=</xsl:text><xsl:value-of select="@value" />
+        <xsl:text>&amp;</xsl:text>
+      </xsl:for-each>
+      <xsl:text>start=</xsl:text><xsl:value-of select="($pageNum -1) * opensearch:itemsPerPage" />
+    </xsl:attribute>
+    <xsl:value-of select="$linkText" />
+  </xsl:element>
+</xsl:template>
+
+</xsl:stylesheet>

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2008-12-16 06:41:44 UTC (rev 2674)
@@ -0,0 +1,80 @@
+<?xml version="1.0" encoding="ISO-8859-1"?>
+<!DOCTYPE web-app
+    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
+    "http://java.sun.com/dtd/web-app_2_3.dtd">
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<web-app>
+
+<!-- order is very important here -->
+
+<listener>
+  <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+  <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
+</listener>
+
+<servlet>
+  <servlet-name>Cached</servlet-name>
+  <servlet-class>org.apache.nutch.servlet.Cached</servlet-class>
+</servlet>
+
+<servlet>
+  <servlet-name>OpenSearch</servlet-name>
+  <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
+</servlet>
+
+<servlet-mapping>
+  <servlet-name>Cached</servlet-name>
+  <url-pattern>/servlet/cached</url-pattern>
+</servlet-mapping>
+
+<servlet-mapping>
+  <servlet-name>OpenSearch</servlet-name>
+  <url-pattern>/opensearch</url-pattern>
+</servlet-mapping>
+
+<servlet-mapping>
+  <servlet-name>OpenSearch</servlet-name>
+  <url-pattern>/search</url-pattern>
+</servlet-mapping>
+
+<filter>
+  <filter-name>XSLT Filter</filter-name>
+  <filter-class>org.archive.nutchwax.XSLTFilter</filter-class>
+  <init-param>
+    <param-name>xsltUrl</param-name>
+    <param-value>style/search.xsl</param-value>
+  </init-param>
+</filter>
+
+<filter-mapping>
+  <filter-name>XSLT Filter</filter-name>
+  <url-pattern>/search</url-pattern>
+</filter-mapping>
+
+<welcome-file-list>
+  <welcome-file>search.html</welcome-file>
+  <welcome-file>index.html</welcome-file>
+  <welcome-file>index.jsp</welcome-file>
+</welcome-file-list>
+
+<taglib>
+  <taglib-uri>http://jakarta.apache.org/taglibs/i18n</taglib-uri>
+  <taglib-location>/WEB-INF/taglibs-i18n.tld</taglib-location>
+ </taglib>
+
+</web-app>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2673] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2008-12-16 06:24:10

Revision: 2673
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2673&view=rev
Author:   binzino
Date:     2008-12-16 06:24:01 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Moved conf sub-dir so that it's automatically copied over into Nutch
directory during build.  This way the NutchWAX extensions are
automatically included in the Nutch build.  Operators/users don't have
to do hand-editing of Nutch conf files to get NutchWAX enhancements.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/conf/


Property changes on: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf
___________________________________________________________________
Added: svn:mergeinfo
   + 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2672] trunk/archive-access/projects/nutchwax/ archive/src/web/web.xml

From: <bi...@us...> - 2008-12-16 04:58:23

Revision: 2672
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2672&view=rev
Author:   binzino
Date:     2008-12-16 04:58:21 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Changed to use NutchWAX OpenSearchServlet instead of Nutch's.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/web/web.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/web/web.xml	2008-12-16 03:00:10 UTC (rev 2671)
+++ trunk/archive-access/projects/nutchwax/archive/src/web/web.xml	2008-12-16 04:58:21 UTC (rev 2672)
@@ -34,7 +34,7 @@
 
 <servlet>
   <servlet-name>OpenSearch</servlet-name>
-  <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class>
+  <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
 </servlet>
 
 <servlet-mapping>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2671] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2008-12-16 03:00:15

Revision: 2671
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2671&view=rev
Author:   binzino
Date:     2008-12-16 03:00:10 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Updated documentation for 0.12.3 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
    trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -157,62 +157,36 @@
 
 
 ======================================================================
-Index
+Index and Index merging
 ======================================================================
 
-The only chage we make to the indexing step is the destination of the
-index directory.
+Perform the index step as normal, yielding an 'indexes' directory.
 
-By default, Nutch expects the per-segment index directory to live in a
-sub-directory called 'indexes' and the index command is accordingly
+E.g.
 
   $ nutch index indexes crawldb linkdb segments/*
 
-Resulting in an index directory structure of the form
+Then, merge the 'indexes' directory into a single Lucene index by
+invoking the Nutch 'merge' command
 
-    indexes/part-00000
+  $ nutch merge index indexes
 
-For de-duplication, we use a slightly different directory structure,
-which will be used by a de-duplication-aware NutchWaxBean at
-search-time.  The directory structure we use is:
 
-    pindexes/<segment>/part-00000
-
-Using the segment name is not strictly required, but it is a good
-practice and is strongly recommended.  This way the segment and its
-corresponding index directory are easily matched.
-
-Let's assume that the segment directory created during the import is
-named
-
-  segments/20080703050349
-
-In that case, our index command becomes:
-
-  $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349
-
-Upon completion, the Lucene index is created in
-
-  pindexes/20080703050349/part-0000
-
-This index is exactly the same as one normally created by Nutch, the
-only difference is the location.
-
-
 ======================================================================
 Add Revisit Dates
 ======================================================================
 
-Now that we have the Nutch index, we add the revisit dates to it.
+Now that we have a single, merged index, we create a "parallel" index
+directory which contains the additional revisit dates.
 
 Examine the "all.dup" file again, it has lines of the form
 
-   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
-   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+  example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+  example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
 
 These are the revisit dates that need to be added to the records in
 the Lucene index.  When we generated the index, only the date of the
@@ -220,35 +194,47 @@
 
 As explained in README-dedup.txt, modifying the Lucene index to
 actually add these dates is infeasible.  What we do is create a
-parallel index next to the main index (the part-00000 created above)
-that contains all the dates for each record.
+parallel index next to the merged index that contains all the dates
+for each record.
 
 The NutchWAX 'add-dates' command creates this parallel index for us.
 
-  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/dates \
+  $ nutchwax add-dates index \
+                       index \
+                       dates \
                        all.dup
 
-Yes, the part-0000 argument does appear twice.  This is beacuse it is
+Yes, the 'index' argument does appear twice.  This is beacuse it is
 both the "key" index and the "source" index.
 
-
 Suppose we did another crawl and had even more dates to add to the
 existing index.  In that case we would run
 
-  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
-                       pindexes/20080703050349/dates \
-                       pindexes/20080703050349/new-dates \
+  $ nutchwax add-dates index \
+                       dates \
+                       new-dates \
                        new-crawl.dup
-  $ rm -r pindexes/20080703050349/dates
-  $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates
+  $ rm -r dates
+  $ mv new-dates dates
 
 This copies the existing dates from "dates" to "new-dates" and adds
 additional ones from "new-crawl.dup" along the way.  Then we replace
 the previous "dates" index with the new one.
 
+Now, Nutch doesn't know what to do with the extra 'dates' parallel
+index, but NutchWAX does and it requires them to be arranged
+in a directory structure of the following form:
 
+  pindexes/<name>/dates
+                 /index
+
+Where "name" is any name of your choosing.  For example,
+
+  $ mkdir -p pindexes/200812180000
+  $ mv dates pindexes/200812180000/
+  $ mv index pindexes/200812180000/
+
+
 WARC
 ----
 This step is the same for ARCs and WARCs.
@@ -318,6 +304,8 @@
 
   <listener>
     <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+  </listener>
+  <listener>
     <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
   </listener>
 

Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -0,0 +1,129 @@
+
+HOWTO-pagerank.txt
+2008-12-18
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+ o Overview
+ o Generate PageRank
+ o PageRank Scoring and Boosting
+ o Configuration and Indexing
+
+
+======================================================================
+Prerequisites
+======================================================================
+
+This HOWTO assumes you've already read the main NutchWAX HOWTO and are
+familiar with importing and indexing archive files with NutchWAX.
+
+Also, we assume that you are familiar with deploying the Nutch(WAX)
+web application into a servlet container such as Tomcat.
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX provides a pair of tools for extracting and utilizing
+simplistic "page rank" information for scoring and sorting documents
+in the full-text search index.
+
+Nutch's 'invertlinks' step inverts links and stores them in the
+'linkdb' directory.  We use the inlinks to boost the Lucene score of
+documents in proportion to the number of inlinks.
+
+
+======================================================================
+Generate PageRank
+======================================================================
+
+After the Nutch 'invertlinks' step is performed, run the NutchWAX
+'pagerank' command to extract inlink information from the 'linkdb'
+
+For example
+
+  $ nutch invertlinks linkdb -dir segments
+  $ nutchwax pagerank pagerank.txt linkdb
+
+The resulting "pagerank.txt" file is a simple text file containing
+a count of the number of inlinks followed by the URL. 
+
+  $ sort -n pagerank.txt | tail
+  367762 http://informe.presidencia.gob.mx/
+  367809 http://comovamos.presidencia.gob.mx/
+  367852 http://ocho.presidencia.gob.mx/
+  372681 http://www.gob.mx/
+  398073 http://pnd.presidencia.gob.mx/
+  399321 http://zedillo.presidencia.gob.mx/
+  496993 http://www.google-analytics.com/urchin.js
+  702448 http://www.elbalero.gob.mx/
+  703517 http://www.mexicoenlinea.gob.mx/
+  764195 http://www.brasil.gov.br
+
+In the above example, the most linked-to URL has 764195 inlinks.
+
+
+======================================================================
+PageRank Scoring and Boosting
+======================================================================
+
+During indexing, the NutchWAX PageRankScoringFilter uses the page rank
+information to boost the Lucene documents score in proportion to the
+number of inlinks.
+
+The formula used for boosting the Lucene document score is a simple
+log10()-based calculation 
+
+  boost = log10( # inlinks ) + 1
+
+In Lucene, the boost is a multiplier where a boost of 1.0 means "no
+change" or "no boost" for the document score.  By default, all
+documents have a boost of 1.0 unless a scoring filter changes it.
+
+Thus, we add 1 to the log10() value so that our boost scores start and
+1.0 and go up from there.  
+
+The use of log10() gives us a linear boost based on the order of
+magnitude of the number of inlinks.  Consider the following boost
+scores as determined by our formula:
+
+   # inlinks     boost
+   1             1.00
+   10            2.00
+   82            2.91
+   100           3.00
+   532           3.72
+   1000          4.00
+   14892         5.17
+
+A document with 1000 inlinks will have it's score boosted 4x compared
+to a document with 1 inlink.
+
+
+======================================================================
+Configuration and Indexing
+======================================================================
+
+To use the PageRankScoringFilter during indexing, replace the Nutch
+OPIC scoring filter in the Nutch(WAX) configuration:
+
+nutch-site.xml
+  <property>
+    <name>plugin.includes</name>
+    <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
+  </property>
+
+Where we change 'scoring-opic' to 'scoring-nutchwax'.
+
+Then, when we invoke the Nutch(WAX) 'index' command, we specify the
+location of the page rank file.  For example,
+
+  $ nutch index \
+          -Dnutchwax.scoringfilter.pagerank.ranks=pagerank.txt \
+          indexes \
+          linkdb \
+          crawldb \
+          segments/*
+

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,13 +1,15 @@
 
 HOWTO-xslt.txt
-2008-07-25
+2008-12-18
 Aaron Binns
 
 Table of Contents
  o Prerequisites
    - NutchWAX HOWTO.txt
  o Overview
+ o NutchWAX OpenSearchServlet
  o XSLTFilter and web.xml
+ o Sample
 
 
 ======================================================================
@@ -31,9 +33,10 @@
   Servlet  : OpenSearchServlet
 
 If you read the OpenSearchServlet.java source code and the search.jsp
-page, you'll notice a lot of similarity, if not duplication of code.
+page, you'll notice a lot of similarity, if not outright duplication
+of code.
 
-The Internet Archive Web Team plans to improve and expand upon the
+The Internet Archive Web Team has improved and expanded upon the
 existing OpenSearchServlet interface as well as adding more XML-based
 capabilities, including replacements for the existing JSP pages.  In
 short, moving away from JSP and toward XML.
@@ -48,6 +51,21 @@
 
 
 ======================================================================
+NutchWAX OpenSearchServlet
+======================================================================
+
+NutchWAX contains an enhanced OpenSearch servlet which is a drop-in
+replacement for the default Nutch OpenSearch servlet.  To use the
+NutchWAX implementation, modify the 'web.xml'
+
+from:
+    <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class>
+
+to:
+    <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
+
+
+======================================================================
 XSLTFilter and web.xml
 ======================================================================
 
@@ -55,11 +73,11 @@
 OpenSearchServlet is straightforward.  Simply add the XSLTFilter to
 the servlet's path and specify the XSL transform to apply.
 
-For example, consider the default Nutch web.xml
+For example, consider the default NutchWAX web.xml
 
   <servlet>
     <servlet-name>OpenSearch</servlet-name>
-    <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class>
+    <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
   </servlet>
 
   <servlet-mapping>
@@ -68,13 +86,13 @@
   </servlet-mapping>
 
 Let's say we want to retain the '/opensearch' path for the XML output,
-and add the human-friendly HTML page at '/coolsearch'
+and add the human-friendly HTML page at '/search'
 
 First, we add an additional 'servlet-mapping' for our new path:
 
   <servlet-mapping>
     <servlet-name>OpenSearch</servlet-name>
-    <url-pattern>/coolsearch</url-pattern>
+    <url-pattern>/search</url-pattern>
   </servlet-mapping>
 
 Then, we add the XSLTFilter, passing it a URL to the XSLT file
@@ -93,7 +111,7 @@
 
   <filter-mapping>
     <filter-name>XSLT Filter</filter-name>
-    <url-pattern>/coolsearch</url-pattern>
+    <url-pattern>/search</url-pattern>
   </filter-mapping>
 
 This way, we have two URLs, which run the exact same
@@ -101,11 +119,11 @@
 output whereas the other produces human-friendly HTML output.
 
  OpenSearch     XML  :  http://someserver/opensearch?query=foo
- Human-friendly HTML :  http://someserver/coolsearch?query=foo
+ Human-friendly HTML :  http://someserver/search?query=foo
 
 
 ======================================================================
-Samples
+Sample
 ======================================================================
 
 You can find sample 'web.xml' and 'search.xsl' files in 

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-10-01
+2008-12-18
 Aaron Binns
 
 This installation guide assumes the reader is already familiar with
@@ -43,7 +43,7 @@
 -------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.2 is
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.3 is
 built against is:
 
   701524

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,9 +1,9 @@
 
 README.txt
-2008-10-01
+2008-12-18
 Aaron Binns
 
-Welcome to NutchWAX 0.12.2!
+Welcome to NutchWAX 0.12.3!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.
@@ -60,6 +60,15 @@
    Filtering plugin which can be used to exclude URLs from import.  It
    can be used as part of a NutchWAX de-duplication scheme.
 
+ plugins/scoring-nutchwax
+
+   Scoring plugin for use at index-time which reads from an external
+   "pagerank.txt" file for scoring documents based on the log10 of the
+   number of inlinks to a document.
+
+   The use of this plugin is optional but can improve the quality of
+   search results, especially for very large collections.
+
  conf/nutch-site.xml
 
    Sample configuration properties file showing suggested settings for
@@ -131,6 +140,4 @@
       contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
       ...
 
-
 ======================================================================
-

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 02:59:10 UTC (rev 2670)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 03:00:10 UTC (rev 2671)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2008-10-13
+2008-12-18
 Aaron Binns
 
-Release notes for NutchWAX 0.12.2
+Release notes for NutchWAX 0.12.3
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,9 +15,14 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.2 contains some minor enhancements and fixes to NutchWAX
-0.12.1.
+NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2
 
+  o PageRank calculation and scoring
+  o Enhanced OpenSearchServlet
+  o Improved XSLT sample for OpenSearch
+  o System init.d script for searcher slaves
+  o Enhanced searcher slave aware of NutchWAX extensions
+
 ======================================================================
 Issues
 ======================================================================
@@ -28,23 +33,6 @@
 
 Issues resolved in this release:
 
-WAX-19 
-  Add strict/loose option to DateAdder for revisit lines with extra
-  data on end.
-
-WAX-21 
-  Allow for blank lines and comment lines in manifest file.
-
-WAX-22 
-  Various code clean-ups based on code review using PMD tool.
-
-WAX-23 
-  Add a "field setter" filter to set a field to a static value in the
-  Lucene document during indexing.
-
-WAX-24
-  DateAdder fails due to uncaught exception in URL canonicalization
-
-WAX-25
-  Add utility/tool to dump unique values of a field in an index.
-
+WAX-26
+  Add XML elements containing all search URL params for self-link
+  generation


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2670] trunk/archive-access/projects/nutchwax/ archive/bin/nutchwax

From: <bi...@us...> - 2008-12-16 02:59:13

Revision: 2670
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2670&view=rev
Author:   binzino
Date:     2008-12-16 02:59:10 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Added a command for running the PageRanker tool.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/bin/nutchwax

Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2008-12-16 02:43:25 UTC (rev 2669)
+++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2008-12-16 02:59:10 UTC (rev 2670)
@@ -50,6 +50,10 @@
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@
     ;;
+  pagerank)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@
+    ;;
   *)
     echo ""
     echo "Usage: nutchwax COMMAND"
@@ -57,6 +61,7 @@
     echo "  import       Import ARCs into a new Nutch segment"
     echo "  add-dates    Add dates to a parallel index"
     echo "  dumpindex    Dump an index or set of parallel indices to stdout"
+    echo "  pagerank     Generate pagerank file for URLs in a 'linkdb'."
     echo ""
     exit 1
     ;;


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2669] trunk/archive-access/projects/nutchwax/ archive/conf/nutch-site.xml

From: <bi...@us...> - 2008-12-16 02:43:28

Revision: 2669
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2669&view=rev
Author:   binzino
Date:     2008-12-16 02:43:25 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Removed Nutch OPIC scoring filter and replaced with NutchWAX PageRank
scoring filter.  Also added a comment about the HTTP code filter.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-16 02:42:20 UTC (rev 2668)
+++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-16 02:43:25 UTC (rev 2669)
@@ -10,7 +10,7 @@
   <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. -->
   <!-- Also, add 'parse-pdf' -->
   <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' -->
-  <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax</value>
+  <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
 </property>
 
 <!-- The indexing filter order *must* be specified in order for
@@ -115,6 +115,9 @@
   <description>Implementation of URL canonicalizer to use.</description>
 </property>
 
+<!-- Only pass URLs with an HTTP status in this range.  Used by the
+     NutchWAX importer.
+   -->
 <property>
   <name>nutchwax.filter.http.status</name>
   <value>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2668] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/tools/PageRanker.java

From: <bi...@us...> - 2008-12-16 02:42:28

Revision: 2668
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2668&view=rev
Author:   binzino
Date:     2008-12-16 02:42:20 +0000 (Tue, 16 Dec 2008)

Log Message:
-----------
Removed unused member variable.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	2008-12-15 21:39:28 UTC (rev 2667)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	2008-12-16 02:42:20 UTC (rev 2668)
@@ -45,13 +45,13 @@
 {
   public static final Log LOG = LogFactory.getLog(PageRanker.class);
 
-  public static final String DONE_NAME = "merge.done";
-
-  public PageRanker() {
+  public PageRanker()
+  {
     
   }
   
-  public PageRanker(Configuration conf) {
+  public PageRanker(Configuration conf)
+  {
     setConf(conf);
   }
   


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2667] trunk/archive-access/projects/nutchwax/ archive/build.xml

From: <bi...@us...> - 2008-12-15 22:55:54

Revision: 2667
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2667&view=rev
Author:   binzino
Date:     2008-12-15 21:39:28 +0000 (Mon, 15 Dec 2008)

Log Message:
-----------
Copy the src/etc directory to the build/package directory, just like
we do with conf and web.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-15 17:47:01 UTC (rev 2666)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-15 21:39:28 UTC (rev 2667)
@@ -147,6 +147,11 @@
       <fileset dir="src/web" />
     </copy>
 
+    <mkdir dir="${dist.dir}/contrib/archive/etc"/>
+    <copy todir="${dist.dir}/contrib/archive/etc">
+      <fileset dir="src/etc" />
+    </copy>
+
   </target>
 
 </project>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2666] trunk/archive-access/projects/nutchwax/ archive/conf/nutch-site.xml

From: <bi...@us...> - 2008-12-15 17:47:04

Revision: 2666
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2666&view=rev
Author:   binzino
Date:     2008-12-15 17:47:01 +0000 (Mon, 15 Dec 2008)

Log Message:
-----------
Oops, fix bug where I accidentally removed closing tag in previous
edit.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-15 02:19:53 UTC (rev 2665)
+++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-15 17:47:01 UTC (rev 2666)
@@ -147,7 +147,6 @@
 <!-- The following are over-rides of property values in
      nutch-default which the Internet Archive uses in
      most NutchWAX projects. -->
-
 <property>
   <name>io.map.index.skip</name>
   <value>32</value>
@@ -167,3 +166,5 @@
   <name>searcher.summary.length</name>
   <value>80</value>
 </property>
+
+</configuration>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2665] trunk/archive-access/projects/nutchwax/ archive/conf/nutch-site.xml

From: <bi...@us...> - 2008-12-15 02:19:55

Revision: 2665
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2665&view=rev
Author:   binzino
Date:     2008-12-15 02:19:53 +0000 (Mon, 15 Dec 2008)

Log Message:
-----------
Added some property values which we commonly use in deployments.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-15 01:47:48 UTC (rev 2664)
+++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-15 02:19:53 UTC (rev 2665)
@@ -144,4 +144,26 @@
   <value>true</value>
 </property>
 
-</configuration>
+<!-- The following are over-rides of property values in
+     nutch-default which the Internet Archive uses in
+     most NutchWAX projects. -->
+
+<property>
+  <name>io.map.index.skip</name>
+  <value>32</value>
+</property>
+
+<property>
+  <name>searcher.max.hits</name>
+  <value>1000</value>
+</property>
+
+<property>
+  <name>searcher.summary.context</name>
+  <value>8</value>
+</property>
+
+<property>
+  <name>searcher.summary.length</name>
+  <value>80</value>
+</property>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2664] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2008-12-15 02:11:18

Revision: 2664
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2664&view=rev
Author:   binzino
Date:     2008-12-15 01:47:48 +0000 (Mon, 15 Dec 2008)

Log Message:
-----------
Added own version of OpenSerach servlet which adds some XML elements
and has a few other enhancements.  Also revised the sample XSLT to
take advantage of these changes in the OpenSearch servlet.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2008-12-15 01:47:48 UTC (rev 2664)
@@ -0,0 +1,372 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.net.URLEncoder;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Set;
+import java.util.HashSet;
+
+import javax.servlet.ServletException;
+import javax.servlet.ServletConfig;
+import javax.servlet.http.HttpServlet;
+import javax.servlet.http.HttpServletRequest;
+import javax.servlet.http.HttpServletResponse;
+
+import javax.xml.parsers.*;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.util.NutchConfiguration;
+import org.w3c.dom.*;
+import javax.xml.transform.TransformerFactory;
+import javax.xml.transform.Transformer;
+import javax.xml.transform.dom.DOMSource;
+import javax.xml.transform.stream.StreamResult;
+
+import org.apache.nutch.searcher.Hit;
+import org.apache.nutch.searcher.HitDetails;
+import org.apache.nutch.searcher.Hits;
+import org.apache.nutch.searcher.NutchBean;
+import org.apache.nutch.searcher.Query;
+import org.apache.nutch.searcher.Summary;
+
+/** 
+ * Present search results using A9's OpenSearch extensions to RSS,
+ * plus a few Nutch-specific extensions.
+ */   
+public class OpenSearchServlet extends HttpServlet 
+{
+  private static final Map NS_MAP = new HashMap();
+  private int MAX_HITS_PER_PAGE;
+
+  static {
+    NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/");
+    NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/");
+  }
+
+  private static final Set SKIP_DETAILS = new HashSet();
+  static {
+    SKIP_DETAILS.add("url");                   // redundant with RSS link
+    SKIP_DETAILS.add("title");                 // redundant with RSS title
+  }
+
+  private NutchBean bean;
+  private Configuration conf;
+
+  public void init(ServletConfig config) throws ServletException {
+    try {
+      this.conf = NutchConfiguration.get(config.getServletContext());
+      bean = NutchBean.get(config.getServletContext(), this.conf);
+    } catch (IOException e) {
+      throw new ServletException(e);
+    }
+    MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
+  }
+
+  public void doGet(HttpServletRequest request, HttpServletResponse response)
+    throws ServletException, IOException {
+
+    long responseTime = System.nanoTime( );
+
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("query request from " + request.getRemoteAddr());
+    }
+
+    // get parameters from request
+    request.setCharacterEncoding("UTF-8");
+    String queryString = request.getParameter("query");
+    if (queryString == null)
+      queryString = "";
+    String urlQuery = URLEncoder.encode(queryString, "UTF-8");
+    
+    // the query language
+    String queryLang = request.getParameter("lang");
+    
+    int start = 0;                                // first hit to display
+    String startString = request.getParameter("start");
+    if (startString != null)
+      start = Integer.parseInt(startString);
+    
+    int hitsPerPage = 10;                         // number of hits to display
+    String hitsString = request.getParameter("hitsPerPage");
+    if (hitsString != null)
+      hitsPerPage = Integer.parseInt(hitsString);
+    if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE)
+      hitsPerPage = MAX_HITS_PER_PAGE;
+
+    String sort = request.getParameter("sort");
+    boolean reverse = sort != null && "true".equals(request.getParameter("reverse"));
+
+    // De-Duplicate handling.  Look for duplicates field and for how many
+    // duplicates per results to return. Default duplicates field is 'site'
+    // and duplicates per results default is '2'.
+    String dedupField = request.getParameter("dedupField");
+    if (dedupField == null || dedupField.length() == 0) {
+        dedupField = "site";
+    }
+    int hitsPerDup = 2;
+    String hitsPerDupString = request.getParameter("hitsPerDup");
+    String hitsPerSiteString = request.getParameter("hitsPerSite");
+    if (hitsPerDupString != null && hitsPerDupString.length() > 0) {
+        hitsPerDup = Integer.parseInt(hitsPerDupString);
+    } else {
+        // If 'hitsPerSite' present, use that value.
+        if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) {
+            hitsPerDup = Integer.parseInt(hitsPerSiteString);
+        }
+    }
+     
+    // Make up query string for use later drawing the 'rss' logo.
+    String params = "&hitsPerPage=" + hitsPerPage +
+        (queryLang == null ? "" : "&lang=" + queryLang) +
+        (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") +
+        (dedupField == null ? "" : "&dedupField=" + dedupField));
+
+    Query query = Query.parse(queryString, queryLang, this.conf);
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("query: " + queryString);
+      NutchBean.LOG.info("lang: " + queryLang);
+    }
+
+    // execute the query
+    Hits hits;
+    try {
+      hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField, sort, reverse);
+    } catch (IOException e) {
+      if (NutchBean.LOG.isWarnEnabled()) {
+        NutchBean.LOG.warn("Search Error", e);
+      }
+      hits = new Hits(0,new Hit[0]);	
+    }
+
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("total hits: " + hits.getTotal());
+    }
+
+    responseTime = System.nanoTime( ) - responseTime;
+
+    // generate xml results
+    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
+    int length = end-start;
+
+    Hit[] show = hits.getHits(start, end-start);
+    HitDetails[] details = bean.getDetails(show);
+    Summary[] summaries = bean.getSummary(details, query);
+
+    String requestUrl = request.getRequestURL().toString();
+    String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
+      
+
+    try {
+      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
+      factory.setNamespaceAware(true);
+      Document doc = factory.newDocumentBuilder().newDocument();
+ 
+      Element rss = addNode(doc, doc, "rss");
+      addAttribute(doc, rss, "version", "2.0");
+      addAttribute(doc, rss, "xmlns:opensearch",
+                   (String)NS_MAP.get("opensearch"));
+      addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch"));
+
+      Element channel = addNode(doc, rss, "channel");
+    
+      addNode(doc, channel, "title", "Nutch: " + queryString);
+      addNode(doc, channel, "description", "Nutch search results for query: "
+              + queryString);
+      addNode(doc, channel, "link",
+              base+"/search.jsp"
+              +"?query="+urlQuery
+              +"&start="+start
+              +"&hitsPerDup="+hitsPerDup
+              +params);
+
+      addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
+      addNode(doc, channel, "opensearch", "startIndex", ""+start);
+      addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
+
+      addNode(doc, channel, "nutch", "query", queryString);
+      addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) );
+
+      // Add a <nutch:urlParams> element containing a list of all the URL parameters.
+      Element urlParams = doc.createElementNS((String)NS_MAP.get("nutch"), "nutch:urlParams" );
+      channel.appendChild( urlParams );
+
+      for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) )
+        {
+          String key = e.getKey( );
+          for ( String value : e.getValue( ) )
+            {
+              Element urlParam = doc.createElementNS((String)NS_MAP.get("nutch"), "nutch:param" );
+              addAttribute( doc, urlParam, "name",  key   );
+              addAttribute( doc, urlParam, "value", value );
+              urlParams.appendChild(urlParam);
+            }
+        }
+
+      // Hmm, we should indicate whether or not the "totalResults"
+      // number as being exact some other way; perhaps just have a
+      // <nutch:totalIsExact>true</nutch:totalIsExact> element.
+      /*
+      if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
+          || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){
+        addNode(doc, channel, "nutch", "nextPage", requestUrl
+                +"?query="+urlQuery
+                +"&start="+end
+                +"&hitsPerDup="+hitsPerDup
+                +params);
+      }
+      */
+
+      // Same here, this seems odd.
+      /*
+      if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) {
+        addNode(doc, channel, "nutch", "showAllHits", requestUrl
+                +"?query="+urlQuery
+                +"&hitsPerDup="+0
+                +params);
+      }
+      */
+
+      for (int i = 0; i < length; i++) {
+        Hit hit = show[i];
+        HitDetails detail = details[i];
+        String title = detail.getValue("title");
+        String url = detail.getValue("url");
+        String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo();
+      
+        if (title == null || title.equals("")) {   // use url for docs w/o title
+          title = url;
+        }
+        
+        Element item = addNode(doc, channel, "item");
+
+        addNode(doc, item, "title", title);
+        if (summaries[i] != null) {
+          addNode(doc, item, "description", summaries[i].toString() );
+        }
+        addNode(doc, item, "link", url);
+
+        addNode(doc, item, "nutch", "site", hit.getDedupValue());
+
+        addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id);
+        addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id
+                +"&query="+urlQuery+"&lang="+queryLang);
+
+        // Probably don't need this as the XML processor/front-end can
+        // easily do this themselves.
+        if (hit.moreFromDupExcluded()) {
+          addNode(doc, item, "nutch", "moreFromSite", requestUrl
+                  +"?query="
+                  +URLEncoder.encode("site:"+hit.getDedupValue()
+                                     +" "+queryString, "UTF-8")
+                  +"&hitsPerSite="+0
+                  +params);
+        }
+
+        for (int j = 0; j < detail.getLength(); j++) { // add all from detail
+          String field = detail.getField(j);
+          if (!SKIP_DETAILS.contains(field))
+            addNode(doc, item, "nutch", field, detail.getValue(j));
+        }
+      }
+
+      // dump DOM tree
+
+      DOMSource source = new DOMSource(doc);
+      TransformerFactory transFactory = TransformerFactory.newInstance();
+      Transformer transformer = transFactory.newTransformer();
+      transformer.setOutputProperty("indent", "yes");
+      StreamResult result = new StreamResult(response.getOutputStream());
+      response.setContentType("text/xml");
+      transformer.transform(source, result);
+
+    } catch (javax.xml.parsers.ParserConfigurationException e) {
+      throw new ServletException(e);
+    } catch (javax.xml.transform.TransformerException e) {
+      throw new ServletException(e);
+    }
+      
+  }
+
+  private static Element addNode(Document doc, Node parent, String name) {
+    Element child = doc.createElement(name);
+    parent.appendChild(child);
+    return child;
+  }
+
+  private static void addNode(Document doc, Node parent,
+                              String name, String text) {
+    if ( text == null ) text = "";
+    Element child = doc.createElement(name);
+    child.appendChild(doc.createTextNode(getLegalXml(text)));
+    parent.appendChild(child);
+  }
+
+  private static void addNode(Document doc, Node parent,
+                              String ns, String name, String text) {
+    if ( text == null ) text = "";
+    Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name);
+    child.appendChild(doc.createTextNode(getLegalXml(text)));
+    parent.appendChild(child);
+  }
+
+  private static void addAttribute(Document doc, Element node,
+                                   String name, String value) {
+    Attr attribute = doc.createAttribute(name);
+    attribute.setValue(getLegalXml(value));
+    node.getAttributes().setNamedItem(attribute);
+  }
+
+  /*
+   * Ensure string is legal xml.
+   * @param text String to verify.
+   * @return Passed <code>text</code> or a new string with illegal
+   * characters removed if any found in <code>text</code>.
+   * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
+   */
+  protected static String getLegalXml(final String text) {
+      if (text == null) {
+          return null;
+      }
+      StringBuffer buffer = null;
+      for (int i = 0; i < text.length(); i++) {
+        char c = text.charAt(i);
+        if (!isLegalXml(c)) {
+	  if (buffer == null) {
+              // Start up a buffer.  Copy characters here from now on
+              // now we've found at least one bad character in original.
+	      buffer = new StringBuffer(text.length());
+              buffer.append(text.substring(0, i));
+          }
+        } else {
+           if (buffer != null) {
+             buffer.append(c);
+           }
+        }
+      }
+      return (buffer != null)? buffer.toString(): text;
+  }
+ 
+  private static boolean isLegalXml(final char c) {
+    return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff)
+        || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff);
+  }
+
+}

Modified: trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl	2008-12-14 21:10:33 UTC (rev 2663)
+++ trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl	2008-12-15 01:47:48 UTC (rev 2664)
@@ -115,42 +115,49 @@
         <span class="searchFields">
         Search for 
         <input id="query" name="query" type="text" size="40" value="{nutch:query}" />
+
+        <!-- Create hidden form fields for the rest of the URL parameters -->
+        <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start' and @name!='query']">
+          <xsl:element name="input" namespace="http://www.w3.org/1999/xhtml">
+            <xsl:attribute name="type">hidden</xsl:attribute>
+            <xsl:attribute name="name" ><xsl:value-of select="@name"  /></xsl:attribute>
+            <xsl:attribute name="value"><xsl:value-of select="@value" /></xsl:attribute>
+          </xsl:element>
+        </xsl:for-each>
+
         <input type="submit" value="Search"/>
         </span>
       </form>
     </div>
-    <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"><a href="{nutch:nextPage}">Next</a></span></div>
+    <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"></span></div>
     <!-- Search results -->
     <ol start="{opensearch:startIndex + 1}">
       <xsl:apply-templates select="item" />
     </ol>
     <!-- Generate list of page links -->
     <center>
-    <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1">
-      <a href="search?query={nutch:query}&amp;start={(floor(opensearch:startIndex div opensearch:itemsPerPage) - 1) * opensearch:itemsPerPage}">&#171;</a><xsl:text> </xsl:text>
-    </xsl:if>
-    <xsl:choose>
-      <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) &lt; 11">
-        <xsl:call-template name="pageLinks" >
-          <xsl:with-param name="begin"   select="1"  />
-          <xsl:with-param name="end"     select="21" />
-          <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
-        </xsl:call-template>
-      </xsl:when>
-      <xsl:otherwise>
-        <xsl:call-template name="pageLinks" >
-          <xsl:with-param name="begin"   select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" />
-          <xsl:with-param name="end"     select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" />
-          <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
-        </xsl:call-template>
-      </xsl:otherwise>
-    </xsl:choose>
-    <a href="{nutch:nextPage}">&#187;</a>
+      <xsl:call-template name="pageLinks">
+        <xsl:with-param name="labelPrevious" select="'&#171;'" />
+        <xsl:with-param name="labelNext"     select="'&#187;'" />
+      </xsl:call-template>
     </center>
   </body>
 </html>
 </xsl:template>
 
+
+<!-- ======================================================================
+     NutchWAX XSLT template/fuction library.
+     
+     The idea is that the above xhtml code is what most NutchWAX users
+     will modify to tailor to their own look and feel.  The stuff
+     below implements the core logic for generating results lists,
+     page links, etc.
+
+     Hopefully NutchWAX web developers will be able to easily edit the
+     above xhtml and css and won't have to change the below.
+     ====================================================================== -->
+
 <!-- Template to emit a search result as an HTML list item (<li/>).
   -->
 <xsl:template match="item">
@@ -176,32 +183,99 @@
   <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text>
 </xsl:template>
 
-<!-- Template to generate a list of numbered links to results pages. 
+<!-- Template to emit a list of numbered page links, *including*
+     "previous" and "next" links on either end, using the given labels.
      Parameters:
+       labelPrevious   Link text for "previous page" link
+       labelNext       Link text for "next page" link
+  -->
+<xsl:template name="pageLinks">
+  <xsl:param name="labelPrevious" />
+  <xsl:param name="labelNext"     />
+  <!-- If we are on any page past the first, emit a "previous" link -->
+  <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1">
+    <xsl:call-template name="pageLink">
+      <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage)" />
+      <xsl:with-param name="linkText" select="$labelPrevious" />
+    </xsl:call-template>
+    <xsl:text> </xsl:text>
+  </xsl:if>
+  <!-- Now, emit numbered page links -->
+  <xsl:choose>
+    <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) &lt; 11">
+      <xsl:call-template name="numberedPageLinks" >
+        <xsl:with-param name="begin"   select="1"  />
+        <xsl:with-param name="end"     select="21" />
+        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+      </xsl:call-template>
+    </xsl:when>
+    <xsl:otherwise>
+      <xsl:call-template name="numberedPageLinks" >
+        <xsl:with-param name="begin"   select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" />
+        <xsl:with-param name="end"     select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" />
+        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+      </xsl:call-template>
+    </xsl:otherwise>
+  </xsl:choose>
+  <!-- Lastly, emit a "next" link. -->
+  <xsl:text> </xsl:text>
+  <xsl:call-template name="pageLink">
+    <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" />
+    <xsl:with-param name="linkText" select="$labelNext" />
+  </xsl:call-template>
+</xsl:template>
+
+<!-- Template to emit a list of numbered links to results pages. 
+     Parameters:
        begin    starting # inclusive
        end      ending # exclusive
        current  the current page, don't emit a link
   -->
-<xsl:template name="pageLinks">
+<xsl:template name="numberedPageLinks">
   <xsl:param name="begin"   />
   <xsl:param name="end"     />
   <xsl:param name="current" />
   <xsl:if test="$begin &lt; $end">
-        <xsl:choose>
-          <xsl:when test="$begin = $current" >
-            <xsl:value-of select="$current" />
-          </xsl:when>
-          <xsl:otherwise>
-            <a href="?query={nutch:query}&amp;start={($begin -1) * opensearch:itemsPerPage}&amp;hitsPerPage={opensearch:itemsPerPage}"><xsl:value-of select="$begin" /></a>
-          </xsl:otherwise>
-        </xsl:choose>
-        <xsl:text> </xsl:text>
-        <xsl:call-template name="pageLinks">
-          <xsl:with-param name="begin"   select="$begin + 1" />
-          <xsl:with-param name="end"     select="$end"       />
-          <xsl:with-param name="current" select="$current"   />
+    <xsl:choose>
+      <xsl:when test="$begin = $current" >
+        <xsl:value-of select="$current" />
+      </xsl:when>
+      <xsl:otherwise>
+        <xsl:call-template name="pageLink" >
+          <xsl:with-param name="pageNum"  select="$begin"  />
+          <xsl:with-param name="linkText" select="$begin"  />
         </xsl:call-template>
+      </xsl:otherwise>
+    </xsl:choose>
+    <xsl:text> </xsl:text>
+    <xsl:call-template name="numberedPageLinks">
+      <xsl:with-param name="begin"   select="$begin + 1" />
+      <xsl:with-param name="end"     select="$end"       />
+      <xsl:with-param name="current" select="$current"   />
+    </xsl:call-template>
   </xsl:if>
 </xsl:template>
 
+<!-- Template to emit a single page link.  All of the URL parameters
+     listed in the OpenSearch results are included in the link.
+     Parmeters:
+       pageNum    page number of the link
+       linkText   text of the link
+  -->
+<xsl:template name="pageLink">
+  <xsl:param name="pageNum"  />
+  <xsl:param name="linkText" />
+  <xsl:element name="a" namespace="http://www.w3.org/1999/xhtml">
+    <xsl:attribute name="href">
+      <xsl:text>?</xsl:text>
+      <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start']">
+        <xsl:value-of select="@name" /><xsl:text>=</xsl:text><xsl:value-of select="@value" />
+        <xsl:text>&amp;</xsl:text>
+      </xsl:for-each>
+      <xsl:text>start=</xsl:text><xsl:value-of select="($pageNum -1) * opensearch:itemsPerPage" />
+    </xsl:attribute>
+    <xsl:value-of select="$linkText" />
+  </xsl:element>
+</xsl:template>
+
 </xsl:stylesheet>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2663] trunk/archive-access/projects/nutchwax/ archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ FieldSetter.java

From: <bi...@us...> - 2008-12-14 21:10:37

Revision: 2663
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2663&view=rev
Author:   binzino
Date:     2008-12-14 21:10:33 +0000 (Sun, 14 Dec 2008)

Log Message:
-----------
Fixed bug where no settings lead to NPE due to uninitialized member
variable.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java	2008-12-12 05:12:36 UTC (rev 2662)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java	2008-12-14 21:10:33 UTC (rev 2663)
@@ -20,6 +20,7 @@
  */
 package org.archive.nutchwax.index;
 
+import java.util.Collections;
 import java.util.List;
 import java.util.ArrayList;
 
@@ -69,7 +70,7 @@
   public static final Log LOG = LogFactory.getLog( FieldSetter.class );
 
   private Configuration conf;
-  private List<FieldSetting> settings;
+  private List<FieldSetting> settings = Collections.emptyList();
 
   public void setConf( Configuration conf )
   {


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2662] trunk/archive-access/projects/nutchwax/ archive/build.xml

From: <bi...@us...> - 2008-12-12 05:12:41

Revision: 2662
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2662&view=rev
Author:   binzino
Date:     2008-12-12 05:12:36 +0000 (Fri, 12 Dec 2008)

Log Message:
-----------
Fixed rsync args to exclude .svn subdirs and other stuff we don't want
to copy over into the Nutch source tree.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-11 22:59:27 UTC (rev 2661)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-12 05:12:36 UTC (rev 2662)
@@ -28,9 +28,7 @@
   <target name="nutch-compile-core">
     <!-- First, copy over Nutch source overlays -->
     <exec executable="rsync">
-      <arg value="-vac"/>
-      <arg value="--exclude"/>
-      <arg value="*~"/>
+      <arg value="-vacC"/>
       <arg value="src/nutch/"/>
       <arg value="../../"/>
     </exec>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2661] trunk/archive-access/projects/nutchwax/ archive/build.xml

From: <bi...@us...> - 2008-12-11 22:59:31

Revision: 2661
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2661&view=rev
Author:   binzino
Date:     2008-12-11 22:59:27 +0000 (Thu, 11 Dec 2008)

Log Message:
-----------
Add use of 'rsync' to copy Nutch source over-rides into Nutch main
source dir before compilation.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-11 22:58:28 UTC (rev 2660)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2008-12-11 22:59:27 UTC (rev 2661)
@@ -26,6 +26,14 @@
   <property name="dist.dir"  value="${build.dir}/nutch-1.0-dev" />
 
   <target name="nutch-compile-core">
+    <!-- First, copy over Nutch source overlays -->
+    <exec executable="rsync">
+      <arg value="-vac"/>
+      <arg value="--exclude"/>
+      <arg value="*~"/>
+      <arg value="src/nutch/"/>
+      <arg value="../../"/>
+    </exec>
     <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" />
   </target>
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2660] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2008-12-11 22:58:33

Revision: 2660
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2660&view=rev
Author:   binzino
Date:     2008-12-11 22:58:28 +0000 (Thu, 11 Dec 2008)

Log Message:
-----------
Initial checkin of Nutch source-files that are over-ridden and copied
into the Nutch source tree when compiling.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	2008-12-11 22:58:28 UTC (rev 2660)
@@ -0,0 +1,375 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.searcher;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.BufferedReader;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Iterator;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.commons.lang.StringUtils;
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.nutch.protocol.*;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.mapred.lib.*;
+import org.apache.nutch.crawl.*;
+
+/** Implements {@link HitSummarizer} and {@link HitContent} for a set of
+ * fetched segments. */
+public class FetchedSegments implements HitSummarizer, HitContent
+{
+  public static final Log LOG = LogFactory.getLog(FetchedSegments.class);
+
+  private static class Segment implements Closeable {
+    
+    private static final Partitioner PARTITIONER = new HashPartitioner();
+
+    private FileSystem fs;
+    private Path segmentDir;
+
+    private MapFile.Reader[] content;
+    private MapFile.Reader[] parseText;
+    private MapFile.Reader[] parseData;
+    private MapFile.Reader[] crawl;
+    private Configuration conf;
+
+    public Segment(FileSystem fs, Path segmentDir, Configuration conf) throws IOException {
+      this.fs = fs;
+      this.segmentDir = segmentDir;
+      this.conf = conf;
+    }
+
+    public CrawlDatum getCrawlDatum(Text url) throws IOException {
+      synchronized (this) {
+        if (crawl == null)
+          crawl = getReaders(CrawlDatum.FETCH_DIR_NAME);
+      }
+      return (CrawlDatum)getEntry(crawl, url, new CrawlDatum());
+    }
+    
+    public byte[] getContent(Text url) throws IOException {
+      synchronized (this) {
+        if (content == null)
+          content = getReaders(Content.DIR_NAME);
+      }
+      return ((Content)getEntry(content, url, new Content())).getContent();
+    }
+
+    public ParseData getParseData(Text url) throws IOException {
+      synchronized (this) {
+        if (parseData == null)
+          parseData = getReaders(ParseData.DIR_NAME);
+      }
+      return (ParseData)getEntry(parseData, url, new ParseData());
+    }
+
+    public ParseText getParseText(Text url) throws IOException {
+      synchronized (this) {
+        if (parseText == null)
+          parseText = getReaders(ParseText.DIR_NAME);
+      }
+      return (ParseText)getEntry(parseText, url, new ParseText());
+    }
+    
+    private MapFile.Reader[] getReaders(String subDir) throws IOException {
+      return MapFileOutputFormat.getReaders(fs, new Path(segmentDir, subDir), this.conf);
+    }
+
+    private Writable getEntry(MapFile.Reader[] readers, Text url,
+                              Writable entry) throws IOException {
+      return MapFileOutputFormat.getEntry(readers, PARTITIONER, url, entry);
+    }
+
+    public void close() throws IOException {
+      if (content != null) { closeReaders(content); }
+      if (parseText != null) { closeReaders(parseText); }
+      if (parseData != null) { closeReaders(parseData); }
+      if (crawl != null) { closeReaders(crawl); }
+    }
+
+    private void closeReaders(MapFile.Reader[] readers) throws IOException {
+      for (int i = 0; i < readers.length; i++) {
+        readers[i].close();
+      }
+    }
+
+  }
+
+  private HashMap segments = new HashMap( );
+  private boolean perCollection = false;
+  private Summarizer summarizer;
+
+  /** Construct given a directory containing fetcher output. */
+  public FetchedSegments(FileSystem fs, String segmentsDir, Configuration conf) throws IOException
+  {
+    this.summarizer = new SummarizerFactory(conf).getSummarizer();
+
+    Path[] segmentDirs = HadoopFSUtil.getPaths( fs.listStatus(new Path(segmentsDir), HadoopFSUtil.getPassDirectoriesFilter(fs)) );
+    if ( segmentDirs == null )
+      {
+        LOG.warn( "No segment directories: " + segmentsDir );
+        return ;
+      }
+
+    this.perCollection = conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false );
+
+    LOG.info( "Per-collection segments: " + this.perCollection );
+
+    for ( int i = 0; i < segmentDirs.length; i++ )
+      {
+        if ( this.perCollection )
+          {
+            // Assume segmentDir is actually a 'collection' dir which
+            // contains a list of segments, such as:
+            //   crawl/segments/194/segment-foo
+            //                     /segment-bar
+            //                     /segment-baz
+            //   crawl/segments/366/segment-frotz
+            //                     /segment-fizzle
+            //                     /segment-bizzle
+            // The '194' and '366' are collection dirs, which contain the
+            // actual segment dirs.
+            Path collectionDir = segmentDirs[i];
+            
+            Map perCollectionSegments = (Map) this.segments.get( collectionDir.getName( ) );
+            if ( perCollectionSegments == null )
+              {
+                perCollectionSegments = new HashMap( );
+                this.segments.put( collectionDir.getName( ), perCollectionSegments );
+              }
+            
+            // Now, get a list of all the sub-dirs of the collectionDir,
+            // and create segments for them, adding them to the
+            // per-collection map.
+            Path[] perCollectionSegmentDirs = HadoopFSUtil.getPaths( fs.listStatus( collectionDir, HadoopFSUtil.getPassDirectoriesFilter(fs) ) );
+            for ( Path segmentDir : perCollectionSegmentDirs )
+              {
+                perCollectionSegments.put( segmentDir.getName( ), new Segment( fs, segmentDir, conf ) );
+              }
+
+            addRemaps( fs, collectionDir, (Map<String,Segment>) perCollectionSegments );
+          }
+        else
+          {
+            Path segmentDir = segmentDirs[i];
+            segments.put(segmentDir.getName(), new Segment(fs, segmentDir, conf));
+          }
+      }
+
+    // If we not-doing perCollection segments, process a single
+    // "remap" file for the "segments" dir.
+    if ( ! this.perCollection )
+      {
+        addRemaps( fs, new Path(segmentsDir), (Map<String,Segment>) segments );
+      }
+
+    LOG.info( "segments: " + segments );
+  }
+
+  protected void addRemaps( FileSystem fs, Path segmentDir, Map<String,Segment> segments )
+    throws IOException
+  {
+    Path segmentRemapFile = new Path( segmentDir, "remap" );
+
+    if ( ! fs.exists( segmentRemapFile ) )
+      {
+        LOG.warn( "Remap file doesn't exist: " + segmentRemapFile );
+        
+        return ;
+      }
+
+    // InputStream is = segmentRemapFile.getFileSystem( conf ).open( segmentRemapFile );
+    InputStream is = fs.open( segmentRemapFile );
+    
+    BufferedReader reader = new BufferedReader( new InputStreamReader( is, "UTF-8" ) );
+            
+    String line;
+    while ( (line = reader.readLine()) != null )
+      {
+        String fields[] = line.trim( ).split( "\\s+" );
+        
+        if ( fields.length < 2 )
+          {
+            LOG.warn( "Malformed remap line, not enough fields ("+fields.length+"): " + line );
+            continue ;
+          }
+        
+        // Look for the "to" name in the segments.
+        Segment toSegment = segments.get( fields[1] );
+        if ( toSegment == null )
+          {
+            LOG.warn( "Segment remap destination doesn't exist: " + fields[1] );
+          }
+        else
+          {
+            LOG.warn( "Remap: " + fields[0] + " => " + fields[1] );
+            segments.put( fields[0], toSegment );
+          }
+      }
+  }
+
+
+  public String[] getSegmentNames() {
+    return (String[])segments.keySet().toArray(new String[segments.size()]);
+  }
+
+  public byte[] getContent(HitDetails details) throws IOException {
+    return getSegment(details).getContent(getUrl(details));
+  }
+
+  public ParseData getParseData(HitDetails details) throws IOException {
+    return getSegment(details).getParseData(getUrl(details));
+  }
+
+  public long getFetchDate(HitDetails details) throws IOException {
+    return getSegment(details).getCrawlDatum(getUrl(details))
+      .getFetchTime();
+  }
+
+  public ParseText getParseText(HitDetails details) throws IOException {
+    return getSegment(details).getParseText(getUrl(details));
+  }
+
+  public Summary getSummary(HitDetails details, Query query)
+    throws IOException {
+    
+    if (this.summarizer == null) { return new Summary(); }
+    
+    Segment segment = getSegment(details);
+    ParseText parseText = segment.getParseText(getUrl(details));
+    String text = (parseText != null) ? parseText.getText() : "";
+    
+    return this.summarizer.getSummary(text, query);
+  }
+    
+  private class SummaryThread extends Thread {
+    private HitDetails details;
+    private Query query;
+
+    private Summary summary;
+    private Throwable throwable;
+
+    public SummaryThread(HitDetails details, Query query) {
+      this.details = details;
+      this.query = query;
+    }
+
+    public void run() {
+      try {
+        this.summary = getSummary(details, query);
+      } catch (Throwable throwable) {
+        this.throwable = throwable;
+      }
+    }
+
+  }
+
+
+  public Summary[] getSummary(HitDetails[] details, Query query)
+    throws IOException {
+    SummaryThread[] threads = new SummaryThread[details.length];
+    for (int i = 0; i < threads.length; i++) {
+      threads[i] = new SummaryThread(details[i], query);
+      threads[i].start();
+    }
+
+    Summary[] results = new Summary[details.length];
+    for (int i = 0; i < threads.length; i++) {
+      try {
+        threads[i].join();
+      } catch (InterruptedException e) {
+        throw new RuntimeException(e);
+      }
+      if (threads[i].throwable instanceof IOException) {
+        throw (IOException)threads[i].throwable;
+      } else if (threads[i].throwable != null) {
+        throw new RuntimeException(threads[i].throwable);
+      }
+      results[i] = threads[i].summary;
+    }
+    return results;
+  }
+
+
+  private Segment getSegment(HitDetails details) 
+  {
+    if ( this.perCollection )
+      {
+        LOG.info( "getSegment: " + details );
+        LOG.info( "  collection: " + details.getValue("collection") );
+        LOG.info( "  segment   : " + details.getValue("segment") );
+
+        String collectionId = details.getValue("collection");
+        String segmentName  = details.getValue("segment");
+        
+        Map perCollectionSegments = (Map) this.segments.get( collectionId );
+        
+        Segment segment = (Segment) perCollectionSegments.get( segmentName );
+        
+        if ( segment == null )
+          {
+            LOG.warn( "Didn't find segment: collection=" + collectionId + " segment=" + segmentName );
+          }
+
+        return segment;
+      }
+    else
+      {
+        LOG.info( "getSegment: " + details );
+        LOG.info( "  segment   : " + details.getValue("segment") );
+
+        String segmentName = details.getValue( "segment" );
+        Segment segment = (Segment) segments.get( segmentName );
+
+        if ( segment == null )
+          {
+            LOG.warn( "Didn't find segment: " + segmentName );
+          }
+        
+        return segment;
+      }
+  }
+
+  private Text getUrl(HitDetails details) {
+    String url = details.getValue("orig");
+    if (StringUtils.isBlank(url)) {
+      url = details.getValue("url");
+    }
+    return new Text(url);
+  }
+
+  public void close() throws IOException {
+    Iterator iterator = segments.values().iterator();
+    while (iterator.hasNext()) {
+      ((Segment) iterator.next()).close();
+    }
+  }
+  
+}

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	2008-12-11 22:58:28 UTC (rev 2660)
@@ -0,0 +1,179 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.searcher;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.FloatWritable;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.WritableComparable;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.MultiReader;
+import org.apache.lucene.search.FieldCache;
+import org.apache.lucene.search.FieldDoc;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.apache.nutch.indexer.FsDirectory;
+import org.apache.nutch.indexer.NutchSimilarity;
+
+/** Implements {@link Searcher} and {@link HitDetailer} for either a single
+ * merged index, or a set of indexes. */
+public class IndexSearcher implements Searcher, HitDetailer {
+
+  private org.apache.lucene.search.Searcher luceneSearcher;
+  private org.apache.lucene.index.IndexReader reader;
+  private LuceneQueryOptimizer optimizer;
+  private FileSystem fs;
+  private Configuration conf;
+  private QueryFilters queryFilters;
+
+  /** Construct given a number of indexes. */
+  public IndexSearcher(Path[] indexDirs, Configuration conf) throws IOException {
+    IndexReader[] readers = new IndexReader[indexDirs.length];
+    this.conf = conf;
+    this.fs = FileSystem.get(conf);
+    for (int i = 0; i < indexDirs.length; i++) {
+      readers[i] = IndexReader.open(getDirectory(indexDirs[i]));
+    }
+    init(new MultiReader(readers), conf);
+  }
+
+  /** Construct given a single merged index. */
+  public IndexSearcher(Path index,  Configuration conf)
+    throws IOException {
+    this.conf = conf;
+    this.fs = FileSystem.get(conf);
+    init(IndexReader.open(getDirectory(index)), conf);
+  }
+
+  private void init(IndexReader reader, Configuration conf) throws IOException {
+    this.reader = reader;
+    this.luceneSearcher = new org.apache.lucene.search.IndexSearcher(reader);
+    this.luceneSearcher.setSimilarity(new NutchSimilarity());
+    this.optimizer = new LuceneQueryOptimizer(conf);
+    this.queryFilters = new QueryFilters(conf);
+  }
+
+  private Directory getDirectory(Path file) throws IOException {
+    if ("file".equals(this.fs.getUri().getScheme())) {
+      Path qualified = file.makeQualified(FileSystem.getLocal(conf));
+      File fsLocal = new File(qualified.toUri());
+      return FSDirectory.getDirectory(fsLocal.getAbsolutePath());
+    } else {
+      return new FsDirectory(this.fs, file, false, this.conf);
+    }
+  }
+
+  public Hits search(Query query, int numHits,
+                     String dedupField, String sortField, boolean reverse)
+
+    throws IOException {
+    org.apache.lucene.search.BooleanQuery luceneQuery =
+      this.queryFilters.filter(query);
+    
+    System.out.println( "Nutch  query: " + query       );
+    System.out.println( "Lucene query: " + luceneQuery );
+
+    return translateHits
+      (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
+                          sortField, reverse),
+       dedupField, sortField);
+  }
+
+  public String getExplanation(Query query, Hit hit) throws IOException {
+    return luceneSearcher.explain(this.queryFilters.filter(query),
+                                  hit.getIndexDocNo()).toHtml();
+  }
+
+  public HitDetails getDetails(Hit hit) throws IOException {
+
+    Document doc = luceneSearcher.doc(hit.getIndexDocNo());
+
+    List docFields = doc.getFields();
+    String[] fields = new String[docFields.size()];
+    String[] values = new String[docFields.size()];
+    for (int i = 0; i < docFields.size(); i++) {
+      Field field = (Field)docFields.get(i);
+      fields[i] = field.name();
+      values[i] = field.stringValue();
+    }
+
+    return new HitDetails(fields, values);
+  }
+
+  public HitDetails[] getDetails(Hit[] hits) throws IOException {
+    HitDetails[] results = new HitDetails[hits.length];
+    for (int i = 0; i < hits.length; i++)
+      results[i] = getDetails(hits[i]);
+    return results;
+  }
+
+  private Hits translateHits(TopDocs topDocs,
+                             String dedupField, String sortField)
+    throws IOException {
+
+    String[] dedupValues = null;
+    if (dedupField != null) 
+      dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
+
+    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
+    int length = scoreDocs.length;
+    Hit[] hits = new Hit[length];
+    for (int i = 0; i < length; i++) {
+      
+      int doc = scoreDocs[i].doc;
+      
+      WritableComparable sortValue;               // convert value to writable
+      if (sortField == null) {
+        sortValue = new FloatWritable(scoreDocs[i].score);
+      } else {
+        Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
+        if (raw instanceof Integer) {
+          sortValue = new IntWritable(((Integer)raw).intValue());
+        } else if (raw instanceof Float) {
+          sortValue = new FloatWritable(((Float)raw).floatValue());
+        } else if (raw instanceof String) {
+          sortValue = new Text((String)raw);
+        } else {
+          throw new RuntimeException("Unknown sort value type!");
+        }
+      }
+
+      String dedupValue = dedupValues == null ? null : dedupValues[doc];
+
+      hits[i] = new Hit(doc, sortValue, dedupValue);
+    }
+    return new Hits(topDocs.totalHits, hits);
+  }
+  
+  public void close() throws IOException {
+    if (luceneSearcher != null) { luceneSearcher.close(); }
+    if (reader != null) { reader.close(); }
+  }
+
+}

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	2008-12-11 22:58:28 UTC (rev 2660)
@@ -0,0 +1,333 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.searcher;
+
+import java.io.IOException;
+import java.net.URLEncoder;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Set;
+import java.util.HashSet;
+
+import javax.servlet.ServletException;
+import javax.servlet.ServletConfig;
+import javax.servlet.http.HttpServlet;
+import javax.servlet.http.HttpServletRequest;
+import javax.servlet.http.HttpServletResponse;
+
+import javax.xml.parsers.*;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.util.NutchConfiguration;
+import org.w3c.dom.*;
+import javax.xml.transform.TransformerFactory;
+import javax.xml.transform.Transformer;
+import javax.xml.transform.dom.DOMSource;
+import javax.xml.transform.stream.StreamResult;
+
+
+/** Present search results using A9's OpenSearch extensions to RSS, plus a few
+ * Nutch-specific extensions. */   
+public class OpenSearchServlet extends HttpServlet {
+  private static final Map NS_MAP = new HashMap();
+  private int MAX_HITS_PER_PAGE;
+
+  static {
+    NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/");
+    NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/");
+  }
+
+  private static final Set SKIP_DETAILS = new HashSet();
+  static {
+    SKIP_DETAILS.add("url");                   // redundant with RSS link
+    SKIP_DETAILS.add("title");                 // redundant with RSS title
+  }
+
+  private NutchBean bean;
+  private Configuration conf;
+
+  public void init(ServletConfig config) throws ServletException {
+    try {
+      this.conf = NutchConfiguration.get(config.getServletContext());
+      bean = NutchBean.get(config.getServletContext(), this.conf);
+    } catch (IOException e) {
+      throw new ServletException(e);
+    }
+    MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
+  }
+
+  public void doGet(HttpServletRequest request, HttpServletResponse response)
+    throws ServletException, IOException {
+
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("query request from " + request.getRemoteAddr());
+    }
+
+    // get parameters from request
+    request.setCharacterEncoding("UTF-8");
+    String queryString = request.getParameter("query");
+    if (queryString == null)
+      queryString = "";
+    String urlQuery = URLEncoder.encode(queryString, "UTF-8");
+    
+    // the query language
+    String queryLang = request.getParameter("lang");
+    
+    int start = 0;                                // first hit to display
+    String startString = request.getParameter("start");
+    if (startString != null)
+      start = Integer.parseInt(startString);
+    
+    int hitsPerPage = 10;                         // number of hits to display
+    String hitsString = request.getParameter("hitsPerPage");
+    if (hitsString != null)
+      hitsPerPage = Integer.parseInt(hitsString);
+    if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE)
+      hitsPerPage = MAX_HITS_PER_PAGE;
+
+    String sort = request.getParameter("sort");
+    boolean reverse =
+      sort!=null && "true".equals(request.getParameter("reverse"));
+
+    // De-Duplicate handling.  Look for duplicates field and for how many
+    // duplicates per results to return. Default duplicates field is 'site'
+    // and duplicates per results default is '2'.
+    String dedupField = request.getParameter("dedupField");
+    if (dedupField == null || dedupField.length() == 0) {
+        dedupField = "site";
+    }
+    int hitsPerDup = 2;
+    String hitsPerDupString = request.getParameter("hitsPerDup");
+    if (hitsPerDupString != null && hitsPerDupString.length() > 0) {
+        hitsPerDup = Integer.parseInt(hitsPerDupString);
+    } else {
+        // If 'hitsPerSite' present, use that value.
+        String hitsPerSiteString = request.getParameter("hitsPerSite");
+        if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) {
+            hitsPerDup = Integer.parseInt(hitsPerSiteString);
+        }
+    }
+     
+    // Make up query string for use later drawing the 'rss' logo.
+    String params = "&hitsPerPage=" + hitsPerPage +
+        (queryLang == null ? "" : "&lang=" + queryLang) +
+        (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") +
+        (dedupField == null ? "" : "&dedupField=" + dedupField));
+
+    Query query = Query.parse(queryString, queryLang, this.conf);
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("query: " + queryString);
+      NutchBean.LOG.info("lang: " + queryLang);
+    }
+
+    // execute the query
+    Hits hits;
+    try {
+      hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField,
+          sort, reverse);
+    } catch (IOException e) {
+      if (NutchBean.LOG.isWarnEnabled()) {
+        NutchBean.LOG.warn("Search Error", e);
+      }
+      hits = new Hits(0,new Hit[0]);	
+    }
+
+    if (NutchBean.LOG.isInfoEnabled()) {
+      NutchBean.LOG.info("total hits: " + hits.getTotal());
+    }
+
+    // generate xml results
+    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
+    int length = end-start;
+
+    Hit[] show = hits.getHits(start, end-start);
+    HitDetails[] details = bean.getDetails(show);
+    Summary[] summaries = bean.getSummary(details, query);
+
+    String requestUrl = request.getRequestURL().toString();
+    String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
+      
+
+    try {
+      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
+      factory.setNamespaceAware(true);
+      Document doc = factory.newDocumentBuilder().newDocument();
+ 
+      Element rss = addNode(doc, doc, "rss");
+      addAttribute(doc, rss, "version", "2.0");
+      addAttribute(doc, rss, "xmlns:opensearch",
+                   (String)NS_MAP.get("opensearch"));
+      addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch"));
+
+      Element channel = addNode(doc, rss, "channel");
+    
+      addNode(doc, channel, "title", "Nutch: " + queryString);
+      addNode(doc, channel, "description", "Nutch search results for query: "
+              + queryString);
+      addNode(doc, channel, "link",
+              base+"/search.jsp"
+              +"?query="+urlQuery
+              +"&start="+start
+              +"&hitsPerDup="+hitsPerDup
+              +params);
+
+      addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
+      addNode(doc, channel, "opensearch", "startIndex", ""+start);
+      addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
+
+      addNode(doc, channel, "nutch", "query", queryString);
+    
+
+      if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
+          || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){
+        addNode(doc, channel, "nutch", "nextPage", requestUrl
+                +"?query="+urlQuery
+                +"&start="+end
+                +"&hitsPerDup="+hitsPerDup
+                +params);
+      }
+
+      if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) {
+        addNode(doc, channel, "nutch", "showAllHits", requestUrl
+                +"?query="+urlQuery
+                +"&hitsPerDup="+0
+                +params);
+      }
+
+      for (int i = 0; i < length; i++) {
+        Hit hit = show[i];
+        HitDetails detail = details[i];
+        String title = detail.getValue("title");
+        String url = detail.getValue("url");
+        String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo();
+      
+        if (title == null || title.equals("")) {   // use url for docs w/o title
+          title = url;
+        }
+        
+        Element item = addNode(doc, channel, "item");
+
+        addNode(doc, item, "title", title);
+        if (summaries[i] != null) {
+          addNode(doc, item, "description", summaries[i].toString() );
+        }
+        addNode(doc, item, "link", url);
+
+        addNode(doc, item, "nutch", "site", hit.getDedupValue());
+
+        addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id);
+        addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id
+                +"&query="+urlQuery+"&lang="+queryLang);
+
+        if (hit.moreFromDupExcluded()) {
+          addNode(doc, item, "nutch", "moreFromSite", requestUrl
+                  +"?query="
+                  +URLEncoder.encode("site:"+hit.getDedupValue()
+                                     +" "+queryString, "UTF-8")
+                  +"&hitsPerSite="+0
+                  +params);
+        }
+
+        for (int j = 0; j < detail.getLength(); j++) { // add all from detail
+          String field = detail.getField(j);
+          if (!SKIP_DETAILS.contains(field))
+            addNode(doc, item, "nutch", field, detail.getValue(j));
+        }
+      }
+
+      // dump DOM tree
+
+      DOMSource source = new DOMSource(doc);
+      TransformerFactory transFactory = TransformerFactory.newInstance();
+      Transformer transformer = transFactory.newTransformer();
+      transformer.setOutputProperty("indent", "yes");
+      StreamResult result = new StreamResult(response.getOutputStream());
+      response.setContentType("text/xml");
+      transformer.transform(source, result);
+
+    } catch (javax.xml.parsers.ParserConfigurationException e) {
+      throw new ServletException(e);
+    } catch (javax.xml.transform.TransformerException e) {
+      throw new ServletException(e);
+    }
+      
+  }
+
+  private static Element addNode(Document doc, Node parent, String name) {
+    Element child = doc.createElement(name);
+    parent.appendChild(child);
+    return child;
+  }
+
+  private static void addNode(Document doc, Node parent,
+                              String name, String text) {
+    Element child = doc.createElement(name);
+    child.appendChild(doc.createTextNode(getLegalXml(text)));
+    parent.appendChild(child);
+  }
+
+  private static void addNode(Document doc, Node parent,
+                              String ns, String name, String text) {
+    Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name);
+    child.appendChild(doc.createTextNode(getLegalXml(text)));
+    parent.appendChild(child);
+  }
+
+  private static void addAttribute(Document doc, Element node,
+                                   String name, String value) {
+    Attr attribute = doc.createAttribute(name);
+    attribute.setValue(getLegalXml(value));
+    node.getAttributes().setNamedItem(attribute);
+  }
+
+  /*
+   * Ensure string is legal xml.
+   * @param text String to verify.
+   * @return Passed <code>text</code> or a new string with illegal
+   * characters removed if any found in <code>text</code>.
+   * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
+   */
+  protected static String getLegalXml(final String text) {
+      if (text == null) {
+          return null;
+      }
+      StringBuffer buffer = null;
+      for (int i = 0; i < text.length(); i++) {
+        char c = text.charAt(i);
+        if (!isLegalXml(c)) {
+	  if (buffer == null) {
+              // Start up a buffer.  Copy characters here from now on
+              // now we've found at least one bad character in original.
+	      buffer = new StringBuffer(text.length());
+              buffer.append(text.substring(0, i));
+          }
+        } else {
+           if (buffer != null) {
+             buffer.append(c);
+           }
+        }
+      }
+      return (buffer != null)? buffer.toString(): text;
+  }
+ 
+  private static boolean isLegalXml(final char c) {
+    return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff)
+        || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff);
+  }
+
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2659] trunk/archive-access/projects/nutchwax/ archive/conf/nutch-site.xml

From: <bi...@us...> - 2008-12-11 22:21:49

Revision: 2659
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2659&view=rev
Author:   binzino
Date:     2008-12-11 22:21:44 +0000 (Thu, 11 Dec 2008)

Log Message:
-----------
Added proprty for per-collection segments.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-10 05:02:19 UTC (rev 2658)
+++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml	2008-12-11 22:21:44 UTC (rev 2659)
@@ -134,4 +134,14 @@
   <value>1048576</value>
 </property>
 
+<!-- Enable per-collection segment sub-dirs, e.g.
+       segments/<collectionId>/segment1
+                              /segment2
+                              ...
+  -->
+<property>
+  <name>nutchwax.FetchedSegments.perCollection</name>
+  <value>true</value>
+</property>
+
 </configuration>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2658] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2008-12-10 05:02:22

Revision: 2658
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2658&view=rev
Author:   binzino
Date:     2008-12-10 05:02:19 +0000 (Wed, 10 Dec 2008)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/etc/
    trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/
    trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java

Added: trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave	2008-12-10 05:02:19 UTC (rev 2658)
@@ -0,0 +1,63 @@
+#! /bin/sh
+#
+# -----------------------------------
+# Initscript for NutchWAX searcher slave
+# -----------------------------------
+
+set -e
+
+PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
+DESC="NutchWAX searcher slave"
+NAME="searcher-slave"
+
+DAEMON="/3/search/nutchwax-0.12.2/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 /3/search/deploy"
+NUTCH_HOME=/3/search/nutchwax-0.12.2
+JAVA_HOME=/usr
+export NUTCH_HEAPSIZE=2500
+PIDFILE=/var/run/$NAME.pid
+SCRIPTNAME=/etc/init.d/$NAME
+
+# Gracefully exit if the package has been removed.
+test -x /usr/bin/java || exit 0
+
+# ---------------------------------------
+# Function that starts the daemon/service
+# ---------------------------------------
+d_start()
+{
+start-stop-daemon --start -b -m -c webcrawl:webcrawl --pidfile $PIDFILE --exec $DAEMON 
+}
+
+# --------------------------------------
+# Function that stops the daemon/service
+# --------------------------------------
+d_stop()
+{
+start-stop-daemon --stop --pidfile $PIDFILE
+}
+
+case "$1" in
+start)
+echo -n "Starting $DESC: $NAME"
+d_start
+echo "."
+;;
+stop)
+echo -n "Stopping $DESC: $NAME"
+d_stop
+echo "."
+;;
+restart|force-reload)
+echo -n "Restarting $DESC: $NAME"
+d_stop
+sleep 1
+d_start
+echo "."
+;;
+*)
+echo "Usage: $SCRIPTNAME {start|stop|restart|force-reload}" >&2
+exit 1
+;;
+esac
+
+exit 0

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	2008-12-10 05:02:19 UTC (rev 2658)
@@ -0,0 +1,208 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax.tools;
+
+import java.io.*;
+import java.util.*;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.mapred.FileAlreadyExistsException;
+import org.apache.hadoop.util.*;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.util.ReflectionUtils;
+
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LogUtil;
+import org.apache.nutch.util.NutchConfiguration;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.index.IndexWriter;
+
+/**
+ *
+ */
+public class PageRanker extends Configured implements Tool 
+{
+  public static final Log LOG = LogFactory.getLog(PageRanker.class);
+
+  public static final String DONE_NAME = "merge.done";
+
+  public PageRanker() {
+    
+  }
+  
+  public PageRanker(Configuration conf) {
+    setConf(conf);
+  }
+  
+  /** 
+   * Create an index for the input files in the named directory. 
+   */
+  public static void main(String[] args)
+    throws Exception
+  {
+    int res = ToolRunner.run(NutchConfiguration.create(), new PageRanker(), args);
+    System.exit(res);
+  }
+
+  /**
+   *
+   */
+  public int run(String[] args) 
+    throws Exception
+  {
+    String usage = "Usage: PageRanker [OPTIONS] outputFile <linkdb|paths>\n"
+      + "Emit PageRank values for URLs in linkDb(s).  Suitable for use with\n"
+      + "PageRank scoring filter.\n"
+      + "\n"
+      + "OPTIONS:\n"
+      + "  -p              Use exact path as given, don't assume it's a typical\n"
+      + "                    linkdb with \"current/part-nnnnn\" subdirs.\n"
+      + "  -t threshold    Do not emit records with less than this many inlinks.\n"
+      + "                    Default value 10."
+      ;
+    if ( args.length < 1 )
+      {
+        System.err.println( "Usage: " + usage );
+        return -1;
+      }
+
+    boolean exactPath  = false;
+    int     threshold  = 10;
+
+    int pos = 0;
+    for ( ; pos < args.length && args[pos].charAt(0) == '-' ; pos++ )
+      {
+        if ( args[pos].equals( "-p" ) )
+          {
+            exactPath = true;
+          }
+        if ( args[pos].equals( "-t" ) ) 
+          {
+            pos++;
+            if ( args.length - pos < 1 ) 
+              {
+                System.err.println( "Error: missing argument to -t option" );
+                return -1;
+              }
+            try
+              {
+                threshold = Integer.parseInt( args[pos] );
+              }
+            catch ( NumberFormatException nfe )
+              {
+                System.err.println( "Error: bad value for -t option: " + args[pos] );
+                return -1;
+              }
+          }
+      }
+
+    Configuration conf = getConf( );
+    FileSystem    fs   = FileSystem.get( conf );
+
+    if ( pos >= args.length )
+      {
+        System.err.println( "Error: missing outputFile" );
+        return -1;
+      }
+
+    Path outputPath = new Path( args[pos++] );
+    if ( fs.exists( outputPath ) )
+      {
+        System.err.println( "Erorr: outputFile already exists: " + outputPath );
+        return -1;
+      }
+
+    PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) );
+
+    if ( pos >= args.length )
+      {
+        System.err.println( "Error: missing linkdb" );
+        return -1;
+      }
+
+    List<Path> mapfiles = new ArrayList<Path>();
+
+    // If we are using exact paths, add each one to the list.
+    // Otherwise, assume the given path is to a linkdb and look for
+    // <linkdbPath>/current/part-nnnnn sub-dirs.
+    if ( exactPath )
+      {
+        for ( ; pos < args.length ; pos++ )
+          {
+            mapfiles.add( new Path( args[pos] ) );
+          }
+      }
+    else
+      {
+        FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs));
+        mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats)));
+      }
+
+    System.out.println( "mapfiles = " + mapfiles );
+    try 
+      {
+        for ( Path p : mapfiles )
+          {
+            MapFile.Reader reader = new MapFile.Reader( fs, p.toString(), conf );
+            
+            WritableComparable key   = (WritableComparable) ReflectionUtils.newInstance( reader.getKeyClass()  , conf );
+            Writable           value = (Writable)           ReflectionUtils.newInstance( reader.getValueClass(), conf );
+            
+            while ( reader.next( key, value ) )
+              {
+                if ( key instanceof Text && value instanceof Inlinks )
+                  {
+                    Text    toUrl   = (Text)    key;
+                    Inlinks inlinks = (Inlinks) value;
+
+                    if ( inlinks.size( ) < threshold )
+                      {
+                        continue ;
+                      }
+
+                    String toUrlString = toUrl.toString( );
+
+                    // HACK: Should make this into some externally configurable regex.
+                    if ( toUrlString.startsWith( "http" ) )
+                      {
+                        output.println( inlinks.size( ) + " " + toUrl.toString() );
+                      }
+                  }
+              }
+          }
+
+        return 0;
+      }
+    catch ( Exception e )
+      {
+        LOG.fatal( "PageRanker: " + StringUtils.stringifyException( e ) );
+        return -1;
+      }
+    finally
+      {
+        output.flush( );
+        output.close( );
+      }
+  }
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2657] trunk/archive-access/projects/nutchwax/ archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/ PageRankScoringFilter.java

From: <bi...@us...> - 2008-12-10 05:01:18

Revision: 2657
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2657&view=rev
Author:   binzino
Date:     2008-12-10 05:01:14 +0000 (Wed, 10 Dec 2008)

Log Message:
-----------
Removed use of floor() in calculating the book multiplier.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java	2008-12-10 04:59:10 UTC (rev 2656)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java	2008-12-10 05:01:14 UTC (rev 2657)
@@ -56,17 +56,14 @@
  * </p><p>
  * Applies a simple log10 multipler to the document score based on the
  * base-10 log value of the number of inlinks.  For example, a page with
- * 13,032 inlinks will have a score/boost of 5.  The actual formula is
+ * 13,032 inlinks will have a score/boost of 5.115.  The actual formula is
  * </p>
  * <code>
- *  initialScore *= ( floor( log10( # inlinks ) ) + 1 )
+ *  newScore = initialScore * ( log10( # inlinks ) + 1 )
  * </code>
  * <p>
- * We use floor() to get an integer value from the log10() function
- * since we're only interested in order of magnitude.  We then add 1
- * so that a page with &lt; 10 inlins will have a multipler of 1, and
- * thus stay the same, 10-100 gets a multipler of 2, 100-1000 is 3, and
- * so forth.
+ * We add the extra 1 for pages with only 1 inlink since log10(1)=0 and we
+ * don't want a 0 multiplier.
  * </p>
  * <p>
  * The number of inlinks for a page is not taken from the <code>inlinks</code>
@@ -115,8 +112,6 @@
   public void setConf( Configuration conf )
   {
     this.conf = conf;
-
-    //this.ranks = getPageRanks( conf );
   }
   
   public void injectedScore(Text url, CrawlDatum datum) 
@@ -181,7 +176,7 @@
         return initScore;
       }
 
-    String keyParts[] = key.toString( ).split( "\\s+" );
+    String keyParts[] = key.toString( ).split( "\\s+", 2 );
 
     if ( keyParts.length != 2 )
       {
@@ -201,7 +196,7 @@
         return initScore;
       }
 
-    float newScore = initScore * (float) ( Math.floor( Math.log( rank ) ) + 1 );
+    float newScore = initScore * (float) ( Math.log( rank ) + 1 );
 
     LOG.info( "PageRankScoringFilter: initScore = " + newScore + " ; key = " + key );
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2656] trunk/archive-access/projects/nutchwax/ archive/bin/nutchwax

From: <bi...@us...> - 2008-12-10 04:59:14

Revision: 2656
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2656&view=rev
Author:   binzino
Date:     2008-12-10 04:59:10 +0000 (Wed, 10 Dec 2008)

Log Message:
-----------
Fixed but to pass back return code of invoked command.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/bin/nutchwax

Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2008-12-10 04:58:24 UTC (rev 2655)
+++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2008-12-10 04:59:10 UTC (rev 2656)
@@ -62,4 +62,5 @@
     ;;
 esac
 
-exit 0
+# Return the exit code of the command invoked.
+exit $?


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 44 45 46 47 48 .. 171 > >> (Page 46 of 171)