archive-access-cvs Mailing List for Web Archive Access Utilities (Page 45)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 43 44 45 46 47 .. 171 > >> (Page 45 of 171)

[Archive-access-cvs] SF.net SVN: archive-access:[2706] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/core/UIResults.java

From: <bra...@us...> - 2009-05-20 00:41:51

Revision: 2706
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2706&view=rev
Author:   bradtofel
Date:     2009-05-20 00:41:15 +0000 (Wed, 20 May 2009)

Log Message:
-----------
TWEAK: added getter for ResultURIConverter

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java	2009-05-20 00:40:05 UTC (rev 2705)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java	2009-05-20 00:41:15 UTC (rev 2706)
@@ -123,8 +123,14 @@
 	/*
 	 * GENERAL GETTERS:
 	 */
-	
 	/**
+	 * @return the uriConverter
+	 */
+	public ResultURIConverter getUriConverter() {
+		return uriConverter;
+	}
+
+	/**
 	 * @return Returns the wbRequest.
 	 */
 	public WaybackRequest getWbRequest() {


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2704] tags/nutchwax-0_12_4/archive

From: <bi...@us...> - 2009-05-05 22:18:28

Revision: 2704
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2704&view=rev
Author:   binzino
Date:     2009-05-05 22:17:48 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Oops, didn't have the updated versions checked-in when I did the
release copy.  Fixed.

Added Paths:
-----------
    tags/nutchwax-0_12_4/archive/README.txt
    tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt

Removed Paths:
-------------
    tags/nutchwax-0_12_4/archive/README.txt
    tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt

Deleted: tags/nutchwax-0_12_4/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,104 +0,0 @@
-
-README.txt
-2008-03-08
-Aaron Binns
-
-Table of Contents
- o Introduction
- o Build and Install
- o Tutorial
-
-
-======================================================================
-Introduction
-======================================================================
-
-Welcome to NutchWAX 0.12.4!
-
-NutchWAX is a set of add-ons to Nutch in order to index and search
-archived web data.
-
-These add-ons are developed and maintained by the Internet Archive Web
-Team in conjunction with a broad community of contributors, partners
-and end-users.
-
-The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
-
-Since NutchWAX is a set of add-ons to Nutch, you should already be
-familiar with Nutch before using NutchWAX.
-
-
-The goal of NutchWAX is to enable full-text indexing and searching of
-documents stored in web archive file formats (ARC and WARC).
-
-The way we achieve that goal is by providing plugins and add-on tools
-to Nutch to read documents directly from ARC/WARC files.  We call this
-process "importing" archive files.
-
-Importing produces a Nutch segment, the same as when Nutch is used to
-crawl documents itself.  In essence, document importing replaces the
-conventional "generate/fetch/update" cycle of Nutch.
-
-Once the archival documents have been imported into a segment, the
-regular Nutch commands to index the document contents can proceed as
-normal.
-
-======================================================================
-
-The main NutchWAX add-ons are:
-
- bin/nutchwax
-
-   A shell script that is used to run the NutchWAX commands, such as
-   document importing.
-
-   This is patterned after the 'bin/nutch' shell script.
-
- plugins/index-nutchwax
-
-   Indexing plugin which adds NutchWAX-specific metadata fields to the
-   indexed document.
-
- plugins/query-nutchwax
-
-   Query plugin which allows for querying against the metadata fields
-   added by 'index-nutchwax'.
-
- plugins/urlfilter-nutchwax
-
-   Filtering plugin which can be used to exclude URLs from import.  It
-   can be used as part of a NutchWAX de-duplication scheme.
-
- plugins/scoring-nutchwax
-
-   Scoring plugin for use at index-time which reads from an external
-   "pagerank.txt" file for scoring documents based on the log10 of the
-   number of inlinks to a document.
-
-   The use of this plugin is optional but can improve the quality of
-   search results, especially for very large collections.
-
- conf/nutch-site.xml
-
-   Additional configuration properties for NutchWAX, including
-   over-rides for properties defined in 'nutch-default.xml'
-
-There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
-is distributed in source code form and is intended to be built in
-conjunction with Nutch.
-
-
-======================================================================
-Build and Install
-======================================================================
-
-See "INSTALL.txt" for detailed instructions to build NutchWAX from
-source or install a binary package.
-
-
-======================================================================
-Tutorial
-======================================================================
-
-See "HOWTO.txt" for a quick tutorial on importing, indexing and
-searching a set of documents in a web archive file.

Copied: tags/nutchwax-0_12_4/archive/README.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/README.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/README.txt	                        (rev 0)
+++ tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,104 @@
+
+README.txt
+2009-05-05
+Aaron Binns
+
+Table of Contents
+ o Introduction
+ o Build and Install
+ o Tutorial
+
+
+======================================================================
+Introduction
+======================================================================
+
+Welcome to NutchWAX 0.12.4!
+
+NutchWAX is a set of add-ons to Nutch in order to index and search
+archived web data.
+
+These add-ons are developed and maintained by the Internet Archive Web
+Team in conjunction with a broad community of contributors, partners
+and end-users.
+
+The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
+
+Since NutchWAX is a set of add-ons to Nutch, you should already be
+familiar with Nutch before using NutchWAX.
+
+
+The goal of NutchWAX is to enable full-text indexing and searching of
+documents stored in web archive file formats (ARC and WARC).
+
+The way we achieve that goal is by providing plugins and add-on tools
+to Nutch to read documents directly from ARC/WARC files.  We call this
+process "importing" archive files.
+
+Importing produces a Nutch segment, the same as when Nutch is used to
+crawl documents itself.  In essence, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
+
+Once the archival documents have been imported into a segment, the
+regular Nutch commands to index the document contents can proceed as
+normal.
+
+======================================================================
+
+The main NutchWAX add-ons are:
+
+ bin/nutchwax
+
+   A shell script that is used to run the NutchWAX commands, such as
+   document importing.
+
+   This is patterned after the 'bin/nutch' shell script.
+
+ plugins/index-nutchwax
+
+   Indexing plugin which adds NutchWAX-specific metadata fields to the
+   indexed document.
+
+ plugins/query-nutchwax
+
+   Query plugin which allows for querying against the metadata fields
+   added by 'index-nutchwax'.
+
+ plugins/urlfilter-nutchwax
+
+   Filtering plugin which can be used to exclude URLs from import.  It
+   can be used as part of a NutchWAX de-duplication scheme.
+
+ plugins/scoring-nutchwax
+
+   Scoring plugin for use at index-time which reads from an external
+   "pagerank.txt" file for scoring documents based on the log10 of the
+   number of inlinks to a document.
+
+   The use of this plugin is optional but can improve the quality of
+   search results, especially for very large collections.
+
+ conf/nutch-site.xml
+
+   Additional configuration properties for NutchWAX, including
+   over-rides for properties defined in 'nutch-default.xml'
+
+There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
+is distributed in source code form and is intended to be built in
+conjunction with Nutch.
+
+
+======================================================================
+Build and Install
+======================================================================
+
+See "INSTALL.txt" for detailed instructions to build NutchWAX from
+source or install a binary package.
+
+
+======================================================================
+Tutorial
+======================================================================
+
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.

Deleted: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,58 +0,0 @@
-
-RELEASE-NOTES.TXT
-2008-03-08
-Aaron Binns
-
-Release notes for NutchWAX 0.12.4
-
-For the most recent updates and information on NutchWAX,
-please visit the project wiki at:
-
-  http://webteam.archive.org/confluence/display/search/NutchWAX
-
-
-======================================================================
-Overview
-======================================================================
-
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
-
-  o Option to omit storing of content during import.
-  o Support for per-collection segments in master/slave config.
-  o Additional diagnostic/log messages to help troubleshoot common
-    deployment mistakes.
-  o PageRankDb similar to LinkDb but only keeping inlink counts.
-  o Improved paging through results, handling "paging past the end".
-
-
-======================================================================
-Issues
-======================================================================
-
-For an up-to-date list of NutchWAX issues:
-
-  http://webteam.archive.org/jira/browse/WAX
-
-Issues resolved in this release:
-
-WAX-27 Sensible output for requesting page of results past the end.
-
-WAX-34 Add option to omit storing of content in segment
-
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
-       rather than actual inlinks.
-
-WAX-36 Some additional diagnostics on connecting results to segments
-       and snippets would be very helpful.
-
-WAX-37 Per-collection segments not supported in distributed
-       master-slave configuration.
-
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-
-
-
-

Copied: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	                        (rev 0)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,57 @@
+
+RELEASE-NOTES.TXT
+2009-05-05
+Aaron Binns
+
+Release notes for NutchWAX 0.12.4
+
+For the most recent updates and information on NutchWAX,
+please visit the project wiki at:
+
+  http://webteam.archive.org/confluence/display/search/NutchWAX
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+
+  o Option to omit storing of content during import.
+  o Support for per-collection segments in master/slave config.
+  o Additional diagnostic/log messages to help troubleshoot common
+    deployment mistakes.
+  o PageRankDb similar to LinkDb but only keeping inlink counts.
+  o Improved paging through results, handling "paging past the end".
+
+
+======================================================================
+Issues
+======================================================================
+
+For an up-to-date list of NutchWAX issues:
+
+  http://webteam.archive.org/jira/browse/WAX
+
+Issues resolved in this release:
+
+WAX-27 Sensible output for requesting page of results past the end.
+
+WAX-34 Add option to omit storing of content in segment
+
+WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
+       rather than actual inlinks.
+
+WAX-36 Some additional diagnostics on connecting results to segments
+       and snippets would be very helpful.
+
+WAX-37 Per-collection segments not supported in distributed
+       master-slave configuration.
+
+WAX-38 Build omits neessary libraries from .job file.
+
+WAX-39 Write more efficient, specialized segment parse_text merging.
+
+WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
+
+WAX-42 Add option to continue importing if an arcfile cannot be read.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2703] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2009-05-05 21:46:49

Revision: 2703
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2703&view=rev
Author:   binzino
Date:     2009-05-05 21:46:40 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Updated for NutchWAX 0.12.4 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2009-05-05 21:44:29 UTC (rev 2702)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2009-05-05 21:46:40 UTC (rev 2703)
@@ -1,6 +1,6 @@
 
 README.txt
-2008-03-08
+2009-05-05
 Aaron Binns
 
 Table of Contents

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2009-05-05 21:44:29 UTC (rev 2702)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2009-05-05 21:46:40 UTC (rev 2703)
@@ -1,6 +1,6 @@
 
 RELEASE-NOTES.TXT
-2008-03-08
+2009-05-05
 Aaron Binns
 
 Release notes for NutchWAX 0.12.4
@@ -52,7 +52,6 @@
 
 WAX-39 Write more efficient, specialized segment parse_text merging.
 
+WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
 
-
-
-
+WAX-42 Add option to continue importing if an arcfile cannot be read.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2702] tags

From: <bi...@us...> - 2009-05-05 21:44:37

Revision: 2702
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2702&view=rev
Author:   binzino
Date:     2009-05-05 21:44:29 +0000 (Tue, 05 May 2009)

Log Message:
-----------
NutchWAX 0.12.4 release.

Added Paths:
-----------
    tags/nutchwax-0_12_4/
    tags/nutchwax-0_12_4/archive/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2701] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/web/web.xml

From: <bi...@us...> - 2009-05-05 21:15:55

Revision: 2701
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2701&view=rev
Author:   binzino
Date:     2009-05-05 21:15:28 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Changed default location to look for search.xsl.  Likely needs editing
post-deployment however.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2009-05-05 21:14:39 UTC (rev 2700)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2009-05-05 21:15:28 UTC (rev 2701)
@@ -59,7 +59,7 @@
   <filter-class>org.archive.nutchwax.XSLTFilter</filter-class>
   <init-param>
     <param-name>xsltUrl</param-name>
-    <param-value>style/search.xsl</param-value>
+    <param-value>webapps/nutchwax-0.12.4/search.xsl</param-value>
   </init-param>
 </filter>
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2700] trunk/archive-access/projects/nutchwax/ archive/BUILD-NOTES.txt

From: <bi...@us...> - 2009-05-05 21:15:47

Revision: 2700
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2700&view=rev
Author:   binzino
Date:     2009-05-05 21:14:39 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Fix type-o

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2009-05-05 20:24:22 UTC (rev 2699)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2009-05-05 21:14:39 UTC (rev 2700)
@@ -222,7 +222,7 @@
   <name>nutchwax.filter.index</name>
   <value>
     url:false:true:true
-    url:flase:true:false:true:exacturl
+    url:false:true:false:true:exacturl
     orig:false
     digest:false
     filename:false


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2699] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/Importer.java

From: <bi...@us...> - 2009-05-05 20:24:28

Revision: 2699
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2699&view=rev
Author:   binzino
Date:     2009-05-05 20:24:22 +0000 (Tue, 05 May 2009)

Log Message:
-----------
WAX-42.  Add option to continue/abort importing after read error on
archive file.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java	2009-05-05 20:20:45 UTC (rev 2698)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java	2009-05-05 20:24:22 UTC (rev 2699)
@@ -210,6 +210,15 @@
             reporter.progress();
           }
       }
+    catch ( Exception e )
+      {
+        LOG.warn( "Error processing archive file: " + arcUrl, e );
+
+        if ( jobConf.getBoolean( "nutchwax.import.abortOnArchiveReadError", false ) )
+          {
+            throw new IOException( e );
+          }
+      }
     finally
       {
         r.close();


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2698] trunk/archive-access/projects/nutchwax/ archive/src/nutch/conf/nutch-site.xml

From: <bi...@us...> - 2009-05-05 20:20:48

Revision: 2698
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2698&view=rev
Author:   binzino
Date:     2009-05-05 20:20:45 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Fixed type-o.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 19:24:16 UTC (rev 2697)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 20:20:45 UTC (rev 2698)
@@ -186,7 +186,7 @@
 
 <property>
   <name>searcher.fieldcache</name>
-  <property>true</property>
+  <value>true</value>
 </property>
 
 </configuration>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2697] trunk/archive-access/projects/nutchwax/ archive/src/nutch

From: <bi...@us...> - 2009-05-05 19:25:06

Revision: 2697
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2697&view=rev
Author:   binzino
Date:     2009-05-05 19:24:16 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Fix WAX-41.  Added option to use fieldcache or not when handling
searches using 'dedup' feature.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 17:52:47 UTC (rev 2696)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 19:24:16 UTC (rev 2697)
@@ -184,4 +184,9 @@
   <value>80</value>
 </property>
 
+<property>
+  <name>searcher.fieldcache</name>
+  <property>true</property>
+</property>
+
 </configuration>

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	2009-05-05 17:52:47 UTC (rev 2696)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	2009-05-05 19:24:16 UTC (rev 2697)
@@ -136,9 +136,9 @@
   private Hits translateHits(TopDocs topDocs,
                              String dedupField, String sortField)
     throws IOException {
-
+    
     String[] dedupValues = null;
-    if (dedupField != null) 
+    if (dedupField != null && this.conf.getBoolean( "searcher.fieldcache", true ) )
       dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 
     ScoreDoc[] scoreDocs = topDocs.scoreDocs;
@@ -164,7 +164,33 @@
         }
       }
 
-      String dedupValue = dedupValues == null ? null : dedupValues[doc];
+      String dedupValue = "";
+      if ( dedupValues != null )
+        {
+          dedupValue = dedupValues[doc];
+        }
+      else
+        {
+          if ( "site".equals( dedupField ) )
+            {
+              String exactUrl = reader.document( doc ).get( "exacturl");
+              try 
+                {
+                  java.net.URL u = new java.net.URL( exactUrl );
+                  dedupValue = u.getHost();
+                  
+                  System.out.println("Dedup value hack:" + dedupValue);
+                }
+              catch ( java.net.MalformedURLException e )
+                {
+                  // Eat it.
+                }
+            }
+          else
+            {
+              dedupValue = reader.document( doc ).get( dedupField );
+            }
+        }
 
       hits[i] = new Hit(doc, sortValue, dedupValue);
     }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2696] trunk/archive-access/projects/nutchwax/ archive/src/nutch/conf/nutch-site.xml

From: <bi...@us...> - 2009-05-05 17:53:28

Revision: 2696
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2696&view=rev
Author:   binzino
Date:     2009-05-05 17:52:47 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Fix type-o.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 17:52:20 UTC (rev 2695)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-05-05 17:52:47 UTC (rev 2696)
@@ -44,7 +44,7 @@
   <name>nutchwax.filter.index</name>
   <value>
     url:false:true:true
-    url:flase:true:false:true:exacturl
+    url:false:true:false:true:exacturl
     orig:false
     digest:false
     filename:false


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2695] trunk/archive-access/projects/nutchwax/ archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java

From: <bi...@us...> - 2009-05-05 17:53:03

Revision: 2695
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2695&view=rev
Author:   binzino
Date:     2009-05-05 17:52:20 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Fix type-o

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java	2009-03-08 22:59:46 UTC (rev 2694)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java	2009-05-05 17:52:20 UTC (rev 2695)
@@ -248,7 +248,7 @@
    * searching behavior where a field is only searched in the first
    * index that has the field.</p>
    * <p>This differs from the bundled Lucene <code>ParallelReader</code>,
-   * which adds all vales from every index that has the field.</p>
+   * which adds all values from every index that has the field.</p>
    * <p>The <code>fieldSelector<code> parameter is ignored.</p>
    * <h3>Implementation Notes</h3>
    * <p>Since getting the document from the reader is the expensive


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2694] trunk/archive-access/projects/nutchwax/ archive/bin/nutchwax

From: <bi...@us...> - 2009-03-08 22:59:48

Revision: 2694
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2694&view=rev
Author:   binzino
Date:     2009-03-08 22:59:46 +0000 (Sun, 08 Mar 2009)

Log Message:
-----------
Added commands to drive recently added tools.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/bin/nutchwax

Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2009-03-08 21:43:33 UTC (rev 2693)
+++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax	2009-03-08 22:59:46 UTC (rev 2694)
@@ -42,6 +42,14 @@
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.Importer $@
     ;;
+  pagerankdb)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDb $@
+    ;;
+  pagerankdbmerger)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger $@
+    ;;
   add-dates)
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder $@
@@ -50,18 +58,25 @@
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@
     ;;
-  pagerank)
+  pageranker)
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@
     ;;
+  parsetextmerger)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@
+    ;;
   *)
     echo ""
     echo "Usage: nutchwax COMMAND"
     echo "where COMMAND is one of:"
-    echo "  import       Import ARCs into a new Nutch segment"
-    echo "  add-dates    Add dates to a parallel index"
-    echo "  dumpindex    Dump an index or set of parallel indices to stdout"
-    echo "  pagerank     Generate pagerank file for URLs in a 'linkdb'."
+    echo "  import            Import ARCs into a new Nutch segment"
+    echo "  pagerankdb        Generate pagerankdb for a segment"
+    echo "  pagerankdbmerger  Merge multiple pagerankdbs"
+    echo "  pageranker        Generate pagerank.txt file from 'pagerankdb's or 'linkdb's"
+    echo "  parsetextmerger   Merge segement parse_text/part-nnnnn directories."
+    echo "  add-dates         Add dates to a parallel index"
+    echo "  dumpindex         Dump an index or set of parallel indices to stdout"
     echo ""
     exit 1
     ;;


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2693] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2009-03-08 21:43:45

Revision: 2693
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2693&view=rev
Author:   binzino
Date:     2009-03-08 21:43:33 +0000 (Sun, 08 Mar 2009)

Log Message:
-----------
Updated documentation for 0.12.4 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2009-03-08 21:43:33 UTC (rev 2693)
@@ -79,7 +79,7 @@
 ----------------------------------------------------------------------
 The file
 
-  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
 
 contains two errors: one where a mimetype is referenced before it is
 defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
 You can either apply these patches yourself, or copy an already-patched
 copy from:
 
-  /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
 
 to 
 
-  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
 
 ----------------------------------------------------------------------
 
@@ -166,7 +166,6 @@
 --------------------------------------------------
 indexingfilter.order
 --------------------------------------------------
-
 Add this property with a value of
 
     org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -300,7 +299,6 @@
 --------------------------------------------------
 nutchwax.urlfilter.wayback.canonicalizer
 --------------------------------------------------
-
 For CDX-based de-duplication, the same URL canonicalization algorithm
 must be used here as was used to generate the CDX files.
 
@@ -390,3 +388,43 @@
 capacity of the computers performing the import.  Something in the
 1-4MB range is typical.
 
+--------------------------------------------------
+nutchwax.FetchedSegments.perCollection
+--------------------------------------------------
+Enable per-collection segment sub-dirs, e.g.
+
+  segments/<collectionId>/segment1
+                         /segment2
+                         ...
+
+Default value: false
+
+For example,
+
+  <property>
+    <name>nutchwax.FetchedSegments.perCollection</name>
+    <value>true</value>
+  </property>
+
+--------------------------------------------------
+nutchwax.import.content.store
+--------------------------------------------------
+Whether or not we store the full content in the segment's "content"
+directory.  Most NutchWAX users are also using Wayback to serve the
+archived content, so there's no need for NutchWAX to keep a "cached"
+copy as well.
+
+Setting to 'true' yields the same bahavior as in previous versions of
+NutchWAX, and as in Nutch.  The content is stored in the segment's
+"content" directory.
+
+Setting to 'false' results in an empty "content" directory in the
+segment.  The content is not stored.
+     
+Default value is 'false'.
+
+  <property>
+    <name>nutchwax.import.store.content</name>
+    <value>false</value>
+  </property>
+

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2009-03-08 21:43:33 UTC (rev 2693)
@@ -26,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.3
+      /opt/nutchwax-0.12.4
 
  2. ARC/WARC files.
 
@@ -68,10 +68,10 @@
 
   $ mkdir crawl
   $ cd crawl
-  $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest
-  $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/*
+  $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest
+  $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/*
   $ ls -F1
   crawldb/
   indexes/
@@ -96,7 +96,7 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
 
 This calls the NutchBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
 
 The Nutch(WAX) web application is bundled with NutchWAX as
 
-  /opt/nutchwax-0.12.3/nutch-1.0-dev.war
+  /opt/nutchwax-0.12.4/nutch-1.0-dev.war
 
 Simply deploy that web application in the same fashion as with
 Nutch.

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-12-18
+2009-03-08
 Aaron Binns
 
 Table of Contents
@@ -10,6 +10,7 @@
     - SVN: NutchWAX
     - Build and Install
  o Install binary package
+ o Install start-up scripts
 
 
 ======================================================================
@@ -62,7 +63,7 @@
 ------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.3 is
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.4 is
 built against is:
 
   701524
@@ -78,14 +79,14 @@
 
 SVN: NutchWAX
 -------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
-Nutch's "contrib" directory.
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+source into Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
 
 This will create a sub-directory named "archive" containing the
-NutchWAX sources.
+NutchWAX 0.12.4 sources.
 
 Build and install
 -----------------
@@ -112,7 +113,7 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
-  $ mv nutch-1.0-dev nutchwax-0.12.3
+  $ mv nutch-1.0-dev nutchwax-0.12.4
 
 
 ======================================================================
@@ -125,5 +126,50 @@
 Install it simply by untarring it, for example:
 
   $ cd /opt
-  $ tar xvfz nutchwax-0.12.3.tar.gz
+  $ tar xvfz nutchwax-0.12.4.tar.gz
 
+
+======================================================================
+Install start-up scripts
+======================================================================
+
+NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
+automatically start the searcher slaves for a multi-node search
+configuration.
+
+Assuming you installed NutchWAX as
+
+  /opt/nutchwax-0.12.4
+
+the script is found at
+
+  /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
+
+This script can be placed in /etc/init.d then added to the list of
+startup scripts to run at bootup by using commands appropriate to your
+Linux distribution.
+
+You must edit a few of the environment variables defined in the
+'searcher-slave' specifying where NutchWAX is installed and where the
+index(s) are deployed.  In 'searcher-slave' you will find the:
+
+  export NUTCH_HOME=TODO
+  export DEPLOYMENT_DIR=TODO
+
+edit those appropriately for your system.
+
+
+The "master" in the multi-node search deployment is the NutchWAX
+webapp running in a webapp server, such as Tomcat or Jetty.
+
+Jetty comes with a start/stop script appropriate for use as an init.d
+script, similar to the 'searcher-slave' script described above.  If you
+use Jetty, create a symlink 
+
+  /etc/init.d/jetty.sh  -> /opt/jetty/bin/jetty.sh
+
+Then add this script to the list of startup scripts to run at bootup
+by using commands appropriate to your Linux distribution.
+
+Follow the instructions from Jetty on the deployment of the NutchWAX
+webapp (nutch-1.0-dev.war) in the Jetty web application server.

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,6 +1,6 @@
 
 README.txt
-2008-12-18
+2008-03-08
 Aaron Binns
 
 Table of Contents
@@ -13,7 +13,7 @@
 Introduction
 ======================================================================
 
-Welcome to NutchWAX 0.12.3!
+Welcome to NutchWAX 0.12.4!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2008-12-18
+2008-03-08
 Aaron Binns
 
-Release notes for NutchWAX 0.12.3
+Release notes for NutchWAX 0.12.4
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,61 +15,44 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2
+NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
 
-  o PageRank calculation and scoring
-  o Enhanced OpenSearchServlet
-  o Improved XSLT sample for OpenSearch
-  o System init.d script for searcher slaves
-  o Enhanced searcher slave which supports NutchWAX extensions
+  o Option to omit storing of content during import.
+  o Support for per-collection segments in master/slave config.
+  o Additional diagnostic/log messages to help troubleshoot common
+    deployment mistakes.
+  o PageRankDb similar to LinkDb but only keeping inlink counts.
+  o Improved paging through results, handling "paging past the end".
 
 
-One of the major changes to 0.12.3 is not a feature, enhancement or
-bug-fix, but the way the NutchWAX source is "integrated" into the
-Nutch source.
+======================================================================
+Issues
+======================================================================
 
-Yes, the NutchWAX source is still kept in the contrib/archive
-sub-directory, but when you invoke a build command from the
-NutchWAX directory, such as
+For an up-to-date list of NutchWAX issues:
 
-  $ cd nutch/contrib/archive
-  $ ant tar
+  http://webteam.archive.org/jira/browse/WAX
 
-Many files from the NutchWAX source tree are copied directly into the
-Nutch source tree before the build process begins.
+Issues resolved in this release:
 
-The reason for this is to make NutchWAX easier to use.
+WAX-27 Sensible output for requesting page of results past the end.
 
-In previous versions of NutchWAX, once 'ant' build command was
-finished, the operator had to manually patch configuration files in
-the Nutch directory.  Upon a subsequent build, the files would be
-over-written by Nutch's and would have to be patched again.
+WAX-34 Add option to omit storing of content in segment
 
-It was a major hassle and complication.
+WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
+       rather than actual inlinks.
 
-Another impetus for copying files into the Nutch source was to patch
-bugs and make enhancements in the Nutch Java code which couldn't be
-effectively done keeping the sources separate.  When an 'ant' build
-command is run a few Java files are copied from the NutchWAX source
-tree into the Nutch source tree.
+WAX-36 Some additional diagnostics on connecting results to segments
+       and snippets would be very helpful.
 
-In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of
-this.  Simply execute your build commands from 'contrib/archive' as
-instructed in the HOWTO and no longer worry about patching
-configuration files.  If you wish to alter the NutchWAX configuration
-file, make those changes in the NutchWAX source tree.
+WAX-37 Per-collection segments not supported in distributed
+       master-slave configuration.
 
+WAX-38 Build omits neessary libraries from .job file.
 
-======================================================================
-Issues
-======================================================================
+WAX-39 Write more efficient, specialized segment parse_text merging.
 
-For an up-to-date list of NutchWAX issues:
 
-  http://webteam.archive.org/jira/browse/WAX
 
-Issues resolved in this release:
 
-WAX-26
-  Add XML elements containing all search URL params for self-link
-  generation
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2692] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java

From: <bi...@us...> - 2009-03-08 20:44:50

Revision: 2692
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2692&view=rev
Author:   binzino
Date:     2009-03-08 20:44:25 +0000 (Sun, 08 Mar 2009)

Log Message:
-----------
First cut.  Works, but isn't the prettiest code I've ever written.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java	2009-03-08 20:44:25 UTC (rev 2692)
@@ -0,0 +1,216 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax.tools;
+
+import java.io.*;
+import java.util.*;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.mapred.FileAlreadyExistsException;
+import org.apache.hadoop.util.*;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.util.ReflectionUtils;
+
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LogUtil;
+import org.apache.nutch.util.NutchConfiguration;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.index.IndexWriter;
+
+/**
+ * <p>This is a one-off/hack to (hopefully) efficiently combine
+ * multiple "parse_text/part-nnnnn" map files into a single map file.
+ * Using the Nutch 'mergesegs' takes far too long in practice, and
+ * often fails to complete due to memory constraints.
+ * </p>
+ * <p>This class takes advantage of the fact that the
+ * "parse_text/part-nnnnn" directories are Hadoop MapFiles.  To merge
+ * them, all we have to do is read key/value pairs from each one and
+ * write them back out in sorted order.
+ * </p>
+ */
+public class ParseTextCombiner extends Configured implements Tool 
+{
+  public static final Log LOG = LogFactory.getLog(ParseTextCombiner.class);
+  
+  private boolean verbose = false;
+
+  public ParseTextCombiner()
+  {
+    
+  }
+  
+  public ParseTextCombiner(Configuration conf)
+  {
+    setConf(conf);
+  }
+  
+  /** 
+   * Create an index for the input files in the named directory. 
+   */
+  public static void main(String[] args)
+    throws Exception
+  {
+    int res = ToolRunner.run(NutchConfiguration.create(), new ParseTextCombiner(), args);
+    System.exit(res);
+  }
+
+  /**
+   *
+   */
+  public int run(String[] args) 
+    throws Exception
+  {
+    String usage = "Usage: ParseTextCombiner [-v] output input...\n";
+
+    if ( args.length < 1 )
+      {
+        System.err.println( "Usage: " + usage );
+        return 1;
+      }
+
+    if ( args[0].equals( "-h" ) )
+      {
+        System.err.println( "Usage: " + usage );
+        return 1;
+      }
+
+    int argStart = 0;
+    if ( args[argStart].equals( "-v" ) )
+      {
+        verbose = true;
+        argStart = 1;
+      }
+
+    if ( args.length - argStart < 2 )
+      {
+        System.err.println( "Usage: " + usage );
+        return 1;
+      }
+
+    Configuration conf = getConf( );
+    FileSystem    fs   = FileSystem.get( conf );
+
+    Path outputPath = new Path( args[argStart] );
+    if ( fs.exists( outputPath ) )
+      {
+        System.err.println( "ERROR: output already exists: " + outputPath );
+        return -1;
+      }
+
+    MapFile.Reader[] readers = new MapFile.Reader[args.length - argStart - 1];
+    for ( int pos = argStart + 1 ; pos < args.length ; pos++ )
+      {
+        readers[pos - argStart - 1] = new MapFile.Reader( fs, args[pos], conf );
+      }
+
+    WritableComparable[] keys   = new WritableComparable[readers.length];
+    Writable[]           values = new Writable          [readers.length];
+
+    WritableComparator wc = WritableComparator.get( readers[0].getKeyClass() );
+
+    MapFile.Writer writer = new MapFile.Writer( conf, fs, outputPath.toString(), readers[0].getKeyClass(), readers[0].getValueClass( ) );
+
+    int readCount  = 0;
+    int writeCount = 0;
+
+    for ( int i = 0 ; i < readers.length ; i++ )
+      {
+        WritableComparable key   = (WritableComparable) ReflectionUtils.newInstance( readers[i].getKeyClass(),   conf );
+        Writable           value = (Writable)           ReflectionUtils.newInstance( readers[i].getValueClass(), conf );
+        
+        if ( readers[i].next( key, value ) )
+          {
+            keys  [i] = key;
+            values[i] = value;
+            
+            readCount++;
+            if ( verbose ) System.out.println( "read: " + i + ": " + key );
+          }
+        else
+          {
+            // Not even one key/value pair in the map.
+            System.out.println( "WARN: No key/value pairs in mapfile: " + args[i+argStart+1] );
+            try { readers[i].close(); } catch ( IOException ioe ) { /* Don't care */ }
+            readers[i] = null;
+          }
+      }
+
+    while ( true )
+      {
+        int candidate = -1;
+
+        for ( int i = 0 ; i < keys.length ; i++ )
+          {
+            if ( keys[i] == null ) continue ;
+
+            if ( candidate < 0 )
+              {
+                candidate = i;
+              }
+            else if ( wc.compare( keys[i], keys[candidate] ) < 0 )
+              {
+                candidate = i;
+              }
+          }
+
+        if ( candidate < 0 )
+          {
+            if ( verbose ) System.out.println( "Candidate < 0, all done." );
+            break ;
+          }
+        
+        // Candidate is the index of the "smallest" key.
+
+        // Write it out.
+        writer.append( keys[candidate], values[candidate] );
+        writeCount++;
+        if ( verbose ) System.out.println( "write: " + candidate + ": " + keys[candidate] );
+
+        // Now read in a new value from the corresponding reader.
+        if ( ! readers[candidate].next( keys[candidate], values[candidate] ) )
+          {
+            if ( verbose ) System.out.println( "No more key/value pairs in (" + candidate + "): " + args[candidate+argStart+1] );
+            
+            // No more key/value pairs left in this reader.
+            try { readers[candidate].close(); } catch ( IOException ioe ) { /* Don't care */ }
+            readers[candidate] = null;
+            keys   [candidate] = null;
+            values [candidate] = null;
+          }
+        else
+          {
+            readCount++;
+            if ( verbose ) System.out.println( "read: " + candidate + ": " + keys[candidate] );
+          }
+      }
+
+    System.out.println( "Total # records in : " + readCount  );
+    System.out.println( "Total # records out: " + writeCount ); 
+    
+    writer.close();
+
+    return 0;
+  }
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2691] tags/nutchwax-0_12_3/archive

From: <bi...@us...> - 2009-03-08 02:54:26

Revision: 2691
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2691&view=rev
Author:   binzino
Date:     2009-03-08 02:54:12 +0000 (Sun, 08 Mar 2009)

Log Message:
-----------
Added info on start/stop scripts to INSTALL.txt and also clarified the
parts of searcher-slave that need post-installation edits by the
administrator.

Modified Paths:
--------------
    tags/nutchwax-0_12_3/archive/INSTALL.txt
    tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave

Modified: tags/nutchwax-0_12_3/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_3/archive/INSTALL.txt	2009-03-04 04:35:06 UTC (rev 2690)
+++ tags/nutchwax-0_12_3/archive/INSTALL.txt	2009-03-08 02:54:12 UTC (rev 2691)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-12-18
+2009-03-06
 Aaron Binns
 
 Table of Contents
@@ -127,3 +127,48 @@
   $ cd /opt
   $ tar xvfz nutchwax-0.12.3.tar.gz
 
+
+======================================================================
+Install start-up scripts
+======================================================================
+
+NutchWAX 0.12.3 comes with a Unix init.d script which can be used to
+automatically start the searcher slaves for a multi-node search
+configuration.
+
+Assuming you installed NutchWAX as
+
+  /opt/nutchwax-0.12.3
+
+the script is found at
+
+  /opt/nutchwax-0.12.3/contrib/archive/etc/init.d/searcher-slave
+
+This script can be placed in /etc/init.d then added to the list of
+startup scripts to run at bootup by using commands appropriate to your
+Linux distribution.
+
+You must edit a few of the environment variables defined in the
+'searcher-slave' specifying where NutchWAX is installed and where the
+index(s) are deployed.  In 'searcher-slave' you will find the:
+
+  export NUTCH_HOME=TODO
+  export DEPLOYMENT_DIR=TODO
+
+edit those appropriately for your system.
+
+
+The "master" in the multi-node search deployment is the NutchWAX
+webapp running in a webapp server, such as Tomcat or Jetty.
+
+Jetty comes with a start/stop script appropriate for use as an init.d
+script, similar to the 'searcher-slave' script described above.  If you
+use Jetty, create a symlink 
+
+  /etc/init.d/jetty.sh  -> /opt/jetty/bin/jetty.sh
+
+Then add this script to the list of startup scripts to run at bootup
+by using commands appropriate to your Linux distribution.
+
+Follow the instructions from Jetty on the deployment of the NutchWAX
+webapp (nutch-1.0-dev.war) in the Jetty web application server.

Modified: tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave
===================================================================
--- tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave	2009-03-04 04:35:06 UTC (rev 2690)
+++ tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave	2009-03-08 02:54:12 UTC (rev 2691)
@@ -10,10 +10,11 @@
 DESC="NutchWAX searcher slave"
 NAME="searcher-slave"
 
-DAEMON="/3/search/nutchwax-0.12.2/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 /3/search/deploy"
-NUTCH_HOME=/3/search/nutchwax-0.12.2
-JAVA_HOME=/usr
+export NUTCH_HOME=TODO
+export DEPLOYMENT_DIR=TODO
+export JAVA_HOME=/usr
 export NUTCH_HEAPSIZE=2500
+DAEMON="${NUTCH_HOME}/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 ${DEPLOYMENT_DIR}"
 PIDFILE=/var/run/$NAME.pid
 SCRIPTNAME=/etc/init.d/$NAME
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2690] trunk/archive-access/projects/nutchwax/ archive/build.xml

From: <bi...@us...> - 2009-03-04 04:35:07

Revision: 2690
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2690&view=rev
Author:   binzino
Date:     2009-03-04 04:35:06 +0000 (Wed, 04 Mar 2009)

Log Message:
-----------
Fix JIRA WAX-38.  Added rules to "job" target to add our libraries to
the .job file.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2009-03-04 01:18:44 UTC (rev 2689)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2009-03-04 04:35:06 UTC (rev 2690)
@@ -15,7 +15,7 @@
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
-<project name="nutchwax" default="job">
+<project name="nutchwax" default="jar">
 
   <property name="nutch.dir" value="../../" />
 
@@ -23,8 +23,9 @@
   <property name="lib.dir"   value="lib" />
   <property name="build.dir" value="${nutch.dir}/build" />
   <!-- HACK: Need to import default.properties like Nutch does -->
-  <property name="dist.dir"  value="${build.dir}/nutch-1.0-dev" />
-
+  <property name="final.name" value="nutch-1.0-dev" />
+  <property name="dist.dir"  value="${build.dir}/${final.name}" />
+ 
   <target name="nutch-compile-core">
     <!-- First, copy over Nutch source overlays -->
     <exec executable="rsync">
@@ -83,6 +84,11 @@
 
   <target name="job" depends="compile">
     <ant dir="${nutch.dir}" target="job" inheritAll="false" />
+
+    <!-- Add our NutchWAX libs to the .job created by Nutch's build. -->
+    <jar jarfile="${build.dir}/${final.name}.job" update="true">
+      <zipfileset dir="lib" prefix="lib" includes="*.jar"/>
+    </jar>
   </target>
 
   <target name="war" depends="compile">


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2689] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2009-03-04 01:18:45

Revision: 2689
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2689&view=rev
Author:   binzino
Date:     2009-03-04 01:18:44 +0000 (Wed, 04 Mar 2009)

Log Message:
-----------
Added boolean configuration property nutchwax.import.store.content to
determine whether or not the Importer stores the full content in the
segment's "content" directory.

Removed a useless debug message from the end of the Import job.

Removed searcher.max.hits from nutch-site.xml as it actually causes
lots of problems with search-time site-based de-dup.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java	2009-03-03 20:34:38 UTC (rev 2688)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java	2009-03-04 01:18:44 UTC (rev 2689)
@@ -456,8 +456,12 @@
     
     try
       {
-        output.collect( key, new NutchWritable( datum   ) );
-        output.collect( key, new NutchWritable( content ) );
+        output.collect( key, new NutchWritable( datum ) );
+
+        if ( jobConf.getBoolean( "nutchwax.import.store.content", false ) )
+          {
+            output.collect( key, new NutchWritable( content ) );
+          }
         
         if ( parseResult != null )
           {
@@ -649,9 +653,6 @@
 
         RunningJob rj = JobClient.runJob( job );
 
-        // Emit job id and status.
-        System.out.println( "JOB_STATUS: " + rj.getID( ) + ": " + (rj.isSuccessful( ) ? "SUCCESS" : "FAIL" ) );
-
         return rj.isSuccessful( ) ? 0 : 1;
       }
     catch ( Exception e )

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-03-03 20:34:38 UTC (rev 2688)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2009-03-04 01:18:44 UTC (rev 2689)
@@ -137,6 +137,25 @@
   <value>1048576</value>
 </property>
 
+<!-- Whether or not we store the full content in the segment's
+     "content" directory.  Most NutchWAX users are also using Wayback
+     to serve the archived content, so there's no need for NutchWAX to
+     keep a "cached" copy as well.
+
+     Setting to 'true' yields the same bahavior as in previous
+     versions of NutchWAX, and as in Nutch.  The content is stored in
+     the segment's "content" directory.
+
+     Setting to 'false' results in an empty "content" directory in the
+     segment.  The content is not stored.
+     
+     Default value is 'false'.
+  -->
+<property>
+  <name>nutchwax.import.store.content</name>
+  <value>false</value>
+</property>
+
 <!-- Enable per-collection segment sub-dirs, e.g.
        segments/<collectionId>/segment1
                               /segment2
@@ -156,11 +175,6 @@
 </property>
 
 <property>
-  <name>searcher.max.hits</name>
-  <value>1000</value>
-</property>
-
-<property>
   <name>searcher.summary.context</name>
   <value>8</value>
 </property>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2688] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/web/jsp/search.xsl

From: <bi...@us...> - 2009-03-03 20:34:43

Revision: 2688
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2688&view=rev
Author:   binzino
Date:     2009-03-03 20:34:38 +0000 (Tue, 03 Mar 2009)

Log Message:
-----------
Re-worked the page link generation to handle last-page and
paging-off-the-end.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl	2009-03-03 18:20:14 UTC (rev 2687)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl	2009-03-03 20:34:38 UTC (rev 2688)
@@ -192,37 +192,73 @@
 <xsl:template name="pageLinks">
   <xsl:param name="labelPrevious" />
   <xsl:param name="labelNext"     />
+  <xsl:variable name="startPage" select="floor(opensearch:startIndex   div opensearch:itemsPerPage) + 1" />
+  <xsl:variable name="lastPage"  select="floor(opensearch:totalResults div opensearch:itemsPerPage) + 1" />
   <!-- If we are on any page past the first, emit a "previous" link -->
-  <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1">
+  <xsl:if test="$startPage != 1">
     <xsl:call-template name="pageLink">
-      <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage)" />
+      <xsl:with-param name="pageNum"  select="$startPage - 1" />
       <xsl:with-param name="linkText" select="$labelPrevious" />
     </xsl:call-template>
     <xsl:text> </xsl:text>
   </xsl:if>
   <!-- Now, emit numbered page links -->
   <xsl:choose>
-    <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) &lt; 11">
-      <xsl:call-template name="numberedPageLinks" >
-        <xsl:with-param name="begin"   select="1"  />
-        <xsl:with-param name="end"     select="21" />
-        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
-      </xsl:call-template>
+    <!-- We are on pages 1-10.  Emit links  -->
+    <xsl:when test="$startPage &lt; 11">
+      <xsl:choose>
+        <xsl:when test="$lastPage &lt; 21">
+          <xsl:call-template name="numberedPageLinks" >
+            <xsl:with-param name="begin"   select="1"  />
+            <xsl:with-param name="end"     select="$lastPage + 1" />
+            <xsl:with-param name="current" select="$startPage" />
+          </xsl:call-template>
+        </xsl:when>
+        <xsl:otherwise>
+          <xsl:call-template name="numberedPageLinks" >
+            <xsl:with-param name="begin"   select="1"  />
+            <xsl:with-param name="end"     select="21" />
+            <xsl:with-param name="current" select="$startPage" />
+          </xsl:call-template>
+        </xsl:otherwise>
+      </xsl:choose>
     </xsl:when>
+    <!-- We are past page 10, but not to the last page yet.  Emit links for 10 pages before and 10 pages after -->
+    <xsl:when test="$startPage &lt; $lastPage">
+      <xsl:choose>
+        <xsl:when test="$lastPage &lt; ($startPage + 11)">
+          <xsl:call-template name="numberedPageLinks" >
+            <xsl:with-param name="begin"   select="$startPage - 10" />
+            <xsl:with-param name="end"     select="$lastPage  +  1" />
+            <xsl:with-param name="current" select="$startPage"      />
+          </xsl:call-template>
+        </xsl:when>
+        <xsl:otherwise>
+          <xsl:call-template name="numberedPageLinks" >
+            <xsl:with-param name="begin"   select="$startPage - 10" />
+            <xsl:with-param name="end"     select="$startPage + 11" />
+            <xsl:with-param name="current" select="$startPage"      />
+          </xsl:call-template>
+        </xsl:otherwise>
+      </xsl:choose>
+    </xsl:when>
+    <!-- This covers the case where we are on (or past) the last page -->
     <xsl:otherwise>
       <xsl:call-template name="numberedPageLinks" >
-        <xsl:with-param name="begin"   select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" />
-        <xsl:with-param name="end"     select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" />
-        <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+        <xsl:with-param name="begin"   select="$startPage - 10" />
+        <xsl:with-param name="end"     select="$lastPage  + 1"  />
+        <xsl:with-param name="current" select="$startPage"      />
       </xsl:call-template>
     </xsl:otherwise>
   </xsl:choose>
   <!-- Lastly, emit a "next" link. -->
   <xsl:text> </xsl:text>
-  <xsl:call-template name="pageLink">
-    <xsl:with-param name="pageNum"  select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" />
-    <xsl:with-param name="linkText" select="$labelNext" />
-  </xsl:call-template>
+  <xsl:if test="$startPage &lt; $lastPage">
+    <xsl:call-template name="pageLink">
+      <xsl:with-param name="pageNum"  select="$startPage + 1" />
+      <xsl:with-param name="linkText" select="$labelNext" />
+    </xsl:call-template>
+  </xsl:if>
 </xsl:template>
 
 <!-- Template to emit a list of numbered links to results pages. 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2687] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

From: <bi...@us...> - 2009-03-03 18:20:21

Revision: 2687
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2687&view=rev
Author:   binzino
Date:     2009-03-03 18:20:14 +0000 (Tue, 03 Mar 2009)

Log Message:
-----------
Fixed handling of start and end of search results so that we detect
"paging off the end" and return an empty result set rather than an
exception.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2009-02-28 01:26:25 UTC (rev 2686)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2009-03-03 18:20:14 UTC (rev 2687)
@@ -162,18 +162,30 @@
 
     responseTime = System.nanoTime( ) - responseTime;
 
-    // generate xml results
-    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
-    int length = end-start;
+    // The 'end' is usually just the end of the current page
+    // (start+hitsPerPage); but if we are on the last page
+    // of de-duped results, then the end is hits.getLength().
+    int end = Math.min( hits.getLength( ), start + hitsPerPage );
 
-    Hit[] show = hits.getHits(start, end-start);
-    HitDetails[] details = bean.getDetails(show);
-    Summary[] summaries = bean.getSummary(details, query);
+    // The length is usually just (end-start), unless the start
+    // position is past the end of the results -- which is common when
+    // de-duping.  The user could easily jump past the true end of the
+    // de-dup'd results.  If the start is past the end, we use a
+    // length of '0' to produce an empty results page.
+    int length = Math.max( end-start, 0 );
 
+    // Usually, the total results is the total number of non-de-duped
+    // results.  Howerver, if we are on last page of de-duped results,
+    // then we know our de-dup'd total is hits.getLength().
+    long totalResults = hits.getLength( ) < (start+hitsPerPage) ? hits.getLength( ) : hits.getTotal( );
+
+    Hit[]        show      = hits.getHits(start, length );
+    HitDetails[] details   = bean.getDetails(show);
+    Summary[]    summaries = bean.getSummary(details, query);
+
     String requestUrl = request.getRequestURL().toString();
     String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
       
-
     try {
       DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
       factory.setNamespaceAware(true);
@@ -197,8 +209,8 @@
               +"&hitsPerDup="+hitsPerDup
               +params);
 
-      addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
-      addNode(doc, channel, "opensearch", "startIndex", ""+start);
+      addNode(doc, channel, "opensearch", "totalResults", ""+totalResults);
+      addNode(doc, channel, "opensearch", "startIndex",   ""+start);
       addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
 
       addNode(doc, channel, "nutch", "query", queryString);


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2686] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java

From: <bi...@us...> - 2009-02-28 01:26:27

Revision: 2686
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2686&view=rev
Author:   binzino
Date:     2009-02-28 01:26:25 +0000 (Sat, 28 Feb 2009)

Log Message:
-----------
Added here with local edits to handle perCollection segments in a
distributed setup.  Also added info/diagnostic messages to help
diagnose common deployment errors.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java	2009-02-28 01:26:25 UTC (rev 2686)
@@ -0,0 +1,483 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.searcher;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.lang.reflect.Method;
+import java.net.InetSocketAddress;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.StringTokenizer;
+import java.util.TreeSet;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.ipc.RPC;
+import org.apache.hadoop.ipc.VersionedProtocol;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.util.NutchConfiguration;
+
+/** Implements the search API over IPC connnections. */
+public class DistributedSearch {
+  public static final Log LOG = LogFactory.getLog(DistributedSearch.class);
+
+  private DistributedSearch() {}                  // no public ctor
+
+  /** The distributed search protocol. */
+  public static interface Protocol
+    extends Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, VersionedProtocol {
+
+    /** The name of the segments searched by this node. */
+    String[] getSegmentNames();
+  }
+
+  /** The search server. */
+  public static class Server  {
+
+    private Server() {}
+
+    /** Runs a search server. */
+    public static void main(String[] args) throws Exception {
+      String usage = "DistributedSearch$Server <port> <index dir>";
+
+      if (args.length == 0 || args.length > 2) {
+        System.err.println(usage);
+        System.exit(-1);
+      }
+
+      int port = Integer.parseInt(args[0]);
+      Path directory = new Path(args[1]);
+
+      Configuration conf = NutchConfiguration.create();
+
+      org.apache.hadoop.ipc.Server server = getServer(conf, directory, port);
+      server.start();
+      server.join();
+    }
+    
+    static org.apache.hadoop.ipc.Server getServer(Configuration conf, Path directory, int port) throws IOException{
+      NutchBean bean = new NutchBean(conf, directory);
+      int numHandlers = conf.getInt("searcher.num.handlers", 10);      
+      return RPC.getServer(bean, "0.0.0.0", port, numHandlers, true, conf);
+    }
+
+  }
+
+  /** The search client. */
+  public static class Client extends Thread
+    implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
+               Runnable {
+
+    private InetSocketAddress[] defaultAddresses;
+    private boolean[] liveServer;
+    private HashMap segmentToAddress = new HashMap();
+    
+    private boolean running = true;
+    private Configuration conf;
+    private boolean perCollection = false;
+
+    private Path file;
+    private long timestamp;
+    private FileSystem fs;
+    
+    /** Construct a client talking to servers listed in the named file.
+     * Each line in the file lists a server hostname and port, separated by
+     * whitespace. 
+     */
+    public Client(Path file, Configuration conf) 
+      throws IOException {
+      this(readConfig(file, conf), conf);
+      this.file = file;
+      this.timestamp = fs.getFileStatus(file).getModificationTime();
+    }
+
+    private static InetSocketAddress[] readConfig(Path path, Configuration conf)
+      throws IOException {
+      FileSystem fs = FileSystem.get(conf);
+      BufferedReader reader =
+        new BufferedReader(new InputStreamReader(fs.open(path)));
+      try {
+        ArrayList addrs = new ArrayList();
+        String line;
+        while ((line = reader.readLine()) != null) {
+          StringTokenizer tokens = new StringTokenizer(line);
+          if (tokens.hasMoreTokens()) {
+            String host = tokens.nextToken();
+            if (tokens.hasMoreTokens()) {
+              String port = tokens.nextToken();
+              addrs.add(new InetSocketAddress(host, Integer.parseInt(port)));
+              if (LOG.isInfoEnabled()) {
+                LOG.info("Client adding server "  + host + ":" + port);
+              }
+            }
+          }
+        }
+        return (InetSocketAddress[])
+          addrs.toArray(new InetSocketAddress[addrs.size()]);
+      } finally {
+        reader.close();
+      }
+    }
+
+    /** Construct a client talking to the named servers. */
+    public Client(InetSocketAddress[] addresses, Configuration conf) throws IOException {
+      this.conf = conf;
+      this.defaultAddresses = addresses;
+      this.liveServer = new boolean[addresses.length];
+      this.fs = FileSystem.get(conf);
+
+      this.perCollection = this.conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false );
+
+      updateSegments();
+      setDaemon(true);
+      start();
+    }
+    
+    private static final Method GET_SEGMENTS;
+    private static final Method SEARCH;
+    private static final Method DETAILS;
+    private static final Method SUMMARY;
+    static {
+      try {
+        GET_SEGMENTS = Protocol.class.getMethod
+          ("getSegmentNames", new Class[] {});
+        SEARCH = Protocol.class.getMethod
+          ("search", new Class[] { Query.class, Integer.TYPE, String.class,
+                                   String.class, Boolean.TYPE});
+        DETAILS = Protocol.class.getMethod
+          ("getDetails", new Class[] { Hit.class});
+        SUMMARY = Protocol.class.getMethod
+          ("getSummary", new Class[] { HitDetails.class, Query.class});
+      } catch (NoSuchMethodException e) {
+        throw new RuntimeException(e);
+      }
+    }
+
+    /**
+     * Check to see if search-servers file has been modified
+     * 
+     * @throws IOException
+     */
+    public boolean isFileModified()
+      throws IOException {
+
+      if (file != null) {        
+        long modTime = fs.getFileStatus(file).getModificationTime();
+        if (timestamp < modTime) {
+          this.timestamp = fs.getFileStatus(file).getModificationTime();
+          return true;
+        }
+      }
+
+      return false;
+    }
+
+    /** Updates segment names.
+     * 
+     * @throws IOException
+     */
+    public void updateSegments() throws IOException {
+      
+      int liveServers = 0;
+      int liveSegments = 0;
+      
+      if (isFileModified()) {
+        defaultAddresses = readConfig(file, conf);
+      }
+      
+      // Create new array of flags so they can all be updated at once.
+      boolean[] updatedLiveServer = new boolean[defaultAddresses.length];
+      
+      // build segmentToAddress map
+      Object[][] params = new Object[defaultAddresses.length][0];
+      String[][] results =
+        (String[][])RPC.call(GET_SEGMENTS, params, defaultAddresses, this.conf);
+
+      for (int i = 0; i < results.length; i++) {  // process results of call
+        InetSocketAddress addr = defaultAddresses[i];
+        String[] segments = results[i];
+        if (segments == null) {
+          updatedLiveServer[i] = false;
+          if (LOG.isWarnEnabled()) {
+            LOG.warn("Client: no segments from: " + addr);
+          }
+          continue;
+        }
+        
+        for (int j = 0; j < segments.length; j++) {
+          if (LOG.isTraceEnabled()) {
+            LOG.trace("Client: segment "+segments[j]+" at "+addr);
+          }
+          segmentToAddress.put(segments[j], addr);
+        }
+        
+        updatedLiveServer[i] = true;
+        liveServers++;
+        liveSegments += segments.length;
+      }
+
+      // Now update live server flags.
+      this.liveServer = updatedLiveServer;
+
+      if (LOG.isInfoEnabled()) {
+        LOG.info("STATS: "+liveServers+" servers, "+liveSegments+" segments.");
+      }
+    }
+
+    /** Return the names of segments searched. */
+    public String[] getSegmentNames() {
+      return (String[])
+        segmentToAddress.keySet().toArray(new String[segmentToAddress.size()]);
+    }
+
+    public Hits search(final Query query, final int numHits,
+                       final String dedupField, final String sortField,
+                       final boolean reverse) throws IOException {
+      // Get the list of live servers.  It would be nice to build this
+      // list in updateSegments(), but that would create concurrency issues.
+      // We grab a local reference to the live server flags in case it
+      // is updated while we are building our list of liveAddresses.
+      boolean[] savedLiveServer = this.liveServer;
+      int numLive = 0;
+      for (int i = 0; i < savedLiveServer.length; i++) {
+        if (savedLiveServer[i])
+          numLive++;
+      }
+      InetSocketAddress[] liveAddresses = new InetSocketAddress[numLive];
+      int[] liveIndexNos = new int[numLive];
+      int k = 0;
+      for (int i = 0; i < savedLiveServer.length; i++) {
+        if (savedLiveServer[i]) {
+          liveAddresses[k] = defaultAddresses[i];
+          liveIndexNos[k] = i;
+          k++;
+        }
+      }
+
+      Object[][] params = new Object[liveAddresses.length][5];
+      for (int i = 0; i < params.length; i++) {
+        params[i][0] = query;
+        params[i][1] = new Integer(numHits);
+        params[i][2] = dedupField;
+        params[i][3] = sortField;
+        params[i][4] = Boolean.valueOf(reverse);
+      }
+      Hits[] results = (Hits[])RPC.call(SEARCH, params, liveAddresses, this.conf);
+
+      TreeSet queue;                              // cull top hits from results
+
+      if (sortField == null || reverse) {
+        queue = new TreeSet(new Comparator() {
+            public int compare(Object o1, Object o2) {
+              return ((Comparable)o2).compareTo(o1); // reverse natural order
+            }
+          });
+      } else {
+        queue = new TreeSet();
+      }
+      
+      long totalHits = 0;
+      Comparable maxValue = null;
+      for (int i = 0; i < results.length; i++) {
+        Hits hits = results[i];
+        if (hits == null) continue;
+        totalHits += hits.getTotal();
+        for (int j = 0; j < hits.getLength(); j++) {
+          Hit h = hits.getHit(j);
+          if (maxValue == null ||
+              ((reverse || sortField == null)
+               ? h.getSortValue().compareTo(maxValue) >= 0
+               : h.getSortValue().compareTo(maxValue) <= 0)) {
+            queue.add(new Hit(liveIndexNos[i], h.getIndexDocNo(),
+                              h.getSortValue(), h.getDedupValue()));
+            if (queue.size() > numHits) {         // if hit queue overfull
+              queue.remove(queue.last());         // remove lowest in hit queue
+              maxValue = ((Hit)queue.last()).getSortValue(); // reset maxValue
+            }
+          }
+        }
+      }
+      return new Hits(totalHits, (Hit[])queue.toArray(new Hit[queue.size()]));
+    }
+    
+    // version for hadoop-0.5.0.jar
+    public static final long versionID = 1L;
+    
+    private Protocol getRemote(Hit hit) throws IOException {
+      return (Protocol)
+        RPC.getProxy(Protocol.class, versionID, defaultAddresses[hit.getIndexNo()], conf);
+    }
+
+    private Protocol getRemote(HitDetails hit) throws IOException {
+      InetSocketAddress address =
+        (InetSocketAddress)segmentToAddress.get(hit.getValue("segment"));
+      return (Protocol)RPC.getProxy(Protocol.class, versionID, address, conf);
+    }
+
+    public String getExplanation(Query query, Hit hit) throws IOException {
+      return getRemote(hit).getExplanation(query, hit);
+    }
+    
+    public HitDetails getDetails(Hit hit) throws IOException {
+      return getRemote(hit).getDetails(hit);
+    }
+    
+    public HitDetails[] getDetails(Hit[] hits) throws IOException {
+      InetSocketAddress[] addrs = new InetSocketAddress[hits.length];
+      Object[][] params = new Object[hits.length][1];
+      for (int i = 0; i < hits.length; i++) {
+        addrs[i] = defaultAddresses[hits[i].getIndexNo()];
+        params[i][0] = hits[i];
+      }
+      return (HitDetails[])RPC.call(DETAILS, params, addrs, conf);
+    }
+
+
+    public Summary getSummary(HitDetails hit, Query query) throws IOException {
+      return getRemote(hit).getSummary(hit, query);
+    }
+
+
+    /* DIFF: Added handling for perCollection segments.  Also info
+     *       messages about each hit to help diagnose typical
+     *       deployment errors.
+     */
+    public Summary[] getSummary(HitDetails[] hits, Query query) throws IOException
+    {
+      try
+        {
+          InetSocketAddress[] addrs = new InetSocketAddress[hits.length];
+          Object[][] params = new Object[hits.length][2];
+          for (int i = 0; i < hits.length; i++)
+            {
+              HitDetails hit = hits[i];
+              if ( this.perCollection )
+                {
+                  addrs[i] = (InetSocketAddress)segmentToAddress.get(hit.getValue("collection"));
+                  LOG.info( "Hit: " + hit + " addr: " + addrs[i] + " collection:" + hit.getValue("collection") );
+                }
+              else
+                {
+                  addrs[i] = (InetSocketAddress)segmentToAddress.get(hit.getValue("segment"));
+                  LOG.info( "Hit: " + hit + " addr: " + addrs[i] + " segment:" + hit.getValue("segment") );
+                }
+              params[i][0] = hit;
+              params[i][1] = query;
+            }
+          return (Summary[])RPC.call(SUMMARY, params, addrs, conf);
+        }
+      catch ( Exception e )
+        {
+          LOG.warn( "Error getting summaries: ", e );
+          return new Summary[hits.length];
+        }
+    }
+    
+    public byte[] getContent(HitDetails hit) throws IOException {
+      return getRemote(hit).getContent(hit);
+    }
+    
+    public ParseData getParseData(HitDetails hit) throws IOException {
+      return getRemote(hit).getParseData(hit);
+    }
+      
+    public ParseText getParseText(HitDetails hit) throws IOException {
+      return getRemote(hit).getParseText(hit);
+    }
+      
+    public String[] getAnchors(HitDetails hit) throws IOException {
+      return getRemote(hit).getAnchors(hit);
+    }
+
+    public Inlinks getInlinks(HitDetails hit) throws IOException {
+      return getRemote(hit).getInlinks(hit);
+    }
+
+    public long getFetchDate(HitDetails hit) throws IOException {
+      return getRemote(hit).getFetchDate(hit);
+    }
+      
+    public static void main(String[] args) throws Exception {
+      String usage = "DistributedSearch$Client query <host> <port> ...";
+
+      if (args.length == 0) {
+        System.err.println(usage);
+        System.exit(-1);
+      }
+
+      Query query = Query.parse(args[0], NutchConfiguration.create());
+      
+      InetSocketAddress[] addresses = new InetSocketAddress[(args.length-1)/2];
+      for (int i = 0; i < (args.length-1)/2; i++) {
+        addresses[i] =
+          new InetSocketAddress(args[i*2+1], Integer.parseInt(args[i*2+2]));
+      }
+
+      Client client = new Client(addresses, NutchConfiguration.create());
+      //client.setTimeout(Integer.MAX_VALUE);
+
+      Hits hits = client.search(query, 10, null, null, false);
+      System.out.println("Total hits: " + hits.getTotal());
+      for (int i = 0; i < hits.getLength(); i++) {
+        System.out.println(" "+i+" "+ client.getDetails(hits.getHit(i)));
+      }
+
+    }
+
+    public void run() {
+      while (running){
+        try{
+          Thread.sleep(10000);
+        } catch (InterruptedException ie){
+          if (LOG.isInfoEnabled()) {
+            LOG.info("Thread sleep interrupted.");
+          }
+        }
+        try{
+          if (LOG.isInfoEnabled()) {
+            LOG.info("Querying segments from search servers...");
+          }
+          updateSegments();
+        } catch (IOException ioe) {
+          if (LOG.isWarnEnabled()) { LOG.warn("No search servers available!"); }
+          liveServer = new boolean[defaultAddresses.length];
+        }
+      }
+    }
+    
+    /**
+     * Stops the watchdog thread.
+     */
+    public void close() {
+      running = false;
+      interrupt();
+    }
+
+    public boolean[] getLiveServer() {
+      return liveServer;
+    }
+  }
+}
\ No newline at end of file


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2685] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java

From: <bi...@us...> - 2009-02-28 01:23:12

Revision: 2685
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2685&view=rev
Author:   binzino
Date:     2009-02-28 01:23:10 +0000 (Sat, 28 Feb 2009)

Log Message:
-----------
Improvied error handling with better diagnostic messages to help catch
common deployment mistakes.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	2009-02-28 01:18:32 UTC (rev 2684)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	2009-02-28 01:23:10 UTC (rev 2685)
@@ -262,10 +262,26 @@
     
     if (this.summarizer == null) { return new Summary(); }
     
+    String  text    = "";
     Segment segment = getSegment(details);
-    ParseText parseText = segment.getParseText(getUrl(details));
-    String text = (parseText != null) ? parseText.getText() : "";
-    
+
+    if ( segment != null )
+      {
+        try
+          {
+            ParseText parseText = segment.getParseText(getUrl(details));
+            text = (parseText != null) ? parseText.getText() : "";
+          }
+        catch ( Exception e )
+          {
+            LOG.error( "segment = " + segment.segmentDir, e );
+          }
+      }
+    else
+      {
+        LOG.warn( "No segment for: " + details );
+      }
+
     return this.summarizer.getSummary(text, query);
   }
     
@@ -330,12 +346,19 @@
         String segmentName  = details.getValue("segment");
         
         Map perCollectionSegments = (Map) this.segments.get( collectionId );
+
+        if ( perCollectionSegments == null )
+          {
+            LOG.warn( "Cannot find per-collection segments for: " + collectionId );
+
+            return null;
+          }
         
         Segment segment = (Segment) perCollectionSegments.get( segmentName );
         
         if ( segment == null )
           {
-            LOG.warn( "Didn't find segment: collection=" + collectionId + " segment=" + segmentName );
+            LOG.warn( "Cannot find segment: collection=" + collectionId + " segment=" + segmentName );
           }
 
         return segment;
@@ -350,7 +373,7 @@
 
         if ( segment == null )
           {
-            LOG.warn( "Didn't find segment: " + segmentName );
+            LOG.warn( "Cannot find segment: " + segmentName );
           }
         
         return segment;


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2684] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/tools/BuildIndex.java

From: <bi...@us...> - 2009-02-28 01:18:34

Revision: 2684
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2684&view=rev
Author:   binzino
Date:     2009-02-28 01:18:32 +0000 (Sat, 28 Feb 2009)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java	2009-02-28 01:18:32 UTC (rev 2684)
@@ -0,0 +1,79 @@
+/*
+ * Copyright (C) 2008 Internet Archive.
+ * 
+ * This file is part of the archive-access tools project
+ * (http://sourceforge.net/projects/archive-access).
+ * 
+ * The archive-access tools are free software; you can redistribute them and/or
+ * modify them under the terms of the GNU Lesser Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or any
+ * later version.
+ * 
+ * The archive-access tools are distributed in the hope that they will be
+ * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU Lesser Public License along with
+ * the archive-access tools; if not, write to the Free Software Foundation,
+ * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+package org.archive.nutchwax.tools;
+
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.analysis.WhitespaceAnalyzer;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.util.NutchConfiguration;
+
+
+/**
+ * A nice command-line hack to generate a Lucene index N documents,
+ * each with one field set to the same value.  This value is both
+ * stored and tokenized/indexed.
+ */
+public class BuildIndex extends Configured implements Tool
+{
+  public int run( String[] args ) throws Exception
+  {
+    if ( args.length < 4 )
+      {
+        System.out.println( "BuildIndex index field value count" );
+        System.exit( 0 );
+      }
+
+    String indexDir   = args[0].trim();
+    String fieldKey   = args[1].trim();
+    String fieldValue = args[2].trim();
+    int    count      = Integer.parseInt( args[3].trim() );
+    
+    IndexWriter writer = new IndexWriter( indexDir, new WhitespaceAnalyzer( ), true );
+    
+    for ( int i = 0 ; i < count ; i++ )
+      {
+        Document newDoc = new Document( );
+        newDoc.add( new Field( fieldKey, fieldValue, Field.Store.YES, Field.Index.TOKENIZED ) );
+
+        writer.addDocument( newDoc );
+      }
+
+    writer.close( );
+
+    return 0;
+  }
+
+  /**
+   * Runs using the Hadoop ToolRunner, which means it accepts the
+   * standard Hadoop command-line options.
+   */
+  public static void main( String args[] ) throws Exception
+  {
+    int result = ToolRunner.run( NutchConfiguration.create(), new BuildIndex(), args );
+
+    System.exit( result );
+  }
+  
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2683] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax

From: <bi...@us...> - 2009-02-23 03:54:54

Revision: 2683
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2683&view=rev
Author:   binzino
Date:     2009-02-23 03:54:47 +0000 (Mon, 23 Feb 2009)

Log Message:
-----------
Added PageRank* classes to mirror the Nutch LinkDb classes but only
/count/ the inlinks, not preserve them.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java	2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,366 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.*;
+import java.util.*;
+import java.net.*;
+
+// Commons Logging imports
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.*;
+
+import org.apache.nutch.crawl.LinkDbFilter;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LockUtil;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.NutchJob;
+
+/**
+ * <p>Maintains an inverted link map, listing incoming links for each
+ * url.</p>
+ * <p>Aaron Binns @ archive.org: see comments in PageRankDbMerger.</p>
+*/
+public class PageRankDb extends Configured 
+  implements Tool, Mapper<Text, ParseData, Text, IntWritable>
+{
+  public static final Log LOG = LogFactory.getLog(PageRankDb.class);
+  
+  public static final String CURRENT_NAME = "current";
+  public static final String LOCK_NAME = ".locked";
+  
+  private int maxAnchorLength;
+  private boolean ignoreInternalLinks;
+  private URLFilters urlFilters;
+  private URLNormalizers urlNormalizers;
+  
+  public PageRankDb( ) 
+  {
+  }
+  
+  public PageRankDb( Configuration conf )
+  {
+    setConf(conf);
+  }
+  
+  public void configure( JobConf job )
+  {
+    ignoreInternalLinks = job.getBoolean("db.ignore.internal.links", true);
+    if (job.getBoolean(LinkDbFilter.URL_FILTERING, false)) 
+      {
+        urlFilters = new URLFilters(job);
+      }
+    if (job.getBoolean(LinkDbFilter.URL_NORMALIZING, false)) 
+      {
+        urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_LINKDB);
+      }
+  }
+
+  public void close( ) 
+  {
+  }
+  
+  public void map( Text key, ParseData parseData, OutputCollector<Text, IntWritable> output, Reporter reporter )
+    throws IOException 
+  {
+    String fromUrl  = key.toString();
+    String fromHost = getHost(fromUrl);
+
+    if (urlNormalizers != null) 
+      {
+        try 
+          {
+            fromUrl = urlNormalizers.normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // normalize the url
+          }
+        catch (Exception e) 
+          {
+            LOG.warn("Skipping " + fromUrl + ":" + e);
+            fromUrl = null;
+          }
+      }
+    if (fromUrl != null && urlFilters != null) 
+      {
+        try 
+          {
+            fromUrl = urlFilters.filter(fromUrl); // filter the url
+          }
+        catch (Exception e) 
+          {
+            LOG.warn("Skipping " + fromUrl + ":" + e);
+            fromUrl = null;
+          }
+      }
+    if (fromUrl == null) return;
+
+    Outlink[] outlinks = parseData.getOutlinks();
+
+    for (int i = 0; i < outlinks.length; i++) 
+      {
+        Outlink outlink = outlinks[i];
+        String toUrl = outlink.getToUrl();
+
+        if (ignoreInternalLinks) 
+          {
+            String toHost = getHost(toUrl);
+            if (toHost == null || toHost.equals(fromHost)) 
+              { // internal link
+                continue;                               // skip it
+              }
+          }
+        if (urlNormalizers != null) 
+          {
+            try 
+              {
+                toUrl = urlNormalizers.normalize(toUrl, URLNormalizers.SCOPE_LINKDB); // normalize the url
+              }
+            catch (Exception e) 
+              {
+                LOG.warn("Skipping " + toUrl + ":" + e);
+                toUrl = null;
+              }
+          }
+        if (toUrl != null && urlFilters != null) 
+          {
+            try 
+              {
+                toUrl = urlFilters.filter(toUrl); // filter the url
+              }
+            catch (Exception e) 
+              {
+                LOG.warn("Skipping " + toUrl + ":" + e);
+                toUrl = null;
+              }
+          }
+
+        if (toUrl == null) continue;
+
+        // DIFF: We just emit a count of '1' for the toUrl.  That's it.
+        //       Rather than the list of inlinks as in LinkDb.
+        output.collect( new Text(toUrl), new IntWritable( 1 ) );
+      }
+  }
+
+  private String getHost(String url) 
+  {
+    try 
+      {
+        return new URL(url).getHost().toLowerCase();
+      }
+    catch (MalformedURLException e) 
+      {
+        return null;
+      }
+  }
+
+  public void invert(Path pageRankDb, final Path segmentsDir, boolean normalize, boolean filter, boolean force) throws IOException 
+  {
+    final FileSystem fs = FileSystem.get(getConf());
+    FileStatus[] files = fs.listStatus(segmentsDir, HadoopFSUtil.getPassDirectoriesFilter(fs));
+    invert(pageRankDb, HadoopFSUtil.getPaths(files), normalize, filter, force);
+  }
+
+  public void invert(Path pageRankDb, Path[] segments, boolean normalize, boolean filter, boolean force) throws IOException 
+  {
+
+    Path lock = new Path(pageRankDb, LOCK_NAME);
+    FileSystem fs = FileSystem.get(getConf());
+    LockUtil.createLockFile(fs, lock, force);
+    Path currentPageRankDb = new Path(pageRankDb, CURRENT_NAME);
+    if (LOG.isInfoEnabled()) 
+      {
+        LOG.info("PageRankDb: starting");
+        LOG.info("PageRankDb: pageRankDb: " + pageRankDb);
+        LOG.info("PageRankDb: URL normalize: " + normalize);
+        LOG.info("PageRankDb: URL filter: " + filter);
+      }
+    JobConf job = PageRankDb.createJob(getConf(), pageRankDb, normalize, filter);
+    for (int i = 0; i < segments.length; i++) 
+      {
+        if (LOG.isInfoEnabled()) 
+          {
+            LOG.info("PageRankDb: adding segment: " + segments[i]);
+          }
+        FileInputFormat.addInputPath(job, new Path(segments[i], ParseData.DIR_NAME));
+      }
+    try 
+      {
+        JobClient.runJob(job);
+      }
+    catch (IOException e) 
+      {
+        LockUtil.removeLockFile(fs, lock);
+        throw e;
+      }
+    if (fs.exists(currentPageRankDb)) 
+      {
+        if (LOG.isInfoEnabled()) 
+          {
+            LOG.info("PageRankDb: merging with existing pageRankDb: " + pageRankDb);
+          }
+        // try to merge
+        Path newPageRankDb = FileOutputFormat.getOutputPath(job);
+        job = PageRankDbMerger.createMergeJob(getConf(), pageRankDb, normalize, filter);
+        FileInputFormat.addInputPath(job, currentPageRankDb);
+        FileInputFormat.addInputPath(job, newPageRankDb);
+        try 
+          {
+            JobClient.runJob(job);
+          }
+        catch (IOException e) 
+          {
+            LockUtil.removeLockFile(fs, lock);
+            fs.delete(newPageRankDb, true);
+            throw e;
+          }
+        fs.delete(newPageRankDb, true);
+      }
+    PageRankDb.install(job, pageRankDb);
+    if (LOG.isInfoEnabled()) 
+      { LOG.info("PageRankDb: done"); }
+  }
+
+  private static JobConf createJob(Configuration config, Path pageRankDb, boolean normalize, boolean filter) 
+  {
+    Path newPageRankDb = new Path("pagerankdb-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
+
+    JobConf job = new NutchJob(config);
+    job.setJobName("pagerankdb " + pageRankDb);
+
+    job.setInputFormat(SequenceFileInputFormat.class);
+
+    job.setMapperClass(PageRankDb.class);
+    job.setCombinerClass(PageRankDbMerger.class);
+    // if we don't run the mergeJob, perform normalization/filtering now
+    if (normalize || filter) 
+      {
+        try 
+          {
+            FileSystem fs = FileSystem.get(config);
+            if (!fs.exists(pageRankDb)) 
+              {
+                job.setBoolean(LinkDbFilter.URL_FILTERING, filter);
+                job.setBoolean(LinkDbFilter.URL_NORMALIZING, normalize);
+              }
+          }
+        catch (Exception e) 
+          {
+            LOG.warn("PageRankDb createJob: " + e);
+          }
+      }
+    job.setReducerClass(PageRankDbMerger.class);
+
+    FileOutputFormat.setOutputPath(job, newPageRankDb);
+    job.setOutputFormat(MapFileOutputFormat.class);
+    job.setBoolean("mapred.output.compress", false);
+    job.setOutputKeyClass(Text.class);
+
+    // DIFF: Use IntWritable instead of Inlinks as the output value type.
+    job.setOutputValueClass(IntWritable.class);
+
+    return job;
+  }
+
+  public static void install(JobConf job, Path pageRankDb) throws IOException 
+  {
+    Path newPageRankDb = FileOutputFormat.getOutputPath(job);
+    FileSystem fs = new JobClient(job).getFs();
+    Path old = new Path(pageRankDb, "old");
+    Path current = new Path(pageRankDb, CURRENT_NAME);
+    if (fs.exists(current)) 
+      {
+        if (fs.exists(old)) fs.delete(old, true);
+        fs.rename(current, old);
+      }
+    fs.mkdirs(pageRankDb);
+    fs.rename(newPageRankDb, current);
+    if (fs.exists(old)) fs.delete(old, true);
+    LockUtil.removeLockFile(fs, new Path(pageRankDb, LOCK_NAME));
+  }
+
+  public static void main(String[] args) throws Exception 
+  {
+    int res = ToolRunner.run(NutchConfiguration.create(), new PageRankDb(), args);
+    System.exit(res);
+  }
+  
+  public int run(String[] args) throws Exception 
+  {
+    if (args.length < 2) 
+      {
+        System.err.println("Usage: PageRankDb <pagerankdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]");
+        System.err.println("\tpagerankdb\toutput PageRankDb to create or update");
+        System.err.println("\t-dir segmentsDir\tparent directory of several segments, OR");
+        System.err.println("\tseg1 seg2 ...\t list of segment directories");
+        System.err.println("\t-force\tforce update even if PageRankDb appears to be locked (CAUTION advised)");
+        System.err.println("\t-noNormalize\tdon't normalize link URLs");
+        System.err.println("\t-noFilter\tdon't apply URLFilters to link URLs");
+        return -1;
+      }
+    Path segDir = null;
+    final FileSystem fs = FileSystem.get(getConf());
+    Path db = new Path(args[0]);
+    ArrayList<Path> segs = new ArrayList<Path>();
+    boolean filter = true;
+    boolean normalize = true;
+    boolean force = false;
+    for (int i = 1; i < args.length; i++) 
+      {
+        if (args[i].equals("-dir")) 
+          {
+            segDir = new Path(args[++i]);
+            FileStatus[] files = fs.listStatus(segDir, HadoopFSUtil.getPassDirectoriesFilter(fs));
+            if (files != null) segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(files)));
+            break;
+          }
+        else if (args[i].equalsIgnoreCase("-noNormalize")) 
+          {
+            normalize = false;
+          }
+        else if (args[i].equalsIgnoreCase("-noFilter")) 
+          {
+            filter = false;
+          }
+        else if (args[i].equalsIgnoreCase("-force")) 
+          {
+            force = true;
+          }
+        else segs.add(new Path(args[i]));
+      }
+    try 
+      {
+        invert(db, segs.toArray(new Path[segs.size()]), normalize, filter, force);
+        return 0;
+      }
+    catch (Exception e) 
+      {
+        LOG.fatal("PageRankDb: " + StringUtils.stringifyException(e));
+        return -1;
+      }
+  }
+
+}

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java	2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,118 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Mapper;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+
+/**
+ * <p>This class provides a way to separate the URL normalization
+ * and filtering steps from the rest of LinkDb manipulation code.</p>
+ * <p>Aaron Binns @ archive.org: see comments in PageRankDbMerger.</p>
+ * 
+ * @author Andrzej Bialecki
+ * @author Aaron Binns (archive.org)
+ */
+public class PageRankDbFilter implements Mapper<Text, IntWritable, Text, IntWritable>
+{
+  public static final String URL_FILTERING = "linkdb.url.filters";
+  
+  public static final String URL_NORMALIZING = "linkdb.url.normalizer";
+
+  public static final String URL_NORMALIZING_SCOPE = "linkdb.url.normalizer.scope";
+
+  private boolean filter;
+
+  private boolean normalize;
+
+  private URLFilters filters;
+
+  private URLNormalizers normalizers;
+  
+  private String scope;
+  
+  public static final Log LOG = LogFactory.getLog(PageRankDbFilter.class);
+
+  private Text newKey = new Text();
+  
+  public void configure(JobConf job)
+  {
+    filter = job.getBoolean(URL_FILTERING, false);
+    normalize = job.getBoolean(URL_NORMALIZING, false);
+    if (filter)
+      {
+        filters = new URLFilters(job);
+      }
+    if (normalize)
+      {
+        scope = job.get(URL_NORMALIZING_SCOPE, URLNormalizers.SCOPE_LINKDB);
+        normalizers = new URLNormalizers(job, scope);
+      }
+  }
+
+  public void close()
+  {
+  }
+
+  public void map(Text key, IntWritable value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
+  {
+    String url = key.toString();
+    // Inlinks result = new Inlinks();
+    if (normalize)
+      {
+        try
+          {
+            url = normalizers.normalize(url, scope); // normalize the url
+          }
+        catch (Exception e)
+          {
+            LOG.warn("Skipping " + url + ":" + e);
+            url = null;
+          }
+      }
+    if (url != null && filter)
+      {
+        try
+          {
+            url = filters.filter(url); // filter the url
+          }
+        catch (Exception e)
+          {
+            LOG.warn("Skipping " + url + ":" + e);
+            url = null;
+          }
+      }
+    if (url == null) return; // didn't pass the filters
+
+    // DIFF: Now that normalizers and filters have run, just emit the
+    //       <url,value> pair.  No processing to be done on the value.
+    Text newKey = new Text( url );
+    output.collect( newKey, value );
+  }
+}

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java	2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,199 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.Random;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.mapred.JobClient;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.MapFileOutputFormat;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reducer;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.SequenceFileInputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.LinkDbFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.NutchJob;
+
+/**
+ * This tool merges several PageRankDb-s into one, optionally filtering
+ * URLs through the current URLFilters, to skip prohibited URLs and
+ * links.
+ * 
+ * <p>It's possible to use this tool just for filtering - in that case
+ * only one PageRankDb should be specified in arguments.</p>
+ * <p>If more than one PageRankDb contains information about the same URL,
+ * all inlinks are accumulated, but only at most <code>db.max.inlinks</code>
+ * inlinks will ever be added.</p>
+ * <p>If activated, URLFilters will be applied to both the target URLs and
+ * to any incoming link URL. If a target URL is prohibited, all
+ * inlinks to that target will be removed, including the target URL. If
+ * some of incoming links are prohibited, only they will be removed, and they
+ * won't count when checking the above-mentioned maximum limit.</p>
+ * <p>Aaron Binns @ archive.org:
+ * <blockquote>
+ *   Copy/paste/edit from LinkDbMerger.  We only care about the inlink
+ *   <em>count</em> not the inlinks themsevles.  In fact, trying to
+ *   retain the inlinks doesn't scale when processing 100s of millions
+ *   of documents.  In large part, due to fact that that Inlinks
+ *   object wants to keep all of the inlinks in memory at once,
+ *   i.e. in a Set.  This doesn't work when we have 600 million
+ *   documents and a single URL could easily have a million inlinks.
+ * </blockquote></p>
+ *
+ * @author Andrzej Bialecki
+ * @author Aaron Binns (archive.org)
+ */
+public class PageRankDbMerger extends Configured 
+  implements Tool, Reducer<Text, IntWritable, Text, IntWritable> 
+{
+  private static final Log LOG = LogFactory.getLog(PageRankDbMerger.class);
+  
+  private int maxInlinks;
+  
+  public PageRankDbMerger() 
+  {
+    
+  }
+  
+  public PageRankDbMerger(Configuration conf) 
+  {
+    setConf(conf);
+  }
+
+  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException 
+  {
+    // DIFF: Simply sum the count values for the key. 
+    int count = 0;
+    while ( values.hasNext( ) )
+      {
+        count += values.next( ).get( );
+      }
+    output.collect( key, new IntWritable( count ) );
+  }
+
+  public void configure(JobConf job) 
+  {
+    maxInlinks = job.getInt("db.max.inlinks", 10000);
+  }
+
+  public void close() throws IOException 
+  { }
+
+  public void merge(Path output, Path[] dbs, boolean normalize, boolean filter) throws Exception 
+  {
+    JobConf job = createMergeJob(getConf(), output, normalize, filter);
+    for (int i = 0; i < dbs.length; i++) 
+      {
+        FileInputFormat.addInputPath(job, new Path(dbs[i], PageRankDb.CURRENT_NAME));      
+      }
+    JobClient.runJob(job);
+    FileSystem fs = FileSystem.get(getConf());
+    fs.mkdirs(output);
+    fs.rename(FileOutputFormat.getOutputPath(job), new Path(output, PageRankDb.CURRENT_NAME));
+  }
+
+  public static JobConf createMergeJob(Configuration config, Path pageRankDb, boolean normalize, boolean filter) 
+  {
+    Path newPageRankDb =
+      new Path("pagerankdb-merge-" + 
+               Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
+
+    JobConf job = new NutchJob(config);
+    job.setJobName("pagerankdb merge " + pageRankDb);
+
+    job.setInputFormat(SequenceFileInputFormat.class);
+
+    job.setMapperClass(PageRankDbFilter.class);
+    job.setBoolean(LinkDbFilter.URL_NORMALIZING, normalize);
+    job.setBoolean(LinkDbFilter.URL_FILTERING, filter);
+    job.setReducerClass(PageRankDbMerger.class);
+
+    FileOutputFormat.setOutputPath(job, newPageRankDb);
+    job.setOutputFormat(MapFileOutputFormat.class);
+    job.setBoolean("mapred.output.compress", true);
+    job.setOutputKeyClass(Text.class);
+
+    // DIFF: Use IntWritable instead of Inlinks as the output value type.
+    job.setOutputValueClass(IntWritable.class);
+
+    return job;
+  }
+  
+  /**
+   * @param args
+   */
+  public static void main(String[] args) throws Exception 
+  {
+    int res = ToolRunner.run(NutchConfiguration.create(), new PageRankDbMerger(), args);
+    System.exit(res);
+  }
+  
+  public int run(String[] args) throws Exception 
+  {
+    if (args.length < 2) 
+      {
+        System.err.println("Usage: PageRankDbMerger <output_pagerankdb> <pagerankdb1> [<pagerankdb2> <pagerankdb3> ...] [-normalize] [-filter]");
+        System.err.println("\toutput_pagerankdb\toutput PageRankDb");
+        System.err.println("\tpagerankdb1 ...\tinput PageRankDb-s (single input PageRankDb is ok)");
+        System.err.println("\t-normalize\tuse URLNormalizer on both fromUrls and toUrls in pagerankdb(s) (usually not needed)");
+        System.err.println("\t-filter\tuse URLFilters on both fromUrls and toUrls in pagerankdb(s)");
+        return -1;
+      }
+    Path output = new Path(args[0]);
+    ArrayList<Path> dbs = new ArrayList<Path>();
+    boolean normalize = false;
+    boolean filter = false;
+    for (int i = 1; i < args.length; i++) 
+      {
+        if (args[i].equals("-filter")) 
+          {
+            filter = true;
+          } else if (args[i].equals("-normalize")) 
+          {
+            normalize = true;
+          } else dbs.add(new Path(args[i]));
+      }
+    try 
+      {
+        merge(output, dbs.toArray(new Path[dbs.size()]), normalize, filter);
+        return 0;
+      } 
+    catch (Exception e) 
+      {
+        LOG.fatal("PageRankDbMerger: " + StringUtils.stringifyException(e));
+        return -1;
+      }
+  }
+
+}

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	2009-02-10 22:19:48 UTC (rev 2682)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java	2009-02-23 03:54:47 UTC (rev 2683)
@@ -133,8 +133,6 @@
         return -1;
       }
 
-    PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) );
-
     if ( pos >= args.length )
       {
         System.err.println( "Error: missing linkdb" );
@@ -155,11 +153,17 @@
       }
     else
       {
-        FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs));
-        mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats)));
+        for ( ; pos < args.length ; pos++ )
+          {
+            FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs));
+            mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats)));
+          }
       }
 
     System.out.println( "mapfiles = " + mapfiles );
+
+    PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) );
+
     try 
       {
         for ( Path p : mapfiles )
@@ -171,24 +175,28 @@
             
             while ( reader.next( key, value ) )
               {
-                if ( key instanceof Text && value instanceof Inlinks )
+                if ( ! (key instanceof Text) ) continue ;
+
+                String toUrl = ((Text) key).toString( );
+
+                // HACK: Should make this into some externally configurable regex.
+                if ( ! toUrl.startsWith( "http" ) ) continue;
+
+                int count = -1;
+                if ( value instanceof IntWritable )
                   {
-                    Text    toUrl   = (Text)    key;
+                    count = ( (IntWritable) value ).get( );
+                  }
+                else if ( value instanceof Inlinks )
+                  {
                     Inlinks inlinks = (Inlinks) value;
 
-                    if ( inlinks.size( ) < threshold )
-                      {
-                        continue ;
-                      }
+                    count = inlinks.size( );
+                  }
+                
+                if ( count < threshold ) continue ;
 
-                    String toUrlString = toUrl.toString( );
-
-                    // HACK: Should make this into some externally configurable regex.
-                    if ( toUrlString.startsWith( "http" ) )
-                      {
-                        output.println( inlinks.size( ) + " " + toUrl.toString() );
-                      }
-                  }
+                output.println( count + " " + toUrl );
               }
           }
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2682] trunk/archive-access/projects/wayback/ wayback-core/pom.xml

From: <bra...@us...> - 2009-02-10 22:52:06

Revision: 2682
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2682&view=rev
Author:   bradtofel
Date:     2009-02-10 22:19:48 +0000 (Tue, 10 Feb 2009)

Log Message:
-----------
TWEAK: updated heritrix commons to 2.0.2 which has several bug fixes.
TWEAK: updated org.mozilla.juniversalchardet to 1.0.3 which has index OOB error fix.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/pom.xml

Modified: trunk/archive-access/projects/wayback/wayback-core/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/pom.xml	2009-01-31 00:57:49 UTC (rev 2681)
+++ trunk/archive-access/projects/wayback/wayback-core/pom.xml	2009-02-10 22:19:48 UTC (rev 2682)
@@ -57,7 +57,7 @@
     <dependency>
       <groupId>org.archive.heritrix</groupId>
       <artifactId>commons</artifactId>
-      <version>2.0.1-SNAPSHOT</version>
+      <version>2.0.2-SNAPSHOT</version>
     </dependency>
     <dependency>
       <groupId>org.archive.access-control</groupId>
@@ -67,7 +67,7 @@
     <dependency>
       <groupId>org.mozilla</groupId>
       <artifactId>juniversalchardet</artifactId>
-      <version>1.0</version>
+      <version>1.0.3</version>
     </dependency>
     <dependency>
       <groupId>org.springframework</groupId>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2681] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/replay/ RedirectRewritingHttpHeaderProcessor.java

From: <bra...@us...> - 2009-01-31 00:57:55

Revision: 2681
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2681&view=rev
Author:   bradtofel
Date:     2009-01-31 00:57:49 +0000 (Sat, 31 Jan 2009)

Log Message:
-----------
BUGFIX(ACC-60): now we omit sending the original Content-Length http header.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java	2009-01-29 23:52:10 UTC (rev 2680)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java	2009-01-31 00:57:49 UTC (rev 2681)
@@ -62,7 +62,9 @@
 		// first stick it in as-is, or with prefix, then maybe we'll overwrite
 		// with the later logic.
 		if(prefix == null) {
-			output.put(key, value);
+			if(!keyUp.equals(HTTP_LENGTH_HEADER_UP)) {
+				output.put(key, value);
+			}
 		} else {
 			output.put(prefix + key, value);
 		}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 43 44 45 46 47 .. 171 > >> (Page 45 of 171)