You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
| 2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
| 2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
| 2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
| 2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
| 2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
| 2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
|
From: <bra...@us...> - 2009-05-20 00:41:51
|
Revision: 2706
http://archive-access.svn.sourceforge.net/archive-access/?rev=2706&view=rev
Author: bradtofel
Date: 2009-05-20 00:41:15 +0000 (Wed, 20 May 2009)
Log Message:
-----------
TWEAK: added getter for ResultURIConverter
Modified Paths:
--------------
trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java
Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java 2009-05-20 00:40:05 UTC (rev 2705)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/UIResults.java 2009-05-20 00:41:15 UTC (rev 2706)
@@ -123,8 +123,14 @@
/*
* GENERAL GETTERS:
*/
-
/**
+ * @return the uriConverter
+ */
+ public ResultURIConverter getUriConverter() {
+ return uriConverter;
+ }
+
+ /**
* @return Returns the wbRequest.
*/
public WaybackRequest getWbRequest() {
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 22:18:28
|
Revision: 2704
http://archive-access.svn.sourceforge.net/archive-access/?rev=2704&view=rev
Author: binzino
Date: 2009-05-05 22:17:48 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Oops, didn't have the updated versions checked-in when I did the
release copy. Fixed.
Added Paths:
-----------
tags/nutchwax-0_12_4/archive/README.txt
tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt
Removed Paths:
-------------
tags/nutchwax-0_12_4/archive/README.txt
tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt
Deleted: tags/nutchwax-0_12_4/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,104 +0,0 @@
-
-README.txt
-2008-03-08
-Aaron Binns
-
-Table of Contents
- o Introduction
- o Build and Install
- o Tutorial
-
-
-======================================================================
-Introduction
-======================================================================
-
-Welcome to NutchWAX 0.12.4!
-
-NutchWAX is a set of add-ons to Nutch in order to index and search
-archived web data.
-
-These add-ons are developed and maintained by the Internet Archive Web
-Team in conjunction with a broad community of contributors, partners
-and end-users.
-
-The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
-
-Since NutchWAX is a set of add-ons to Nutch, you should already be
-familiar with Nutch before using NutchWAX.
-
-
-The goal of NutchWAX is to enable full-text indexing and searching of
-documents stored in web archive file formats (ARC and WARC).
-
-The way we achieve that goal is by providing plugins and add-on tools
-to Nutch to read documents directly from ARC/WARC files. We call this
-process "importing" archive files.
-
-Importing produces a Nutch segment, the same as when Nutch is used to
-crawl documents itself. In essence, document importing replaces the
-conventional "generate/fetch/update" cycle of Nutch.
-
-Once the archival documents have been imported into a segment, the
-regular Nutch commands to index the document contents can proceed as
-normal.
-
-======================================================================
-
-The main NutchWAX add-ons are:
-
- bin/nutchwax
-
- A shell script that is used to run the NutchWAX commands, such as
- document importing.
-
- This is patterned after the 'bin/nutch' shell script.
-
- plugins/index-nutchwax
-
- Indexing plugin which adds NutchWAX-specific metadata fields to the
- indexed document.
-
- plugins/query-nutchwax
-
- Query plugin which allows for querying against the metadata fields
- added by 'index-nutchwax'.
-
- plugins/urlfilter-nutchwax
-
- Filtering plugin which can be used to exclude URLs from import. It
- can be used as part of a NutchWAX de-duplication scheme.
-
- plugins/scoring-nutchwax
-
- Scoring plugin for use at index-time which reads from an external
- "pagerank.txt" file for scoring documents based on the log10 of the
- number of inlinks to a document.
-
- The use of this plugin is optional but can improve the quality of
- search results, especially for very large collections.
-
- conf/nutch-site.xml
-
- Additional configuration properties for NutchWAX, including
- over-rides for properties defined in 'nutch-default.xml'
-
-There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX
-is distributed in source code form and is intended to be built in
-conjunction with Nutch.
-
-
-======================================================================
-Build and Install
-======================================================================
-
-See "INSTALL.txt" for detailed instructions to build NutchWAX from
-source or install a binary package.
-
-
-======================================================================
-Tutorial
-======================================================================
-
-See "HOWTO.txt" for a quick tutorial on importing, indexing and
-searching a set of documents in a web archive file.
Copied: tags/nutchwax-0_12_4/archive/README.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/README.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/README.txt (rev 0)
+++ tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,104 @@
+
+README.txt
+2009-05-05
+Aaron Binns
+
+Table of Contents
+ o Introduction
+ o Build and Install
+ o Tutorial
+
+
+======================================================================
+Introduction
+======================================================================
+
+Welcome to NutchWAX 0.12.4!
+
+NutchWAX is a set of add-ons to Nutch in order to index and search
+archived web data.
+
+These add-ons are developed and maintained by the Internet Archive Web
+Team in conjunction with a broad community of contributors, partners
+and end-users.
+
+The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
+
+Since NutchWAX is a set of add-ons to Nutch, you should already be
+familiar with Nutch before using NutchWAX.
+
+
+The goal of NutchWAX is to enable full-text indexing and searching of
+documents stored in web archive file formats (ARC and WARC).
+
+The way we achieve that goal is by providing plugins and add-on tools
+to Nutch to read documents directly from ARC/WARC files. We call this
+process "importing" archive files.
+
+Importing produces a Nutch segment, the same as when Nutch is used to
+crawl documents itself. In essence, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
+
+Once the archival documents have been imported into a segment, the
+regular Nutch commands to index the document contents can proceed as
+normal.
+
+======================================================================
+
+The main NutchWAX add-ons are:
+
+ bin/nutchwax
+
+ A shell script that is used to run the NutchWAX commands, such as
+ document importing.
+
+ This is patterned after the 'bin/nutch' shell script.
+
+ plugins/index-nutchwax
+
+ Indexing plugin which adds NutchWAX-specific metadata fields to the
+ indexed document.
+
+ plugins/query-nutchwax
+
+ Query plugin which allows for querying against the metadata fields
+ added by 'index-nutchwax'.
+
+ plugins/urlfilter-nutchwax
+
+ Filtering plugin which can be used to exclude URLs from import. It
+ can be used as part of a NutchWAX de-duplication scheme.
+
+ plugins/scoring-nutchwax
+
+ Scoring plugin for use at index-time which reads from an external
+ "pagerank.txt" file for scoring documents based on the log10 of the
+ number of inlinks to a document.
+
+ The use of this plugin is optional but can improve the quality of
+ search results, especially for very large collections.
+
+ conf/nutch-site.xml
+
+ Additional configuration properties for NutchWAX, including
+ over-rides for properties defined in 'nutch-default.xml'
+
+There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX
+is distributed in source code form and is intended to be built in
+conjunction with Nutch.
+
+
+======================================================================
+Build and Install
+======================================================================
+
+See "INSTALL.txt" for detailed instructions to build NutchWAX from
+source or install a binary package.
+
+
+======================================================================
+Tutorial
+======================================================================
+
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.
Deleted: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,58 +0,0 @@
-
-RELEASE-NOTES.TXT
-2008-03-08
-Aaron Binns
-
-Release notes for NutchWAX 0.12.4
-
-For the most recent updates and information on NutchWAX,
-please visit the project wiki at:
-
- http://webteam.archive.org/confluence/display/search/NutchWAX
-
-
-======================================================================
-Overview
-======================================================================
-
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
-
- o Option to omit storing of content during import.
- o Support for per-collection segments in master/slave config.
- o Additional diagnostic/log messages to help troubleshoot common
- deployment mistakes.
- o PageRankDb similar to LinkDb but only keeping inlink counts.
- o Improved paging through results, handling "paging past the end".
-
-
-======================================================================
-Issues
-======================================================================
-
-For an up-to-date list of NutchWAX issues:
-
- http://webteam.archive.org/jira/browse/WAX
-
-Issues resolved in this release:
-
-WAX-27 Sensible output for requesting page of results past the end.
-
-WAX-34 Add option to omit storing of content in segment
-
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
- rather than actual inlinks.
-
-WAX-36 Some additional diagnostics on connecting results to segments
- and snippets would be very helpful.
-
-WAX-37 Per-collection segments not supported in distributed
- master-slave configuration.
-
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-
-
-
-
Copied: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (rev 0)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,57 @@
+
+RELEASE-NOTES.TXT
+2009-05-05
+Aaron Binns
+
+Release notes for NutchWAX 0.12.4
+
+For the most recent updates and information on NutchWAX,
+please visit the project wiki at:
+
+ http://webteam.archive.org/confluence/display/search/NutchWAX
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+
+ o Option to omit storing of content during import.
+ o Support for per-collection segments in master/slave config.
+ o Additional diagnostic/log messages to help troubleshoot common
+ deployment mistakes.
+ o PageRankDb similar to LinkDb but only keeping inlink counts.
+ o Improved paging through results, handling "paging past the end".
+
+
+======================================================================
+Issues
+======================================================================
+
+For an up-to-date list of NutchWAX issues:
+
+ http://webteam.archive.org/jira/browse/WAX
+
+Issues resolved in this release:
+
+WAX-27 Sensible output for requesting page of results past the end.
+
+WAX-34 Add option to omit storing of content in segment
+
+WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
+ rather than actual inlinks.
+
+WAX-36 Some additional diagnostics on connecting results to segments
+ and snippets would be very helpful.
+
+WAX-37 Per-collection segments not supported in distributed
+ master-slave configuration.
+
+WAX-38 Build omits neessary libraries from .job file.
+
+WAX-39 Write more efficient, specialized segment parse_text merging.
+
+WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
+
+WAX-42 Add option to continue importing if an arcfile cannot be read.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 21:46:49
|
Revision: 2703
http://archive-access.svn.sourceforge.net/archive-access/?rev=2703&view=rev
Author: binzino
Date: 2009-05-05 21:46:40 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Updated for NutchWAX 0.12.4 release.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/README.txt
trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt 2009-05-05 21:44:29 UTC (rev 2702)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt 2009-05-05 21:46:40 UTC (rev 2703)
@@ -1,6 +1,6 @@
README.txt
-2008-03-08
+2009-05-05
Aaron Binns
Table of Contents
Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-05-05 21:44:29 UTC (rev 2702)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-05-05 21:46:40 UTC (rev 2703)
@@ -1,6 +1,6 @@
RELEASE-NOTES.TXT
-2008-03-08
+2009-05-05
Aaron Binns
Release notes for NutchWAX 0.12.4
@@ -52,7 +52,6 @@
WAX-39 Write more efficient, specialized segment parse_text merging.
+WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
-
-
-
+WAX-42 Add option to continue importing if an arcfile cannot be read.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 21:44:37
|
Revision: 2702
http://archive-access.svn.sourceforge.net/archive-access/?rev=2702&view=rev
Author: binzino
Date: 2009-05-05 21:44:29 +0000 (Tue, 05 May 2009)
Log Message:
-----------
NutchWAX 0.12.4 release.
Added Paths:
-----------
tags/nutchwax-0_12_4/
tags/nutchwax-0_12_4/archive/
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 21:15:55
|
Revision: 2701
http://archive-access.svn.sourceforge.net/archive-access/?rev=2701&view=rev
Author: binzino
Date: 2009-05-05 21:15:28 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Changed default location to look for search.xsl. Likely needs editing
post-deployment however.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2009-05-05 21:14:39 UTC (rev 2700)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2009-05-05 21:15:28 UTC (rev 2701)
@@ -59,7 +59,7 @@
<filter-class>org.archive.nutchwax.XSLTFilter</filter-class>
<init-param>
<param-name>xsltUrl</param-name>
- <param-value>style/search.xsl</param-value>
+ <param-value>webapps/nutchwax-0.12.4/search.xsl</param-value>
</init-param>
</filter>
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 21:15:47
|
Revision: 2700
http://archive-access.svn.sourceforge.net/archive-access/?rev=2700&view=rev
Author: binzino
Date: 2009-05-05 21:14:39 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fix type-o
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-05-05 20:24:22 UTC (rev 2699)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-05-05 21:14:39 UTC (rev 2700)
@@ -222,7 +222,7 @@
<name>nutchwax.filter.index</name>
<value>
url:false:true:true
- url:flase:true:false:true:exacturl
+ url:false:true:false:true:exacturl
orig:false
digest:false
filename:false
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 20:24:28
|
Revision: 2699
http://archive-access.svn.sourceforge.net/archive-access/?rev=2699&view=rev
Author: binzino
Date: 2009-05-05 20:24:22 +0000 (Tue, 05 May 2009)
Log Message:
-----------
WAX-42. Add option to continue/abort importing after read error on
archive file.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2009-05-05 20:20:45 UTC (rev 2698)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2009-05-05 20:24:22 UTC (rev 2699)
@@ -210,6 +210,15 @@
reporter.progress();
}
}
+ catch ( Exception e )
+ {
+ LOG.warn( "Error processing archive file: " + arcUrl, e );
+
+ if ( jobConf.getBoolean( "nutchwax.import.abortOnArchiveReadError", false ) )
+ {
+ throw new IOException( e );
+ }
+ }
finally
{
r.close();
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 20:20:48
|
Revision: 2698
http://archive-access.svn.sourceforge.net/archive-access/?rev=2698&view=rev
Author: binzino
Date: 2009-05-05 20:20:45 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fixed type-o.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 19:24:16 UTC (rev 2697)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 20:20:45 UTC (rev 2698)
@@ -186,7 +186,7 @@
<property>
<name>searcher.fieldcache</name>
- <property>true</property>
+ <value>true</value>
</property>
</configuration>
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 19:25:06
|
Revision: 2697
http://archive-access.svn.sourceforge.net/archive-access/?rev=2697&view=rev
Author: binzino
Date: 2009-05-05 19:24:16 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fix WAX-41. Added option to use fieldcache or not when handling
searches using 'dedup' feature.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 17:52:47 UTC (rev 2696)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 19:24:16 UTC (rev 2697)
@@ -184,4 +184,9 @@
<value>80</value>
</property>
+<property>
+ <name>searcher.fieldcache</name>
+ <property>true</property>
+</property>
+
</configuration>
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java 2009-05-05 17:52:47 UTC (rev 2696)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java 2009-05-05 19:24:16 UTC (rev 2697)
@@ -136,9 +136,9 @@
private Hits translateHits(TopDocs topDocs,
String dedupField, String sortField)
throws IOException {
-
+
String[] dedupValues = null;
- if (dedupField != null)
+ if (dedupField != null && this.conf.getBoolean( "searcher.fieldcache", true ) )
dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
@@ -164,7 +164,33 @@
}
}
- String dedupValue = dedupValues == null ? null : dedupValues[doc];
+ String dedupValue = "";
+ if ( dedupValues != null )
+ {
+ dedupValue = dedupValues[doc];
+ }
+ else
+ {
+ if ( "site".equals( dedupField ) )
+ {
+ String exactUrl = reader.document( doc ).get( "exacturl");
+ try
+ {
+ java.net.URL u = new java.net.URL( exactUrl );
+ dedupValue = u.getHost();
+
+ System.out.println("Dedup value hack:" + dedupValue);
+ }
+ catch ( java.net.MalformedURLException e )
+ {
+ // Eat it.
+ }
+ }
+ else
+ {
+ dedupValue = reader.document( doc ).get( dedupField );
+ }
+ }
hits[i] = new Hit(doc, sortValue, dedupValue);
}
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 17:53:28
|
Revision: 2696
http://archive-access.svn.sourceforge.net/archive-access/?rev=2696&view=rev
Author: binzino
Date: 2009-05-05 17:52:47 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fix type-o.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 17:52:20 UTC (rev 2695)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-05-05 17:52:47 UTC (rev 2696)
@@ -44,7 +44,7 @@
<name>nutchwax.filter.index</name>
<value>
url:false:true:true
- url:flase:true:false:true:exacturl
+ url:false:true:false:true:exacturl
orig:false
digest:false
filename:false
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-05-05 17:53:03
|
Revision: 2695
http://archive-access.svn.sourceforge.net/archive-access/?rev=2695&view=rev
Author: binzino
Date: 2009-05-05 17:52:20 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fix type-o
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java
Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java 2009-03-08 22:59:46 UTC (rev 2694)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java 2009-05-05 17:52:20 UTC (rev 2695)
@@ -248,7 +248,7 @@
* searching behavior where a field is only searched in the first
* index that has the field.</p>
* <p>This differs from the bundled Lucene <code>ParallelReader</code>,
- * which adds all vales from every index that has the field.</p>
+ * which adds all values from every index that has the field.</p>
* <p>The <code>fieldSelector<code> parameter is ignored.</p>
* <h3>Implementation Notes</h3>
* <p>Since getting the document from the reader is the expensive
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-08 22:59:48
|
Revision: 2694
http://archive-access.svn.sourceforge.net/archive-access/?rev=2694&view=rev
Author: binzino
Date: 2009-03-08 22:59:46 +0000 (Sun, 08 Mar 2009)
Log Message:
-----------
Added commands to drive recently added tools.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/bin/nutchwax
Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2009-03-08 21:43:33 UTC (rev 2693)
+++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2009-03-08 22:59:46 UTC (rev 2694)
@@ -42,6 +42,14 @@
shift
${NUTCH_HOME}/bin/nutch org.archive.nutchwax.Importer $@
;;
+ pagerankdb)
+ shift
+ ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDb $@
+ ;;
+ pagerankdbmerger)
+ shift
+ ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger $@
+ ;;
add-dates)
shift
${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder $@
@@ -50,18 +58,25 @@
shift
${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@
;;
- pagerank)
+ pageranker)
shift
${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@
;;
+ parsetextmerger)
+ shift
+ ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@
+ ;;
*)
echo ""
echo "Usage: nutchwax COMMAND"
echo "where COMMAND is one of:"
- echo " import Import ARCs into a new Nutch segment"
- echo " add-dates Add dates to a parallel index"
- echo " dumpindex Dump an index or set of parallel indices to stdout"
- echo " pagerank Generate pagerank file for URLs in a 'linkdb'."
+ echo " import Import ARCs into a new Nutch segment"
+ echo " pagerankdb Generate pagerankdb for a segment"
+ echo " pagerankdbmerger Merge multiple pagerankdbs"
+ echo " pageranker Generate pagerank.txt file from 'pagerankdb's or 'linkdb's"
+ echo " parsetextmerger Merge segement parse_text/part-nnnnn directories."
+ echo " add-dates Add dates to a parallel index"
+ echo " dumpindex Dump an index or set of parallel indices to stdout"
echo ""
exit 1
;;
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-08 21:43:45
|
Revision: 2693
http://archive-access.svn.sourceforge.net/archive-access/?rev=2693&view=rev
Author: binzino
Date: 2009-03-08 21:43:33 +0000 (Sun, 08 Mar 2009)
Log Message:
-----------
Updated documentation for 0.12.4 release.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
trunk/archive-access/projects/nutchwax/archive/README.txt
trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-03-08 21:43:33 UTC (rev 2693)
@@ -79,7 +79,7 @@
----------------------------------------------------------------------
The file
- /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
contains two errors: one where a mimetype is referenced before it is
defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
You can either apply these patches yourself, or copy an already-patched
copy from:
- /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
to
- /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
----------------------------------------------------------------------
@@ -166,7 +166,6 @@
--------------------------------------------------
indexingfilter.order
--------------------------------------------------
-
Add this property with a value of
org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -300,7 +299,6 @@
--------------------------------------------------
nutchwax.urlfilter.wayback.canonicalizer
--------------------------------------------------
-
For CDX-based de-duplication, the same URL canonicalization algorithm
must be used here as was used to generate the CDX files.
@@ -390,3 +388,43 @@
capacity of the computers performing the import. Something in the
1-4MB range is typical.
+--------------------------------------------------
+nutchwax.FetchedSegments.perCollection
+--------------------------------------------------
+Enable per-collection segment sub-dirs, e.g.
+
+ segments/<collectionId>/segment1
+ /segment2
+ ...
+
+Default value: false
+
+For example,
+
+ <property>
+ <name>nutchwax.FetchedSegments.perCollection</name>
+ <value>true</value>
+ </property>
+
+--------------------------------------------------
+nutchwax.import.content.store
+--------------------------------------------------
+Whether or not we store the full content in the segment's "content"
+directory. Most NutchWAX users are also using Wayback to serve the
+archived content, so there's no need for NutchWAX to keep a "cached"
+copy as well.
+
+Setting to 'true' yields the same bahavior as in previous versions of
+NutchWAX, and as in Nutch. The content is stored in the segment's
+"content" directory.
+
+Setting to 'false' results in an empty "content" directory in the
+segment. The content is not stored.
+
+Default value is 'false'.
+
+ <property>
+ <name>nutchwax.import.store.content</name>
+ <value>false</value>
+ </property>
+
Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2009-03-08 21:43:33 UTC (rev 2693)
@@ -26,7 +26,7 @@
This HOWTO assumes it is installed in
- /opt/nutchwax-0.12.3
+ /opt/nutchwax-0.12.4
2. ARC/WARC files.
@@ -68,10 +68,10 @@
$ mkdir crawl
$ cd crawl
- $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest
- $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments
- $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb -dir segments
- $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/*
+ $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest
+ $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments
+ $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments
+ $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/*
$ ls -F1
crawldb/
indexes/
@@ -96,7 +96,7 @@
$ cd ../
$ ls -F1
crawl/
- $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer
+ $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
This calls the NutchBean to execute a simple keyword search for
"computer". Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
The Nutch(WAX) web application is bundled with NutchWAX as
- /opt/nutchwax-0.12.3/nutch-1.0-dev.war
+ /opt/nutchwax-0.12.4/nutch-1.0-dev.war
Simply deploy that web application in the same fashion as with
Nutch.
Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,6 +1,6 @@
INSTALL.txt
-2008-12-18
+2009-03-08
Aaron Binns
Table of Contents
@@ -10,6 +10,7 @@
- SVN: NutchWAX
- Build and Install
o Install binary package
+ o Install start-up scripts
======================================================================
@@ -62,7 +63,7 @@
------------------
As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is
+Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.4 is
built against is:
701524
@@ -78,14 +79,14 @@
SVN: NutchWAX
-------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
-Nutch's "contrib" directory.
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+source into Nutch's "contrib" directory.
$ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
This will create a sub-directory named "archive" containing the
-NutchWAX sources.
+NutchWAX 0.12.4 sources.
Build and install
-----------------
@@ -112,7 +113,7 @@
$ cd /opt
$ tar xvfz nutch-1.0-dev.tar.gz
- $ mv nutch-1.0-dev nutchwax-0.12.3
+ $ mv nutch-1.0-dev nutchwax-0.12.4
======================================================================
@@ -125,5 +126,50 @@
Install it simply by untarring it, for example:
$ cd /opt
- $ tar xvfz nutchwax-0.12.3.tar.gz
+ $ tar xvfz nutchwax-0.12.4.tar.gz
+
+======================================================================
+Install start-up scripts
+======================================================================
+
+NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
+automatically start the searcher slaves for a multi-node search
+configuration.
+
+Assuming you installed NutchWAX as
+
+ /opt/nutchwax-0.12.4
+
+the script is found at
+
+ /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
+
+This script can be placed in /etc/init.d then added to the list of
+startup scripts to run at bootup by using commands appropriate to your
+Linux distribution.
+
+You must edit a few of the environment variables defined in the
+'searcher-slave' specifying where NutchWAX is installed and where the
+index(s) are deployed. In 'searcher-slave' you will find the:
+
+ export NUTCH_HOME=TODO
+ export DEPLOYMENT_DIR=TODO
+
+edit those appropriately for your system.
+
+
+The "master" in the multi-node search deployment is the NutchWAX
+webapp running in a webapp server, such as Tomcat or Jetty.
+
+Jetty comes with a start/stop script appropriate for use as an init.d
+script, similar to the 'searcher-slave' script described above. If you
+use Jetty, create a symlink
+
+ /etc/init.d/jetty.sh -> /opt/jetty/bin/jetty.sh
+
+Then add this script to the list of startup scripts to run at bootup
+by using commands appropriate to your Linux distribution.
+
+Follow the instructions from Jetty on the deployment of the NutchWAX
+webapp (nutch-1.0-dev.war) in the Jetty web application server.
Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt 2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt 2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,6 +1,6 @@
README.txt
-2008-12-18
+2008-03-08
Aaron Binns
Table of Contents
@@ -13,7 +13,7 @@
Introduction
======================================================================
-Welcome to NutchWAX 0.12.3!
+Welcome to NutchWAX 0.12.4!
NutchWAX is a set of add-ons to Nutch in order to index and search
archived web data.
Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-03-08 20:44:25 UTC (rev 2692)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-03-08 21:43:33 UTC (rev 2693)
@@ -1,9 +1,9 @@
RELEASE-NOTES.TXT
-2008-12-18
+2008-03-08
Aaron Binns
-Release notes for NutchWAX 0.12.3
+Release notes for NutchWAX 0.12.4
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
@@ -15,61 +15,44 @@
Overview
======================================================================
-NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2
+NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
- o PageRank calculation and scoring
- o Enhanced OpenSearchServlet
- o Improved XSLT sample for OpenSearch
- o System init.d script for searcher slaves
- o Enhanced searcher slave which supports NutchWAX extensions
+ o Option to omit storing of content during import.
+ o Support for per-collection segments in master/slave config.
+ o Additional diagnostic/log messages to help troubleshoot common
+ deployment mistakes.
+ o PageRankDb similar to LinkDb but only keeping inlink counts.
+ o Improved paging through results, handling "paging past the end".
-One of the major changes to 0.12.3 is not a feature, enhancement or
-bug-fix, but the way the NutchWAX source is "integrated" into the
-Nutch source.
+======================================================================
+Issues
+======================================================================
-Yes, the NutchWAX source is still kept in the contrib/archive
-sub-directory, but when you invoke a build command from the
-NutchWAX directory, such as
+For an up-to-date list of NutchWAX issues:
- $ cd nutch/contrib/archive
- $ ant tar
+ http://webteam.archive.org/jira/browse/WAX
-Many files from the NutchWAX source tree are copied directly into the
-Nutch source tree before the build process begins.
+Issues resolved in this release:
-The reason for this is to make NutchWAX easier to use.
+WAX-27 Sensible output for requesting page of results past the end.
-In previous versions of NutchWAX, once 'ant' build command was
-finished, the operator had to manually patch configuration files in
-the Nutch directory. Upon a subsequent build, the files would be
-over-written by Nutch's and would have to be patched again.
+WAX-34 Add option to omit storing of content in segment
-It was a major hassle and complication.
+WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
+ rather than actual inlinks.
-Another impetus for copying files into the Nutch source was to patch
-bugs and make enhancements in the Nutch Java code which couldn't be
-effectively done keeping the sources separate. When an 'ant' build
-command is run a few Java files are copied from the NutchWAX source
-tree into the Nutch source tree.
+WAX-36 Some additional diagnostics on connecting results to segments
+ and snippets would be very helpful.
-In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of
-this. Simply execute your build commands from 'contrib/archive' as
-instructed in the HOWTO and no longer worry about patching
-configuration files. If you wish to alter the NutchWAX configuration
-file, make those changes in the NutchWAX source tree.
+WAX-37 Per-collection segments not supported in distributed
+ master-slave configuration.
+WAX-38 Build omits neessary libraries from .job file.
-======================================================================
-Issues
-======================================================================
+WAX-39 Write more efficient, specialized segment parse_text merging.
-For an up-to-date list of NutchWAX issues:
- http://webteam.archive.org/jira/browse/WAX
-Issues resolved in this release:
-WAX-26
- Add XML elements containing all search URL params for self-link
- generation
+
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-08 20:44:50
|
Revision: 2692
http://archive-access.svn.sourceforge.net/archive-access/?rev=2692&view=rev
Author: binzino
Date: 2009-03-08 20:44:25 +0000 (Sun, 08 Mar 2009)
Log Message:
-----------
First cut. Works, but isn't the prettiest code I've ever written.
Added Paths:
-----------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java
Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/ParseTextCombiner.java 2009-03-08 20:44:25 UTC (rev 2692)
@@ -0,0 +1,216 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax.tools;
+
+import java.io.*;
+import java.util.*;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.mapred.FileAlreadyExistsException;
+import org.apache.hadoop.util.*;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.util.ReflectionUtils;
+
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LogUtil;
+import org.apache.nutch.util.NutchConfiguration;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.index.IndexWriter;
+
+/**
+ * <p>This is a one-off/hack to (hopefully) efficiently combine
+ * multiple "parse_text/part-nnnnn" map files into a single map file.
+ * Using the Nutch 'mergesegs' takes far too long in practice, and
+ * often fails to complete due to memory constraints.
+ * </p>
+ * <p>This class takes advantage of the fact that the
+ * "parse_text/part-nnnnn" directories are Hadoop MapFiles. To merge
+ * them, all we have to do is read key/value pairs from each one and
+ * write them back out in sorted order.
+ * </p>
+ */
+public class ParseTextCombiner extends Configured implements Tool
+{
+ public static final Log LOG = LogFactory.getLog(ParseTextCombiner.class);
+
+ private boolean verbose = false;
+
+ public ParseTextCombiner()
+ {
+
+ }
+
+ public ParseTextCombiner(Configuration conf)
+ {
+ setConf(conf);
+ }
+
+ /**
+ * Create an index for the input files in the named directory.
+ */
+ public static void main(String[] args)
+ throws Exception
+ {
+ int res = ToolRunner.run(NutchConfiguration.create(), new ParseTextCombiner(), args);
+ System.exit(res);
+ }
+
+ /**
+ *
+ */
+ public int run(String[] args)
+ throws Exception
+ {
+ String usage = "Usage: ParseTextCombiner [-v] output input...\n";
+
+ if ( args.length < 1 )
+ {
+ System.err.println( "Usage: " + usage );
+ return 1;
+ }
+
+ if ( args[0].equals( "-h" ) )
+ {
+ System.err.println( "Usage: " + usage );
+ return 1;
+ }
+
+ int argStart = 0;
+ if ( args[argStart].equals( "-v" ) )
+ {
+ verbose = true;
+ argStart = 1;
+ }
+
+ if ( args.length - argStart < 2 )
+ {
+ System.err.println( "Usage: " + usage );
+ return 1;
+ }
+
+ Configuration conf = getConf( );
+ FileSystem fs = FileSystem.get( conf );
+
+ Path outputPath = new Path( args[argStart] );
+ if ( fs.exists( outputPath ) )
+ {
+ System.err.println( "ERROR: output already exists: " + outputPath );
+ return -1;
+ }
+
+ MapFile.Reader[] readers = new MapFile.Reader[args.length - argStart - 1];
+ for ( int pos = argStart + 1 ; pos < args.length ; pos++ )
+ {
+ readers[pos - argStart - 1] = new MapFile.Reader( fs, args[pos], conf );
+ }
+
+ WritableComparable[] keys = new WritableComparable[readers.length];
+ Writable[] values = new Writable [readers.length];
+
+ WritableComparator wc = WritableComparator.get( readers[0].getKeyClass() );
+
+ MapFile.Writer writer = new MapFile.Writer( conf, fs, outputPath.toString(), readers[0].getKeyClass(), readers[0].getValueClass( ) );
+
+ int readCount = 0;
+ int writeCount = 0;
+
+ for ( int i = 0 ; i < readers.length ; i++ )
+ {
+ WritableComparable key = (WritableComparable) ReflectionUtils.newInstance( readers[i].getKeyClass(), conf );
+ Writable value = (Writable) ReflectionUtils.newInstance( readers[i].getValueClass(), conf );
+
+ if ( readers[i].next( key, value ) )
+ {
+ keys [i] = key;
+ values[i] = value;
+
+ readCount++;
+ if ( verbose ) System.out.println( "read: " + i + ": " + key );
+ }
+ else
+ {
+ // Not even one key/value pair in the map.
+ System.out.println( "WARN: No key/value pairs in mapfile: " + args[i+argStart+1] );
+ try { readers[i].close(); } catch ( IOException ioe ) { /* Don't care */ }
+ readers[i] = null;
+ }
+ }
+
+ while ( true )
+ {
+ int candidate = -1;
+
+ for ( int i = 0 ; i < keys.length ; i++ )
+ {
+ if ( keys[i] == null ) continue ;
+
+ if ( candidate < 0 )
+ {
+ candidate = i;
+ }
+ else if ( wc.compare( keys[i], keys[candidate] ) < 0 )
+ {
+ candidate = i;
+ }
+ }
+
+ if ( candidate < 0 )
+ {
+ if ( verbose ) System.out.println( "Candidate < 0, all done." );
+ break ;
+ }
+
+ // Candidate is the index of the "smallest" key.
+
+ // Write it out.
+ writer.append( keys[candidate], values[candidate] );
+ writeCount++;
+ if ( verbose ) System.out.println( "write: " + candidate + ": " + keys[candidate] );
+
+ // Now read in a new value from the corresponding reader.
+ if ( ! readers[candidate].next( keys[candidate], values[candidate] ) )
+ {
+ if ( verbose ) System.out.println( "No more key/value pairs in (" + candidate + "): " + args[candidate+argStart+1] );
+
+ // No more key/value pairs left in this reader.
+ try { readers[candidate].close(); } catch ( IOException ioe ) { /* Don't care */ }
+ readers[candidate] = null;
+ keys [candidate] = null;
+ values [candidate] = null;
+ }
+ else
+ {
+ readCount++;
+ if ( verbose ) System.out.println( "read: " + candidate + ": " + keys[candidate] );
+ }
+ }
+
+ System.out.println( "Total # records in : " + readCount );
+ System.out.println( "Total # records out: " + writeCount );
+
+ writer.close();
+
+ return 0;
+ }
+}
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-08 02:54:26
|
Revision: 2691
http://archive-access.svn.sourceforge.net/archive-access/?rev=2691&view=rev
Author: binzino
Date: 2009-03-08 02:54:12 +0000 (Sun, 08 Mar 2009)
Log Message:
-----------
Added info on start/stop scripts to INSTALL.txt and also clarified the
parts of searcher-slave that need post-installation edits by the
administrator.
Modified Paths:
--------------
tags/nutchwax-0_12_3/archive/INSTALL.txt
tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave
Modified: tags/nutchwax-0_12_3/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_3/archive/INSTALL.txt 2009-03-04 04:35:06 UTC (rev 2690)
+++ tags/nutchwax-0_12_3/archive/INSTALL.txt 2009-03-08 02:54:12 UTC (rev 2691)
@@ -1,6 +1,6 @@
INSTALL.txt
-2008-12-18
+2009-03-06
Aaron Binns
Table of Contents
@@ -127,3 +127,48 @@
$ cd /opt
$ tar xvfz nutchwax-0.12.3.tar.gz
+
+======================================================================
+Install start-up scripts
+======================================================================
+
+NutchWAX 0.12.3 comes with a Unix init.d script which can be used to
+automatically start the searcher slaves for a multi-node search
+configuration.
+
+Assuming you installed NutchWAX as
+
+ /opt/nutchwax-0.12.3
+
+the script is found at
+
+ /opt/nutchwax-0.12.3/contrib/archive/etc/init.d/searcher-slave
+
+This script can be placed in /etc/init.d then added to the list of
+startup scripts to run at bootup by using commands appropriate to your
+Linux distribution.
+
+You must edit a few of the environment variables defined in the
+'searcher-slave' specifying where NutchWAX is installed and where the
+index(s) are deployed. In 'searcher-slave' you will find the:
+
+ export NUTCH_HOME=TODO
+ export DEPLOYMENT_DIR=TODO
+
+edit those appropriately for your system.
+
+
+The "master" in the multi-node search deployment is the NutchWAX
+webapp running in a webapp server, such as Tomcat or Jetty.
+
+Jetty comes with a start/stop script appropriate for use as an init.d
+script, similar to the 'searcher-slave' script described above. If you
+use Jetty, create a symlink
+
+ /etc/init.d/jetty.sh -> /opt/jetty/bin/jetty.sh
+
+Then add this script to the list of startup scripts to run at bootup
+by using commands appropriate to your Linux distribution.
+
+Follow the instructions from Jetty on the deployment of the NutchWAX
+webapp (nutch-1.0-dev.war) in the Jetty web application server.
Modified: tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave
===================================================================
--- tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave 2009-03-04 04:35:06 UTC (rev 2690)
+++ tags/nutchwax-0_12_3/archive/src/etc/init.d/searcher-slave 2009-03-08 02:54:12 UTC (rev 2691)
@@ -10,10 +10,11 @@
DESC="NutchWAX searcher slave"
NAME="searcher-slave"
-DAEMON="/3/search/nutchwax-0.12.2/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 /3/search/deploy"
-NUTCH_HOME=/3/search/nutchwax-0.12.2
-JAVA_HOME=/usr
+export NUTCH_HOME=TODO
+export DEPLOYMENT_DIR=TODO
+export JAVA_HOME=/usr
export NUTCH_HEAPSIZE=2500
+DAEMON="${NUTCH_HOME}/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 ${DEPLOYMENT_DIR}"
PIDFILE=/var/run/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-04 04:35:07
|
Revision: 2690
http://archive-access.svn.sourceforge.net/archive-access/?rev=2690&view=rev
Author: binzino
Date: 2009-03-04 04:35:06 +0000 (Wed, 04 Mar 2009)
Log Message:
-----------
Fix JIRA WAX-38. Added rules to "job" target to add our libraries to
the .job file.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/build.xml
Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml 2009-03-04 01:18:44 UTC (rev 2689)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml 2009-03-04 04:35:06 UTC (rev 2690)
@@ -15,7 +15,7 @@
See the License for the specific language governing permissions and
limitations under the License.
-->
-<project name="nutchwax" default="job">
+<project name="nutchwax" default="jar">
<property name="nutch.dir" value="../../" />
@@ -23,8 +23,9 @@
<property name="lib.dir" value="lib" />
<property name="build.dir" value="${nutch.dir}/build" />
<!-- HACK: Need to import default.properties like Nutch does -->
- <property name="dist.dir" value="${build.dir}/nutch-1.0-dev" />
-
+ <property name="final.name" value="nutch-1.0-dev" />
+ <property name="dist.dir" value="${build.dir}/${final.name}" />
+
<target name="nutch-compile-core">
<!-- First, copy over Nutch source overlays -->
<exec executable="rsync">
@@ -83,6 +84,11 @@
<target name="job" depends="compile">
<ant dir="${nutch.dir}" target="job" inheritAll="false" />
+
+ <!-- Add our NutchWAX libs to the .job created by Nutch's build. -->
+ <jar jarfile="${build.dir}/${final.name}.job" update="true">
+ <zipfileset dir="lib" prefix="lib" includes="*.jar"/>
+ </jar>
</target>
<target name="war" depends="compile">
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-04 01:18:45
|
Revision: 2689
http://archive-access.svn.sourceforge.net/archive-access/?rev=2689&view=rev
Author: binzino
Date: 2009-03-04 01:18:44 +0000 (Wed, 04 Mar 2009)
Log Message:
-----------
Added boolean configuration property nutchwax.import.store.content to
determine whether or not the Importer stores the full content in the
segment's "content" directory.
Removed a useless debug message from the end of the Import job.
Removed searcher.max.hits from nutch-site.xml as it actually causes
lots of problems with search-time site-based de-dup.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2009-03-03 20:34:38 UTC (rev 2688)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2009-03-04 01:18:44 UTC (rev 2689)
@@ -456,8 +456,12 @@
try
{
- output.collect( key, new NutchWritable( datum ) );
- output.collect( key, new NutchWritable( content ) );
+ output.collect( key, new NutchWritable( datum ) );
+
+ if ( jobConf.getBoolean( "nutchwax.import.store.content", false ) )
+ {
+ output.collect( key, new NutchWritable( content ) );
+ }
if ( parseResult != null )
{
@@ -649,9 +653,6 @@
RunningJob rj = JobClient.runJob( job );
- // Emit job id and status.
- System.out.println( "JOB_STATUS: " + rj.getID( ) + ": " + (rj.isSuccessful( ) ? "SUCCESS" : "FAIL" ) );
-
return rj.isSuccessful( ) ? 0 : 1;
}
catch ( Exception e )
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-03-03 20:34:38 UTC (rev 2688)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-03-04 01:18:44 UTC (rev 2689)
@@ -137,6 +137,25 @@
<value>1048576</value>
</property>
+<!-- Whether or not we store the full content in the segment's
+ "content" directory. Most NutchWAX users are also using Wayback
+ to serve the archived content, so there's no need for NutchWAX to
+ keep a "cached" copy as well.
+
+ Setting to 'true' yields the same bahavior as in previous
+ versions of NutchWAX, and as in Nutch. The content is stored in
+ the segment's "content" directory.
+
+ Setting to 'false' results in an empty "content" directory in the
+ segment. The content is not stored.
+
+ Default value is 'false'.
+ -->
+<property>
+ <name>nutchwax.import.store.content</name>
+ <value>false</value>
+</property>
+
<!-- Enable per-collection segment sub-dirs, e.g.
segments/<collectionId>/segment1
/segment2
@@ -156,11 +175,6 @@
</property>
<property>
- <name>searcher.max.hits</name>
- <value>1000</value>
-</property>
-
-<property>
<name>searcher.summary.context</name>
<value>8</value>
</property>
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-03 20:34:43
|
Revision: 2688
http://archive-access.svn.sourceforge.net/archive-access/?rev=2688&view=rev
Author: binzino
Date: 2009-03-03 20:34:38 +0000 (Tue, 03 Mar 2009)
Log Message:
-----------
Re-worked the page link generation to handle last-page and
paging-off-the-end.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl 2009-03-03 18:20:14 UTC (rev 2687)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl 2009-03-03 20:34:38 UTC (rev 2688)
@@ -192,37 +192,73 @@
<xsl:template name="pageLinks">
<xsl:param name="labelPrevious" />
<xsl:param name="labelNext" />
+ <xsl:variable name="startPage" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+ <xsl:variable name="lastPage" select="floor(opensearch:totalResults div opensearch:itemsPerPage) + 1" />
<!-- If we are on any page past the first, emit a "previous" link -->
- <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1">
+ <xsl:if test="$startPage != 1">
<xsl:call-template name="pageLink">
- <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage)" />
+ <xsl:with-param name="pageNum" select="$startPage - 1" />
<xsl:with-param name="linkText" select="$labelPrevious" />
</xsl:call-template>
<xsl:text> </xsl:text>
</xsl:if>
<!-- Now, emit numbered page links -->
<xsl:choose>
- <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) < 11">
- <xsl:call-template name="numberedPageLinks" >
- <xsl:with-param name="begin" select="1" />
- <xsl:with-param name="end" select="21" />
- <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
- </xsl:call-template>
+ <!-- We are on pages 1-10. Emit links -->
+ <xsl:when test="$startPage < 11">
+ <xsl:choose>
+ <xsl:when test="$lastPage < 21">
+ <xsl:call-template name="numberedPageLinks" >
+ <xsl:with-param name="begin" select="1" />
+ <xsl:with-param name="end" select="$lastPage + 1" />
+ <xsl:with-param name="current" select="$startPage" />
+ </xsl:call-template>
+ </xsl:when>
+ <xsl:otherwise>
+ <xsl:call-template name="numberedPageLinks" >
+ <xsl:with-param name="begin" select="1" />
+ <xsl:with-param name="end" select="21" />
+ <xsl:with-param name="current" select="$startPage" />
+ </xsl:call-template>
+ </xsl:otherwise>
+ </xsl:choose>
</xsl:when>
+ <!-- We are past page 10, but not to the last page yet. Emit links for 10 pages before and 10 pages after -->
+ <xsl:when test="$startPage < $lastPage">
+ <xsl:choose>
+ <xsl:when test="$lastPage < ($startPage + 11)">
+ <xsl:call-template name="numberedPageLinks" >
+ <xsl:with-param name="begin" select="$startPage - 10" />
+ <xsl:with-param name="end" select="$lastPage + 1" />
+ <xsl:with-param name="current" select="$startPage" />
+ </xsl:call-template>
+ </xsl:when>
+ <xsl:otherwise>
+ <xsl:call-template name="numberedPageLinks" >
+ <xsl:with-param name="begin" select="$startPage - 10" />
+ <xsl:with-param name="end" select="$startPage + 11" />
+ <xsl:with-param name="current" select="$startPage" />
+ </xsl:call-template>
+ </xsl:otherwise>
+ </xsl:choose>
+ </xsl:when>
+ <!-- This covers the case where we are on (or past) the last page -->
<xsl:otherwise>
<xsl:call-template name="numberedPageLinks" >
- <xsl:with-param name="begin" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" />
- <xsl:with-param name="end" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" />
- <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" />
+ <xsl:with-param name="begin" select="$startPage - 10" />
+ <xsl:with-param name="end" select="$lastPage + 1" />
+ <xsl:with-param name="current" select="$startPage" />
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
<!-- Lastly, emit a "next" link. -->
<xsl:text> </xsl:text>
- <xsl:call-template name="pageLink">
- <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" />
- <xsl:with-param name="linkText" select="$labelNext" />
- </xsl:call-template>
+ <xsl:if test="$startPage < $lastPage">
+ <xsl:call-template name="pageLink">
+ <xsl:with-param name="pageNum" select="$startPage + 1" />
+ <xsl:with-param name="linkText" select="$labelNext" />
+ </xsl:call-template>
+ </xsl:if>
</xsl:template>
<!-- Template to emit a list of numbered links to results pages.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-03-03 18:20:21
|
Revision: 2687
http://archive-access.svn.sourceforge.net/archive-access/?rev=2687&view=rev
Author: binzino
Date: 2009-03-03 18:20:14 +0000 (Tue, 03 Mar 2009)
Log Message:
-----------
Fixed handling of start and end of search results so that we detect
"paging off the end" and return an empty result set rather than an
exception.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-02-28 01:26:25 UTC (rev 2686)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-03-03 18:20:14 UTC (rev 2687)
@@ -162,18 +162,30 @@
responseTime = System.nanoTime( ) - responseTime;
- // generate xml results
- int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
- int length = end-start;
+ // The 'end' is usually just the end of the current page
+ // (start+hitsPerPage); but if we are on the last page
+ // of de-duped results, then the end is hits.getLength().
+ int end = Math.min( hits.getLength( ), start + hitsPerPage );
- Hit[] show = hits.getHits(start, end-start);
- HitDetails[] details = bean.getDetails(show);
- Summary[] summaries = bean.getSummary(details, query);
+ // The length is usually just (end-start), unless the start
+ // position is past the end of the results -- which is common when
+ // de-duping. The user could easily jump past the true end of the
+ // de-dup'd results. If the start is past the end, we use a
+ // length of '0' to produce an empty results page.
+ int length = Math.max( end-start, 0 );
+ // Usually, the total results is the total number of non-de-duped
+ // results. Howerver, if we are on last page of de-duped results,
+ // then we know our de-dup'd total is hits.getLength().
+ long totalResults = hits.getLength( ) < (start+hitsPerPage) ? hits.getLength( ) : hits.getTotal( );
+
+ Hit[] show = hits.getHits(start, length );
+ HitDetails[] details = bean.getDetails(show);
+ Summary[] summaries = bean.getSummary(details, query);
+
String requestUrl = request.getRequestURL().toString();
String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
-
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
@@ -197,8 +209,8 @@
+"&hitsPerDup="+hitsPerDup
+params);
- addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
- addNode(doc, channel, "opensearch", "startIndex", ""+start);
+ addNode(doc, channel, "opensearch", "totalResults", ""+totalResults);
+ addNode(doc, channel, "opensearch", "startIndex", ""+start);
addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
addNode(doc, channel, "nutch", "query", queryString);
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
Revision: 2686
http://archive-access.svn.sourceforge.net/archive-access/?rev=2686&view=rev
Author: binzino
Date: 2009-02-28 01:26:25 +0000 (Sat, 28 Feb 2009)
Log Message:
-----------
Added here with local edits to handle perCollection segments in a
distributed setup. Also added info/diagnostic messages to help
diagnose common deployment errors.
Added Paths:
-----------
trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java
Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSearch.java 2009-02-28 01:26:25 UTC (rev 2686)
@@ -0,0 +1,483 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.searcher;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.lang.reflect.Method;
+import java.net.InetSocketAddress;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.StringTokenizer;
+import java.util.TreeSet;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.ipc.RPC;
+import org.apache.hadoop.ipc.VersionedProtocol;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.util.NutchConfiguration;
+
+/** Implements the search API over IPC connnections. */
+public class DistributedSearch {
+ public static final Log LOG = LogFactory.getLog(DistributedSearch.class);
+
+ private DistributedSearch() {} // no public ctor
+
+ /** The distributed search protocol. */
+ public static interface Protocol
+ extends Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks, VersionedProtocol {
+
+ /** The name of the segments searched by this node. */
+ String[] getSegmentNames();
+ }
+
+ /** The search server. */
+ public static class Server {
+
+ private Server() {}
+
+ /** Runs a search server. */
+ public static void main(String[] args) throws Exception {
+ String usage = "DistributedSearch$Server <port> <index dir>";
+
+ if (args.length == 0 || args.length > 2) {
+ System.err.println(usage);
+ System.exit(-1);
+ }
+
+ int port = Integer.parseInt(args[0]);
+ Path directory = new Path(args[1]);
+
+ Configuration conf = NutchConfiguration.create();
+
+ org.apache.hadoop.ipc.Server server = getServer(conf, directory, port);
+ server.start();
+ server.join();
+ }
+
+ static org.apache.hadoop.ipc.Server getServer(Configuration conf, Path directory, int port) throws IOException{
+ NutchBean bean = new NutchBean(conf, directory);
+ int numHandlers = conf.getInt("searcher.num.handlers", 10);
+ return RPC.getServer(bean, "0.0.0.0", port, numHandlers, true, conf);
+ }
+
+ }
+
+ /** The search client. */
+ public static class Client extends Thread
+ implements Searcher, HitDetailer, HitSummarizer, HitContent, HitInlinks,
+ Runnable {
+
+ private InetSocketAddress[] defaultAddresses;
+ private boolean[] liveServer;
+ private HashMap segmentToAddress = new HashMap();
+
+ private boolean running = true;
+ private Configuration conf;
+ private boolean perCollection = false;
+
+ private Path file;
+ private long timestamp;
+ private FileSystem fs;
+
+ /** Construct a client talking to servers listed in the named file.
+ * Each line in the file lists a server hostname and port, separated by
+ * whitespace.
+ */
+ public Client(Path file, Configuration conf)
+ throws IOException {
+ this(readConfig(file, conf), conf);
+ this.file = file;
+ this.timestamp = fs.getFileStatus(file).getModificationTime();
+ }
+
+ private static InetSocketAddress[] readConfig(Path path, Configuration conf)
+ throws IOException {
+ FileSystem fs = FileSystem.get(conf);
+ BufferedReader reader =
+ new BufferedReader(new InputStreamReader(fs.open(path)));
+ try {
+ ArrayList addrs = new ArrayList();
+ String line;
+ while ((line = reader.readLine()) != null) {
+ StringTokenizer tokens = new StringTokenizer(line);
+ if (tokens.hasMoreTokens()) {
+ String host = tokens.nextToken();
+ if (tokens.hasMoreTokens()) {
+ String port = tokens.nextToken();
+ addrs.add(new InetSocketAddress(host, Integer.parseInt(port)));
+ if (LOG.isInfoEnabled()) {
+ LOG.info("Client adding server " + host + ":" + port);
+ }
+ }
+ }
+ }
+ return (InetSocketAddress[])
+ addrs.toArray(new InetSocketAddress[addrs.size()]);
+ } finally {
+ reader.close();
+ }
+ }
+
+ /** Construct a client talking to the named servers. */
+ public Client(InetSocketAddress[] addresses, Configuration conf) throws IOException {
+ this.conf = conf;
+ this.defaultAddresses = addresses;
+ this.liveServer = new boolean[addresses.length];
+ this.fs = FileSystem.get(conf);
+
+ this.perCollection = this.conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false );
+
+ updateSegments();
+ setDaemon(true);
+ start();
+ }
+
+ private static final Method GET_SEGMENTS;
+ private static final Method SEARCH;
+ private static final Method DETAILS;
+ private static final Method SUMMARY;
+ static {
+ try {
+ GET_SEGMENTS = Protocol.class.getMethod
+ ("getSegmentNames", new Class[] {});
+ SEARCH = Protocol.class.getMethod
+ ("search", new Class[] { Query.class, Integer.TYPE, String.class,
+ String.class, Boolean.TYPE});
+ DETAILS = Protocol.class.getMethod
+ ("getDetails", new Class[] { Hit.class});
+ SUMMARY = Protocol.class.getMethod
+ ("getSummary", new Class[] { HitDetails.class, Query.class});
+ } catch (NoSuchMethodException e) {
+ throw new RuntimeException(e);
+ }
+ }
+
+ /**
+ * Check to see if search-servers file has been modified
+ *
+ * @throws IOException
+ */
+ public boolean isFileModified()
+ throws IOException {
+
+ if (file != null) {
+ long modTime = fs.getFileStatus(file).getModificationTime();
+ if (timestamp < modTime) {
+ this.timestamp = fs.getFileStatus(file).getModificationTime();
+ return true;
+ }
+ }
+
+ return false;
+ }
+
+ /** Updates segment names.
+ *
+ * @throws IOException
+ */
+ public void updateSegments() throws IOException {
+
+ int liveServers = 0;
+ int liveSegments = 0;
+
+ if (isFileModified()) {
+ defaultAddresses = readConfig(file, conf);
+ }
+
+ // Create new array of flags so they can all be updated at once.
+ boolean[] updatedLiveServer = new boolean[defaultAddresses.length];
+
+ // build segmentToAddress map
+ Object[][] params = new Object[defaultAddresses.length][0];
+ String[][] results =
+ (String[][])RPC.call(GET_SEGMENTS, params, defaultAddresses, this.conf);
+
+ for (int i = 0; i < results.length; i++) { // process results of call
+ InetSocketAddress addr = defaultAddresses[i];
+ String[] segments = results[i];
+ if (segments == null) {
+ updatedLiveServer[i] = false;
+ if (LOG.isWarnEnabled()) {
+ LOG.warn("Client: no segments from: " + addr);
+ }
+ continue;
+ }
+
+ for (int j = 0; j < segments.length; j++) {
+ if (LOG.isTraceEnabled()) {
+ LOG.trace("Client: segment "+segments[j]+" at "+addr);
+ }
+ segmentToAddress.put(segments[j], addr);
+ }
+
+ updatedLiveServer[i] = true;
+ liveServers++;
+ liveSegments += segments.length;
+ }
+
+ // Now update live server flags.
+ this.liveServer = updatedLiveServer;
+
+ if (LOG.isInfoEnabled()) {
+ LOG.info("STATS: "+liveServers+" servers, "+liveSegments+" segments.");
+ }
+ }
+
+ /** Return the names of segments searched. */
+ public String[] getSegmentNames() {
+ return (String[])
+ segmentToAddress.keySet().toArray(new String[segmentToAddress.size()]);
+ }
+
+ public Hits search(final Query query, final int numHits,
+ final String dedupField, final String sortField,
+ final boolean reverse) throws IOException {
+ // Get the list of live servers. It would be nice to build this
+ // list in updateSegments(), but that would create concurrency issues.
+ // We grab a local reference to the live server flags in case it
+ // is updated while we are building our list of liveAddresses.
+ boolean[] savedLiveServer = this.liveServer;
+ int numLive = 0;
+ for (int i = 0; i < savedLiveServer.length; i++) {
+ if (savedLiveServer[i])
+ numLive++;
+ }
+ InetSocketAddress[] liveAddresses = new InetSocketAddress[numLive];
+ int[] liveIndexNos = new int[numLive];
+ int k = 0;
+ for (int i = 0; i < savedLiveServer.length; i++) {
+ if (savedLiveServer[i]) {
+ liveAddresses[k] = defaultAddresses[i];
+ liveIndexNos[k] = i;
+ k++;
+ }
+ }
+
+ Object[][] params = new Object[liveAddresses.length][5];
+ for (int i = 0; i < params.length; i++) {
+ params[i][0] = query;
+ params[i][1] = new Integer(numHits);
+ params[i][2] = dedupField;
+ params[i][3] = sortField;
+ params[i][4] = Boolean.valueOf(reverse);
+ }
+ Hits[] results = (Hits[])RPC.call(SEARCH, params, liveAddresses, this.conf);
+
+ TreeSet queue; // cull top hits from results
+
+ if (sortField == null || reverse) {
+ queue = new TreeSet(new Comparator() {
+ public int compare(Object o1, Object o2) {
+ return ((Comparable)o2).compareTo(o1); // reverse natural order
+ }
+ });
+ } else {
+ queue = new TreeSet();
+ }
+
+ long totalHits = 0;
+ Comparable maxValue = null;
+ for (int i = 0; i < results.length; i++) {
+ Hits hits = results[i];
+ if (hits == null) continue;
+ totalHits += hits.getTotal();
+ for (int j = 0; j < hits.getLength(); j++) {
+ Hit h = hits.getHit(j);
+ if (maxValue == null ||
+ ((reverse || sortField == null)
+ ? h.getSortValue().compareTo(maxValue) >= 0
+ : h.getSortValue().compareTo(maxValue) <= 0)) {
+ queue.add(new Hit(liveIndexNos[i], h.getIndexDocNo(),
+ h.getSortValue(), h.getDedupValue()));
+ if (queue.size() > numHits) { // if hit queue overfull
+ queue.remove(queue.last()); // remove lowest in hit queue
+ maxValue = ((Hit)queue.last()).getSortValue(); // reset maxValue
+ }
+ }
+ }
+ }
+ return new Hits(totalHits, (Hit[])queue.toArray(new Hit[queue.size()]));
+ }
+
+ // version for hadoop-0.5.0.jar
+ public static final long versionID = 1L;
+
+ private Protocol getRemote(Hit hit) throws IOException {
+ return (Protocol)
+ RPC.getProxy(Protocol.class, versionID, defaultAddresses[hit.getIndexNo()], conf);
+ }
+
+ private Protocol getRemote(HitDetails hit) throws IOException {
+ InetSocketAddress address =
+ (InetSocketAddress)segmentToAddress.get(hit.getValue("segment"));
+ return (Protocol)RPC.getProxy(Protocol.class, versionID, address, conf);
+ }
+
+ public String getExplanation(Query query, Hit hit) throws IOException {
+ return getRemote(hit).getExplanation(query, hit);
+ }
+
+ public HitDetails getDetails(Hit hit) throws IOException {
+ return getRemote(hit).getDetails(hit);
+ }
+
+ public HitDetails[] getDetails(Hit[] hits) throws IOException {
+ InetSocketAddress[] addrs = new InetSocketAddress[hits.length];
+ Object[][] params = new Object[hits.length][1];
+ for (int i = 0; i < hits.length; i++) {
+ addrs[i] = defaultAddresses[hits[i].getIndexNo()];
+ params[i][0] = hits[i];
+ }
+ return (HitDetails[])RPC.call(DETAILS, params, addrs, conf);
+ }
+
+
+ public Summary getSummary(HitDetails hit, Query query) throws IOException {
+ return getRemote(hit).getSummary(hit, query);
+ }
+
+
+ /* DIFF: Added handling for perCollection segments. Also info
+ * messages about each hit to help diagnose typical
+ * deployment errors.
+ */
+ public Summary[] getSummary(HitDetails[] hits, Query query) throws IOException
+ {
+ try
+ {
+ InetSocketAddress[] addrs = new InetSocketAddress[hits.length];
+ Object[][] params = new Object[hits.length][2];
+ for (int i = 0; i < hits.length; i++)
+ {
+ HitDetails hit = hits[i];
+ if ( this.perCollection )
+ {
+ addrs[i] = (InetSocketAddress)segmentToAddress.get(hit.getValue("collection"));
+ LOG.info( "Hit: " + hit + " addr: " + addrs[i] + " collection:" + hit.getValue("collection") );
+ }
+ else
+ {
+ addrs[i] = (InetSocketAddress)segmentToAddress.get(hit.getValue("segment"));
+ LOG.info( "Hit: " + hit + " addr: " + addrs[i] + " segment:" + hit.getValue("segment") );
+ }
+ params[i][0] = hit;
+ params[i][1] = query;
+ }
+ return (Summary[])RPC.call(SUMMARY, params, addrs, conf);
+ }
+ catch ( Exception e )
+ {
+ LOG.warn( "Error getting summaries: ", e );
+ return new Summary[hits.length];
+ }
+ }
+
+ public byte[] getContent(HitDetails hit) throws IOException {
+ return getRemote(hit).getContent(hit);
+ }
+
+ public ParseData getParseData(HitDetails hit) throws IOException {
+ return getRemote(hit).getParseData(hit);
+ }
+
+ public ParseText getParseText(HitDetails hit) throws IOException {
+ return getRemote(hit).getParseText(hit);
+ }
+
+ public String[] getAnchors(HitDetails hit) throws IOException {
+ return getRemote(hit).getAnchors(hit);
+ }
+
+ public Inlinks getInlinks(HitDetails hit) throws IOException {
+ return getRemote(hit).getInlinks(hit);
+ }
+
+ public long getFetchDate(HitDetails hit) throws IOException {
+ return getRemote(hit).getFetchDate(hit);
+ }
+
+ public static void main(String[] args) throws Exception {
+ String usage = "DistributedSearch$Client query <host> <port> ...";
+
+ if (args.length == 0) {
+ System.err.println(usage);
+ System.exit(-1);
+ }
+
+ Query query = Query.parse(args[0], NutchConfiguration.create());
+
+ InetSocketAddress[] addresses = new InetSocketAddress[(args.length-1)/2];
+ for (int i = 0; i < (args.length-1)/2; i++) {
+ addresses[i] =
+ new InetSocketAddress(args[i*2+1], Integer.parseInt(args[i*2+2]));
+ }
+
+ Client client = new Client(addresses, NutchConfiguration.create());
+ //client.setTimeout(Integer.MAX_VALUE);
+
+ Hits hits = client.search(query, 10, null, null, false);
+ System.out.println("Total hits: " + hits.getTotal());
+ for (int i = 0; i < hits.getLength(); i++) {
+ System.out.println(" "+i+" "+ client.getDetails(hits.getHit(i)));
+ }
+
+ }
+
+ public void run() {
+ while (running){
+ try{
+ Thread.sleep(10000);
+ } catch (InterruptedException ie){
+ if (LOG.isInfoEnabled()) {
+ LOG.info("Thread sleep interrupted.");
+ }
+ }
+ try{
+ if (LOG.isInfoEnabled()) {
+ LOG.info("Querying segments from search servers...");
+ }
+ updateSegments();
+ } catch (IOException ioe) {
+ if (LOG.isWarnEnabled()) { LOG.warn("No search servers available!"); }
+ liveServer = new boolean[defaultAddresses.length];
+ }
+ }
+ }
+
+ /**
+ * Stops the watchdog thread.
+ */
+ public void close() {
+ running = false;
+ interrupt();
+ }
+
+ public boolean[] getLiveServer() {
+ return liveServer;
+ }
+ }
+}
\ No newline at end of file
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-02-28 01:23:12
|
Revision: 2685
http://archive-access.svn.sourceforge.net/archive-access/?rev=2685&view=rev
Author: binzino
Date: 2009-02-28 01:23:10 +0000 (Sat, 28 Feb 2009)
Log Message:
-----------
Improvied error handling with better diagnostic messages to help catch
common deployment mistakes.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2009-02-28 01:18:32 UTC (rev 2684)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2009-02-28 01:23:10 UTC (rev 2685)
@@ -262,10 +262,26 @@
if (this.summarizer == null) { return new Summary(); }
+ String text = "";
Segment segment = getSegment(details);
- ParseText parseText = segment.getParseText(getUrl(details));
- String text = (parseText != null) ? parseText.getText() : "";
-
+
+ if ( segment != null )
+ {
+ try
+ {
+ ParseText parseText = segment.getParseText(getUrl(details));
+ text = (parseText != null) ? parseText.getText() : "";
+ }
+ catch ( Exception e )
+ {
+ LOG.error( "segment = " + segment.segmentDir, e );
+ }
+ }
+ else
+ {
+ LOG.warn( "No segment for: " + details );
+ }
+
return this.summarizer.getSummary(text, query);
}
@@ -330,12 +346,19 @@
String segmentName = details.getValue("segment");
Map perCollectionSegments = (Map) this.segments.get( collectionId );
+
+ if ( perCollectionSegments == null )
+ {
+ LOG.warn( "Cannot find per-collection segments for: " + collectionId );
+
+ return null;
+ }
Segment segment = (Segment) perCollectionSegments.get( segmentName );
if ( segment == null )
{
- LOG.warn( "Didn't find segment: collection=" + collectionId + " segment=" + segmentName );
+ LOG.warn( "Cannot find segment: collection=" + collectionId + " segment=" + segmentName );
}
return segment;
@@ -350,7 +373,7 @@
if ( segment == null )
{
- LOG.warn( "Didn't find segment: " + segmentName );
+ LOG.warn( "Cannot find segment: " + segmentName );
}
return segment;
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-02-28 01:18:34
|
Revision: 2684
http://archive-access.svn.sourceforge.net/archive-access/?rev=2684&view=rev
Author: binzino
Date: 2009-02-28 01:18:32 +0000 (Sat, 28 Feb 2009)
Log Message:
-----------
Initial revision.
Added Paths:
-----------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java
Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/BuildIndex.java 2009-02-28 01:18:32 UTC (rev 2684)
@@ -0,0 +1,79 @@
+/*
+ * Copyright (C) 2008 Internet Archive.
+ *
+ * This file is part of the archive-access tools project
+ * (http://sourceforge.net/projects/archive-access).
+ *
+ * The archive-access tools are free software; you can redistribute them and/or
+ * modify them under the terms of the GNU Lesser Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or any
+ * later version.
+ *
+ * The archive-access tools are distributed in the hope that they will be
+ * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser
+ * Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser Public License along with
+ * the archive-access tools; if not, write to the Free Software Foundation,
+ * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+package org.archive.nutchwax.tools;
+
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.analysis.WhitespaceAnalyzer;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.util.NutchConfiguration;
+
+
+/**
+ * A nice command-line hack to generate a Lucene index N documents,
+ * each with one field set to the same value. This value is both
+ * stored and tokenized/indexed.
+ */
+public class BuildIndex extends Configured implements Tool
+{
+ public int run( String[] args ) throws Exception
+ {
+ if ( args.length < 4 )
+ {
+ System.out.println( "BuildIndex index field value count" );
+ System.exit( 0 );
+ }
+
+ String indexDir = args[0].trim();
+ String fieldKey = args[1].trim();
+ String fieldValue = args[2].trim();
+ int count = Integer.parseInt( args[3].trim() );
+
+ IndexWriter writer = new IndexWriter( indexDir, new WhitespaceAnalyzer( ), true );
+
+ for ( int i = 0 ; i < count ; i++ )
+ {
+ Document newDoc = new Document( );
+ newDoc.add( new Field( fieldKey, fieldValue, Field.Store.YES, Field.Index.TOKENIZED ) );
+
+ writer.addDocument( newDoc );
+ }
+
+ writer.close( );
+
+ return 0;
+ }
+
+ /**
+ * Runs using the Hadoop ToolRunner, which means it accepts the
+ * standard Hadoop command-line options.
+ */
+ public static void main( String args[] ) throws Exception
+ {
+ int result = ToolRunner.run( NutchConfiguration.create(), new BuildIndex(), args );
+
+ System.exit( result );
+ }
+
+}
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2009-02-23 03:54:54
|
Revision: 2683
http://archive-access.svn.sourceforge.net/archive-access/?rev=2683&view=rev
Author: binzino
Date: 2009-02-23 03:54:47 +0000 (Mon, 23 Feb 2009)
Log Message:
-----------
Added PageRank* classes to mirror the Nutch LinkDb classes but only
/count/ the inlinks, not preserve them.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java
Added Paths:
-----------
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java
trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java
Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDb.java 2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,366 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.*;
+import java.util.*;
+import java.net.*;
+
+// Commons Logging imports
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.*;
+
+import org.apache.nutch.crawl.LinkDbFilter;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.parse.*;
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LockUtil;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.NutchJob;
+
+/**
+ * <p>Maintains an inverted link map, listing incoming links for each
+ * url.</p>
+ * <p>Aaron Binns @ archive.org: see comments in PageRankDbMerger.</p>
+*/
+public class PageRankDb extends Configured
+ implements Tool, Mapper<Text, ParseData, Text, IntWritable>
+{
+ public static final Log LOG = LogFactory.getLog(PageRankDb.class);
+
+ public static final String CURRENT_NAME = "current";
+ public static final String LOCK_NAME = ".locked";
+
+ private int maxAnchorLength;
+ private boolean ignoreInternalLinks;
+ private URLFilters urlFilters;
+ private URLNormalizers urlNormalizers;
+
+ public PageRankDb( )
+ {
+ }
+
+ public PageRankDb( Configuration conf )
+ {
+ setConf(conf);
+ }
+
+ public void configure( JobConf job )
+ {
+ ignoreInternalLinks = job.getBoolean("db.ignore.internal.links", true);
+ if (job.getBoolean(LinkDbFilter.URL_FILTERING, false))
+ {
+ urlFilters = new URLFilters(job);
+ }
+ if (job.getBoolean(LinkDbFilter.URL_NORMALIZING, false))
+ {
+ urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_LINKDB);
+ }
+ }
+
+ public void close( )
+ {
+ }
+
+ public void map( Text key, ParseData parseData, OutputCollector<Text, IntWritable> output, Reporter reporter )
+ throws IOException
+ {
+ String fromUrl = key.toString();
+ String fromHost = getHost(fromUrl);
+
+ if (urlNormalizers != null)
+ {
+ try
+ {
+ fromUrl = urlNormalizers.normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // normalize the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + fromUrl + ":" + e);
+ fromUrl = null;
+ }
+ }
+ if (fromUrl != null && urlFilters != null)
+ {
+ try
+ {
+ fromUrl = urlFilters.filter(fromUrl); // filter the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + fromUrl + ":" + e);
+ fromUrl = null;
+ }
+ }
+ if (fromUrl == null) return;
+
+ Outlink[] outlinks = parseData.getOutlinks();
+
+ for (int i = 0; i < outlinks.length; i++)
+ {
+ Outlink outlink = outlinks[i];
+ String toUrl = outlink.getToUrl();
+
+ if (ignoreInternalLinks)
+ {
+ String toHost = getHost(toUrl);
+ if (toHost == null || toHost.equals(fromHost))
+ { // internal link
+ continue; // skip it
+ }
+ }
+ if (urlNormalizers != null)
+ {
+ try
+ {
+ toUrl = urlNormalizers.normalize(toUrl, URLNormalizers.SCOPE_LINKDB); // normalize the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + toUrl + ":" + e);
+ toUrl = null;
+ }
+ }
+ if (toUrl != null && urlFilters != null)
+ {
+ try
+ {
+ toUrl = urlFilters.filter(toUrl); // filter the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + toUrl + ":" + e);
+ toUrl = null;
+ }
+ }
+
+ if (toUrl == null) continue;
+
+ // DIFF: We just emit a count of '1' for the toUrl. That's it.
+ // Rather than the list of inlinks as in LinkDb.
+ output.collect( new Text(toUrl), new IntWritable( 1 ) );
+ }
+ }
+
+ private String getHost(String url)
+ {
+ try
+ {
+ return new URL(url).getHost().toLowerCase();
+ }
+ catch (MalformedURLException e)
+ {
+ return null;
+ }
+ }
+
+ public void invert(Path pageRankDb, final Path segmentsDir, boolean normalize, boolean filter, boolean force) throws IOException
+ {
+ final FileSystem fs = FileSystem.get(getConf());
+ FileStatus[] files = fs.listStatus(segmentsDir, HadoopFSUtil.getPassDirectoriesFilter(fs));
+ invert(pageRankDb, HadoopFSUtil.getPaths(files), normalize, filter, force);
+ }
+
+ public void invert(Path pageRankDb, Path[] segments, boolean normalize, boolean filter, boolean force) throws IOException
+ {
+
+ Path lock = new Path(pageRankDb, LOCK_NAME);
+ FileSystem fs = FileSystem.get(getConf());
+ LockUtil.createLockFile(fs, lock, force);
+ Path currentPageRankDb = new Path(pageRankDb, CURRENT_NAME);
+ if (LOG.isInfoEnabled())
+ {
+ LOG.info("PageRankDb: starting");
+ LOG.info("PageRankDb: pageRankDb: " + pageRankDb);
+ LOG.info("PageRankDb: URL normalize: " + normalize);
+ LOG.info("PageRankDb: URL filter: " + filter);
+ }
+ JobConf job = PageRankDb.createJob(getConf(), pageRankDb, normalize, filter);
+ for (int i = 0; i < segments.length; i++)
+ {
+ if (LOG.isInfoEnabled())
+ {
+ LOG.info("PageRankDb: adding segment: " + segments[i]);
+ }
+ FileInputFormat.addInputPath(job, new Path(segments[i], ParseData.DIR_NAME));
+ }
+ try
+ {
+ JobClient.runJob(job);
+ }
+ catch (IOException e)
+ {
+ LockUtil.removeLockFile(fs, lock);
+ throw e;
+ }
+ if (fs.exists(currentPageRankDb))
+ {
+ if (LOG.isInfoEnabled())
+ {
+ LOG.info("PageRankDb: merging with existing pageRankDb: " + pageRankDb);
+ }
+ // try to merge
+ Path newPageRankDb = FileOutputFormat.getOutputPath(job);
+ job = PageRankDbMerger.createMergeJob(getConf(), pageRankDb, normalize, filter);
+ FileInputFormat.addInputPath(job, currentPageRankDb);
+ FileInputFormat.addInputPath(job, newPageRankDb);
+ try
+ {
+ JobClient.runJob(job);
+ }
+ catch (IOException e)
+ {
+ LockUtil.removeLockFile(fs, lock);
+ fs.delete(newPageRankDb, true);
+ throw e;
+ }
+ fs.delete(newPageRankDb, true);
+ }
+ PageRankDb.install(job, pageRankDb);
+ if (LOG.isInfoEnabled())
+ { LOG.info("PageRankDb: done"); }
+ }
+
+ private static JobConf createJob(Configuration config, Path pageRankDb, boolean normalize, boolean filter)
+ {
+ Path newPageRankDb = new Path("pagerankdb-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
+
+ JobConf job = new NutchJob(config);
+ job.setJobName("pagerankdb " + pageRankDb);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+
+ job.setMapperClass(PageRankDb.class);
+ job.setCombinerClass(PageRankDbMerger.class);
+ // if we don't run the mergeJob, perform normalization/filtering now
+ if (normalize || filter)
+ {
+ try
+ {
+ FileSystem fs = FileSystem.get(config);
+ if (!fs.exists(pageRankDb))
+ {
+ job.setBoolean(LinkDbFilter.URL_FILTERING, filter);
+ job.setBoolean(LinkDbFilter.URL_NORMALIZING, normalize);
+ }
+ }
+ catch (Exception e)
+ {
+ LOG.warn("PageRankDb createJob: " + e);
+ }
+ }
+ job.setReducerClass(PageRankDbMerger.class);
+
+ FileOutputFormat.setOutputPath(job, newPageRankDb);
+ job.setOutputFormat(MapFileOutputFormat.class);
+ job.setBoolean("mapred.output.compress", false);
+ job.setOutputKeyClass(Text.class);
+
+ // DIFF: Use IntWritable instead of Inlinks as the output value type.
+ job.setOutputValueClass(IntWritable.class);
+
+ return job;
+ }
+
+ public static void install(JobConf job, Path pageRankDb) throws IOException
+ {
+ Path newPageRankDb = FileOutputFormat.getOutputPath(job);
+ FileSystem fs = new JobClient(job).getFs();
+ Path old = new Path(pageRankDb, "old");
+ Path current = new Path(pageRankDb, CURRENT_NAME);
+ if (fs.exists(current))
+ {
+ if (fs.exists(old)) fs.delete(old, true);
+ fs.rename(current, old);
+ }
+ fs.mkdirs(pageRankDb);
+ fs.rename(newPageRankDb, current);
+ if (fs.exists(old)) fs.delete(old, true);
+ LockUtil.removeLockFile(fs, new Path(pageRankDb, LOCK_NAME));
+ }
+
+ public static void main(String[] args) throws Exception
+ {
+ int res = ToolRunner.run(NutchConfiguration.create(), new PageRankDb(), args);
+ System.exit(res);
+ }
+
+ public int run(String[] args) throws Exception
+ {
+ if (args.length < 2)
+ {
+ System.err.println("Usage: PageRankDb <pagerankdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]");
+ System.err.println("\tpagerankdb\toutput PageRankDb to create or update");
+ System.err.println("\t-dir segmentsDir\tparent directory of several segments, OR");
+ System.err.println("\tseg1 seg2 ...\t list of segment directories");
+ System.err.println("\t-force\tforce update even if PageRankDb appears to be locked (CAUTION advised)");
+ System.err.println("\t-noNormalize\tdon't normalize link URLs");
+ System.err.println("\t-noFilter\tdon't apply URLFilters to link URLs");
+ return -1;
+ }
+ Path segDir = null;
+ final FileSystem fs = FileSystem.get(getConf());
+ Path db = new Path(args[0]);
+ ArrayList<Path> segs = new ArrayList<Path>();
+ boolean filter = true;
+ boolean normalize = true;
+ boolean force = false;
+ for (int i = 1; i < args.length; i++)
+ {
+ if (args[i].equals("-dir"))
+ {
+ segDir = new Path(args[++i]);
+ FileStatus[] files = fs.listStatus(segDir, HadoopFSUtil.getPassDirectoriesFilter(fs));
+ if (files != null) segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(files)));
+ break;
+ }
+ else if (args[i].equalsIgnoreCase("-noNormalize"))
+ {
+ normalize = false;
+ }
+ else if (args[i].equalsIgnoreCase("-noFilter"))
+ {
+ filter = false;
+ }
+ else if (args[i].equalsIgnoreCase("-force"))
+ {
+ force = true;
+ }
+ else segs.add(new Path(args[i]));
+ }
+ try
+ {
+ invert(db, segs.toArray(new Path[segs.size()]), normalize, filter, force);
+ return 0;
+ }
+ catch (Exception e)
+ {
+ LOG.fatal("PageRankDb: " + StringUtils.stringifyException(e));
+ return -1;
+ }
+ }
+
+}
Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbFilter.java 2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,118 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Mapper;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+
+/**
+ * <p>This class provides a way to separate the URL normalization
+ * and filtering steps from the rest of LinkDb manipulation code.</p>
+ * <p>Aaron Binns @ archive.org: see comments in PageRankDbMerger.</p>
+ *
+ * @author Andrzej Bialecki
+ * @author Aaron Binns (archive.org)
+ */
+public class PageRankDbFilter implements Mapper<Text, IntWritable, Text, IntWritable>
+{
+ public static final String URL_FILTERING = "linkdb.url.filters";
+
+ public static final String URL_NORMALIZING = "linkdb.url.normalizer";
+
+ public static final String URL_NORMALIZING_SCOPE = "linkdb.url.normalizer.scope";
+
+ private boolean filter;
+
+ private boolean normalize;
+
+ private URLFilters filters;
+
+ private URLNormalizers normalizers;
+
+ private String scope;
+
+ public static final Log LOG = LogFactory.getLog(PageRankDbFilter.class);
+
+ private Text newKey = new Text();
+
+ public void configure(JobConf job)
+ {
+ filter = job.getBoolean(URL_FILTERING, false);
+ normalize = job.getBoolean(URL_NORMALIZING, false);
+ if (filter)
+ {
+ filters = new URLFilters(job);
+ }
+ if (normalize)
+ {
+ scope = job.get(URL_NORMALIZING_SCOPE, URLNormalizers.SCOPE_LINKDB);
+ normalizers = new URLNormalizers(job, scope);
+ }
+ }
+
+ public void close()
+ {
+ }
+
+ public void map(Text key, IntWritable value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
+ {
+ String url = key.toString();
+ // Inlinks result = new Inlinks();
+ if (normalize)
+ {
+ try
+ {
+ url = normalizers.normalize(url, scope); // normalize the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + url + ":" + e);
+ url = null;
+ }
+ }
+ if (url != null && filter)
+ {
+ try
+ {
+ url = filters.filter(url); // filter the url
+ }
+ catch (Exception e)
+ {
+ LOG.warn("Skipping " + url + ":" + e);
+ url = null;
+ }
+ }
+ if (url == null) return; // didn't pass the filters
+
+ // DIFF: Now that normalizers and filters have run, just emit the
+ // <url,value> pair. No processing to be done on the value.
+ Text newKey = new Text( url );
+ output.collect( newKey, value );
+ }
+}
Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/PageRankDbMerger.java 2009-02-23 03:54:47 UTC (rev 2683)
@@ -0,0 +1,199 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.Random;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.IntWritable;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.mapred.JobClient;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.MapFileOutputFormat;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reducer;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.SequenceFileInputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.LinkDbFilter;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.NutchJob;
+
+/**
+ * This tool merges several PageRankDb-s into one, optionally filtering
+ * URLs through the current URLFilters, to skip prohibited URLs and
+ * links.
+ *
+ * <p>It's possible to use this tool just for filtering - in that case
+ * only one PageRankDb should be specified in arguments.</p>
+ * <p>If more than one PageRankDb contains information about the same URL,
+ * all inlinks are accumulated, but only at most <code>db.max.inlinks</code>
+ * inlinks will ever be added.</p>
+ * <p>If activated, URLFilters will be applied to both the target URLs and
+ * to any incoming link URL. If a target URL is prohibited, all
+ * inlinks to that target will be removed, including the target URL. If
+ * some of incoming links are prohibited, only they will be removed, and they
+ * won't count when checking the above-mentioned maximum limit.</p>
+ * <p>Aaron Binns @ archive.org:
+ * <blockquote>
+ * Copy/paste/edit from LinkDbMerger. We only care about the inlink
+ * <em>count</em> not the inlinks themsevles. In fact, trying to
+ * retain the inlinks doesn't scale when processing 100s of millions
+ * of documents. In large part, due to fact that that Inlinks
+ * object wants to keep all of the inlinks in memory at once,
+ * i.e. in a Set. This doesn't work when we have 600 million
+ * documents and a single URL could easily have a million inlinks.
+ * </blockquote></p>
+ *
+ * @author Andrzej Bialecki
+ * @author Aaron Binns (archive.org)
+ */
+public class PageRankDbMerger extends Configured
+ implements Tool, Reducer<Text, IntWritable, Text, IntWritable>
+{
+ private static final Log LOG = LogFactory.getLog(PageRankDbMerger.class);
+
+ private int maxInlinks;
+
+ public PageRankDbMerger()
+ {
+
+ }
+
+ public PageRankDbMerger(Configuration conf)
+ {
+ setConf(conf);
+ }
+
+ public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
+ {
+ // DIFF: Simply sum the count values for the key.
+ int count = 0;
+ while ( values.hasNext( ) )
+ {
+ count += values.next( ).get( );
+ }
+ output.collect( key, new IntWritable( count ) );
+ }
+
+ public void configure(JobConf job)
+ {
+ maxInlinks = job.getInt("db.max.inlinks", 10000);
+ }
+
+ public void close() throws IOException
+ { }
+
+ public void merge(Path output, Path[] dbs, boolean normalize, boolean filter) throws Exception
+ {
+ JobConf job = createMergeJob(getConf(), output, normalize, filter);
+ for (int i = 0; i < dbs.length; i++)
+ {
+ FileInputFormat.addInputPath(job, new Path(dbs[i], PageRankDb.CURRENT_NAME));
+ }
+ JobClient.runJob(job);
+ FileSystem fs = FileSystem.get(getConf());
+ fs.mkdirs(output);
+ fs.rename(FileOutputFormat.getOutputPath(job), new Path(output, PageRankDb.CURRENT_NAME));
+ }
+
+ public static JobConf createMergeJob(Configuration config, Path pageRankDb, boolean normalize, boolean filter)
+ {
+ Path newPageRankDb =
+ new Path("pagerankdb-merge-" +
+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
+
+ JobConf job = new NutchJob(config);
+ job.setJobName("pagerankdb merge " + pageRankDb);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+
+ job.setMapperClass(PageRankDbFilter.class);
+ job.setBoolean(LinkDbFilter.URL_NORMALIZING, normalize);
+ job.setBoolean(LinkDbFilter.URL_FILTERING, filter);
+ job.setReducerClass(PageRankDbMerger.class);
+
+ FileOutputFormat.setOutputPath(job, newPageRankDb);
+ job.setOutputFormat(MapFileOutputFormat.class);
+ job.setBoolean("mapred.output.compress", true);
+ job.setOutputKeyClass(Text.class);
+
+ // DIFF: Use IntWritable instead of Inlinks as the output value type.
+ job.setOutputValueClass(IntWritable.class);
+
+ return job;
+ }
+
+ /**
+ * @param args
+ */
+ public static void main(String[] args) throws Exception
+ {
+ int res = ToolRunner.run(NutchConfiguration.create(), new PageRankDbMerger(), args);
+ System.exit(res);
+ }
+
+ public int run(String[] args) throws Exception
+ {
+ if (args.length < 2)
+ {
+ System.err.println("Usage: PageRankDbMerger <output_pagerankdb> <pagerankdb1> [<pagerankdb2> <pagerankdb3> ...] [-normalize] [-filter]");
+ System.err.println("\toutput_pagerankdb\toutput PageRankDb");
+ System.err.println("\tpagerankdb1 ...\tinput PageRankDb-s (single input PageRankDb is ok)");
+ System.err.println("\t-normalize\tuse URLNormalizer on both fromUrls and toUrls in pagerankdb(s) (usually not needed)");
+ System.err.println("\t-filter\tuse URLFilters on both fromUrls and toUrls in pagerankdb(s)");
+ return -1;
+ }
+ Path output = new Path(args[0]);
+ ArrayList<Path> dbs = new ArrayList<Path>();
+ boolean normalize = false;
+ boolean filter = false;
+ for (int i = 1; i < args.length; i++)
+ {
+ if (args[i].equals("-filter"))
+ {
+ filter = true;
+ } else if (args[i].equals("-normalize"))
+ {
+ normalize = true;
+ } else dbs.add(new Path(args[i]));
+ }
+ try
+ {
+ merge(output, dbs.toArray(new Path[dbs.size()]), normalize, filter);
+ return 0;
+ }
+ catch (Exception e)
+ {
+ LOG.fatal("PageRankDbMerger: " + StringUtils.stringifyException(e));
+ return -1;
+ }
+ }
+
+}
Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java 2009-02-10 22:19:48 UTC (rev 2682)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java 2009-02-23 03:54:47 UTC (rev 2683)
@@ -133,8 +133,6 @@
return -1;
}
- PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) );
-
if ( pos >= args.length )
{
System.err.println( "Error: missing linkdb" );
@@ -155,11 +153,17 @@
}
else
{
- FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs));
- mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats)));
+ for ( ; pos < args.length ; pos++ )
+ {
+ FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs));
+ mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats)));
+ }
}
System.out.println( "mapfiles = " + mapfiles );
+
+ PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) );
+
try
{
for ( Path p : mapfiles )
@@ -171,24 +175,28 @@
while ( reader.next( key, value ) )
{
- if ( key instanceof Text && value instanceof Inlinks )
+ if ( ! (key instanceof Text) ) continue ;
+
+ String toUrl = ((Text) key).toString( );
+
+ // HACK: Should make this into some externally configurable regex.
+ if ( ! toUrl.startsWith( "http" ) ) continue;
+
+ int count = -1;
+ if ( value instanceof IntWritable )
{
- Text toUrl = (Text) key;
+ count = ( (IntWritable) value ).get( );
+ }
+ else if ( value instanceof Inlinks )
+ {
Inlinks inlinks = (Inlinks) value;
- if ( inlinks.size( ) < threshold )
- {
- continue ;
- }
+ count = inlinks.size( );
+ }
+
+ if ( count < threshold ) continue ;
- String toUrlString = toUrl.toString( );
-
- // HACK: Should make this into some externally configurable regex.
- if ( toUrlString.startsWith( "http" ) )
- {
- output.println( inlinks.size( ) + " " + toUrl.toString() );
- }
- }
+ output.println( count + " " + toUrl );
}
}
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bra...@us...> - 2009-02-10 22:52:06
|
Revision: 2682
http://archive-access.svn.sourceforge.net/archive-access/?rev=2682&view=rev
Author: bradtofel
Date: 2009-02-10 22:19:48 +0000 (Tue, 10 Feb 2009)
Log Message:
-----------
TWEAK: updated heritrix commons to 2.0.2 which has several bug fixes.
TWEAK: updated org.mozilla.juniversalchardet to 1.0.3 which has index OOB error fix.
Modified Paths:
--------------
trunk/archive-access/projects/wayback/wayback-core/pom.xml
Modified: trunk/archive-access/projects/wayback/wayback-core/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/pom.xml 2009-01-31 00:57:49 UTC (rev 2681)
+++ trunk/archive-access/projects/wayback/wayback-core/pom.xml 2009-02-10 22:19:48 UTC (rev 2682)
@@ -57,7 +57,7 @@
<dependency>
<groupId>org.archive.heritrix</groupId>
<artifactId>commons</artifactId>
- <version>2.0.1-SNAPSHOT</version>
+ <version>2.0.2-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.archive.access-control</groupId>
@@ -67,7 +67,7 @@
<dependency>
<groupId>org.mozilla</groupId>
<artifactId>juniversalchardet</artifactId>
- <version>1.0</version>
+ <version>1.0.3</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
Revision: 2681
http://archive-access.svn.sourceforge.net/archive-access/?rev=2681&view=rev
Author: bradtofel
Date: 2009-01-31 00:57:49 +0000 (Sat, 31 Jan 2009)
Log Message:
-----------
BUGFIX(ACC-60): now we omit sending the original Content-Length http header.
Modified Paths:
--------------
trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java
Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java 2009-01-29 23:52:10 UTC (rev 2680)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java 2009-01-31 00:57:49 UTC (rev 2681)
@@ -62,7 +62,9 @@
// first stick it in as-is, or with prefix, then maybe we'll overwrite
// with the later logic.
if(prefix == null) {
- output.put(key, value);
+ if(!keyUp.equals(HTTP_LENGTH_HEADER_UP)) {
+ output.put(key, value);
+ }
} else {
output.put(prefix + key, value);
}
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|