archive-access-cvs Mailing List for Web Archive Access Utilities (Page 34)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 32 33 34 35 36 .. 171 > >> (Page 34 of 171)

[Archive-access-cvs] SF.net SVN: archive-access:[2980] trunk/archive-access/projects/nutchwax/ archive/RELEASE-NOTES.txt

From: <bi...@us...> - 2010-03-18 22:43:10

Revision: 2980
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2980&view=rev
Author:   binzino
Date:     2010-03-18 22:43:04 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2010-03-18 22:40:39 UTC (rev 2979)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2010-03-18 22:43:04 UTC (rev 2980)
@@ -1,57 +1,56 @@
 
 RELEASE-NOTES.TXT
-2009-05-05
+2010-02-13
 Aaron Binns
 
-Release notes for NutchWAX 0.12.4
+Release notes for NutchWAX 0.13
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
 
-  http://webteam.archive.org/confluence/display/search/NutchWAX
+  http://webarchive.jira.com/wiki/display/search/NutchWAX
 
-
 ======================================================================
 Overview
 ======================================================================
 
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+NutchWAX 0.13 is an update of NutchWAX code the Nutch 1.0
+release.
 
-  o Option to omit storing of content during import.
-  o Support for per-collection segments in master/slave config.
-  o Additional diagnostic/log messages to help troubleshoot common
-    deployment mistakes.
-  o PageRankDb similar to LinkDb but only keeping inlink counts.
-  o Improved paging through results, handling "paging past the end".
+This release also allows for field values to be stored in the index in
+compressed form.  Simply change the field storage specification in the
+'nutchwax.filter.index' property from "true" to "compress".  
 
+For example,
 
+<property>
+  <name>nutchwax.filter.index</name>
+  <value>
+    title:false:true:tokenized
+    content:false:compress:tokenized
+    ...
+  </value>
+</property>
+
+This stores the entire content field in the Lucene index, using
+compression.
+
 ======================================================================
 Issues
 ======================================================================
 
 For an up-to-date list of NutchWAX issues:
 
-  http://webteam.archive.org/jira/browse/WAX
+  http://webarchive.jira.com/browse/WAX
 
 Issues resolved in this release:
 
-WAX-27 Sensible output for requesting page of results past the end.
+WAX-74  Add support for storing fields in compressed form.
 
-WAX-34 Add option to omit storing of content in segment
+WAX-73  Change default value of searcher.fieldcache in nutch-site.xml to 'false'
 
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
-       rather than actual inlinks.
+WAX-72  Simply build system to copy NW files into Nutch dirs and use Nutch build.xml
 
-WAX-36 Some additional diagnostics on connecting results to segments
-       and snippets would be very helpful.
+WAX-71  NutchWAX-required libraries not included in nutch-1.0.job
 
-WAX-37 Per-collection segments not supported in distributed
-       master-slave configuration.
-
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
-
-WAX-42 Add option to continue importing if an arcfile cannot be read.
+WAX-69  Class not found when importing within a Hadoop MR job.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2979] trunk/archive-access/projects/nutchwax/ archive/src

From: <bi...@us...> - 2010-03-18 22:40:45

Revision: 2979
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2979&view=rev
Author:   binzino
Date:     2010-03-18 22:40:39 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
WAX-74.  Add support for storing field value in compressed form.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
    trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2010-03-18 22:11:53 UTC (rev 2978)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2010-03-18 22:40:39 UTC (rev 2979)
@@ -44,11 +44,10 @@
   <name>nutchwax.filter.index</name>
   <value>
     title:false:true:tokenized
-    content:false:false:tokenized
+    content:false:compress:tokenized
     site:false:false:untokenized
 
     url:false:true:tokenized
-    digest:false:true:no
 
     collection:true:true:no_norms
     date:true:true:no_norms

Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java	2010-03-18 22:11:53 UTC (rev 2978)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java	2010-03-18 22:40:39 UTC (rev 2979)
@@ -36,6 +36,7 @@
 import org.apache.nutch.indexer.NutchDocument;
 import org.apache.nutch.indexer.lucene.LuceneWriter;
 import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
+import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.Parse;
 
@@ -74,7 +75,7 @@
 
         String  srcKey     = spec[0];
         boolean lowerCase  = true;
-        boolean store      = true;
+        STORE   store      = STORE.YES;
         INDEX   index      = INDEX.TOKENIZED;
         boolean exclusive  = true;
         String  destKey    = srcKey;
@@ -91,7 +92,10 @@
                         "no_norms".   equals(spec[3]) ? INDEX.NO_NORMS :
                         INDEX.NO;
           case 3:
-            store     = Boolean.parseBoolean( spec[2] );
+            //store     = Boolean.parseBoolean( spec[2] );
+            store     = "true".    equals(spec[2]) ? STORE.YES :
+                        "compress".equals(spec[2]) ? STORE.COMPRESS :
+                        STORE.NO;
           case 2:
             lowerCase = Boolean.parseBoolean( spec[1] );
           case 1:
@@ -109,12 +113,12 @@
   {
     String  srcKey;
     boolean lowerCase;
-    boolean store;
+    STORE   store;
     INDEX   index;
     boolean exclusive;
     String  destKey;
 
-    public FieldSpecification( String srcKey, boolean lowerCase, boolean store, INDEX index, boolean exclusive, String destKey )
+    public FieldSpecification( String srcKey, boolean lowerCase, STORE store, INDEX index, boolean exclusive, String destKey )
     {
       this.srcKey    = srcKey;
       this.lowerCase = lowerCase;
@@ -147,6 +151,12 @@
             try
               {
                 value = (new URL( meta.get( "url" ) ) ).getHost( );
+
+                // Strip off any "www." header.
+                if ( value.startsWith( "www." ) )
+                  {
+                    value = value.substring( 4 );
+                  }
               }
             catch ( MalformedURLException mue ) { /* Eat it */ }
           }
@@ -171,6 +181,11 @@
             int p = value.indexOf( ';' );
             if ( p >= 0 ) value = value.substring( 0, p );
           }
+        else if ( "collection".equals( spec.srcKey ) )
+          {
+            // Use value given in config first, otherwise what's in the metadata object.
+            value = conf.get( "nutchwax.index.collection", meta.get( spec.srcKey ) );
+          }
         else
           {
             value = meta.get( spec.srcKey );
@@ -188,7 +203,7 @@
             doc.removeField( spec.destKey );
           }
 
-        if ( spec.store || spec.index != INDEX.NO )
+        if ( spec.store != STORE.NO || spec.index != INDEX.NO )
           {
             doc.add( spec.destKey, value );
           }
@@ -202,13 +217,13 @@
   {
     for ( FieldSpecification spec : this.fieldSpecs )
       {
-        if ( ! spec.store && spec.index == INDEX.NO )
+        if ( spec.store == STORE.NO && spec.index == INDEX.NO )
           {
             continue ;
           }
 
         LuceneWriter.addFieldOptions( spec.destKey, 
-                                      spec.store ? LuceneWriter.STORE.YES : LuceneWriter.STORE.NO,
+                                      spec.store,
                                       spec.index,
                                       conf );
       }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2978] trunk/archive-access/projects/nutchwax/ archive/HOWTO-xslt.txt

From: <bi...@us...> - 2010-03-18 22:12:05

Revision: 2978
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2978&view=rev
Author:   binzino
Date:     2010-03-18 22:11:53 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Update for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2010-03-18 22:10:35 UTC (rev 2977)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt	2010-03-18 22:11:53 UTC (rev 2978)
@@ -1,6 +1,6 @@
 
 HOWTO-xslt.txt
-2008-12-18
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -128,8 +128,5 @@
 
 You can find sample 'web.xml' and 'search.xsl' files in 
 
-  contrib/archive/web
-
-in the compiled Nutch package.  Or in this source tree under
-
-  src/web
+  ./src/nutch/src/web/jsp/search.xsl
+  ./src/nutch/src/web/web.xml


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2977] trunk/archive-access/projects/nutchwax/ archive/HOWTO.txt

From: <bi...@us...> - 2010-03-18 22:11:07

Revision: 2977
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2977&view=rev
Author:   binzino
Date:     2010-03-18 22:10:35 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2010-03-18 21:55:45 UTC (rev 2976)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2010-03-18 22:10:35 UTC (rev 2977)
@@ -1,17 +1,18 @@
 
 HOWTO.txt
-2008-07-28
+2010-02-13
 Aaron Binns
 
 Table of Contents
  o Prerequisites
    - NutchWAX installation
    - ARC/WARC files
- o Create a manifest
- o Import, Invert and Index
- o Search
- o Web deployment
-   - Don't forget to config & patch again
+ o Build index
+   - Stand-alone
+   - Hadoop
+ o Search index
+   - Single server
+   - Master/slave servers
 
 ======================================================================
 Prerequisites
@@ -26,7 +27,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.4
+      /opt/nutchwax-0.13
 
  2. ARC/WARC files.
 
@@ -60,32 +61,28 @@
 
 
 ======================================================================
-Import, Invert and Index
+Build Index
 ======================================================================
 
-The steps to import the files, invert the link and index the documents
-are rather simple:
+Building the index consists of two required steps with one recommended
+optional step.
 
-  $ mkdir crawl
-  $ cd crawl
-  $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest
-  $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/*
-  $ ls -F1
-  crawldb/
-  indexes/
-  linkdb/
-  segments/
+  1. Import
+  2. Index
+  3. Pagerank  (optional)
 
-To those already familiar with Nutch, these steps should be quite
-familiar.
+Performing these steps using the 'nutchwax' command-line driver
+are rather straightforward:
 
-The first step, we call NutchWAX's "import" command which creates the
-Nutch segment containing the documents in the ARC/WARC files listed in
-the manifest.  The rest is the same as regular Nutch.
+  $ /opt/nutchwax-0.13/bin/nutchwax import     manifest.txt
+  $ /opt/nutchwax-0.13/bin/nutchwax index      indexes segments/*
+  $ /opt/nutchwax-0.13/bin/nutchwax merge      index   indexes
 
+  $ /opt/nutchwax-0.13/bin/nutchwax pagerankdb pagerankdb segments/*
+  $ /opt/nutchwax-0.13/bin/nutchwax pageranker ranks.txt  pagerankdb
+  $ /opt/nutchwax-0.13/bin/nutchwax reboost    ranks.txt  index
 
+
 ======================================================================
 Search
 ======================================================================
@@ -96,9 +93,9 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.13/bin/nutchwax search computer
 
-This calls the NutchBean to execute a simple keyword search for
+This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
 documents you imported.
 
@@ -109,7 +106,7 @@
 
 The Nutch(WAX) web application is bundled with NutchWAX as
 
-  /opt/nutchwax-0.12.4/nutch-1.0-dev.war
+  /opt/nutchwax-0.13/nutch-1.0-dev.war
 
 Simply deploy that web application in the same fashion as with
 Nutch.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2976] trunk/archive-access/projects/nutchwax/ archive/HOWTO-pagerank.txt

From: <bi...@us...> - 2010-03-18 21:55:58

Revision: 2976
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2976&view=rev
Author:   binzino
Date:     2010-03-18 21:55:45 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2010-03-18 21:51:55 UTC (rev 2975)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2010-03-18 21:55:45 UTC (rev 2976)
@@ -1,6 +1,6 @@
 
 HOWTO-pagerank.txt
-2008-12-18
+2010-02-13
 Aaron Binns
 
 Table of Contents
@@ -30,22 +30,20 @@
 simplistic "page rank" information for scoring and sorting documents
 in the full-text search index.
 
-Nutch's 'invertlinks' step inverts links and stores them in the
-'linkdb' directory.  We use these inlinks to boost the Lucene score of
-documents in proportion to the number of inlinks.
+NutchWAX's 'pagerankdb' command inverts and counts links to a page,
+storing the counts in a directory named 'pagerankdb'.  This
+information is then used to update the boost values in the Lucene
+index in proportion to number of inlinks to each document.
 
 
 ======================================================================
 Generate PageRank
 ======================================================================
 
-After the Nutch 'invertlinks' step is performed, run the NutchWAX
-'pagerank' command to extract inlink information from the 'linkdb'
-
 For example
 
-  $ nutch invertlinks linkdb -dir segments
-  $ nutchwax pagerank pagerank.txt linkdb
+  $ nutchwax pagerankdb prdb -dir segments
+  $ nutchwax pagerank pagerank.txt prdb
 
 The resulting "pagerank.txt" file is a simple text file containing
 a count of the number of inlinks followed by the URL. 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2975] trunk/archive-access/projects/nutchwax/ archive/INSTALL.txt

From: <bi...@us...> - 2010-03-18 21:52:09

Revision: 2975
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2975&view=rev
Author:   binzino
Date:     2010-03-18 21:51:55 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2010-03-18 19:27:14 UTC (rev 2974)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2010-03-18 21:51:55 UTC (rev 2975)
@@ -1,38 +1,39 @@
 
 INSTALL.txt
-2009-03-08
+2010-02-13
 Aaron Binns
 
 Table of Contents
  o Introduction
  o Build from source
-    - SVN: Nutch 1.0-dev
+    - SVN: Nutch 1.0
     - SVN: NutchWAX
     - Build and Install
  o Install binary package
- o Install start-up scripts
 
 
 ======================================================================
 Introduction
 ======================================================================
 
-This installation guide assumes the reader is already familiar with
-building, packaging and deploying Nutch 1.0-dev.
+This installation gues assumes the reader is not familiar with Nutch
+and is looking for step-by-step instructions on building and
+installing NutchWAX.
 
-The NutchWAX 0.12 source and build system are designed to integrate
-into the existing Nutch 1.0-dev source and build.
 
-The long-term goal is for the NutchWAX components to be fully
-integrated into mainline Nutch.  As a stepping-stone toward that goal,
-we have packaged the NutchWAX source to be dropped into the Nutch
-"contrib" directory and built from there.
+======================================================================
+Build from Source
+======================================================================
 
-Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script.  The
-NutchWAX build script calls out to the Nutch script to build Nutch
-proper, then builds NutchWAX components and integrates them into the
-Nutch build directory.
+The NutchWAX source is packaged as a 'contrib' package for Nutch.
+To build from source, you must checkout both the Nutch and
+NutchWAX sources.
 
+Like Nutch, NutchWAX uses a simple 'ant' build script.  The NutchWAX
+build script calls out to the Nutch script to build the Nutch
+components, then builds the NutchWAX components and integrates them
+into the Nutch build directory.
+
 In order to build NutchWAX, execute all build commands from the
 NutchWAX directory.  This way, NutchWAX will ensure that any and all
 dependencies in Nutch will be properly built and kept up-to-date.
@@ -46,130 +47,64 @@
   o tar
   o clean
 
-Again, the idea is that if you're already used to building Nutch, you
-can easily transition to building Nutch and NutchWAX together.  All of
-the build artifacts will still be placed in Nutch's 'build'
-sub-directory as normal.
 
+SVN: nutch-1.0
+--------------
+NutchWAX 0.13 is built against Nutch-1.0.
 
-======================================================================
-Build from Source
-======================================================================
-
-To build from source, you must check-out the Nutch and NutchWAX sources
-from their respective 'subversion' source control servers.
-
-SVN: nutch-1.0-dev
-------------------
-As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
-Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.4 is
-built against is:
-
-  701524
-
 To checkout this revision of Nutch, use:
 
- $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
+ $ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0 nutch
  $ cd nutch
 
-Please be sure to check-out this specific version of the Nutch source.
+Please be sure to check-out this specific release of the Nutch source.
 If you just grab the head of the trunk, there may be newer and
-incompatible changed to Nutch.
+incompatible changes to Nutch.
 
 SVN: NutchWAX
 -------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+Once you have Nutch-1.0 checked-out, check-out the NutchWAX 0.13
 source into Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13/archive
 
 This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.4 sources.
+NutchWAX 0.13 sources.
 
 Build and install
 -----------------
-Assuming you already have the required tool-set for building Nutch,
-building NutchWAX is a snap.
+Simply execute the same 'ant' build command in the NutchWAX
+source tree
 
-Simply execute the same 'ant' build command in
-
-  nutch/contrib/archive
-
-as you normally would and everything will build as normal.
-
-For example
-
   $ cd nutch/contrib/archive
   $ ant tar
 
 This command will build all of Nutch, then the NutchWAX add-ons and
-finally will package everything up into the "nutch-1.0-dev.tar.gz"
-release package.
+finally will package everything up into the "nutch-1.0.tar.gz" release
+package, which is placed in the Nutch 'build' subdir:
 
-Then, install the "nutch-1.0-dev.tar.gz" tarball as normal.  For
+  # Assuming we are still in nutch/contrib/archive
+  $ ls ../../build/nutch-1.0.tar.gz
+  ../../build/nutch-1.0.tar.gz
+
+Then, install the "nutch-1.0.tar.gz" tarball as normal.  For
 example:
 
   $ cd /opt
-  $ tar xvfz nutch-1.0-dev.tar.gz
-  $ mv nutch-1.0-dev nutchwax-0.12.4
+  $ tar xvfz nutch-1.0.tar.gz
+  $ mv nutch-1.0 nutchwax-0.13
 
 
 ======================================================================
 Install binary package
 ======================================================================
 
-Alternatively, grab a "binary" release package from the Internet
-Archive's NutchWAX home page.
+Alternatively, grab a pre-compiled (binary) release package from the
+Internet Archive's NutchWAX home page.
 
 Install it simply by untarring it, for example:
 
   $ cd /opt
-  $ tar xvfz nutchwax-0.12.4.tar.gz
+  $ tar xvfz nutchwax-0.13.tar.gz
 
-
-======================================================================
-Install start-up scripts
-======================================================================
-
-NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
-automatically start the searcher slaves for a multi-node search
-configuration.
-
-Assuming you installed NutchWAX as
-
-  /opt/nutchwax-0.12.4
-
-the script is found at
-
-  /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
-
-This script can be placed in /etc/init.d then added to the list of
-startup scripts to run at bootup by using commands appropriate to your
-Linux distribution.
-
-You must edit a few of the environment variables defined in the
-'searcher-slave' specifying where NutchWAX is installed and where the
-index(s) are deployed.  In 'searcher-slave' you will find the:
-
-  export NUTCH_HOME=TODO
-  export DEPLOYMENT_DIR=TODO
-
-edit those appropriately for your system.
-
-
-The "master" in the multi-node search deployment is the NutchWAX
-webapp running in a webapp server, such as Tomcat or Jetty.
-
-Jetty comes with a start/stop script appropriate for use as an init.d
-script, similar to the 'searcher-slave' script described above.  If you
-use Jetty, create a symlink 
-
-  /etc/init.d/jetty.sh  -> /opt/jetty/bin/jetty.sh
-
-Then add this script to the list of startup scripts to run at bootup
-by using commands appropriate to your Linux distribution.
-
-Follow the instructions from Jetty on the deployment of the NutchWAX
-webapp (nutch-1.0-dev.war) in the Jetty web application server.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2974] trunk/archive-access/projects/nutchwax/ archive/README.txt

From: <bi...@us...> - 2010-03-18 19:27:20

Revision: 2974
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2974&view=rev
Author:   binzino
Date:     2010-03-18 19:27:14 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/README.txt

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2010-03-18 19:26:44 UTC (rev 2973)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2010-03-18 19:27:14 UTC (rev 2974)
@@ -1,6 +1,6 @@
 
 README.txt
-2009-05-05
+2010-02-13
 Aaron Binns
 
 Table of Contents
@@ -13,7 +13,7 @@
 Introduction
 ======================================================================
 
-Welcome to NutchWAX 0.12.4!
+Welcome to NutchWAX 0.13!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.
@@ -24,10 +24,7 @@
 
 The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
 
-Since NutchWAX is a set of add-ons to Nutch, you should already be
-familiar with Nutch before using NutchWAX.
 
-
 The goal of NutchWAX is to enable full-text indexing and searching of
 documents stored in web archive file formats (ARC and WARC).
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2973] trunk/archive-access/projects/nutchwax/ archive/BUILD-NOTES.txt

From: <bi...@us...> - 2010-03-18 19:26:54

Revision: 2973
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2973&view=rev
Author:   binzino
Date:     2010-03-18 19:26:44 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated to match NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2010-03-16 21:37:14 UTC (rev 2972)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2010-03-18 19:26:44 UTC (rev 2973)
@@ -1,6 +1,6 @@
 
 BUILD-NOTES.txt
-2008-12-18
+2010-02-13
 Aaron Binns
 
 ======================================================================
@@ -13,15 +13,15 @@
 
 ======================================================================
 
-This 0.12.x release of NutchWAX is radically different in source-code
+This 0.13 release of NutchWAX is radically different in source-code
 form compared to the previous release, 0.10.
 
-One of the design goals of 0.12.x was to reduce or even eliminate the
+One of the design goals of 0.13 was to reduce or even eliminate the
 "copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
 releases had to copy/paste/edit large chunks of Nutch source code in
 order to add the NutchWAX features.
 
-Also, the NutchWAX 0.12.x sources and build are designed to one day be
+Also, the NutchWAX 0.13 sources and build are designed to one day be
 added into mainline Nutch as a proper "contrib" package; then
 eventually be fully integrated into the core Nutch source code.
 
@@ -77,47 +77,7 @@
 to the Nutch source and configuration files.
 
 ----------------------------------------------------------------------
-The file
 
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-contains two errors: one where a mimetype is referenced before it is
-defined; and a second where a definition has an illegal character.
-
-These errors cause Nutch to not recognize certain mimetypes and
-therefore will ignore documents matching those mimetypes.
-
-There are two fixes:
-
- 1. Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-    definition higher up in the file, before the reference to it.
-
- 2. Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-    as the ';' character is illegal according to the comments in the
-    Nutch code.
-
-You can either apply these patches yourself, or copy an already-patched
-copy from:
-
-  /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
-
-to 
-
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-----------------------------------------------------------------------
-
 In the file 'conf/nutch-site.xml' we define some properties to
 over-ride the values in 'conf/nutch-default.xml'.
 
@@ -130,27 +90,37 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+  protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
 
 and remove:
 
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
 
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
 
 The "parse-pdf" plugin is added simply because we have lots of PDFs in
 our archives and we want to index them.  We sometimes remove the
 "parse-js" plugin if we don't care to index JavaScript files.
 
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter.  Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter.  By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields.  Note that we do retain
+the Nutch query-basic filter however.
+
 We also remove the default Nutch URL filtering and normalizing plugins
 because we do not need the URLs normalized nor filtered.  We trust
 that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +136,14 @@
 --------------------------------------------------
 indexingfilter.order
 --------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset.  However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used.  If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result.  In that case, you need to specify the order.
+
 Add this property with a value of
 
     org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +152,6 @@
 So that the NutchWAX indexing filter is run after the Nutch basic
 indexing filter.
 
-A full explanation is given in "README-dedup.txt".
-
 --------------------------------------------------
 mime.type.magic
 --------------------------------------------------
@@ -205,37 +181,44 @@
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:exclusive:dest-key
+  src-key:lowercase:store:index:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
 
   lowercase = true
   store     = true
-  tokenize  = false
+  index     = tokenized
   exclusive = true
   dest-key  = src-key
 
+For the 'index' property, the possible values are:
+  tokenized
+  untokenized
+  no_norms
+  no
+
+corresponding to the Lucene options of the same names.
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.index</name>
   <value>
-    url:false:true:true
-    url:false:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
+    title:false:true:tokenized
+    content:false:false:tokenized
+    site:false:false:untokenized
+
+    url:false:true:tokenized
+    digest:false:true:no
+
+    collection:true:true:no_norms
+    date:true:true:no_norms
+    type:true:true:no_norms
+    length:false:true:no
   </value>
 </property>
 
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
 
 --------------------------------------------------
 nutchwax.filter.query
@@ -274,15 +257,10 @@
 <property>
   <name>nutchwax.filter.query</name>
   <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
     group:collection
+    group:site:false
     group:type
-    field:anchor
     field:content
-    field:host
     field:title
   </value>
 </property>
@@ -428,3 +406,31 @@
     <value>false</value>
   </property>
 
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher.  Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache.  This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap.  In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value.  If set to false, then the cache is not used at all.
+
+<property>
+  <name>searcher.fieldcache</name>
+  <value>false</value>
+</property>
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2972] trunk/archive-access/projects/nutchwax/ archive/lib

From: <bi...@us...> - 2010-03-16 21:37:21

Revision: 2972
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2972&view=rev
Author:   binzino
Date:     2010-03-16 21:37:14 +0000 (Tue, 16 Mar 2010)

Log Message:
-----------
Removed unnecessary libraries.

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE
    trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar

Deleted: trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE	2010-03-16 21:28:15 UTC (rev 2971)
+++ trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE	2010-03-16 21:37:14 UTC (rev 2972)
@@ -1,56 +0,0 @@
-/*-- 
-
- $Id: LICENSE.txt,v 1.11 2004/02/06 09:32:57 jhunter Exp $
-
- Copyright (C) 2000-2004 Jason Hunter & Brett McLaughlin.
- All rights reserved.
- 
- Redistribution and use in source and binary forms, with or without
- modification, are permitted provided that the following conditions
- are met:
- 
- 1. Redistributions of source code must retain the above copyright
-    notice, this list of conditions, and the following disclaimer.
- 
- 2. Redistributions in binary form must reproduce the above copyright
-    notice, this list of conditions, and the disclaimer that follows 
-    these conditions in the documentation and/or other materials 
-    provided with the distribution.
-
- 3. The name "JDOM" must not be used to endorse or promote products
-    derived from this software without prior written permission.  For
-    written permission, please contact <request_AT_jdom_DOT_org>.
- 
- 4. Products derived from this software may not be called "JDOM", nor
-    may "JDOM" appear in their name, without prior written permission
-    from the JDOM Project Management <request_AT_jdom_DOT_org>.
- 
- In addition, we request (but do not require) that you include in the 
- end-user documentation provided with the redistribution and/or in the 
- software itself an acknowledgement equivalent to the following:
-     "This product includes software developed by the
-      JDOM Project (http://www.jdom.org/)."
- Alternatively, the acknowledgment may be graphical using the logos 
- available at http://www.jdom.org/images/logos.
-
- THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- DISCLAIMED.  IN NO EVENT SHALL THE JDOM AUTHORS OR THE PROJECT
- CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- SUCH DAMAGE.
-
- This software consists of voluntary contributions made by many 
- individuals on behalf of the JDOM Project and was originally 
- created by Jason Hunter <jhunter_AT_jdom_DOT_org> and
- Brett McLaughlin <brett_AT_jdom_DOT_org>.  For more information
- on the JDOM Project, please see <http://www.jdom.org/>. 
-
- */
-

Deleted: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar
===================================================================
(Binary files differ)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2971] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax

From: <bi...@us...> - 2010-03-16 21:28:28

Revision: 2971
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2971&view=rev
Author:   binzino
Date:     2010-03-16 21:28:15 +0000 (Tue, 16 Mar 2010)

Log Message:
-----------
Removed from this release.  Might make a re-appearance in a future release.

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java

Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-23 00:50:11 UTC (rev 2970)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-03-16 21:28:15 UTC (rev 2971)
@@ -1,355 +0,0 @@
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.archive.nutchwax;
-
-import java.io.IOException;
-import java.io.BufferedReader;
-import java.io.InputStreamReader;
-import java.io.FileInputStream;
-import java.util.Comparator;
-import java.util.Collections;
-import java.util.List;
-import java.util.ArrayList;
-import java.util.LinkedList;
-
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
-
-import org.jdom.Document;
-import org.jdom.Element;
-import org.jdom.Namespace;
-import org.jdom.output.XMLOutputter;
-
-
-/** 
- * 
- */   
-public class OpenSearchMaster
-{
-  public static final Log LOG = LogFactory.getLog( OpenSearchMaster.class );
-
-  List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( );
-  long timeout = 0;
-
-  public OpenSearchMaster( String slavesFile, long timeout )
-    throws IOException
-  {
-    this( slavesFile );
-    this.timeout = timeout;
-  }
-
-  public OpenSearchMaster( String slavesFile )
-    throws IOException
-  {
-    BufferedReader r = null;
-    try
-      {
-        r = new BufferedReader( new InputStreamReader( new FileInputStream( slavesFile ), "utf-8" ) );
-
-        String line;
-        while ( (line = r.readLine()) != null )
-          {
-            line = line.trim();
-            if ( line.length() == 0 || line.charAt( 0 ) == '#' )
-              {
-                // Ignore it.
-                continue ;
-              }
-
-            OpenSearchSlave slave = new OpenSearchSlave( line );
-
-            this.slaves.add( slave );            
-          }
-      }
-    finally
-      {
-        try { if ( r != null ) r.close(); } catch ( IOException ioe ) { }
-      }
-    
-  }
-
-  public Document query( String query, int startIndex, int numResults, int hitsPerSite )
-  {
-    long startTime = System.currentTimeMillis( );
-    
-    List<SlaveQueryThread> slaveThreads = new ArrayList<SlaveQueryThread>( this.slaves.size() );
-
-    for ( OpenSearchSlave slave : this.slaves )
-      {
-        SlaveQueryThread sqt = new SlaveQueryThread( slave, query, 0, (startIndex+numResults), hitsPerSite );
-
-        sqt.start( );
-
-        slaveThreads.add( sqt );        
-      }
-
-    waitForThreads( slaveThreads, this.timeout );
-
-    LinkedList<Element> items = new LinkedList<Element>( );
-    long totalResults = 0;
-
-    for ( SlaveQueryThread sqt : slaveThreads )
-      {
-        if ( sqt.throwable != null )
-          {
-            continue ;
-          }
-
-        try
-          {
-            // Dump all the results ("item" elements) into a single list.
-            Element channel = sqt.response.getRootElement( ).getChild( "channel" );
-            items.addAll( (List<Element>) channel.getChildren( "item" ) );
-            channel.removeChildren( "item" );
-            
-            totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) );
-          }
-        catch ( Exception e ) 
-          {
-            LOG.error( "Error processing response from slave: " + sqt.slave, e );
-          }
-        
-      }
-
-    if ( items.size( ) > 0 && hitsPerSite > 0 )
-      {
-        Collections.sort( items, new ElementSiteThenScoreComparator( ) );
-
-        LinkedList<Element> collapsed = new LinkedList<Element>( );
-        
-        collapsed.add( items.removeFirst( ) );
-        
-        int count = 1;
-        for ( Element item : items )
-          {
-            String lastSite = collapsed.getLast( ).getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( );
-
-            if ( lastSite.length( ) == 0 ||
-                 !lastSite.equals( item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ) ) )
-              {
-                collapsed.add( item );
-                count = 1;                
-              }
-            else if ( count < hitsPerSite )
-              {
-                collapsed.add( item );
-                count++;
-              }
-          }
-
-        // Replace the list of items with the collapsed list.
-        items = collapsed;
-      }
-
-    Collections.sort( items, new ElementScoreComparator( ) );
-
-    // Build the final results OpenSearch XML document.
-    Element channel = new Element( "channel" );
-    channel.addContent( new Element( "title"       ) );
-    channel.addContent( new Element( "description" ) );
-    channel.addContent( new Element( "link"        ) );
-
-    Element eTotalResults = new Element( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
-    Element eStartIndex   = new Element( "startIndex",   Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
-    Element eItemsPerPage = new Element( "itemsPerPage", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
-
-    eTotalResults.setText( Long.toString( totalResults ) );
-    eStartIndex.  setText( Long.toString( startIndex   ) );
-    eItemsPerPage.setText( Long.toString( numResults   ) );
-
-    channel.addContent( eTotalResults );
-    channel.addContent( eStartIndex   );
-    channel.addContent( eItemsPerPage );
-
-    // Get a sub-list of only the items we want: [startIndex,(startIndex+numResults)]
-    List<Element> subList = items.subList( Math.min(  startIndex,             items.size( ) ),
-                                           Math.min( (startIndex+numResults), items.size( ) ) );
-    channel.addContent( subList );
-
-    Element rss = new Element( "rss" );
-    rss.addContent( channel );
-
-    return new Document( rss );
-  }
-
-
-  /**
-   * Convenience method to wait for a collection of threads to complete,
-   * or until a timeout after a startTime expires.
-   */
-  private void waitForThreads( List<SlaveQueryThread> threads, long timeout )
-  {
-    for ( Thread t : threads )
-      {
-        try
-          {
-            t.join( timeout );
-          }
-        catch ( InterruptedException ie ) 
-          {
-            break;
-          }
-      }
-  }
-
-  
-  public static void main( String args[] )
-    throws Exception
-  {
-    String usage = "OpenSearchMaster [OPTIONS] SLAVES.txt query"
-      + "\n\t-h <n>    Hits per site"
-      + "\n\t-n <n>    Number of results"
-      + "\n\t-s <n>    Start index"
-      + "\n";
-    
-    if ( args.length < 2 )
-      {
-        System.err.println( usage );
-        System.exit( 1 );
-      }
-
-    String slavesFile = args[args.length - 2];
-    String query      = args[args.length - 1];
-    
-    int startIndex  = 0;
-    int hitsPerSite = 0;
-    int numHits     = 10;
-    for ( int i = 0 ; i < args.length - 2 ; i++ )
-      {
-        try
-          {
-            if ( "-h".equals( args[i] ) )
-              {
-                i++;
-                hitsPerSite = Integer.parseInt( args[i] );
-              }
-            if ( "-n".equals( args[i] ) )
-              {
-                i++;
-                numHits = Integer.parseInt( args[i] );
-              }
-            if ( "-s".equals( args[i] ) )
-              {
-                i++;
-                startIndex = Integer.parseInt( args[i] );
-              }
-          }
-        catch ( NumberFormatException nfe ) 
-          {
-            System.err.println( "Error: not a numeric value: " + args[i] );
-            System.err.println( usage );
-            System.exit( 1 );
-          }
-      }
-
-    OpenSearchMaster master = new OpenSearchMaster( slavesFile );
-
-    Document doc = master.query( query, startIndex, numHits, hitsPerSite );
-
-    (new XMLOutputter()).output( doc, System.out );
-  }
-
-}
-
-
-class SlaveQueryThread extends Thread
-{
-  OpenSearchSlave slave;
-
-  String query;
-  int    startIndex;
-  int    numResults;
-  int    hitsPerSite;
-
-  Document        response;
-  Throwable       throwable;
-
-
-  SlaveQueryThread( OpenSearchSlave slave, String query, int startIndex, int numResults, int hitsPerSite )
-  {
-    this.slave       = slave;
-    this.query       = query;
-    this.startIndex  = startIndex;
-    this.numResults  = numResults;
-    this.hitsPerSite = hitsPerSite;
-  }
-
-  public void run( )
-  {
-    try
-      {
-        this.response = this.slave.query( this.query, this.startIndex, this.numResults, this.hitsPerSite );
-      }
-    catch ( Throwable t )
-      {
-        this.throwable = t;
-      }
-  }
-}
-
-
-class ElementScoreComparator implements Comparator<Element>
-{
-  public int compare( Element e1, Element e2 )
-  {
-    if ( e1 == e2 )   return 0;
-    if ( e1 == null ) return 1;
-    if ( e2 == null ) return -1;
-
-    Element score1 = e1.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" )  );
-    Element score2 = e2.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" )  );
-
-    if ( score1 == score2 ) return 0;
-    if ( score1 == null )   return 1;
-    if ( score2 == null )   return -1;
-
-    String text1 = score1.getText().trim();
-    String text2 = score2.getText().trim();
-
-    float value1 = 0.0f;
-    float value2 = 0.0f;
-
-    try { value1 = Float.parseFloat( text1 ); } catch ( NumberFormatException nfe ) { }
-    try { value2 = Float.parseFloat( text2 ); } catch ( NumberFormatException nfe ) { }
-
-    if ( value1 == value2 ) return 0;
-
-    return value1 > value2 ? -1 : 1;
-  }
-}
-
-class ElementSiteThenScoreComparator extends ElementScoreComparator
-{
-  public int compare( Element e1, Element e2 )
-  {
-    if ( e1 == e2 )   return 0;
-    if ( e1 == null ) return 1;
-    if ( e2 == null ) return -1;
-
-    String site1 = e1.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim();
-    String site2 = e2.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim();
-    
-    if ( site1.equals( site2 ) )
-      {
-        // Sites are equal, then compare scores.
-        return super.compare( e1, e2 );
-      }
-
-    return site1.compareTo( site2 );
-  }
-}
\ No newline at end of file

Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	2010-02-23 00:50:11 UTC (rev 2970)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	2010-03-16 21:28:15 UTC (rev 2971)
@@ -1,148 +0,0 @@
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.archive.nutchwax;
-
-import java.io.BufferedReader;
-import java.io.FileInputStream;
-import java.io.IOException;
-import java.io.InputStreamReader;
-import java.util.ArrayList;
-import java.util.List;
-import java.util.Map;
-import javax.servlet.ServletConfig;
-import javax.servlet.ServletException;
-import javax.servlet.http.HttpServlet;
-import javax.servlet.http.HttpServletRequest;
-import javax.servlet.http.HttpServletResponse;
-
-import org.jdom.Document;
-import org.jdom.Element;
-import org.jdom.Namespace;
-import org.jdom.output.XMLOutputter;
-
-/** 
- * 
- */   
-public class OpenSearchMasterServlet extends HttpServlet 
-{
-  OpenSearchMaster master;
-  
-  int hitsPerSite = 0;
-
-  public void init( ServletConfig config )
-    throws ServletException 
-  {
-    String slavesFile = config.getInitParameter( "slaves" );
-
-    if ( slavesFile == null || slavesFile.trim().length() == 0 )
-      {
-        throw new ServletException( "Required init parameter missing: slaves" );
-      }
-
-    int timeout     = getInteger( config.getInitParameter( "timeout"     ), 0 );
-    int hitsPerSite = getInteger( config.getInitParameter( "hitsPerSite" ), 0 );
-
-    try
-      {
-        this.master = new OpenSearchMaster( slavesFile, timeout );
-      }
-    catch ( IOException ioe )
-      {
-        throw new ServletException( ioe );
-      }
-    
-  }
-
-  public void destroy( )
-  {
-    
-  }
-
-  public void doGet( HttpServletRequest request, HttpServletResponse response )
-    throws ServletException, IOException 
-  {
-    long responseTime = System.nanoTime( );
-
-    request.setCharacterEncoding( "UTF-8" );
-
-    String query       = getString ( request.getParameter( "query" ), "" );
-    int    startIndex  = getInteger( request.getParameter( "start" ), 0  );
-    int    numHits     = getInteger( request.getParameter( "hitsPerPage" ), 10 );
-    int    hitsPerSite = getInteger( request.getParameter( "hitsPerSite" ), this.hitsPerSite );
-
-    Document doc = this.master.query( query, startIndex, numHits, hitsPerSite );
-
-    Element eUrlParams = new Element( "urlParams", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
-
-    for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) )
-      {
-        String key = e.getKey( );
-        for ( String value : e.getValue( ) )
-          {
-            Element eParam = new Element( "param", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
-            eParam.setAttribute( "name",  key   );
-            eParam.setAttribute( "value", value );
-            eUrlParams.addContent( eParam );
-          }
-      }
-
-    doc.getRootElement( ).getChild( "channel" ).addContent( eUrlParams );
-
-    (new XMLOutputter()).output( doc, response.getOutputStream( ) );
-  }
-
-  String getString ( String value, String defaultValue )
-  {
-    if ( value != null )
-      {
-        value = value.trim();
-
-        if ( value.length( ) != 0 )
-          {
-            return value;
-          }
-      }
-    
-    return defaultValue;
-  }
-
-  int getInteger( String value, int defaultValue )
-  {
-    if ( value != null )
-      {
-        value = value.trim();
-        
-        if ( value.length( ) != 0 )
-          {
-            try
-              {
-                int i = Integer.parseInt( value );
-
-                return i;
-              }
-            catch ( NumberFormatException nfe )
-              {
-                // TODO: log?
-              }
-          }
-      }
-    
-    return defaultValue;
-  }
-
-}

Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-23 00:50:11 UTC (rev 2970)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-03-16 21:28:15 UTC (rev 2971)
@@ -1,218 +0,0 @@
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.archive.nutchwax;
-
-import java.io.IOException;
-import java.io.InputStream;
-import java.io.UnsupportedEncodingException;
-import java.net.HttpURLConnection;
-import java.net.MalformedURLException;
-import java.net.URL;
-import java.net.URLConnection;
-import java.net.URLEncoder;
-import java.util.List;
-
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
-
-import org.jdom.Document;
-import org.jdom.Element;
-import org.jdom.Namespace;
-import org.jdom.input.SAXBuilder;
-import org.jdom.output.XMLOutputter;
-
-/** 
- * 
- */   
-public class OpenSearchSlave
-{
-  public static final Log LOG = LogFactory.getLog( OpenSearchSlave.class );
-
-  private String urlTemplate;
-
-  public OpenSearchSlave( String urlTemplate )
-  {
-    this.urlTemplate = urlTemplate;
-  }
-
-  public Document query( String query, int startIndex, int requestedNumResults, int hitsPerSite )
-    throws Exception
-  {
-    URL url = buildRequestUrl( query, startIndex, requestedNumResults, hitsPerSite );
-    
-    InputStream is = null;
-    try
-      {
-        LOG.info( "Querying slave: " + url );
-
-        is = getInputStream( url );
-        
-        Document doc = (new SAXBuilder()).build( is );
-
-        doc = validate( doc );
-
-        return doc;
-      }
-    catch ( Exception e )
-      {
-        LOG.error( url.toString(), e );
-        throw e;
-      }
-    finally
-      {
-        // Ensure the InputStream is closed, which should trigger the
-        // underlying HTTP connection to be cleaned-up.
-        try { if ( is != null ) is.close( ); } catch ( IOException ioe ) { } // Not much we can do
-      }
-  }
-
-  private Document validate( Document doc )
-    throws Exception
-  {
-    if ( doc.getRootElement( ) == null ) throw new Exception( "Invalid OpenSearch response: missing /rss" );
-    Element root = doc.getRootElement( );
-    
-    if ( ! "rss".equals( root.getName( ) ) ) throw new Exception( "Invalid OpenSearch response: missing /rss" );
-    Element channel = root.getChild( "channel" );
-    
-    if ( channel == null ) throw new Exception( "Invalid OpenSearch response: missing /rss/channel" );
-
-    for ( Element item : (List<Element>) channel.getChildren( "item" ) )
-      {
-        Element site = item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
-        if ( site == null )
-          {
-            item.addContent( new Element( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) );
-          }
-        
-        Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
-        if ( score == null )
-          {
-            item.addContent( new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) );
-          }
-      }
-
-    return doc;
-  }
-
-  /**
-   * 
-   */
-  public URL buildRequestUrl( String query, int startIndex, int requestedNumResults, int hitsPerSite )
-    throws MalformedURLException, UnsupportedEncodingException
-  {
-    String url = this.urlTemplate;
-    
-    // Note about replaceAll: In the Java regex library, the replacement string has a few
-    // special characters: \ and $.  Forunately, since we URL-encode the replacement string,
-    // any occurance of \ or $ is converted to %xy form.  So we don't have to worry about it. :)
-    url = url.replaceAll( "[{]searchTerms[}]", URLEncoder.encode( query, "utf-8" ) );
-    url = url.replaceAll( "[{]count[}]"      , String.valueOf( requestedNumResults ) );
-    url = url.replaceAll( "[{]startIndex[}]" , String.valueOf( startIndex ) );
-    url = url.replaceAll( "[{]hitsPerSite[}]", String.valueOf( hitsPerSite ) );
-
-    // We don't know about any optional parameters, so we remove them (per the OpenSearch spec.)
-    url = url.replaceAll( "[{][^}]+[?][}]", "" );
-    
-    return new URL( url );
-  }
-
-
-  public InputStream getInputStream( URL url )
-    throws IOException
-  {
-    URLConnection connection = url.openConnection( );
-    connection.setDoOutput( false );
-    connection.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; NutchWAX OpenSearchMaster)" );
-    connection.connect( );
-
-    if ( connection instanceof HttpURLConnection )
-      {
-        HttpURLConnection hc = (HttpURLConnection) connection;
-
-        switch ( hc.getResponseCode( ) )
-          {
-          case 200:
-            // All good.
-            break;
-          default:
-            // Problems!  Bail out.
-            throw new IOException( "HTTP error from " + url + ": " + hc.getResponseMessage( ) );
-          }
-      }
-
-    InputStream is = connection.getInputStream( );
-
-    return is;
-  }
-
-  public String toString()
-  {
-    return this.urlTemplate;
-  }
-
-  public static void main( String args[] )
-    throws Exception
-  {
-    String usage = "OpenSearchSlave [OPTIONS] urlTemplate query"
-      + "\n\t-h <n>   Hits per site"
-      + "\n\t-n <n>   Number of results"
-      + "\n";
-
-    if ( args.length < 2 )
-      {
-        System.err.println( usage );
-        System.exit( 1 );
-      }
-
-    String urlTemplate = args[args.length - 2];
-    String query       = args[args.length - 1];
-
-    int hitsPerSite = 0;
-    int numHits     = 10;
-    for ( int i = 0 ; i < args.length - 2 ; i++ )
-      {
-        try
-          {
-            if ( "-h".equals( args[i] ) )
-              {
-                i++;
-                hitsPerSite = Integer.parseInt( args[i] );
-              }
-            if ( "-n".equals( args[i] ) )
-              {
-                i++;
-                numHits = Integer.parseInt( args[i] );
-              }
-          }
-        catch ( NumberFormatException nfe ) 
-          {
-            System.err.println( "Error: not a numeric value: " + args[i] );
-            System.err.println( usage );
-            System.exit( 1 );
-          }
-      }
-
-    OpenSearchSlave osl = new OpenSearchSlave( urlTemplate );
-    
-    Document doc = osl.query( query, 0, numHits, hitsPerSite );
-
-    (new XMLOutputter()).output( doc, System.out );
-  }
-
-}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2969] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchSlave.java

From: <bi...@us...> - 2010-02-23 00:50:21

Revision: 2969
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2969&view=rev
Author:   binzino
Date:     2010-02-23 00:25:39 +0000 (Tue, 23 Feb 2010)

Log Message:
-----------
Simplified addition of empty <score/> element if there is no score.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-22 22:39:00 UTC (rev 2968)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-23 00:25:39 UTC (rev 2969)
@@ -91,10 +91,7 @@
         Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
         if ( score == null )
           {
-            score = new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
-            score.setText( "" );
-
-            item.addContent( score );
+            item.addContent( new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) );
           }
       }
 
@@ -206,4 +203,4 @@
     (new XMLOutputter()).output( doc, System.out );
   }
 
-}
\ No newline at end of file
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2970] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax

From: <bi...@us...> - 2010-02-23 00:50:17

Revision: 2970
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2970&view=rev
Author:   binzino
Date:     2010-02-23 00:50:11 +0000 (Tue, 23 Feb 2010)

Log Message:
-----------
Additional logging, especially for error conditions.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-23 00:25:39 UTC (rev 2969)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-23 00:50:11 UTC (rev 2970)
@@ -27,6 +27,9 @@
 import java.util.ArrayList;
 import java.util.LinkedList;
 
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
 import org.jdom.Document;
 import org.jdom.Element;
 import org.jdom.Namespace;
@@ -38,8 +41,10 @@
  */   
 public class OpenSearchMaster
 {
+  public static final Log LOG = LogFactory.getLog( OpenSearchMaster.class );
+
   List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( );
-  long timeout = 30 * 1000;
+  long timeout = 0;
 
   public OpenSearchMaster( String slavesFile, long timeout )
     throws IOException
@@ -102,22 +107,21 @@
       {
         if ( sqt.throwable != null )
           {
-            // TODO: Handle problems with slaves
             continue ;
           }
 
-        // Dump all the results ("item" elements) into a single list.
-        Element channel = sqt.response.getRootElement( ).getChild( "channel" );
-        items.addAll( (List<Element>) channel.getChildren( "item" ) );
-        channel.removeChildren( "item" );
-
         try
           {
+            // Dump all the results ("item" elements) into a single list.
+            Element channel = sqt.response.getRootElement( ).getChild( "channel" );
+            items.addAll( (List<Element>) channel.getChildren( "item" ) );
+            channel.removeChildren( "item" );
+            
             totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) );
           }
         catch ( Exception e ) 
           {
-            // TODO: Log error getting total.
+            LOG.error( "Error processing response from slave: " + sqt.slave, e );
           }
         
       }
@@ -146,10 +150,6 @@
                 collapsed.add( item );
                 count++;
               }
-            else
-              {
-                // TODO: Log collapse of item.
-              }
           }
 
         // Replace the list of items with the collapsed list.

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-23 00:25:39 UTC (rev 2969)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-23 00:50:11 UTC (rev 2970)
@@ -27,6 +27,9 @@
 import java.net.URLEncoder;
 import java.util.List;
 
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
 import org.jdom.Document;
 import org.jdom.Element;
 import org.jdom.Namespace;
@@ -38,6 +41,8 @@
  */   
 public class OpenSearchSlave
 {
+  public static final Log LOG = LogFactory.getLog( OpenSearchSlave.class );
+
   private String urlTemplate;
 
   public OpenSearchSlave( String urlTemplate )
@@ -53,6 +58,8 @@
     InputStream is = null;
     try
       {
+        LOG.info( "Querying slave: " + url );
+
         is = getInputStream( url );
         
         Document doc = (new SAXBuilder()).build( is );
@@ -61,6 +68,11 @@
 
         return doc;
       }
+    catch ( Exception e )
+      {
+        LOG.error( url.toString(), e );
+        throw e;
+      }
     finally
       {
         // Ensure the InputStream is closed, which should trigger the


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2968] trunk/archive-access/projects/nutchwax/ archive/src/nutch/build.xml

From: <bi...@us...> - 2010-02-22 22:39:06

Revision: 2968
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2968&view=rev
Author:   binzino
Date:     2010-02-22 22:39:00 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Add jdom.jar to .war file.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml	2010-02-22 22:28:00 UTC (rev 2967)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml	2010-02-22 22:39:00 UTC (rev 2968)
@@ -193,6 +193,7 @@
         <include name="commons-lang-*.jar"/>
         <include name="commons-logging-*.jar"/>
         <include name="log4j-*.jar"/>
+        <include name="jdom*.jar"/>
       </lib>
       <lib dir="${build.dir}">
 	      <include name="${final.name}.jar"/>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2967] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/web/jsp/slaves.txt

From: <bi...@us...> - 2010-02-22 22:28:06

Revision: 2967
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2967&view=rev
Author:   binzino
Date:     2010-02-22 22:28:00 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt	2010-02-22 22:28:00 UTC (rev 2967)
@@ -0,0 +1 @@
+http://localhost:8080/nw/opensearch?query={searchTerms}&start={startIndex}&hitsPerPage={count}&hitsPerSite={hitsPerSite}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2966] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/web/web.xml

From: <bi...@us...> - 2010-02-22 22:27:44

Revision: 2966
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2966&view=rev
Author:   binzino
Date:     2010-02-22 22:27:37 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Added configuration of OpenSearchMasterServlet.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2010-02-22 22:25:58 UTC (rev 2965)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml	2010-02-22 22:27:37 UTC (rev 2966)
@@ -20,31 +20,25 @@
 -->
 <web-app>
 
-<!-- order is very important here -->
-
 <listener>
   <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
 </listener>
-<listener>
-  <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
-</listener>
 
 <servlet>
-  <servlet-name>Cached</servlet-name>
-  <servlet-class>org.apache.nutch.servlet.Cached</servlet-class>
+  <servlet-name>OpenSearch</servlet-name>
+  <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
 </servlet>
 
 <servlet>
-  <servlet-name>OpenSearch</servlet-name>
-  <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class>
+  <servlet-name>OpenSearchMaster</servlet-name>
+  <servlet-class>org.archive.nutchwax.OpenSearchMasterServlet</servlet-class>
+  <init-param>
+    <param-name>slaves</param-name>
+    <param-value>webapps/nw/slaves.txt</param-value>
+  </init-param>
 </servlet>
 
 <servlet-mapping>
-  <servlet-name>Cached</servlet-name>
-  <url-pattern>/servlet/cached</url-pattern>
-</servlet-mapping>
-
-<servlet-mapping>
   <servlet-name>OpenSearch</servlet-name>
   <url-pattern>/opensearch</url-pattern>
 </servlet-mapping>
@@ -54,12 +48,22 @@
   <url-pattern>/search</url-pattern>
 </servlet-mapping>
 
+<servlet-mapping>
+  <servlet-name>OpenSearchMaster</servlet-name>
+  <url-pattern>/mopensearch</url-pattern>
+</servlet-mapping>
+
+<servlet-mapping>
+  <servlet-name>OpenSearchMaster</servlet-name>
+  <url-pattern>/msearch</url-pattern>
+</servlet-mapping>
+
 <filter>
   <filter-name>XSLT Filter</filter-name>
   <filter-class>org.archive.nutchwax.XSLTFilter</filter-class>
   <init-param>
     <param-name>xsltUrl</param-name>
-    <param-value>webapps/nutchwax-0.12.4/search.xsl</param-value>
+    <param-value>webapps/nw/search.xsl</param-value>
   </init-param>
 </filter>
 
@@ -68,6 +72,11 @@
   <url-pattern>/search</url-pattern>
 </filter-mapping>
 
+<filter-mapping>
+  <filter-name>XSLT Filter</filter-name>
+  <url-pattern>/msearch</url-pattern>
+</filter-mapping>
+
 <welcome-file-list>
   <welcome-file>search.html</welcome-file>
   <welcome-file>index.html</welcome-file>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2965] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java

From: <bi...@us...> - 2010-02-22 22:26:05

Revision: 2965
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2965&view=rev
Author:   binzino
Date:     2010-02-22 22:25:58 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Initial fully functional revision.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	2010-02-22 22:20:45 UTC (rev 2964)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	2010-02-22 22:25:58 UTC (rev 2965)
@@ -17,36 +17,132 @@
 
 package org.archive.nutchwax;
 
+import java.io.BufferedReader;
+import java.io.FileInputStream;
 import java.io.IOException;
-import java.io.BufferedReader;
 import java.io.InputStreamReader;
-import java.io.FileInputStream;
+import java.util.ArrayList;
 import java.util.List;
-import java.util.ArrayList;
+import java.util.Map;
+import javax.servlet.ServletConfig;
 import javax.servlet.ServletException;
-import javax.servlet.ServletConfig;
 import javax.servlet.http.HttpServlet;
 import javax.servlet.http.HttpServletRequest;
 import javax.servlet.http.HttpServletResponse;
 
+import org.jdom.Document;
+import org.jdom.Element;
+import org.jdom.Namespace;
+import org.jdom.output.XMLOutputter;
 
 /** 
  * 
  */   
 public class OpenSearchMasterServlet extends HttpServlet 
 {
+  OpenSearchMaster master;
+  
+  int hitsPerSite = 0;
 
   public void init( ServletConfig config )
     throws ServletException 
   {
+    String slavesFile = config.getInitParameter( "slaves" );
+
+    if ( slavesFile == null || slavesFile.trim().length() == 0 )
+      {
+        throw new ServletException( "Required init parameter missing: slaves" );
+      }
+
+    int timeout     = getInteger( config.getInitParameter( "timeout"     ), 0 );
+    int hitsPerSite = getInteger( config.getInitParameter( "hitsPerSite" ), 0 );
+
+    try
+      {
+        this.master = new OpenSearchMaster( slavesFile, timeout );
+      }
+    catch ( IOException ioe )
+      {
+        throw new ServletException( ioe );
+      }
     
+  }
+
+  public void destroy( )
+  {
     
   }
 
   public void doGet( HttpServletRequest request, HttpServletResponse response )
     throws ServletException, IOException 
   {
+    long responseTime = System.nanoTime( );
 
+    request.setCharacterEncoding( "UTF-8" );
+
+    String query       = getString ( request.getParameter( "query" ), "" );
+    int    startIndex  = getInteger( request.getParameter( "start" ), 0  );
+    int    numHits     = getInteger( request.getParameter( "hitsPerPage" ), 10 );
+    int    hitsPerSite = getInteger( request.getParameter( "hitsPerSite" ), this.hitsPerSite );
+
+    Document doc = this.master.query( query, startIndex, numHits, hitsPerSite );
+
+    Element eUrlParams = new Element( "urlParams", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
+
+    for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) )
+      {
+        String key = e.getKey( );
+        for ( String value : e.getValue( ) )
+          {
+            Element eParam = new Element( "param", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
+            eParam.setAttribute( "name",  key   );
+            eParam.setAttribute( "value", value );
+            eUrlParams.addContent( eParam );
+          }
+      }
+
+    doc.getRootElement( ).getChild( "channel" ).addContent( eUrlParams );
+
+    (new XMLOutputter()).output( doc, response.getOutputStream( ) );
   }
 
+  String getString ( String value, String defaultValue )
+  {
+    if ( value != null )
+      {
+        value = value.trim();
+
+        if ( value.length( ) != 0 )
+          {
+            return value;
+          }
+      }
+    
+    return defaultValue;
+  }
+
+  int getInteger( String value, int defaultValue )
+  {
+    if ( value != null )
+      {
+        value = value.trim();
+        
+        if ( value.length( ) != 0 )
+          {
+            try
+              {
+                int i = Integer.parseInt( value );
+
+                return i;
+              }
+            catch ( NumberFormatException nfe )
+              {
+                // TODO: log?
+              }
+          }
+      }
+    
+    return defaultValue;
+  }
+
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2964] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchMaster.java

From: <bi...@us...> - 2010-02-22 22:20:51

Revision: 2964
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2964&view=rev
Author:   binzino
Date:     2010-02-22 22:20:45 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Added use of namespace when processing 'score' elements.  Fixed timeout handling to allow for unlimited timeout.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-22 22:19:42 UTC (rev 2963)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-22 22:20:45 UTC (rev 2964)
@@ -93,7 +93,7 @@
         slaveThreads.add( sqt );        
       }
 
-    waitForThreads( slaveThreads, this.timeout, startTime );
+    waitForThreads( slaveThreads, this.timeout );
 
     LinkedList<Element> items = new LinkedList<Element>( );
     long totalResults = 0;
@@ -192,22 +192,13 @@
    * Convenience method to wait for a collection of threads to complete,
    * or until a timeout after a startTime expires.
    */
-  private void waitForThreads( List<SlaveQueryThread> threads, long timeout, long startTime )
+  private void waitForThreads( List<SlaveQueryThread> threads, long timeout )
   {
     for ( Thread t : threads )
       {
-        long timeRemaining = timeout - (System.currentTimeMillis( ) - startTime);
-        
-        // If we are out of time, don't wait for any more threads.
-        if ( timeRemaining <= 0 )
-          {
-            break; 
-          }
-        
-        // Otherwise, wait for the next unfinished thread to finish.
         try
           {
-            t.join( timeRemaining );
+            t.join( timeout );
           }
         catch ( InterruptedException ie ) 
           {
@@ -320,8 +311,8 @@
     if ( e1 == null ) return 1;
     if ( e2 == null ) return -1;
 
-    Element score1 = e1.getChild( "score" );
-    Element score2 = e2.getChild( "score" );
+    Element score1 = e1.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" )  );
+    Element score2 = e2.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" )  );
 
     if ( score1 == score2 ) return 0;
     if ( score1 == null )   return 1;


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2963] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

From: <bi...@us...> - 2010-02-22 22:19:48

Revision: 2963
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2963&view=rev
Author:   binzino
Date:     2010-02-22 22:19:42 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Removed extra 'nutch:' prefix from urlParams and param elements in output.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2010-02-22 05:18:57 UTC (rev 2962)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2010-02-22 22:19:42 UTC (rev 2963)
@@ -201,7 +201,7 @@
       addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) );
 
       // Add a <nutch:urlParams> element containing a list of all the URL parameters.
-      Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "nutch:urlParams" );
+      Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "urlParams" );
       channel.appendChild( urlParams );
 
       for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) )
@@ -209,7 +209,7 @@
           String key = e.getKey( );
           for ( String value : e.getValue( ) )
             {
-              Element urlParam = doc.createElementNS(NS_MAP.get("nutch"), "nutch:param" );
+              Element urlParam = doc.createElementNS(NS_MAP.get("nutch"), "param" );
               addAttribute( doc, urlParam, "name",  key   );
               addAttribute( doc, urlParam, "value", value );
               urlParams.appendChild(urlParam);


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2962] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

From: <bi...@us...> - 2010-02-22 05:19:04

Revision: 2962
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2962&view=rev
Author:   binzino
Date:     2010-02-22 05:18:57 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Added result score to OpenSearch output.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2010-02-22 05:18:00 UTC (rev 2961)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java	2010-02-22 05:18:57 UTC (rev 2962)
@@ -46,6 +46,7 @@
 import org.apache.nutch.searcher.NutchBean;
 import org.apache.nutch.searcher.Query;
 import org.apache.nutch.searcher.Summary;
+import org.apache.hadoop.io.FloatWritable;
 
 /** 
  * Present search results using A9's OpenSearch extensions to RSS,
@@ -183,9 +184,8 @@
  
       Element rss = addNode(doc, doc, "rss");
       addAttribute(doc, rss, "version", "2.0");
-      addAttribute(doc, rss, "xmlns:opensearch",
-                   NS_MAP.get("opensearch"));
-      addAttribute(doc, rss, "xmlns:nutch", NS_MAP.get("nutch"));
+      addAttribute(doc, rss, "xmlns:opensearch", NS_MAP.get("opensearch"));
+      addAttribute(doc, rss, "xmlns:nutch",      NS_MAP.get("nutch"));
 
       Element channel = addNode(doc, rss, "channel");
     
@@ -201,7 +201,7 @@
       addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) );
 
       // Add a <nutch:urlParams> element containing a list of all the URL parameters.
-      Element urlParams = doc.createElementNS(NS_MAP.get("nutch"), "nutch:urlParams" );
+      Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "nutch:urlParams" );
       channel.appendChild( urlParams );
 
       for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) )
@@ -219,9 +219,9 @@
       for (int i = 0; i < length; i++) {
         Hit hit = show[i];
         HitDetails detail = details[i];
+        String score = Float.toString( ((FloatWritable)hit.getSortValue( )).get() );
         String title = detail.getValue("title");
-        String url = detail.getValue("url");
-        String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getUniqueKey();
+        String url   = detail.getValue("url");
       
         if (title == null || title.equals("")) {   // use url for docs w/o title
           title = url;
@@ -229,6 +229,7 @@
         
         Element item = addNode(doc, channel, "item");
 
+        addNode(doc, item, "nutch", "score", score );
         addNode(doc, item, "title", title);
         if (summaries[i] != null) {
           addNode(doc, item, "description", summaries[i].toString() );


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2961] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java

From: <bi...@us...> - 2010-02-22 05:18:07

Revision: 2961
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2961&view=rev
Author:   binzino
Date:     2010-02-22 05:18:00 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Added result score to output in main().

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java	2010-02-22 05:17:20 UTC (rev 2960)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java	2010-02-22 05:18:00 UTC (rev 2961)
@@ -33,6 +33,7 @@
 import org.apache.nutch.parse.*;
 import org.apache.nutch.crawl.Inlinks;
 import org.apache.nutch.util.NutchConfiguration;
+import org.apache.hadoop.io.FloatWritable;
 
 /**
  * One stop shopping for search-related functionality.
@@ -443,6 +444,8 @@
         {
           System.out.println( " " 
                               + i 
+                              + " " 
+                              + Float.toString( ((FloatWritable) show[i].getSortValue( )).get() )
                               + " "
                               + java.util.Arrays.asList( details[i].getValues( "segment" ) )
                               + " " 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2960] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2010-02-22 05:17:29

Revision: 2960
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2960&view=rev
Author:   binzino
Date:     2010-02-22 05:17:20 +0000 (Mon, 22 Feb 2010)

Log Message:
-----------
Initial revision of OpenSearch master/slave system.  Work-in-progress.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE
    trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java

Added: trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE	2010-02-22 05:17:20 UTC (rev 2960)
@@ -0,0 +1,56 @@
+/*-- 
+
+ $Id: LICENSE.txt,v 1.11 2004/02/06 09:32:57 jhunter Exp $
+
+ Copyright (C) 2000-2004 Jason Hunter & Brett McLaughlin.
+ All rights reserved.
+ 
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+ 
+ 1. Redistributions of source code must retain the above copyright
+    notice, this list of conditions, and the following disclaimer.
+ 
+ 2. Redistributions in binary form must reproduce the above copyright
+    notice, this list of conditions, and the disclaimer that follows 
+    these conditions in the documentation and/or other materials 
+    provided with the distribution.
+
+ 3. The name "JDOM" must not be used to endorse or promote products
+    derived from this software without prior written permission.  For
+    written permission, please contact <request_AT_jdom_DOT_org>.
+ 
+ 4. Products derived from this software may not be called "JDOM", nor
+    may "JDOM" appear in their name, without prior written permission
+    from the JDOM Project Management <request_AT_jdom_DOT_org>.
+ 
+ In addition, we request (but do not require) that you include in the 
+ end-user documentation provided with the redistribution and/or in the 
+ software itself an acknowledgement equivalent to the following:
+     "This product includes software developed by the
+      JDOM Project (http://www.jdom.org/)."
+ Alternatively, the acknowledgment may be graphical using the logos 
+ available at http://www.jdom.org/images/logos.
+
+ THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ DISCLAIMED.  IN NO EVENT SHALL THE JDOM AUTHORS OR THE PROJECT
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ SUCH DAMAGE.
+
+ This software consists of voluntary contributions made by many 
+ individuals on behalf of the JDOM Project and was originally 
+ created by Jason Hunter <jhunter_AT_jdom_DOT_org> and
+ Brett McLaughlin <brett_AT_jdom_DOT_org>.  For more information
+ on the JDOM Project, please see <http://www.jdom.org/>. 
+
+ */
+

Added: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar
===================================================================
(Binary files differ)


Property changes on: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar
___________________________________________________________________
Added: svn:mime-type
   + application/octet-stream

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java	2010-02-22 05:17:20 UTC (rev 2960)
@@ -0,0 +1,364 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.io.BufferedReader;
+import java.io.InputStreamReader;
+import java.io.FileInputStream;
+import java.util.Comparator;
+import java.util.Collections;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.LinkedList;
+
+import org.jdom.Document;
+import org.jdom.Element;
+import org.jdom.Namespace;
+import org.jdom.output.XMLOutputter;
+
+
+/** 
+ * 
+ */   
+public class OpenSearchMaster
+{
+  List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( );
+  long timeout = 30 * 1000;
+
+  public OpenSearchMaster( String slavesFile, long timeout )
+    throws IOException
+  {
+    this( slavesFile );
+    this.timeout = timeout;
+  }
+
+  public OpenSearchMaster( String slavesFile )
+    throws IOException
+  {
+    BufferedReader r = null;
+    try
+      {
+        r = new BufferedReader( new InputStreamReader( new FileInputStream( slavesFile ), "utf-8" ) );
+
+        String line;
+        while ( (line = r.readLine()) != null )
+          {
+            line = line.trim();
+            if ( line.length() == 0 || line.charAt( 0 ) == '#' )
+              {
+                // Ignore it.
+                continue ;
+              }
+
+            OpenSearchSlave slave = new OpenSearchSlave( line );
+
+            this.slaves.add( slave );            
+          }
+      }
+    finally
+      {
+        try { if ( r != null ) r.close(); } catch ( IOException ioe ) { }
+      }
+    
+  }
+
+  public Document query( String query, int startIndex, int numResults, int hitsPerSite )
+  {
+    long startTime = System.currentTimeMillis( );
+    
+    List<SlaveQueryThread> slaveThreads = new ArrayList<SlaveQueryThread>( this.slaves.size() );
+
+    for ( OpenSearchSlave slave : this.slaves )
+      {
+        SlaveQueryThread sqt = new SlaveQueryThread( slave, query, 0, (startIndex+numResults), hitsPerSite );
+
+        sqt.start( );
+
+        slaveThreads.add( sqt );        
+      }
+
+    waitForThreads( slaveThreads, this.timeout, startTime );
+
+    LinkedList<Element> items = new LinkedList<Element>( );
+    long totalResults = 0;
+
+    for ( SlaveQueryThread sqt : slaveThreads )
+      {
+        if ( sqt.throwable != null )
+          {
+            // TODO: Handle problems with slaves
+            continue ;
+          }
+
+        // Dump all the results ("item" elements) into a single list.
+        Element channel = sqt.response.getRootElement( ).getChild( "channel" );
+        items.addAll( (List<Element>) channel.getChildren( "item" ) );
+        channel.removeChildren( "item" );
+
+        try
+          {
+            totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) );
+          }
+        catch ( Exception e ) 
+          {
+            // TODO: Log error getting total.
+          }
+        
+      }
+
+    if ( items.size( ) > 0 && hitsPerSite > 0 )
+      {
+        Collections.sort( items, new ElementSiteThenScoreComparator( ) );
+
+        LinkedList<Element> collapsed = new LinkedList<Element>( );
+        
+        collapsed.add( items.removeFirst( ) );
+        
+        int count = 1;
+        for ( Element item : items )
+          {
+            String lastSite = collapsed.getLast( ).getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( );
+
+            if ( lastSite.length( ) == 0 ||
+                 !lastSite.equals( item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ) ) )
+              {
+                collapsed.add( item );
+                count = 1;                
+              }
+            else if ( count < hitsPerSite )
+              {
+                collapsed.add( item );
+                count++;
+              }
+            else
+              {
+                // TODO: Log collapse of item.
+              }
+          }
+
+        // Replace the list of items with the collapsed list.
+        items = collapsed;
+      }
+
+    Collections.sort( items, new ElementScoreComparator( ) );
+
+    // Build the final results OpenSearch XML document.
+    Element channel = new Element( "channel" );
+    channel.addContent( new Element( "title"       ) );
+    channel.addContent( new Element( "description" ) );
+    channel.addContent( new Element( "link"        ) );
+
+    Element eTotalResults = new Element( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
+    Element eStartIndex   = new Element( "startIndex",   Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
+    Element eItemsPerPage = new Element( "itemsPerPage", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) );
+
+    eTotalResults.setText( Long.toString( totalResults ) );
+    eStartIndex.  setText( Long.toString( startIndex   ) );
+    eItemsPerPage.setText( Long.toString( numResults   ) );
+
+    channel.addContent( eTotalResults );
+    channel.addContent( eStartIndex   );
+    channel.addContent( eItemsPerPage );
+
+    // Get a sub-list of only the items we want: [startIndex,(startIndex+numResults)]
+    List<Element> subList = items.subList( Math.min(  startIndex,             items.size( ) ),
+                                           Math.min( (startIndex+numResults), items.size( ) ) );
+    channel.addContent( subList );
+
+    Element rss = new Element( "rss" );
+    rss.addContent( channel );
+
+    return new Document( rss );
+  }
+
+
+  /**
+   * Convenience method to wait for a collection of threads to complete,
+   * or until a timeout after a startTime expires.
+   */
+  private void waitForThreads( List<SlaveQueryThread> threads, long timeout, long startTime )
+  {
+    for ( Thread t : threads )
+      {
+        long timeRemaining = timeout - (System.currentTimeMillis( ) - startTime);
+        
+        // If we are out of time, don't wait for any more threads.
+        if ( timeRemaining <= 0 )
+          {
+            break; 
+          }
+        
+        // Otherwise, wait for the next unfinished thread to finish.
+        try
+          {
+            t.join( timeRemaining );
+          }
+        catch ( InterruptedException ie ) 
+          {
+            break;
+          }
+      }
+  }
+
+  
+  public static void main( String args[] )
+    throws Exception
+  {
+    String usage = "OpenSearchMaster [OPTIONS] SLAVES.txt query"
+      + "\n\t-h <n>    Hits per site"
+      + "\n\t-n <n>    Number of results"
+      + "\n\t-s <n>    Start index"
+      + "\n";
+    
+    if ( args.length < 2 )
+      {
+        System.err.println( usage );
+        System.exit( 1 );
+      }
+
+    String slavesFile = args[args.length - 2];
+    String query      = args[args.length - 1];
+    
+    int startIndex  = 0;
+    int hitsPerSite = 0;
+    int numHits     = 10;
+    for ( int i = 0 ; i < args.length - 2 ; i++ )
+      {
+        try
+          {
+            if ( "-h".equals( args[i] ) )
+              {
+                i++;
+                hitsPerSite = Integer.parseInt( args[i] );
+              }
+            if ( "-n".equals( args[i] ) )
+              {
+                i++;
+                numHits = Integer.parseInt( args[i] );
+              }
+            if ( "-s".equals( args[i] ) )
+              {
+                i++;
+                startIndex = Integer.parseInt( args[i] );
+              }
+          }
+        catch ( NumberFormatException nfe ) 
+          {
+            System.err.println( "Error: not a numeric value: " + args[i] );
+            System.err.println( usage );
+            System.exit( 1 );
+          }
+      }
+
+    OpenSearchMaster master = new OpenSearchMaster( slavesFile );
+
+    Document doc = master.query( query, startIndex, numHits, hitsPerSite );
+
+    (new XMLOutputter()).output( doc, System.out );
+  }
+
+}
+
+
+class SlaveQueryThread extends Thread
+{
+  OpenSearchSlave slave;
+
+  String query;
+  int    startIndex;
+  int    numResults;
+  int    hitsPerSite;
+
+  Document        response;
+  Throwable       throwable;
+
+
+  SlaveQueryThread( OpenSearchSlave slave, String query, int startIndex, int numResults, int hitsPerSite )
+  {
+    this.slave       = slave;
+    this.query       = query;
+    this.startIndex  = startIndex;
+    this.numResults  = numResults;
+    this.hitsPerSite = hitsPerSite;
+  }
+
+  public void run( )
+  {
+    try
+      {
+        this.response = this.slave.query( this.query, this.startIndex, this.numResults, this.hitsPerSite );
+      }
+    catch ( Throwable t )
+      {
+        this.throwable = t;
+      }
+  }
+}
+
+
+class ElementScoreComparator implements Comparator<Element>
+{
+  public int compare( Element e1, Element e2 )
+  {
+    if ( e1 == e2 )   return 0;
+    if ( e1 == null ) return 1;
+    if ( e2 == null ) return -1;
+
+    Element score1 = e1.getChild( "score" );
+    Element score2 = e2.getChild( "score" );
+
+    if ( score1 == score2 ) return 0;
+    if ( score1 == null )   return 1;
+    if ( score2 == null )   return -1;
+
+    String text1 = score1.getText().trim();
+    String text2 = score2.getText().trim();
+
+    float value1 = 0.0f;
+    float value2 = 0.0f;
+
+    try { value1 = Float.parseFloat( text1 ); } catch ( NumberFormatException nfe ) { }
+    try { value2 = Float.parseFloat( text2 ); } catch ( NumberFormatException nfe ) { }
+
+    if ( value1 == value2 ) return 0;
+
+    return value1 > value2 ? -1 : 1;
+  }
+}
+
+class ElementSiteThenScoreComparator extends ElementScoreComparator
+{
+  public int compare( Element e1, Element e2 )
+  {
+    if ( e1 == e2 )   return 0;
+    if ( e1 == null ) return 1;
+    if ( e2 == null ) return -1;
+
+    String site1 = e1.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim();
+    String site2 = e2.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim();
+    
+    if ( site1.equals( site2 ) )
+      {
+        // Sites are equal, then compare scores.
+        return super.compare( e1, e2 );
+      }
+
+    return site1.compareTo( site2 );
+  }
+}
\ No newline at end of file

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java	2010-02-22 05:17:20 UTC (rev 2960)
@@ -0,0 +1,52 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.io.BufferedReader;
+import java.io.InputStreamReader;
+import java.io.FileInputStream;
+import java.util.List;
+import java.util.ArrayList;
+import javax.servlet.ServletException;
+import javax.servlet.ServletConfig;
+import javax.servlet.http.HttpServlet;
+import javax.servlet.http.HttpServletRequest;
+import javax.servlet.http.HttpServletResponse;
+
+
+/** 
+ * 
+ */   
+public class OpenSearchMasterServlet extends HttpServlet 
+{
+
+  public void init( ServletConfig config )
+    throws ServletException 
+  {
+    
+    
+  }
+
+  public void doGet( HttpServletRequest request, HttpServletResponse response )
+    throws ServletException, IOException 
+  {
+
+  }
+
+}

Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java	2010-02-22 05:17:20 UTC (rev 2960)
@@ -0,0 +1,209 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.UnsupportedEncodingException;
+import java.net.HttpURLConnection;
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.net.URLConnection;
+import java.net.URLEncoder;
+import java.util.List;
+
+import org.jdom.Document;
+import org.jdom.Element;
+import org.jdom.Namespace;
+import org.jdom.input.SAXBuilder;
+import org.jdom.output.XMLOutputter;
+
+/** 
+ * 
+ */   
+public class OpenSearchSlave
+{
+  private String urlTemplate;
+
+  public OpenSearchSlave( String urlTemplate )
+  {
+    this.urlTemplate = urlTemplate;
+  }
+
+  public Document query( String query, int startIndex, int requestedNumResults, int hitsPerSite )
+    throws Exception
+  {
+    URL url = buildRequestUrl( query, startIndex, requestedNumResults, hitsPerSite );
+    
+    InputStream is = null;
+    try
+      {
+        is = getInputStream( url );
+        
+        Document doc = (new SAXBuilder()).build( is );
+
+        doc = validate( doc );
+
+        return doc;
+      }
+    finally
+      {
+        // Ensure the InputStream is closed, which should trigger the
+        // underlying HTTP connection to be cleaned-up.
+        try { if ( is != null ) is.close( ); } catch ( IOException ioe ) { } // Not much we can do
+      }
+  }
+
+  private Document validate( Document doc )
+    throws Exception
+  {
+    if ( doc.getRootElement( ) == null ) throw new Exception( "Invalid OpenSearch response: missing /rss" );
+    Element root = doc.getRootElement( );
+    
+    if ( ! "rss".equals( root.getName( ) ) ) throw new Exception( "Invalid OpenSearch response: missing /rss" );
+    Element channel = root.getChild( "channel" );
+    
+    if ( channel == null ) throw new Exception( "Invalid OpenSearch response: missing /rss/channel" );
+
+    for ( Element item : (List<Element>) channel.getChildren( "item" ) )
+      {
+        Element site = item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
+        if ( site == null )
+          {
+            item.addContent( new Element( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) );
+          }
+        
+        Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
+        if ( score == null )
+          {
+            score = new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) );
+            score.setText( "" );
+
+            item.addContent( score );
+          }
+      }
+
+    return doc;
+  }
+
+  /**
+   * 
+   */
+  public URL buildRequestUrl( String query, int startIndex, int requestedNumResults, int hitsPerSite )
+    throws MalformedURLException, UnsupportedEncodingException
+  {
+    String url = this.urlTemplate;
+    
+    // Note about replaceAll: In the Java regex library, the replacement string has a few
+    // special characters: \ and $.  Forunately, since we URL-encode the replacement string,
+    // any occurance of \ or $ is converted to %xy form.  So we don't have to worry about it. :)
+    url = url.replaceAll( "[{]searchTerms[}]", URLEncoder.encode( query, "utf-8" ) );
+    url = url.replaceAll( "[{]count[}]"      , String.valueOf( requestedNumResults ) );
+    url = url.replaceAll( "[{]startIndex[}]" , String.valueOf( startIndex ) );
+    url = url.replaceAll( "[{]hitsPerSite[}]", String.valueOf( hitsPerSite ) );
+
+    // We don't know about any optional parameters, so we remove them (per the OpenSearch spec.)
+    url = url.replaceAll( "[{][^}]+[?][}]", "" );
+    
+    return new URL( url );
+  }
+
+
+  public InputStream getInputStream( URL url )
+    throws IOException
+  {
+    URLConnection connection = url.openConnection( );
+    connection.setDoOutput( false );
+    connection.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; NutchWAX OpenSearchMaster)" );
+    connection.connect( );
+
+    if ( connection instanceof HttpURLConnection )
+      {
+        HttpURLConnection hc = (HttpURLConnection) connection;
+
+        switch ( hc.getResponseCode( ) )
+          {
+          case 200:
+            // All good.
+            break;
+          default:
+            // Problems!  Bail out.
+            throw new IOException( "HTTP error from " + url + ": " + hc.getResponseMessage( ) );
+          }
+      }
+
+    InputStream is = connection.getInputStream( );
+
+    return is;
+  }
+
+  public String toString()
+  {
+    return this.urlTemplate;
+  }
+
+  public static void main( String args[] )
+    throws Exception
+  {
+    String usage = "OpenSearchSlave [OPTIONS] urlTemplate query"
+      + "\n\t-h <n>   Hits per site"
+      + "\n\t-n <n>   Number of results"
+      + "\n";
+
+    if ( args.length < 2 )
+      {
+        System.err.println( usage );
+        System.exit( 1 );
+      }
+
+    String urlTemplate = args[args.length - 2];
+    String query       = args[args.length - 1];
+
+    int hitsPerSite = 0;
+    int numHits     = 10;
+    for ( int i = 0 ; i < args.length - 2 ; i++ )
+      {
+        try
+          {
+            if ( "-h".equals( args[i] ) )
+              {
+                i++;
+                hitsPerSite = Integer.parseInt( args[i] );
+              }
+            if ( "-n".equals( args[i] ) )
+              {
+                i++;
+                numHits = Integer.parseInt( args[i] );
+              }
+          }
+        catch ( NumberFormatException nfe ) 
+          {
+            System.err.println( "Error: not a numeric value: " + args[i] );
+            System.err.println( usage );
+            System.exit( 1 );
+          }
+      }
+
+    OpenSearchSlave osl = new OpenSearchSlave( urlTemplate );
+    
+    Document doc = osl.query( query, 0, numHits, hitsPerSite );
+
+    (new XMLOutputter()).output( doc, System.out );
+  }
+
+}
\ No newline at end of file


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2959] trunk/archive-access/projects/nutchwax/ archive/src/nutch/build.xml

From: <bi...@us...> - 2010-02-20 03:26:10

Revision: 2959
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2959&view=rev
Author:   binzino
Date:     2010-02-20 03:26:03 +0000 (Sat, 20 Feb 2010)

Log Message:
-----------
Whoops, this should have gone in the previous commit.  I missed it.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml	2010-02-20 03:26:03 UTC (rev 2959)
@@ -0,0 +1,640 @@
+<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project name="Nutch" default="job">
+
+  <!-- Load all the default properties, and any the user wants    -->
+  <!-- to contribute (without having to type -D or edit this file -->
+  <property file="${user.home}/build.properties" />
+  <property file="${basedir}/build.properties" />
+  <property file="${basedir}/default.properties" />
+  <property name="test.junit.output.format" value="plain"/>
+ 
+  <!-- the normal classpath -->
+  <path id="classpath">
+    <pathelement location="${build.classes}"/>
+    <fileset dir="${lib.dir}">
+      <include name="*.jar" />
+    </fileset>
+  </path>
+
+  <!-- the unit test classpath -->
+  <dirname property="plugins.classpath.dir" file="${build.plugins}"/>
+  <path id="test.classpath">
+    <pathelement location="${test.build.classes}" />
+    <pathelement location="${conf.dir}"/>
+    <pathelement location="${test.src.dir}"/>
+    <pathelement location="${plugins.classpath.dir}"/>
+    <path refid="classpath"/>
+    <pathelement location="${build.dir}/${final.name}.job" />
+  </path>
+
+  <!-- xmlcatalog definition for xslt task -->
+  <xmlcatalog id="docDTDs">
+     <dtd publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"            
+          location="${xmlcatalog.dir}/xhtml1-transitional.dtd"/> 
+  </xmlcatalog> 
+
+  <!-- ====================================================== -->
+  <!-- Stuff needed by all targets                            -->
+  <!-- ====================================================== -->
+  <target name="init">
+    <mkdir dir="${build.dir}"/>
+    <mkdir dir="${build.classes}"/>
+
+    <mkdir dir="${test.build.dir}"/>
+    <mkdir dir="${test.build.classes}"/>
+
+    <touch datetime="01/25/1971 2:00 pm">
+      <fileset dir="${conf.dir}" includes="**/*.template"/>
+    </touch>
+
+    <copy todir="${conf.dir}" verbose="true">
+      <fileset dir="${conf.dir}" includes="**/*.template"/>
+      <mapper type="glob" from="*.template" to="*"/>
+    </copy>
+
+    <!-- unpack hadoop scripts from hadoop jar into bin directory -->
+    <mkdir dir="${build.dir}/hadoop"/>
+    <unjar dest="${build.dir}/hadoop">
+      <fileset dir="${lib.dir}" includes="hadoop*.jar"/>
+      <patternset includes="bin.tgz"/>
+    </unjar>
+    
+    <untar src="${build.dir}/hadoop/bin.tgz" dest="bin" compression="gzip"/>
+    <!-- fix broken library paths with spaces -->
+    <replace file="bin/hadoop" token="PlatformName" value="PlatformName | sed -e 's/ /_/g'"/>
+    <chmod dir="bin" perm="ugo+rx" includes="*.sh,hadoop"/>
+
+    <!-- unpack hadoop webapp from hadoop jar into build directory -->
+    <mkdir dir="${build.dir}/webapps"/>
+    <unjar dest="${build.dir}">
+      <fileset dir="${lib.dir}" includes="hadoop*.jar"/>
+      <patternset includes="webapps/**"/>
+    </unjar>
+
+  </target>
+
+  <!-- ====================================================== -->
+  <!-- Compile the Java files                                 -->
+  <!-- ====================================================== -->
+  <target name="compile" depends="compile-core, compile-plugins"/>
+
+  <target name="compile-core" depends="init">
+    <javac 
+     encoding="${build.encoding}" 
+     srcdir="${src.dir}"
+     includes="**/*.java"
+     destdir="${build.classes}"
+     debug="${javac.debug}"
+     optimize="${javac.optimize}"
+     target="${javac.version}"
+     source="${javac.version}"
+     deprecation="${javac.deprecation}">
+      <classpath refid="classpath"/>
+    </javac>    
+  </target>
+
+  <target name="compile-plugins">
+    <ant dir="src/plugin" target="deploy" inheritAll="false"/>
+  </target>
+
+  <target name="generate-src" depends="init">
+    <javacc target="${src.dir}/org/apache/nutch/analysis/NutchAnalysis.jj"
+            javacchome="${javacc.home}">
+    </javacc>
+
+    <fixcrlf srcdir="${src.dir}" eol="lf" includes="**/*.java"/>
+
+  </target>
+
+  <target name="dynamic" depends="generate-src, compile">
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Make nutch.jar                                                     -->
+  <!-- ================================================================== -->
+  <!--                                                                    -->
+  <!-- ================================================================== -->
+  <target name="jar" depends="compile-core">
+    <copy file="${conf.dir}/nutch-default.xml"
+          todir="${build.classes}"/>
+    <copy file="${conf.dir}/nutch-site.xml"
+          todir="${build.classes}"/>
+    <jar jarfile="${build.dir}/${final.name}.jar"
+         basedir="${build.classes}">
+      <manifest>
+      </manifest>
+    </jar>
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Make job jar                                                       -->
+  <!-- ================================================================== -->
+  <!--                                                                    -->
+  <!-- ================================================================== -->
+  <target name="job" depends="compile">
+    <jar jarfile="${build.dir}/${final.name}.job">
+      <zipfileset dir="${build.classes}"/>
+      <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
+      <zipfileset dir="${lib.dir}" prefix="lib"
+                  includes="**/*.jar" excludes="hadoop-*.jar"/>
+      <zipfileset dir="${build.plugins}" prefix="plugins"/>
+    </jar>
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Make nutch.war                                                     -->
+  <!-- ================================================================== -->
+  <!--                                                                    -->
+  <!-- ================================================================== -->
+  <target name="war" depends="jar,compile,generate-docs">
+
+    <!-- generate the nutch.xml (servlet context) file -->
+    <xslt in="${basedir}/conf/nutch-default.xml"
+          out="${build.dir}/nutch.xml"
+          style="${basedir}/conf/context.xsl">
+        <xmlcatalog refid="docDTDs"/>
+    	<outputproperty name="indent" value="yes"/>
+    </xslt>
+    <war destfile="${build.dir}/${final.name}.war"
+    	webxml="${web.src.dir}/web.xml">
+      <fileset dir="${web.src.dir}/jsp"/>
+      <zipfileset dir="${docs.src}" includes="include/*.html"/>
+      <zipfileset dir="${build.docs}" includes="*/include/*.html"/>
+      <fileset dir="${docs.dir}"/>
+      <lib dir="${lib.dir}">
+        <include name="lucene*.jar"/>
+        <include name="taglibs-*.jar"/>
+        <include name="hadoop-*.jar"/>
+        <include name="dom4j-*.jar"/>
+        <include name="xerces-*.jar"/>
+        <include name="tika-*.jar"/>
+        <include name="apache-solr-*.jar"/>
+        <include name="commons-httpclient-*.jar"/>
+        <include name="commons-codec-*.jar"/>
+        <include name="commons-collections-*.jar"/>
+        <include name="commons-beanutils-*.jar"/>
+        <include name="commons-cli-*.jar"/>
+        <include name="commons-lang-*.jar"/>
+        <include name="commons-logging-*.jar"/>
+        <include name="log4j-*.jar"/>
+      </lib>
+      <lib dir="${build.dir}">
+	      <include name="${final.name}.jar"/>
+      </lib>
+      <classes dir="${conf.dir}" excludes="**/*.template"/>
+      <classes dir="${web.src.dir}/locale"/>
+      <classes file="${web.src.dir}/log4j.properties"/>
+      <zipfileset prefix="WEB-INF/classes/plugins" dir="${build.plugins}"/>
+      <webinf dir="${lib.dir}">
+	      <include name="taglibs-*.tld"/>
+      </webinf>
+    </war>
+   </target>
+
+
+  <!-- ================================================================== -->
+  <!-- Compile test code                                                  --> 
+  <!-- ================================================================== -->
+  <target name="compile-core-test" depends="compile-core">
+    <javac 
+     encoding="${build.encoding}" 
+     srcdir="${test.src.dir}"
+     includes="org/apache/nutch/**/*.java"
+     destdir="${test.build.classes}"
+     debug="${javac.debug}"
+     optimize="${javac.optimize}"
+     target="${javac.version}"
+     source="${javac.version}"
+     deprecation="${javac.deprecation}">
+      <classpath refid="test.classpath"/>
+    </javac>    
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Run code checks (PMD)                                              --> 
+  <!-- ================================================================== -->
+  <target name="pmd" depends="compile">
+	<property name="pmd.report" location="${build.dir}/pmd-report.html" />
+	<taskdef name="pmd" classname="net.sourceforge.pmd.ant.PMDTask">
+	  <classpath>
+		  <fileset dir="${lib.dir}">
+            <include name="pmd-ext/*.jar" />
+            <include name="xerces*.jar" />
+          </fileset>
+	  </classpath>
+	</taskdef>
+	<pmd shortFilenames="true" failonerror="true" failOnRuleViolation="false"
+		 encoding="${build.encoding}" failuresPropertyName="pmd.failures">
+	  <ruleset>unusedcode</ruleset>
+          <!--ruleset>basic</ruleset-->
+          <!--ruleset>optimizations</ruleset-->
+      <formatter type="html" toFile="${pmd.report}" />
+	  <!-- <formatter type="xml" toFile="${tempbuild}/$report_pmd.xml"/> -->
+	<fileset dir="${basedir}/src">
+        	<include name="java/**/*.java"/>
+	        <include name="plugin/**/*.java"/>
+		<!-- Exclude generated sources -->
+		<exclude name="**/NutchAnalysis.java" />
+		<exclude name="**/NutchAnalysisTokenManager.java" />
+      </fileset>
+    </pmd>
+	<condition property="pmd.stop" value="true">
+      <and>
+        <isset property="pmd.failures" />
+          <not>
+            <equals arg1="0" arg2="${pmd.failures}" trim="true" />
+          </not>
+      </and>
+	</condition>
+	<fail if="pmd.stop">FAILURE: PMD shows ${pmd.failures} rule violations. See ${pmd.report} for details.</fail>
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Run unit tests                                                     --> 
+  <!-- ================================================================== -->
+  <target name="test" depends="test-core, test-plugins"/>
+
+  <target name="test-core" depends="job, compile-core-test">
+
+    <delete dir="${test.build.data}"/>
+    <mkdir dir="${test.build.data}"/>
+    <!-- 
+     copy resources needed in junit tests
+    -->
+    <copy todir="${test.build.data}">
+      <fileset dir="src/testresources" includes="**/*"/>
+    </copy>
+    <copy file="${test.src.dir}/nutch-site.xml"
+          todir="${test.build.classes}"/>
+
+    <copy file="${test.src.dir}/log4j.properties"
+          todir="${test.build.classes}"/>
+
+    <junit printsummary="yes" haltonfailure="no" fork="yes" dir="${basedir}"
+      errorProperty="tests.failed" failureProperty="tests.failed" maxmemory="1000m">
+      <sysproperty key="test.build.data" value="${test.build.data}"/>
+      <sysproperty key="test.src.dir" value="${test.src.dir}"/>
+      <classpath refid="test.classpath"/>
+      <formatter type="${test.junit.output.format}" />
+      <batchtest todir="${test.build.dir}" unless="testcase">
+        <fileset dir="${test.src.dir}"
+                 includes="**/Test*.java" excludes="**/${test.exclude}.java" />
+      </batchtest>
+      <batchtest todir="${test.build.dir}" if="testcase">
+        <fileset dir="${test.src.dir}" includes="**/${testcase}.java"/>
+      </batchtest>
+    </junit>
+
+    <fail if="tests.failed">Tests failed!</fail>
+
+  </target>   
+
+  <target name="test-plugins" depends="compile">
+    <ant dir="src/plugin" target="test" inheritAll="false"/>
+  </target>
+
+  <target name="nightly" depends="test, tar">
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Documentation                                                      -->
+  <!-- ================================================================== -->
+  <target name="javadoc" depends="compile">
+    <mkdir dir="${build.javadoc}"/>
+    <javadoc
+      overview="${src.dir}/overview.html"
+      destdir="${build.javadoc}"
+      author="true"
+      version="true"
+      use="true"
+      windowtitle="${Name} ${version} API"
+      doctitle="${Name} ${version} API"
+      bottom="Copyright &amp;copy; ${year} The Apache Software Foundation"
+      >
+        <arg value="${javadoc.proxy.host}"/>
+        <arg value="${javadoc.proxy.port}"/>
+
+      <packageset dir="${src.dir}"/>
+      <packageset dir="${plugins.dir}/lib-http/src/java"/>
+      <packageset dir="${plugins.dir}/lib-parsems/src/java"/>
+      <packageset dir="${plugins.dir}/lib-regex-filter/src/java"/>
+      <packageset dir="${plugins.dir}/microformats-reltag/src/java"/>
+      <packageset dir="${plugins.dir}/ontology/src/java"/>
+      <packageset dir="${plugins.dir}/protocol-file/src/java"/>
+      <packageset dir="${plugins.dir}/protocol-ftp/src/java"/>
+      <packageset dir="${plugins.dir}/protocol-http/src/java"/>
+      <packageset dir="${plugins.dir}/protocol-httpclient/src/java"/>
+      <packageset dir="${plugins.dir}/parse-ext/src/java"/>
+      <packageset dir="${plugins.dir}/parse-html/src/java"/>
+      <packageset dir="${plugins.dir}/parse-js/src/java"/>
+      <packageset dir="${plugins.dir}/parse-text/src/java"/>
+      <packageset dir="${plugins.dir}/parse-pdf/src/java"/>
+<!--  <packageset dir="${plugins.dir}/parse-rtf/src/java"/> plugin excluded from build due to licensing issues-->
+<!--  <packageset dir="${plugins.dir}/parse-mp3/src/java"/> plugin excluded from build due to licensing issues-->
+      <packageset dir="${plugins.dir}/parse-msexcel/src/java"/>
+      <packageset dir="${plugins.dir}/parse-mspowerpoint/src/java"/>
+      <packageset dir="${plugins.dir}/parse-msword/src/java"/>
+      <packageset dir="${plugins.dir}/parse-oo/src/java"/>
+      <packageset dir="${plugins.dir}/parse-rss/src/java"/>
+      <packageset dir="${plugins.dir}/parse-swf/src/java"/>
+      <packageset dir="${plugins.dir}/parse-zip/src/java"/>
+      <packageset dir="${plugins.dir}/index-basic/src/java"/>
+      <packageset dir="${plugins.dir}/index-more/src/java"/>
+      <packageset dir="${plugins.dir}/query-basic/src/java"/>
+      <packageset dir="${plugins.dir}/query-more/src/java"/>
+      <packageset dir="${plugins.dir}/query-site/src/java"/>
+      <packageset dir="${plugins.dir}/query-url/src/java"/>
+      <packageset dir="${plugins.dir}/scoring-opic/src/java"/>
+      <packageset dir="${plugins.dir}/summary-basic/src/java"/>
+      <packageset dir="${plugins.dir}/summary-lucene/src/java"/>
+      <packageset dir="${plugins.dir}/urlfilter-automaton/src/java"/>
+      <packageset dir="${plugins.dir}/urlfilter-regex/src/java"/>
+      <packageset dir="${plugins.dir}/urlfilter-prefix/src/java"/>
+      <packageset dir="${plugins.dir}/creativecommons/src/java"/>
+      <packageset dir="${plugins.dir}/languageidentifier/src/java"/>
+      <packageset dir="${plugins.dir}/clustering-carrot2/src/java"/>
+      <packageset dir="${plugins.dir}/ontology/src/java"/>
+      
+      <packageset dir="${plugins.dir}/index-nutchwax/src/java"/>
+      <packageset dir="${plugins.dir}/query-nutchwax/src/java"/>
+      <packageset dir="${plugins.dir}/scoring-nutchwax/src/java"/>
+      <packageset dir="${plugins.dir}/urlfilter-nutchwax/src/java"/>
+
+      <link href="${javadoc.link.java}"/>
+      <link href="${javadoc.link.lucene}"/>
+      <link href="${javadoc.link.hadoop}"/>
+      
+      <classpath refid="classpath"/>
+    	<classpath>
+    		<fileset dir="${plugins.dir}" >
+    			<include name="**/*.jar"/>
+    		</fileset>
+    	</classpath>
+    	
+      <group title="Core" packages="org.apache.nutch.*"/>
+      <group title="Plugins API" packages="${plugins.api}"/>
+      <group title="Protocol Plugins" packages="${plugins.protocol}"/>
+      <group title="URL Filter Plugins" packages="${plugins.urlfilter}"/>
+      <group title="Scoring Plugins" packages="${plugins.scoring}"/>
+      <group title="Parse Plugins" packages="${plugins.parse}"/>
+      <group title="Analysis Plugins" packages="${plugins.analysis}"/>
+      <group title="Indexing Filter Plugins" packages="${plugins.index}"/>
+      <group title="Query Filter Plugins" packages="${plugins.query}"/>
+      <group title="Summary Plugins" packages="${plugins.summary}"/>
+      <group title="Clustering Plugins" packages="${plugins.clustering}"/>
+      <group title="Ontology Plugins" packages="${plugins.ontology}"/>
+      <group title="Misc. Plugins" packages="${plugins.misc}"/>
+    </javadoc>
+    <!-- Copy the plugin.dtd file to the plugin doc-files dir -->
+    <copy file="${plugins.dir}/plugin.dtd"
+          todir="${build.javadoc}/org/apache/nutch/plugin/doc-files"/>
+  </target>	
+	
+  <target name="default-doc">
+    <style basedir="${conf.dir}" destdir="${docs.dir}"
+           includes="nutch-default.xml" style="conf/nutch-conf.xsl"/>
+  </target>
+
+  <target name="generate-locale" if="doc.locale">
+    <echo message="Generating docs for locale=${doc.locale}"/>
+
+    <mkdir dir="${build.docs}/${doc.locale}/include"/>
+    <xslt in="${docs.src}/include/${doc.locale}/header.xml"
+          out="${build.docs}/${doc.locale}/include/header.html"
+          style="${docs.src}/style/nutch-header.xsl">
+        <xmlcatalog refid="docDTDs"/>
+    </xslt>
+
+    <dependset>
+       <srcfileset dir="${docs.src}/include/${doc.locale}" includes="*.xml"/>
+       <srcfileset dir="${docs.src}/style" includes="*.xsl"/>
+       <targetfileset dir="${docs.dir}/${doc.locale}" includes="*.html"/>
+    </dependset>  
+
+    <copy file="${docs.src}/style/nutch-page.xsl"
+          todir="${build.docs}/${doc.locale}"
+          preservelastmodified="true"/>
+
+    <xslt basedir="${docs.src}/pages/${doc.locale}"
+          destdir="${docs.dir}/${doc.locale}"
+          includes="*.xml"
+          style="${build.docs}/${doc.locale}/nutch-page.xsl">
+         <xmlcatalog refid="docDTDs"/>
+    </xslt>
+  </target>
+
+
+  <target name="generate-docs" depends="init">
+    <dependset>
+       <srcfileset dir="${docs.src}/include" includes="*.html"/>
+       <targetfileset dir="${docs.dir}" includes="**/*.html"/>
+    </dependset>  
+
+    <mkdir dir="${build.docs}/include"/>
+    <copy todir="${build.docs}/include">
+      <fileset dir="${docs.src}/include"/>
+    </copy>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="ca"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="de"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="en"/>
+    </antcall>
+    
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="es"/>
+    </antcall>
+    
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="fi"/>
+    </antcall>
+    
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="fr"/>
+    </antcall>
+    
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="hu"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="it"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="jp"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="ms"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="nl"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="pl"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="pt"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="sh"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="sr"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="sv"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="th"/>
+    </antcall>
+
+    <antcall target="generate-locale">
+      <param name="doc.locale" value="zh"/>
+    </antcall>
+
+    <fixcrlf srcdir="${docs.dir}" eol="lf" encoding="utf-8"
+             includes="**/*.html"/>
+
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- D I S T R I B U T I O N                                            -->
+  <!-- ================================================================== -->
+  <!--                                                                    -->
+  <!-- ================================================================== -->
+  <target name="package" depends="jar, job, war, javadoc">
+    <mkdir dir="${dist.dir}"/>
+    <mkdir dir="${dist.dir}/lib"/>
+    <mkdir dir="${dist.dir}/bin"/>
+    <mkdir dir="${dist.dir}/docs"/>
+    <mkdir dir="${dist.dir}/docs/api"/>
+    <mkdir dir="${dist.dir}/plugins"/>
+
+    <copy todir="${dist.dir}/lib" includeEmptyDirs="false">
+      <fileset dir="lib"/>
+    </copy>
+
+    <copy todir="${dist.dir}/plugins">
+      <fileset dir="${build.plugins}"/>
+    </copy>
+
+    <copy todir="${dist.dir}/webapps">
+      <fileset dir="${build.webapps}"/>
+    </copy>
+
+    <copy file="${build.dir}/${final.name}.jar" todir="${dist.dir}"/>
+    <copy file="${build.dir}/${final.name}.job" todir="${dist.dir}"/>
+    <copy file="${build.dir}/${final.name}.war" todir="${dist.dir}"/>
+
+    <copy todir="${dist.dir}/bin">
+      <fileset dir="bin"/>
+    </copy>
+
+    <copy todir="${dist.dir}/conf">
+      <fileset dir="${conf.dir}" excludes="**/*.template"/>
+    </copy>
+
+    <chmod perm="ugo+x" type="file">
+        <fileset dir="${dist.dir}/bin"/>
+    </chmod>
+
+    <copy todir="${dist.dir}/docs">
+      <fileset dir="${docs.dir}"/>
+    </copy>
+
+    <copy todir="${dist.dir}/docs/api">
+      <fileset dir="${build.javadoc}"/>
+    </copy>
+
+    <copy todir="${dist.dir}">
+      <fileset dir=".">
+        <include name="*.txt" />
+        <include name="KEYS" />
+      </fileset>
+    </copy>
+
+    <copy todir="${dist.dir}/src" includeEmptyDirs="true">
+      <fileset dir="src"/>
+    </copy>
+
+    <copy todir="${dist.dir}/" file="build.xml"/>
+    <copy todir="${dist.dir}/" file="default.properties"/>
+
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- Make release tarball                                               -->
+  <!-- ================================================================== -->
+  <target name="tar" depends="package">
+    <tar compression="gzip" longfile="gnu"
+      destfile="${build.dir}/${final.name}.tar.gz">
+      <tarfileset dir="${build.dir}" mode="664">
+	<exclude name="${final.name}/bin/*" />
+        <include name="${final.name}/**" />
+      </tarfileset>
+      <tarfileset dir="${build.dir}" mode="755">
+        <include name="${final.name}/bin/*" />
+      </tarfileset>
+    </tar>
+  </target>
+	
+  <!-- ================================================================== -->
+  <!-- Clean.  Delete the build files, and their directories              -->
+  <!-- ================================================================== -->
+  <target name="clean">
+    <delete dir="${build.dir}"/>
+  </target>
+
+  <!-- ================================================================== -->
+  <!-- RAT targets                                                        -->
+  <!-- ================================================================== -->
+  <target name="rat-sources-typedef">
+    <typedef resource="org/apache/rat/anttasks/antlib.xml" >
+      <classpath>
+        <fileset dir="." includes="rat*.jar"/>
+      </classpath>
+    </typedef>
+  </target>
+
+  <target name="rat-sources" depends="rat-sources-typedef"
+	  description="runs the tasks over src/java">
+    <rat:report xmlns:rat="antlib:org.apache.rat.anttasks">
+      <fileset dir="src">
+      	<include name="java/**/*"/>
+      	<include name="plugin/**/src/**/*"/>
+      </fileset>
+    </rat:report>
+  </target>
+	
+</project>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2958] trunk/archive-access/projects/nutchwax/ archive

From: <bi...@us...> - 2010-02-20 03:21:06

Revision: 2958
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2958&view=rev
Author:   binzino
Date:     2010-02-20 03:20:59 +0000 (Sat, 20 Feb 2010)

Log Message:
-----------
WAX-72 and WAX-71: Re-did build system.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/build.xml

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml
    trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml

Modified: trunk/archive-access/projects/nutchwax/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/build.xml	2010-02-20 03:18:57 UTC (rev 2957)
+++ trunk/archive-access/projects/nutchwax/archive/build.xml	2010-02-20 03:20:59 UTC (rev 2958)
@@ -25,81 +25,52 @@
   <!-- HACK: Need to import default.properties like Nutch does -->
   <property name="final.name" value="nutch-1.0" />
   <property name="dist.dir"  value="${build.dir}/${final.name}" />
- 
-  <target name="nutch-compile-core">
-    <!-- First, copy over Nutch source overlays -->
+
+  <target name="init">
     <exec executable="rsync">
       <arg value="-vacC"/>
       <arg value="src/nutch/"/>
       <arg value="../../"/>
     </exec>
-    <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" />
+    <exec executable="rsync">
+      <arg value="-vacC"/>
+      <arg value="lib/"/>
+      <arg value="../../lib/"/>
+    </exec>
+    <exec executable="rsync">
+      <arg value="-vacC"/>
+      <arg value="bin/"/>
+      <arg value="../../bin/"/>
+    </exec>
+    <exec executable="rsync">
+      <arg value="-vacC"/>
+      <arg value="src/java/"/>
+      <arg value="../../src/java/"/>
+    </exec>
+    <exec executable="rsync">
+      <arg value="-vacC"/>
+      <arg value="src/plugin/"/>
+      <arg value="../../src/plugin/"/>
+    </exec>
   </target>
-
-  <target name="nutch-compile-plugins">
-    <ant dir="${nutch.dir}" target="compile-plugins" inheritAll="false" />
-  </target>
-
-  <target name="compile-core" depends="nutch-compile-core">
-    <javac 
-           destdir="${build.dir}/classes"
-           debug="true"
-           verbose="false"
-           source="1.5"
-           target="1.5"
-           encoding="UTF-8"
-           fork="true"
-           nowarn="true"
-           deprecation="false">
-      <src path="${src.dir}/java" />
-      <include name="**/*.java" />
-      <classpath>
-        <pathelement location="${build.dir}/classes" />
-        <fileset dir="${lib.dir}">
-          <include name="*.jar"/>
-        </fileset>
-        <fileset dir="${nutch.dir}/lib">
-          <include name="*.jar"/>
-        </fileset>
-      </classpath>
-    </javac>
-  </target>
-
-  <target name="compile-plugins">
-    <ant dir="src/plugin" target="deploy" inheritAll="false" />
-  </target>
-
-  <!--
-      These targets all call down to the corresponding target in the
-      Nutch build.xml file.  This way all of the 'ant' build commands
-      can be executed from this directory and everything should get
-      built as expected.
-    -->
-  <target name="compile" depends="compile-core, compile-plugins, nutch-compile-plugins">
-  </target>
-
-  <target name="jar" depends="compile-core">
+ 
+  <target name="jar" depends="init">
     <ant dir="${nutch.dir}" target="jar" inheritAll="false" />
   </target>
 
-  <target name="job" depends="compile">
+  <target name="job" depends="init">
     <ant dir="${nutch.dir}" target="job" inheritAll="false" />
-
-    <!-- Add our NutchWAX libs to the .job created by Nutch's build. -->
-    <jar jarfile="${build.dir}/${final.name}.job" update="true">
-      <zipfileset dir="lib" prefix="lib" includes="*.jar"/>
-    </jar>
   </target>
 
-  <target name="war" depends="compile">
+  <target name="war" depends="init">
     <ant dir="${nutch.dir}" target="war" inheritAll="false" />
   </target>
 
-  <target name="javadoc" depends="compile">
+  <target name="javadoc" depends="init">
     <ant dir="${nutch.dir}" target="javadoc" inheritAll="false" />
   </target>
 
-  <target name="tar" depends="package">
+  <target name="tar" depends="init">
     <ant dir="${nutch.dir}" target="tar" inheritAll="false" />
   </target>
 
@@ -107,24 +78,12 @@
     <ant dir="${nutch.dir}" target="clean" inheritAll="false" />
   </target>
 
-  <!-- This one does a little more after calling down to the relevant
-       Nutch target.  After Nutch has copied everything into the
-       distribution directory, we add our script, libraries, etc.
-    -->
-  <target name="package" depends="jar, job, war, javadoc" >
+  <target name="package" depends="init">
     <ant dir="${nutch.dir}" target="package" inheritAll="false" />
     <ant target="onlypack" />
   </target>
 
   <target name="onlypack">
-    <copy todir="${dist.dir}/lib" includeEmptyDirs="false">
-      <fileset dir="lib"/>
-    </copy>
-
-    <copy todir="${dist.dir}/bin">
-      <fileset dir="bin"/>
-    </copy>
-
     <chmod perm="ugo+x" type="file">
         <fileset dir="${dist.dir}/bin"/>
     </chmod>

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml	2010-02-20 03:20:59 UTC (rev 2958)
@@ -0,0 +1,204 @@
+<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project name="Nutch" default="deploy-core" basedir=".">
+
+  <target name="deploy-core">
+    <ant target="compile-core" inheritall="false" dir="../.."/>
+    <ant target="deploy"/>
+  </target>
+
+  <!-- ====================================================== -->
+  <!-- Build & deploy all the plugin jars.                    -->
+  <!-- ====================================================== -->
+  <target name="deploy">
+     <ant dir="clustering-carrot2" target="deploy"/>
+     <ant dir="creativecommons" target="deploy"/>
+     <ant dir="feed" target="deploy"/>
+     <ant dir="index-basic" target="deploy"/>
+     <ant dir="index-anchor" target="deploy"/>
+     <ant dir="index-more" target="deploy"/>
+  	 <ant dir="field-basic" target="deploy"/>
+  	 <ant dir="field-boost" target="deploy"/>
+     <ant dir="languageidentifier" target="deploy"/>
+     <ant dir="lib-http" target="deploy"/>
+     <ant dir="lib-jakarta-poi" target="deploy"/>
+     <ant dir="lib-lucene-analyzers" target="deploy"/>
+     <ant dir="lib-nekohtml" target="deploy"/>
+     <ant dir="lib-parsems" target="deploy"/>
+     <ant dir="lib-regex-filter" target="deploy"/>
+     <ant dir="lib-xml" target="deploy"/>
+     <ant dir="microformats-reltag" target="deploy"/>
+     <ant dir="nutch-extensionpoints" target="deploy"/>
+     <ant dir="ontology" target="deploy"/>
+     <ant dir="protocol-file" target="deploy"/>
+     <ant dir="protocol-ftp" target="deploy"/>
+     <ant dir="protocol-http" target="deploy"/>
+     <ant dir="protocol-httpclient" target="deploy"/>
+     <ant dir="parse-ext" target="deploy"/>
+     <ant dir="parse-html" target="deploy"/>
+     <ant dir="parse-js" target="deploy"/>
+     <!-- <ant dir="parse-mp3" target="deploy"/> -->
+     <ant dir="parse-msexcel" target="deploy"/>
+     <ant dir="parse-mspowerpoint" target="deploy"/>
+     <ant dir="parse-msword" target="deploy"/>
+     <ant dir="parse-oo" target="deploy"/>
+     <ant dir="parse-pdf" target="deploy"/>
+     <ant dir="parse-rss" target="deploy"/>
+     <!-- <ant dir="parse-rtf" target="deploy"/> -->
+     <ant dir="parse-swf" target="deploy"/>
+     <ant dir="parse-text" target="deploy"/>
+     <ant dir="parse-zip" target="deploy"/>
+     <ant dir="query-basic" target="deploy"/>
+     <ant dir="query-more" target="deploy"/>
+     <ant dir="query-site" target="deploy"/>
+  	 <ant dir="query-custom" target="deploy"/>
+     <ant dir="query-url" target="deploy"/>
+     <ant dir="response-json" target="deploy"/>
+     <ant dir="response-xml" target="deploy"/>
+     <ant dir="scoring-opic" target="deploy"/>
+  	 <ant dir="scoring-link" target="deploy"/>
+     <ant dir="summary-basic" target="deploy"/>
+     <ant dir="subcollection" target="deploy"/>
+     <ant dir="summary-lucene" target="deploy"/>
+     <ant dir="tld" target="deploy"/>
+     <ant dir="urlfilter-automaton" target="deploy"/>
+     <ant dir="urlfilter-domain" target="deploy" />
+     <ant dir="urlfilter-prefix" target="deploy"/>
+     <ant dir="urlfilter-regex" target="deploy"/>
+     <ant dir="urlfilter-suffix" target="deploy"/>
+     <ant dir="urlfilter-validator" target="deploy"/>
+     <ant dir="urlnormalizer-basic" target="deploy"/>
+     <ant dir="urlnormalizer-pass" target="deploy"/>
+     <ant dir="urlnormalizer-regex" target="deploy"/>
+
+     <ant dir="index-nutchwax" target="deploy" />
+     <ant dir="query-nutchwax" target="deploy" />
+     <ant dir="scoring-nutchwax" target="deploy" />
+     <ant dir="urlfilter-nutchwax" target="deploy" />
+
+  </target>
+
+  <!-- ====================================================== -->
+  <!-- Test all of the plugins.                               -->
+  <!-- ====================================================== -->
+  <target name="test">
+    <parallel threadCount="2">
+     <ant dir="creativecommons" target="test"/>
+     <ant dir="index-more" target="test"/>
+     <ant dir="languageidentifier" target="test"/>
+     <ant dir="lib-http" target="test"/>
+     <ant dir="ontology" target="test"/>
+     <ant dir="protocol-httpclient" target="test"/>
+     <!--ant dir="parse-ext" target="test"/-->
+     <ant dir="parse-html" target="test"/>
+     <!-- <ant dir="parse-mp3" target="test"/> -->
+     <ant dir="parse-msexcel" target="test"/>
+     <ant dir="parse-mspowerpoint" target="test"/>
+     <ant dir="parse-msword" target="test"/>
+     <ant dir="parse-oo" target="test"/>
+     <ant dir="parse-pdf" target="test"/>
+     <ant dir="parse-rss" target="test"/>
+     <ant dir="feed" target="test"/>
+     <!-- <ant dir="parse-rtf" target="test"/> -->
+     <ant dir="parse-swf" target="test"/>
+     <ant dir="parse-zip" target="test"/>
+     <ant dir="query-url" target="test"/>
+     <ant dir="subcollection" target="test"/>
+     <ant dir="urlfilter-automaton" target="test"/>
+     <ant dir="urlfilter-domain" target="test" />
+     <ant dir="urlfilter-regex" target="test"/>
+     <ant dir="urlfilter-suffix" target="test"/>
+     <ant dir="urlnormalizer-basic" target="test"/>
+     <ant dir="urlnormalizer-pass" target="test"/>
+     <ant dir="urlnormalizer-regex" target="test"/>
+    </parallel>
+  </target>
+
+  <!-- ====================================================== -->
+  <!-- Clean all of the plugins.                              -->
+  <!-- ====================================================== -->
+  <target name="clean">
+    <ant dir="analysis-de" target="clean"/>
+    <ant dir="analysis-fr" target="clean"/>
+    <ant dir="clustering-carrot2" target="clean"/>
+    <ant dir="creativecommons" target="clean"/>
+    <ant dir="feed" target="clean"/>
+    <ant dir="index-basic" target="clean"/>
+    <ant dir="index-anchor" target="clean"/>
+    <ant dir="index-more" target="clean"/>
+    <ant dir="field-basic" target="clean"/>
+    <ant dir="field-boost" target="clean"/>  	
+    <ant dir="languageidentifier" target="clean"/>
+    <ant dir="lib-commons-httpclient" target="clean"/>
+    <ant dir="lib-http" target="clean"/>
+    <ant dir="lib-jakarta-poi" target="clean"/>
+    <ant dir="lib-lucene-analyzers" target="clean"/>
+    <ant dir="lib-nekohtml" target="clean"/>
+    <ant dir="lib-parsems" target="clean"/>
+    <ant dir="lib-regex-filter" target="clean"/>
+    <ant dir="lib-xml" target="clean"/>
+    <ant dir="microformats-reltag" target="clean"/>
+    <ant dir="nutch-extensionpoints" target="clean"/>
+    <ant dir="ontology" target="clean"/>
+    <ant dir="protocol-file" target="clean"/>
+    <ant dir="protocol-ftp" target="clean"/>
+    <ant dir="protocol-http" target="clean"/>
+    <ant dir="protocol-httpclient" target="clean"/>
+    <ant dir="parse-ext" target="clean"/>
+    <ant dir="parse-html" target="clean"/>
+    <ant dir="parse-js" target="clean"/>
+    <ant dir="parse-mp3" target="clean"/>
+    <ant dir="parse-msexcel" target="clean"/>
+    <ant dir="parse-mspowerpoint" target="clean"/>
+    <ant dir="parse-msword" target="clean"/>
+    <ant dir="parse-oo" target="clean"/>
+    <ant dir="parse-pdf" target="clean"/>
+    <ant dir="parse-rss" target="clean"/>
+    <ant dir="parse-rtf" target="clean"/>
+    <ant dir="parse-swf" target="clean"/>
+    <ant dir="parse-text" target="clean"/>
+    <ant dir="parse-zip" target="clean"/>
+    <ant dir="query-basic" target="clean"/>
+    <ant dir="query-more" target="clean"/>
+    <ant dir="query-site" target="clean"/>
+    <ant dir="query-url" target="clean"/>
+  	<ant dir="query-custom" target="clean"/>
+    <ant dir="response-json" target="clean"/>
+    <ant dir="response-xml" target="clean"/>
+    <ant dir="scoring-opic" target="clean"/>
+  	<ant dir="scoring-link" target="clean"/>
+    <ant dir="subcollection" target="clean"/>
+    <ant dir="summary-basic" target="clean"/>
+    <ant dir="summary-lucene" target="clean"/>
+    <ant dir="tld" target="clean"/>
+    <ant dir="urlfilter-automaton" target="clean"/>
+    <ant dir="urlfilter-domain" target="clean" />
+    <ant dir="urlfilter-prefix" target="clean"/>
+    <ant dir="urlfilter-regex" target="clean"/>
+    <ant dir="urlfilter-suffix" target="clean"/>
+    <ant dir="urlfilter-validator" target="clean"/>
+    <ant dir="urlnormalizer-basic" target="clean"/>
+    <ant dir="urlnormalizer-pass" target="clean"/>
+    <ant dir="urlnormalizer-regex" target="clean"/>
+
+    <ant dir="index-nutchwax" target="clean" />
+    <ant dir="query-nutchwax" target="clean" />
+    <ant dir="scoring-nutchwax" target="clean" />
+    <ant dir="urlfilter-nutchwax" target="clean" />
+  </target>
+</project>


Property changes on: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml
___________________________________________________________________
Added: svn:executable
   + *

Deleted: trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml	2010-02-20 03:18:57 UTC (rev 2957)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml	2010-02-20 03:20:59 UTC (rev 2958)
@@ -1,216 +0,0 @@
-<?xml version="1.0"?>
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements.  See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License.  You may obtain a copy of the License at
-
-     http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-<!-- Imported by plugin build.xml files to define default targets. -->
-<project>
-
-  <property name="name" value="${ant.project.name}"/>
-  <property name="root" value="${basedir}"/>
-
-  <!-- load plugin-specific properties first -->
-  <property file="${user.home}/${name}.build.properties" />
-  <property file="${root}/build.properties" />
-
-  <property name="nutch.root" location="${root}/../../../../../"/>
-
-  <property name="src.dir" location="${root}/src/java"/>
-  <property name="src.test" location="${root}/src/test"/>
-
-  <available file="${src.test}" type="dir" property="test.available"/>
-
-  <property name="conf.dir" location="${nutch.root}/conf"/>
-
-  <property name="build.dir" location="${nutch.root}/build/${name}"/>
-  <property name="build.classes" location="${build.dir}/classes"/>
-  <property name="build.test" location="${build.dir}/test"/>
-
-  <property name="deploy.dir" location="${nutch.root}/build/plugins/${name}"/>
-
-  <!-- load nutch defaults last so that they can be overridden above -->
-  <property file="${nutch.root}/default.properties" />
-
-  <path id="plugin.deps"/>
-
-  <fileset id="lib.jars" dir="${root}" includes="lib/*.jar"/>
-
-  <!-- the normal classpath -->
-  <path id="classpath">
-    <pathelement location="${build.classes}"/>
-    <fileset refid="lib.jars"/>
-    <pathelement location="${nutch.root}/build/classes"/>
-    <fileset dir="${nutch.root}/lib">
-      <include name="*.jar" />
-    </fileset>
-    <!-- This is the contrib/archive/lib directory -->
-    <fileset dir="../../../lib">
-      <include name="*.jar" />
-    </fileset>
-    <path refid="plugin.deps"/>
-  </path>
-
-  <!-- the unit test classpath -->
-  <path id="test.classpath">
-    <pathelement location="${build.test}" />
-    <pathelement location="${nutch.root}/build/test/classes"/>
-    <pathelement location="${nutch.root}/src/test"/>
-    <pathelement location="${conf.dir}"/>
-    <pathelement location="${nutch.root}/build"/>
-    <path refid="classpath"/>
-  </path>
-
-  <!-- ====================================================== -->
-  <!-- Stuff needed by all targets                            -->
-  <!-- ====================================================== -->
-  <target name="init">
-    <mkdir dir="${build.dir}"/>
-    <mkdir dir="${build.classes}"/>
-    <mkdir dir="${build.test}"/>
-
-    <antcall target="init-plugin"/>
-  </target>
-
-  <!-- to be overridden by sub-projects --> 
-  <target name="init-plugin"/>
-
-  <!--
-   ! Used to build plugin compilation dependencies
-   ! (to be overridden by plugins)
-   !-->
-  <target name="deps-jar"/>
-
-  <!--
-   ! Used to deploy plugin runtime dependencies
-   ! (to be overridden by plugins)
-   !-->
-  <target name="deps-test"/>
-
-  <!-- ====================================================== -->
-  <!-- Compile the Java files                                 -->
-  <!-- ====================================================== -->
-  <target name="compile" depends="init,deps-jar">
-    <echo message="Compiling plugin: ${name}"/>
-    <javac 
-     encoding="${build.encoding}" 
-     srcdir="${src.dir}"
-     includes="**/*.java"
-     destdir="${build.classes}"
-     debug="${javac.debug}"
-     optimize="${javac.optimize}"
-     target="${javac.version}"
-     source="${javac.version}"
-     deprecation="${javac.deprecation}">
-      <classpath refid="classpath"/>
-    </javac>
-  </target>
-
-  <target name="compile-core">
-    <ant target="compile-core" inheritall="false" dir="${nutch.root}"/>
-    <ant target="compile"/>
-  </target>
-  
-  <!-- ================================================================== -->
-  <!-- Make plugin .jar                                                   -->
-  <!-- ================================================================== -->
-  <!--                                                                    -->
-  <!-- ================================================================== -->
-  <target name="jar" depends="compile">
-    <jar
-      jarfile="${build.dir}/${name}.jar"
-      basedir="${build.classes}"
-    />
-  </target>
-
-  <target name="jar-core" depends="compile-core">
-    <jar
-        jarfile="${build.dir}/${name}.jar"
-        basedir="${build.classes}"
-        />
-  </target>
-
-  <!-- ================================================================== -->
-  <!-- Deploy plugin to ${deploy.dir}                                     -->
-  <!-- ================================================================== -->
-  <!--                                                                    -->
-  <!-- ================================================================== -->
-  <target name="deploy" depends="jar, deps-test">
-    <mkdir dir="${deploy.dir}"/>
-    <copy file="plugin.xml" todir="${deploy.dir}" 
-          preservelastmodified="true"/>
-    <available property="lib-available"
-                 file="${build.dir}/${name}.jar"/>
-    <antcall target="copy-generated-lib"/>
-    <copy todir="${deploy.dir}" flatten="true">
-      <fileset refid="lib.jars"/>
-    </copy>
-  </target>
-	
-  <target name="copy-generated-lib" if="lib-available">
-    <copy file="${build.dir}/${name}.jar" todir="${deploy.dir}" failonerror="false"/>
-  </target>
-
-  <!-- ================================================================== -->
-  <!-- Compile test code                                                  --> 
-  <!-- ================================================================== -->
-  <target name="compile-test" depends="compile" if="test.available">
-    <javac 
-     encoding="${build.encoding}" 
-     srcdir="${src.test}"
-     includes="**/*.java"
-     destdir="${build.test}"
-     debug="${javac.debug}"
-     optimize="${javac.optimize}"
-     target="${javac.version}"
-     source="${javac.version}"
-     deprecation="${javac.deprecation}">
-      <classpath refid="test.classpath"/>
-    </javac>    
-  </target>
-
-  <!-- ================================================================== -->
-  <!-- Run unit tests                                                     --> 
-  <!-- ================================================================== -->
-  <target name="test" depends="compile-test, deploy" if="test.available">
-    <echo message="Testing plugin: ${name}"/>
-
-    <junit printsummary="yes" haltonfailure="no" fork="yes"
-      errorProperty="tests.failed" failureProperty="tests.failed">
-      <sysproperty key="test.data" value="${build.test}/data"/>
-      <sysproperty key="test.input" value="${root}/data"/>
-      <classpath refid="test.classpath"/>
-      <formatter type="plain" />
-      <batchtest todir="${build.test}" unless="testcase">
-        <fileset dir="${src.test}"
-                 includes="**/Test*.java" excludes="**/${test.exclude}.java" />
-      </batchtest>
-      <batchtest todir="${build.test}" if="testcase">
-        <fileset dir="${src.test}" includes="**/${testcase}.java"/>
-      </batchtest>
-    </junit>
-
-    <fail if="tests.failed">Tests failed!</fail>
-
-  </target>   
-
-  <!-- ================================================================== -->
-  <!-- Clean.  Delete the build files, and their directories              -->
-  <!-- ================================================================== -->
-  <target name="clean">
-    <delete dir="${build.dir}"/>
-    <delete dir="${deploy.dir}"/>
-  </target>
-
-</project>

Deleted: trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml	2010-02-20 03:18:57 UTC (rev 2957)
+++ trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml	2010-02-20 03:20:59 UTC (rev 2958)
@@ -1,45 +0,0 @@
-<?xml version="1.0"?>
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements.  See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License.  You may obtain a copy of the License at
-
-     http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-<project name="nutchwax" default="deploy-core" basedir=".">
-
-  <target name="deploy-core">
-    <ant target="compile-core" inheritall="false" dir="../../../../"/>
-    <ant target="deploy"/>
-  </target>
-
-  <!-- ====================================================== -->
-  <!-- Build & deploy all the plugin jars.                    -->
-  <!-- ====================================================== -->
-  <target name="deploy">
-    <ant dir="index-nutchwax"     target="deploy"/>
-    <ant dir="query-nutchwax"     target="deploy"/>
-    <ant dir="urlfilter-nutchwax" target="deploy"/>
-    <ant dir="scoring-nutchwax"   target="deploy"/>
-  </target>
-
-  <!-- ====================================================== -->
-  <!-- Clean all of the plugins.                              -->
-  <!-- ====================================================== -->
-  <target name="clean">
-    <ant dir="index-nutchwax"     target="clean"/>
-    <ant dir="query-nutchwax"     target="clean"/>
-    <ant dir="urlfilter-nutchwax" target="clean"/>
-    <ant dir="scoring-nutchwax"   target="clean"/>
-  </target>
-
-</project>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2957] trunk/archive-access/projects/nutchwax/ archive/src/nutch/conf/nutch-site.xml

From: <bi...@us...> - 2010-02-20 03:19:04

Revision: 2957
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2957&view=rev
Author:   binzino
Date:     2010-02-20 03:18:57 +0000 (Sat, 20 Feb 2010)

Log Message:
-----------
WAX-73.  Change fieldcache to false.  Also added scoring-nutchwax to the plugin list even though we don't normally use it.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml

Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2010-02-12 20:54:15 UTC (rev 2956)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml	2010-02-20 03:18:57 UTC (rev 2957)
@@ -10,7 +10,7 @@
   <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. -->
   <!-- Also, add 'parse-pdf' -->
   <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' -->
-  <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|urlfilter-nutchwax</value>
+  <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
 </property>
 
 <!-- 
@@ -182,7 +182,7 @@
 
 <property>
   <name>searcher.fieldcache</name>
-  <value>true</value>
+  <value>false</value>
 </property>
 
 </configuration>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2956] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean .java

From: <bi...@us...> - 2010-02-12 20:54:23

Revision: 2956
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2956&view=rev
Author:   binzino
Date:     2010-02-12 20:54:15 +0000 (Fri, 12 Feb 2010)

Log Message:
-----------
Added logic to handle per-collection segments.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java

Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java	2010-02-12 20:54:15 UTC (rev 2956)
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.searcher;
+
+import java.io.IOException;
+import java.net.InetSocketAddress;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.ipc.RPC;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseText;
+
+public class DistributedSegmentBean implements SegmentBean {
+
+  private static final ExecutorService executor =
+    Executors.newCachedThreadPool();
+
+  private final ScheduledExecutorService pingService;
+
+  private class DistSummmaryTask implements Callable<Summary[]> {
+    private int id;
+
+    private HitDetails[] details;
+    private Query query;
+
+    public DistSummmaryTask(int id) {
+      this.id = id;
+    }
+
+    public Summary[] call() throws Exception {
+      if (details == null) {
+        return null;
+      }
+      return beans[id].getSummary(details, query);
+    }
+
+    public void setSummaryArgs(HitDetails[] details, Query query) {
+      this.details = details;
+      this.query = query;
+    }
+
+  }
+
+  private class SegmentWorker implements Runnable {
+    private int id;
+
+    public SegmentWorker(int id) {
+      this.id = id;
+    }
+
+    public void run()  {
+      try {
+        String[] segments = beans[id].getSegmentNames();
+        for (String segment : segments) {
+          segmentMap.put(segment, id);
+        }
+      } catch (IOException e) {
+        // remove all segments this bean was serving
+        Iterator<Map.Entry<String, Integer>> i =
+          segmentMap.entrySet().iterator();
+        while (i.hasNext()) {
+          Map.Entry<String, Integer> entry = i.next();
+          int curId = entry.getValue();
+          if (curId == this.id) {
+            i.remove();
+          }
+        }
+      }
+    }
+  }
+
+  private long timeout;
+
+  private SegmentBean[] beans;
+
+  private boolean perCollection = false;
+
+  private ConcurrentMap<String, Integer> segmentMap;
+
+  private List<Callable<Summary[]>> summaryTasks;
+
+  private List<SegmentWorker> segmentWorkers;
+
+  public DistributedSegmentBean(Configuration conf, Path serversConfig)
+  throws IOException {
+    this.timeout = conf.getLong("ipc.client.timeout", 60000);
+    this.perCollection = conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false );
+
+    List<SegmentBean> beanList = new ArrayList<SegmentBean>();
+
+    List<InetSocketAddress> segmentServers =
+        NutchBean.readAddresses(serversConfig, conf);
+
+    for (InetSocketAddress addr : segmentServers) {
+      SegmentBean bean = (RPCSegmentBean) RPC.getProxy(RPCSegmentBean.class,
+          FetchedSegments.VERSION, addr, conf);
+      beanList.add(bean);
+    }
+
+    beans = beanList.toArray(new SegmentBean[beanList.size()]);
+
+    summaryTasks = new ArrayList<Callable<Summary[]>>(beans.length);
+    segmentWorkers = new ArrayList<SegmentWorker>(beans.length);
+
+    for (int i = 0; i < beans.length; i++) {
+      summaryTasks.add(new DistSummmaryTask(i));
+      segmentWorkers.add(new SegmentWorker(i));
+    }
+
+    segmentMap = new ConcurrentHashMap<String, Integer>();
+
+    pingService = Executors.newScheduledThreadPool(beans.length);
+    for (SegmentWorker worker : segmentWorkers) {
+      pingService.scheduleAtFixedRate(worker, 0, 30, TimeUnit.SECONDS);
+    }
+  }
+
+  private SegmentBean getBean(HitDetails details) {
+    String key = perCollection ? "collection":"segment";
+    return beans[segmentMap.get(key)];
+  }
+
+  public String[] getSegmentNames() {
+    return segmentMap.keySet().toArray(new String[segmentMap.size()]);
+  }
+
+  public byte[] getContent(HitDetails details) throws IOException {
+    return getBean(details).getContent(details);
+  }
+
+  public long getFetchDate(HitDetails details) throws IOException {
+    return getBean(details).getFetchDate(details);
+  }
+
+  public ParseData getParseData(HitDetails details) throws IOException {
+    return getBean(details).getParseData(details);
+  }
+
+  public ParseText getParseText(HitDetails details) throws IOException {
+    return getBean(details).getParseText(details);
+  }
+
+  public void close() throws IOException {
+    executor.shutdown();
+    pingService.shutdown();
+    for (SegmentBean bean : beans) {
+      bean.close();
+    }
+  }
+
+  public Summary getSummary(HitDetails details, Query query)
+  throws IOException {
+    return getBean(details).getSummary(details, query);
+  }
+
+  @SuppressWarnings("unchecked")
+  public Summary[] getSummary(HitDetails[] detailsArr, Query query)
+  throws IOException {
+    List<HitDetails>[] detailsList = new ArrayList[summaryTasks.size()];
+    for (int i = 0; i < detailsList.length; i++) {
+      detailsList[i] = new ArrayList<HitDetails>();
+    }
+    for (HitDetails details : detailsArr) {
+      String key = details.getValue( perCollection ? "collection":"segment" );
+      detailsList[segmentMap.get(key)].add(details);
+    }
+    for (int i = 0; i < summaryTasks.size(); i++) {
+      DistSummmaryTask task = (DistSummmaryTask)summaryTasks.get(i);
+      if (detailsList[i].size() > 0) {
+        HitDetails[] taskDetails =
+          detailsList[i].toArray(new HitDetails[detailsList[i].size()]);
+        task.setSummaryArgs(taskDetails, query);
+      } else {
+        task.setSummaryArgs(null, null);
+      }
+    }
+
+    List<Future<Summary[]>> summaries;
+    try {
+       summaries =
+         executor.invokeAll(summaryTasks, timeout, TimeUnit.MILLISECONDS);
+    } catch (InterruptedException e) {
+      throw new RuntimeException(e);
+    }
+
+    List<Summary> summaryList = new ArrayList<Summary>();
+    for (Future<Summary[]> f : summaries) {
+      Summary[] summaryArray;
+      try {
+         summaryArray = f.get();
+         if (summaryArray == null) {
+           continue;
+         }
+         for (Summary summary : summaryArray) {
+           summaryList.add(summary);
+         }
+      } catch (Exception e) {
+        if (e.getCause() instanceof IOException) {
+          throw (IOException) e.getCause();
+        }
+        throw new RuntimeException(e);
+      }
+    }
+
+    return summaryList.toArray(new Summary[summaryList.size()]);
+  }
+
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 32 33 34 35 36 .. 171 > >> (Page 34 of 171)