You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bi...@us...> - 2010-03-18 22:43:10
|
Revision: 2980 http://archive-access.svn.sourceforge.net/archive-access/?rev=2980&view=rev Author: binzino Date: 2010-03-18 22:43:04 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2010-03-18 22:40:39 UTC (rev 2979) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2010-03-18 22:43:04 UTC (rev 2980) @@ -1,57 +1,56 @@ RELEASE-NOTES.TXT -2009-05-05 +2010-02-13 Aaron Binns -Release notes for NutchWAX 0.12.4 +Release notes for NutchWAX 0.13 For the most recent updates and information on NutchWAX, please visit the project wiki at: - http://webteam.archive.org/confluence/display/search/NutchWAX + http://webarchive.jira.com/wiki/display/search/NutchWAX - ====================================================================== Overview ====================================================================== -NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3 +NutchWAX 0.13 is an update of NutchWAX code the Nutch 1.0 +release. - o Option to omit storing of content during import. - o Support for per-collection segments in master/slave config. - o Additional diagnostic/log messages to help troubleshoot common - deployment mistakes. - o PageRankDb similar to LinkDb but only keeping inlink counts. - o Improved paging through results, handling "paging past the end". +This release also allows for field values to be stored in the index in +compressed form. Simply change the field storage specification in the +'nutchwax.filter.index' property from "true" to "compress". +For example, +<property> + <name>nutchwax.filter.index</name> + <value> + title:false:true:tokenized + content:false:compress:tokenized + ... + </value> +</property> + +This stores the entire content field in the Lucene index, using +compression. + ====================================================================== Issues ====================================================================== For an up-to-date list of NutchWAX issues: - http://webteam.archive.org/jira/browse/WAX + http://webarchive.jira.com/browse/WAX Issues resolved in this release: -WAX-27 Sensible output for requesting page of results past the end. +WAX-74 Add support for storing fields in compressed form. -WAX-34 Add option to omit storing of content in segment +WAX-73 Change default value of searcher.fieldcache in nutch-site.xml to 'false' -WAX-35 Add pagerankdb similar to linkdb but which only keeps counts - rather than actual inlinks. +WAX-72 Simply build system to copy NW files into Nutch dirs and use Nutch build.xml -WAX-36 Some additional diagnostics on connecting results to segments - and snippets would be very helpful. +WAX-71 NutchWAX-required libraries not included in nutch-1.0.job -WAX-37 Per-collection segments not supported in distributed - master-slave configuration. - -WAX-38 Build omits neessary libraries from .job file. - -WAX-39 Write more efficient, specialized segment parse_text merging. - -WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher - -WAX-42 Add option to continue importing if an arcfile cannot be read. +WAX-69 Class not found when importing within a Hadoop MR job. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 22:40:45
|
Revision: 2979 http://archive-access.svn.sourceforge.net/archive-access/?rev=2979&view=rev Author: binzino Date: 2010-03-18 22:40:39 +0000 (Thu, 18 Mar 2010) Log Message: ----------- WAX-74. Add support for storing field value in compressed form. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2010-03-18 22:11:53 UTC (rev 2978) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2010-03-18 22:40:39 UTC (rev 2979) @@ -44,11 +44,10 @@ <name>nutchwax.filter.index</name> <value> title:false:true:tokenized - content:false:false:tokenized + content:false:compress:tokenized site:false:false:untokenized url:false:true:tokenized - digest:false:true:no collection:true:true:no_norms date:true:true:no_norms Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-03-18 22:11:53 UTC (rev 2978) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2010-03-18 22:40:39 UTC (rev 2979) @@ -36,6 +36,7 @@ import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.indexer.lucene.LuceneWriter; import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX; +import org.apache.nutch.indexer.lucene.LuceneWriter.STORE; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.Parse; @@ -74,7 +75,7 @@ String srcKey = spec[0]; boolean lowerCase = true; - boolean store = true; + STORE store = STORE.YES; INDEX index = INDEX.TOKENIZED; boolean exclusive = true; String destKey = srcKey; @@ -91,7 +92,10 @@ "no_norms". equals(spec[3]) ? INDEX.NO_NORMS : INDEX.NO; case 3: - store = Boolean.parseBoolean( spec[2] ); + //store = Boolean.parseBoolean( spec[2] ); + store = "true". equals(spec[2]) ? STORE.YES : + "compress".equals(spec[2]) ? STORE.COMPRESS : + STORE.NO; case 2: lowerCase = Boolean.parseBoolean( spec[1] ); case 1: @@ -109,12 +113,12 @@ { String srcKey; boolean lowerCase; - boolean store; + STORE store; INDEX index; boolean exclusive; String destKey; - public FieldSpecification( String srcKey, boolean lowerCase, boolean store, INDEX index, boolean exclusive, String destKey ) + public FieldSpecification( String srcKey, boolean lowerCase, STORE store, INDEX index, boolean exclusive, String destKey ) { this.srcKey = srcKey; this.lowerCase = lowerCase; @@ -147,6 +151,12 @@ try { value = (new URL( meta.get( "url" ) ) ).getHost( ); + + // Strip off any "www." header. + if ( value.startsWith( "www." ) ) + { + value = value.substring( 4 ); + } } catch ( MalformedURLException mue ) { /* Eat it */ } } @@ -171,6 +181,11 @@ int p = value.indexOf( ';' ); if ( p >= 0 ) value = value.substring( 0, p ); } + else if ( "collection".equals( spec.srcKey ) ) + { + // Use value given in config first, otherwise what's in the metadata object. + value = conf.get( "nutchwax.index.collection", meta.get( spec.srcKey ) ); + } else { value = meta.get( spec.srcKey ); @@ -188,7 +203,7 @@ doc.removeField( spec.destKey ); } - if ( spec.store || spec.index != INDEX.NO ) + if ( spec.store != STORE.NO || spec.index != INDEX.NO ) { doc.add( spec.destKey, value ); } @@ -202,13 +217,13 @@ { for ( FieldSpecification spec : this.fieldSpecs ) { - if ( ! spec.store && spec.index == INDEX.NO ) + if ( spec.store == STORE.NO && spec.index == INDEX.NO ) { continue ; } LuceneWriter.addFieldOptions( spec.destKey, - spec.store ? LuceneWriter.STORE.YES : LuceneWriter.STORE.NO, + spec.store, spec.index, conf ); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 22:12:05
|
Revision: 2978 http://archive-access.svn.sourceforge.net/archive-access/?rev=2978&view=rev Author: binzino Date: 2010-03-18 22:11:53 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Update for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2010-03-18 22:10:35 UTC (rev 2977) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2010-03-18 22:11:53 UTC (rev 2978) @@ -1,6 +1,6 @@ HOWTO-xslt.txt -2008-12-18 +2009-06-25 Aaron Binns Table of Contents @@ -128,8 +128,5 @@ You can find sample 'web.xml' and 'search.xsl' files in - contrib/archive/web - -in the compiled Nutch package. Or in this source tree under - - src/web + ./src/nutch/src/web/jsp/search.xsl + ./src/nutch/src/web/web.xml This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 22:11:07
|
Revision: 2977 http://archive-access.svn.sourceforge.net/archive-access/?rev=2977&view=rev Author: binzino Date: 2010-03-18 22:10:35 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 21:55:45 UTC (rev 2976) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 22:10:35 UTC (rev 2977) @@ -1,17 +1,18 @@ HOWTO.txt -2008-07-28 +2010-02-13 Aaron Binns Table of Contents o Prerequisites - NutchWAX installation - ARC/WARC files - o Create a manifest - o Import, Invert and Index - o Search - o Web deployment - - Don't forget to config & patch again + o Build index + - Stand-alone + - Hadoop + o Search index + - Single server + - Master/slave servers ====================================================================== Prerequisites @@ -26,7 +27,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.4 + /opt/nutchwax-0.13 2. ARC/WARC files. @@ -60,32 +61,28 @@ ====================================================================== -Import, Invert and Index +Build Index ====================================================================== -The steps to import the files, invert the link and index the documents -are rather simple: +Building the index consists of two required steps with one recommended +optional step. - $ mkdir crawl - $ cd crawl - $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/* - $ ls -F1 - crawldb/ - indexes/ - linkdb/ - segments/ + 1. Import + 2. Index + 3. Pagerank (optional) -To those already familiar with Nutch, these steps should be quite -familiar. +Performing these steps using the 'nutchwax' command-line driver +are rather straightforward: -The first step, we call NutchWAX's "import" command which creates the -Nutch segment containing the documents in the ARC/WARC files listed in -the manifest. The rest is the same as regular Nutch. + $ /opt/nutchwax-0.13/bin/nutchwax import manifest.txt + $ /opt/nutchwax-0.13/bin/nutchwax index indexes segments/* + $ /opt/nutchwax-0.13/bin/nutchwax merge index indexes + $ /opt/nutchwax-0.13/bin/nutchwax pagerankdb pagerankdb segments/* + $ /opt/nutchwax-0.13/bin/nutchwax pageranker ranks.txt pagerankdb + $ /opt/nutchwax-0.13/bin/nutchwax reboost ranks.txt index + ====================================================================== Search ====================================================================== @@ -96,9 +93,9 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.13/bin/nutchwax search computer -This calls the NutchBean to execute a simple keyword search for +This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the documents you imported. @@ -109,7 +106,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.4/nutch-1.0-dev.war + /opt/nutchwax-0.13/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 21:55:58
|
Revision: 2976 http://archive-access.svn.sourceforge.net/archive-access/?rev=2976&view=rev Author: binzino Date: 2010-03-18 21:55:45 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2010-03-18 21:51:55 UTC (rev 2975) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2010-03-18 21:55:45 UTC (rev 2976) @@ -1,6 +1,6 @@ HOWTO-pagerank.txt -2008-12-18 +2010-02-13 Aaron Binns Table of Contents @@ -30,22 +30,20 @@ simplistic "page rank" information for scoring and sorting documents in the full-text search index. -Nutch's 'invertlinks' step inverts links and stores them in the -'linkdb' directory. We use these inlinks to boost the Lucene score of -documents in proportion to the number of inlinks. +NutchWAX's 'pagerankdb' command inverts and counts links to a page, +storing the counts in a directory named 'pagerankdb'. This +information is then used to update the boost values in the Lucene +index in proportion to number of inlinks to each document. ====================================================================== Generate PageRank ====================================================================== -After the Nutch 'invertlinks' step is performed, run the NutchWAX -'pagerank' command to extract inlink information from the 'linkdb' - For example - $ nutch invertlinks linkdb -dir segments - $ nutchwax pagerank pagerank.txt linkdb + $ nutchwax pagerankdb prdb -dir segments + $ nutchwax pagerank pagerank.txt prdb The resulting "pagerank.txt" file is a simple text file containing a count of the number of inlinks followed by the URL. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 21:52:09
|
Revision: 2975 http://archive-access.svn.sourceforge.net/archive-access/?rev=2975&view=rev Author: binzino Date: 2010-03-18 21:51:55 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2010-03-18 19:27:14 UTC (rev 2974) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2010-03-18 21:51:55 UTC (rev 2975) @@ -1,38 +1,39 @@ INSTALL.txt -2009-03-08 +2010-02-13 Aaron Binns Table of Contents o Introduction o Build from source - - SVN: Nutch 1.0-dev + - SVN: Nutch 1.0 - SVN: NutchWAX - Build and Install o Install binary package - o Install start-up scripts ====================================================================== Introduction ====================================================================== -This installation guide assumes the reader is already familiar with -building, packaging and deploying Nutch 1.0-dev. +This installation gues assumes the reader is not familiar with Nutch +and is looking for step-by-step instructions on building and +installing NutchWAX. -The NutchWAX 0.12 source and build system are designed to integrate -into the existing Nutch 1.0-dev source and build. -The long-term goal is for the NutchWAX components to be fully -integrated into mainline Nutch. As a stepping-stone toward that goal, -we have packaged the NutchWAX source to be dropped into the Nutch -"contrib" directory and built from there. +====================================================================== +Build from Source +====================================================================== -Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The -NutchWAX build script calls out to the Nutch script to build Nutch -proper, then builds NutchWAX components and integrates them into the -Nutch build directory. +The NutchWAX source is packaged as a 'contrib' package for Nutch. +To build from source, you must checkout both the Nutch and +NutchWAX sources. +Like Nutch, NutchWAX uses a simple 'ant' build script. The NutchWAX +build script calls out to the Nutch script to build the Nutch +components, then builds the NutchWAX components and integrates them +into the Nutch build directory. + In order to build NutchWAX, execute all build commands from the NutchWAX directory. This way, NutchWAX will ensure that any and all dependencies in Nutch will be properly built and kept up-to-date. @@ -46,130 +47,64 @@ o tar o clean -Again, the idea is that if you're already used to building Nutch, you -can easily transition to building Nutch and NutchWAX together. All of -the build artifacts will still be placed in Nutch's 'build' -sub-directory as normal. +SVN: nutch-1.0 +-------------- +NutchWAX 0.13 is built against Nutch-1.0. -====================================================================== -Build from Source -====================================================================== - -To build from source, you must check-out the Nutch and NutchWAX sources -from their respective 'subversion' source control servers. - -SVN: nutch-1.0-dev ------------------- -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. -Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.4 is -built against is: - - 701524 - To checkout this revision of Nutch, use: - $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch + $ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0 nutch $ cd nutch -Please be sure to check-out this specific version of the Nutch source. +Please be sure to check-out this specific release of the Nutch source. If you just grab the head of the trunk, there may be newer and -incompatible changed to Nutch. +incompatible changes to Nutch. SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4 +Once you have Nutch-1.0 checked-out, check-out the NutchWAX 0.13 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.4 sources. +NutchWAX 0.13 sources. Build and install ----------------- -Assuming you already have the required tool-set for building Nutch, -building NutchWAX is a snap. +Simply execute the same 'ant' build command in the NutchWAX +source tree -Simply execute the same 'ant' build command in - - nutch/contrib/archive - -as you normally would and everything will build as normal. - -For example - $ cd nutch/contrib/archive $ ant tar This command will build all of Nutch, then the NutchWAX add-ons and -finally will package everything up into the "nutch-1.0-dev.tar.gz" -release package. +finally will package everything up into the "nutch-1.0.tar.gz" release +package, which is placed in the Nutch 'build' subdir: -Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For + # Assuming we are still in nutch/contrib/archive + $ ls ../../build/nutch-1.0.tar.gz + ../../build/nutch-1.0.tar.gz + +Then, install the "nutch-1.0.tar.gz" tarball as normal. For example: $ cd /opt - $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.4 + $ tar xvfz nutch-1.0.tar.gz + $ mv nutch-1.0 nutchwax-0.13 ====================================================================== Install binary package ====================================================================== -Alternatively, grab a "binary" release package from the Internet -Archive's NutchWAX home page. +Alternatively, grab a pre-compiled (binary) release package from the +Internet Archive's NutchWAX home page. Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.4.tar.gz + $ tar xvfz nutchwax-0.13.tar.gz - -====================================================================== -Install start-up scripts -====================================================================== - -NutchWAX 0.12.4 comes with a Unix init.d script which can be used to -automatically start the searcher slaves for a multi-node search -configuration. - -Assuming you installed NutchWAX as - - /opt/nutchwax-0.12.4 - -the script is found at - - /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave - -This script can be placed in /etc/init.d then added to the list of -startup scripts to run at bootup by using commands appropriate to your -Linux distribution. - -You must edit a few of the environment variables defined in the -'searcher-slave' specifying where NutchWAX is installed and where the -index(s) are deployed. In 'searcher-slave' you will find the: - - export NUTCH_HOME=TODO - export DEPLOYMENT_DIR=TODO - -edit those appropriately for your system. - - -The "master" in the multi-node search deployment is the NutchWAX -webapp running in a webapp server, such as Tomcat or Jetty. - -Jetty comes with a start/stop script appropriate for use as an init.d -script, similar to the 'searcher-slave' script described above. If you -use Jetty, create a symlink - - /etc/init.d/jetty.sh -> /opt/jetty/bin/jetty.sh - -Then add this script to the list of startup scripts to run at bootup -by using commands appropriate to your Linux distribution. - -Follow the instructions from Jetty on the deployment of the NutchWAX -webapp (nutch-1.0-dev.war) in the Jetty web application server. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 19:27:20
|
Revision: 2974 http://archive-access.svn.sourceforge.net/archive-access/?rev=2974&view=rev Author: binzino Date: 2010-03-18 19:27:14 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/README.txt Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2010-03-18 19:26:44 UTC (rev 2973) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2010-03-18 19:27:14 UTC (rev 2974) @@ -1,6 +1,6 @@ README.txt -2009-05-05 +2010-02-13 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.4! +Welcome to NutchWAX 0.13! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. @@ -24,10 +24,7 @@ The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". -Since NutchWAX is a set of add-ons to Nutch, you should already be -familiar with Nutch before using NutchWAX. - The goal of NutchWAX is to enable full-text indexing and searching of documents stored in web archive file formats (ARC and WARC). This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 19:26:54
|
Revision: 2973 http://archive-access.svn.sourceforge.net/archive-access/?rev=2973&view=rev Author: binzino Date: 2010-03-18 19:26:44 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated to match NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-16 21:37:14 UTC (rev 2972) +++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-18 19:26:44 UTC (rev 2973) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2008-12-18 +2010-02-13 Aaron Binns ====================================================================== @@ -13,15 +13,15 @@ ====================================================================== -This 0.12.x release of NutchWAX is radically different in source-code +This 0.13 release of NutchWAX is radically different in source-code form compared to the previous release, 0.10. -One of the design goals of 0.12.x was to reduce or even eliminate the +One of the design goals of 0.13 was to reduce or even eliminate the "copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX releases had to copy/paste/edit large chunks of Nutch source code in order to add the NutchWAX features. -Also, the NutchWAX 0.12.x sources and build are designed to one day be +Also, the NutchWAX 0.13 sources and build are designed to one day be added into mainline Nutch as a proper "contrib" package; then eventually be fully integrated into the core Nutch source code. @@ -77,47 +77,7 @@ to the Nutch source and configuration files. ---------------------------------------------------------------------- -The file - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml - ----------------------------------------------------------------------- - In the file 'conf/nutch-site.xml' we define some properties to over-ride the values in 'conf/nutch-default.xml'. @@ -130,27 +90,37 @@ to - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax + protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax In short, we add: - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf + parse-pdf + index-nutchwax + query-nutchwax + urlfilter-nutchwax and remove: - urlfilter-regex - urlnormalizer-(pass|regex|basic) + index-basic + index-anchor + query-site + query-url + urlfilter-regex + urlnormalizer-(pass|regex|basic) -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. The "parse-pdf" plugin is added simply because we have lots of PDFs in our archives and we want to index them. We sometimes remove the "parse-js" plugin if we don't care to index JavaScript files. +The Nutch index-basic and index-anchor filters are removed and +replaced with the NutchWAX index-nutchwax filter. Similarly, we +remove the Nutch query-site and query-url filters, replacing them with +the single NutchWAX query-nutchwax filter. By using the configurable +NutchWAX filters for indexing and querying, we get more powerful and +consistent behavior across metadata fields. Note that we do retain +the Nutch query-basic filter however. + We also remove the default Nutch URL filtering and normalizing plugins because we do not need the URLs normalized nor filtered. We trust that the tool that produced the ARC/WARC file will have normalized the @@ -166,6 +136,14 @@ -------------------------------------------------- indexingfilter.order -------------------------------------------------- +If we use the indexing filters as specified in the previous section, +then this property can remain unset. However, if you choose to use +the Nutch index-basic filter, then you *must* specify the order in +which the filters will be used. If you don't then the filters will be +applied in a random order (per Nutch's design) and since one may +over-write the values of another you won't know what values will +result. In that case, you need to specify the order. + Add this property with a value of org.apache.nutch.indexer.basic.BasicIndexingFilter @@ -174,8 +152,6 @@ So that the NutchWAX indexing filter is run after the Nutch basic indexing filter. -A full explanation is given in "README-dedup.txt". - -------------------------------------------------- mime.type.magic -------------------------------------------------- @@ -205,37 +181,44 @@ The specifications here are of the form: - src-key:lowercase:store:tokenize:exclusive:dest-key + src-key:lowercase:store:index:exclusive:dest-key where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true - tokenize = false + index = tokenized exclusive = true dest-key = src-key +For the 'index' property, the possible values are: + tokenized + untokenized + no_norms + no + +corresponding to the Lucene options of the same names. + We recommend: <property> <name>nutchwax.filter.index</name> <value> - url:false:true:true - url:false:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length + title:false:true:tokenized + content:false:false:tokenized + site:false:false:untokenized + + url:false:true:tokenized + digest:false:true:no + + collection:true:true:no_norms + date:true:true:no_norms + type:true:true:no_norms + length:false:true:no </value> </property> -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. -------------------------------------------------- nutchwax.filter.query @@ -274,15 +257,10 @@ <property> <name>nutchwax.filter.query</name> <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false group:collection + group:site:false group:type - field:anchor field:content - field:host field:title </value> </property> @@ -428,3 +406,31 @@ <value>false</value> </property> + +-------------------------------------------------- +searcher.fieldcache +-------------------------------------------------- + +NutchWAX contains a patch controlling the use of a "fieldcache" in the +Nutch searcher. Without this patch Nutch will read the entire set of +hostnames from the index into an in-memory cache. This cache is then +consulted when performing de-duplication of results per the +"hitsPerSite" feature. + +For small-to-medium indexes, this can improve performance as the +de-duplication information is entirely in memory and no disk access is +required. + +However, for large indexes, in the tens of gigabytes in size, reading +the entire set of hostnames into an in-memory cache can exhaust the +Java heap. In this case, omitting the cache all together and just +reading the values off disk as needed is better. + +The NutchWAX patch controls the use of this cache based on this property +value. If set to false, then the cache is not used at all. + +<property> + <name>searcher.fieldcache</name> + <value>false</value> +</property> + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-16 21:37:21
|
Revision: 2972 http://archive-access.svn.sourceforge.net/archive-access/?rev=2972&view=rev Author: binzino Date: 2010-03-16 21:37:14 +0000 (Tue, 16 Mar 2010) Log Message: ----------- Removed unnecessary libraries. Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar Deleted: trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE =================================================================== --- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE 2010-03-16 21:28:15 UTC (rev 2971) +++ trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE 2010-03-16 21:37:14 UTC (rev 2972) @@ -1,56 +0,0 @@ -/*-- - - $Id: LICENSE.txt,v 1.11 2004/02/06 09:32:57 jhunter Exp $ - - Copyright (C) 2000-2004 Jason Hunter & Brett McLaughlin. - All rights reserved. - - Redistribution and use in source and binary forms, with or without - modification, are permitted provided that the following conditions - are met: - - 1. Redistributions of source code must retain the above copyright - notice, this list of conditions, and the following disclaimer. - - 2. Redistributions in binary form must reproduce the above copyright - notice, this list of conditions, and the disclaimer that follows - these conditions in the documentation and/or other materials - provided with the distribution. - - 3. The name "JDOM" must not be used to endorse or promote products - derived from this software without prior written permission. For - written permission, please contact <request_AT_jdom_DOT_org>. - - 4. Products derived from this software may not be called "JDOM", nor - may "JDOM" appear in their name, without prior written permission - from the JDOM Project Management <request_AT_jdom_DOT_org>. - - In addition, we request (but do not require) that you include in the - end-user documentation provided with the redistribution and/or in the - software itself an acknowledgement equivalent to the following: - "This product includes software developed by the - JDOM Project (http://www.jdom.org/)." - Alternatively, the acknowledgment may be graphical using the logos - available at http://www.jdom.org/images/logos. - - THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED - WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES - OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - DISCLAIMED. IN NO EVENT SHALL THE JDOM AUTHORS OR THE PROJECT - CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, - SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT - LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF - USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, - OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT - OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF - SUCH DAMAGE. - - This software consists of voluntary contributions made by many - individuals on behalf of the JDOM Project and was originally - created by Jason Hunter <jhunter_AT_jdom_DOT_org> and - Brett McLaughlin <brett_AT_jdom_DOT_org>. For more information - on the JDOM Project, please see <http://www.jdom.org/>. - - */ - Deleted: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar =================================================================== (Binary files differ) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-16 21:28:28
|
Revision: 2971 http://archive-access.svn.sourceforge.net/archive-access/?rev=2971&view=rev Author: binzino Date: 2010-03-16 21:28:15 +0000 (Tue, 16 Mar 2010) Log Message: ----------- Removed from this release. Might make a re-appearance in a future release. Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-23 00:50:11 UTC (rev 2970) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-03-16 21:28:15 UTC (rev 2971) @@ -1,355 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.archive.nutchwax; - -import java.io.IOException; -import java.io.BufferedReader; -import java.io.InputStreamReader; -import java.io.FileInputStream; -import java.util.Comparator; -import java.util.Collections; -import java.util.List; -import java.util.ArrayList; -import java.util.LinkedList; - -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; - -import org.jdom.Document; -import org.jdom.Element; -import org.jdom.Namespace; -import org.jdom.output.XMLOutputter; - - -/** - * - */ -public class OpenSearchMaster -{ - public static final Log LOG = LogFactory.getLog( OpenSearchMaster.class ); - - List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( ); - long timeout = 0; - - public OpenSearchMaster( String slavesFile, long timeout ) - throws IOException - { - this( slavesFile ); - this.timeout = timeout; - } - - public OpenSearchMaster( String slavesFile ) - throws IOException - { - BufferedReader r = null; - try - { - r = new BufferedReader( new InputStreamReader( new FileInputStream( slavesFile ), "utf-8" ) ); - - String line; - while ( (line = r.readLine()) != null ) - { - line = line.trim(); - if ( line.length() == 0 || line.charAt( 0 ) == '#' ) - { - // Ignore it. - continue ; - } - - OpenSearchSlave slave = new OpenSearchSlave( line ); - - this.slaves.add( slave ); - } - } - finally - { - try { if ( r != null ) r.close(); } catch ( IOException ioe ) { } - } - - } - - public Document query( String query, int startIndex, int numResults, int hitsPerSite ) - { - long startTime = System.currentTimeMillis( ); - - List<SlaveQueryThread> slaveThreads = new ArrayList<SlaveQueryThread>( this.slaves.size() ); - - for ( OpenSearchSlave slave : this.slaves ) - { - SlaveQueryThread sqt = new SlaveQueryThread( slave, query, 0, (startIndex+numResults), hitsPerSite ); - - sqt.start( ); - - slaveThreads.add( sqt ); - } - - waitForThreads( slaveThreads, this.timeout ); - - LinkedList<Element> items = new LinkedList<Element>( ); - long totalResults = 0; - - for ( SlaveQueryThread sqt : slaveThreads ) - { - if ( sqt.throwable != null ) - { - continue ; - } - - try - { - // Dump all the results ("item" elements) into a single list. - Element channel = sqt.response.getRootElement( ).getChild( "channel" ); - items.addAll( (List<Element>) channel.getChildren( "item" ) ); - channel.removeChildren( "item" ); - - totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) ); - } - catch ( Exception e ) - { - LOG.error( "Error processing response from slave: " + sqt.slave, e ); - } - - } - - if ( items.size( ) > 0 && hitsPerSite > 0 ) - { - Collections.sort( items, new ElementSiteThenScoreComparator( ) ); - - LinkedList<Element> collapsed = new LinkedList<Element>( ); - - collapsed.add( items.removeFirst( ) ); - - int count = 1; - for ( Element item : items ) - { - String lastSite = collapsed.getLast( ).getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ); - - if ( lastSite.length( ) == 0 || - !lastSite.equals( item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ) ) ) - { - collapsed.add( item ); - count = 1; - } - else if ( count < hitsPerSite ) - { - collapsed.add( item ); - count++; - } - } - - // Replace the list of items with the collapsed list. - items = collapsed; - } - - Collections.sort( items, new ElementScoreComparator( ) ); - - // Build the final results OpenSearch XML document. - Element channel = new Element( "channel" ); - channel.addContent( new Element( "title" ) ); - channel.addContent( new Element( "description" ) ); - channel.addContent( new Element( "link" ) ); - - Element eTotalResults = new Element( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); - Element eStartIndex = new Element( "startIndex", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); - Element eItemsPerPage = new Element( "itemsPerPage", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); - - eTotalResults.setText( Long.toString( totalResults ) ); - eStartIndex. setText( Long.toString( startIndex ) ); - eItemsPerPage.setText( Long.toString( numResults ) ); - - channel.addContent( eTotalResults ); - channel.addContent( eStartIndex ); - channel.addContent( eItemsPerPage ); - - // Get a sub-list of only the items we want: [startIndex,(startIndex+numResults)] - List<Element> subList = items.subList( Math.min( startIndex, items.size( ) ), - Math.min( (startIndex+numResults), items.size( ) ) ); - channel.addContent( subList ); - - Element rss = new Element( "rss" ); - rss.addContent( channel ); - - return new Document( rss ); - } - - - /** - * Convenience method to wait for a collection of threads to complete, - * or until a timeout after a startTime expires. - */ - private void waitForThreads( List<SlaveQueryThread> threads, long timeout ) - { - for ( Thread t : threads ) - { - try - { - t.join( timeout ); - } - catch ( InterruptedException ie ) - { - break; - } - } - } - - - public static void main( String args[] ) - throws Exception - { - String usage = "OpenSearchMaster [OPTIONS] SLAVES.txt query" - + "\n\t-h <n> Hits per site" - + "\n\t-n <n> Number of results" - + "\n\t-s <n> Start index" - + "\n"; - - if ( args.length < 2 ) - { - System.err.println( usage ); - System.exit( 1 ); - } - - String slavesFile = args[args.length - 2]; - String query = args[args.length - 1]; - - int startIndex = 0; - int hitsPerSite = 0; - int numHits = 10; - for ( int i = 0 ; i < args.length - 2 ; i++ ) - { - try - { - if ( "-h".equals( args[i] ) ) - { - i++; - hitsPerSite = Integer.parseInt( args[i] ); - } - if ( "-n".equals( args[i] ) ) - { - i++; - numHits = Integer.parseInt( args[i] ); - } - if ( "-s".equals( args[i] ) ) - { - i++; - startIndex = Integer.parseInt( args[i] ); - } - } - catch ( NumberFormatException nfe ) - { - System.err.println( "Error: not a numeric value: " + args[i] ); - System.err.println( usage ); - System.exit( 1 ); - } - } - - OpenSearchMaster master = new OpenSearchMaster( slavesFile ); - - Document doc = master.query( query, startIndex, numHits, hitsPerSite ); - - (new XMLOutputter()).output( doc, System.out ); - } - -} - - -class SlaveQueryThread extends Thread -{ - OpenSearchSlave slave; - - String query; - int startIndex; - int numResults; - int hitsPerSite; - - Document response; - Throwable throwable; - - - SlaveQueryThread( OpenSearchSlave slave, String query, int startIndex, int numResults, int hitsPerSite ) - { - this.slave = slave; - this.query = query; - this.startIndex = startIndex; - this.numResults = numResults; - this.hitsPerSite = hitsPerSite; - } - - public void run( ) - { - try - { - this.response = this.slave.query( this.query, this.startIndex, this.numResults, this.hitsPerSite ); - } - catch ( Throwable t ) - { - this.throwable = t; - } - } -} - - -class ElementScoreComparator implements Comparator<Element> -{ - public int compare( Element e1, Element e2 ) - { - if ( e1 == e2 ) return 0; - if ( e1 == null ) return 1; - if ( e2 == null ) return -1; - - Element score1 = e1.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - Element score2 = e2.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - - if ( score1 == score2 ) return 0; - if ( score1 == null ) return 1; - if ( score2 == null ) return -1; - - String text1 = score1.getText().trim(); - String text2 = score2.getText().trim(); - - float value1 = 0.0f; - float value2 = 0.0f; - - try { value1 = Float.parseFloat( text1 ); } catch ( NumberFormatException nfe ) { } - try { value2 = Float.parseFloat( text2 ); } catch ( NumberFormatException nfe ) { } - - if ( value1 == value2 ) return 0; - - return value1 > value2 ? -1 : 1; - } -} - -class ElementSiteThenScoreComparator extends ElementScoreComparator -{ - public int compare( Element e1, Element e2 ) - { - if ( e1 == e2 ) return 0; - if ( e1 == null ) return 1; - if ( e2 == null ) return -1; - - String site1 = e1.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim(); - String site2 = e2.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim(); - - if ( site1.equals( site2 ) ) - { - // Sites are equal, then compare scores. - return super.compare( e1, e2 ); - } - - return site1.compareTo( site2 ); - } -} \ No newline at end of file Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java 2010-02-23 00:50:11 UTC (rev 2970) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java 2010-03-16 21:28:15 UTC (rev 2971) @@ -1,148 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.archive.nutchwax; - -import java.io.BufferedReader; -import java.io.FileInputStream; -import java.io.IOException; -import java.io.InputStreamReader; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; -import javax.servlet.ServletConfig; -import javax.servlet.ServletException; -import javax.servlet.http.HttpServlet; -import javax.servlet.http.HttpServletRequest; -import javax.servlet.http.HttpServletResponse; - -import org.jdom.Document; -import org.jdom.Element; -import org.jdom.Namespace; -import org.jdom.output.XMLOutputter; - -/** - * - */ -public class OpenSearchMasterServlet extends HttpServlet -{ - OpenSearchMaster master; - - int hitsPerSite = 0; - - public void init( ServletConfig config ) - throws ServletException - { - String slavesFile = config.getInitParameter( "slaves" ); - - if ( slavesFile == null || slavesFile.trim().length() == 0 ) - { - throw new ServletException( "Required init parameter missing: slaves" ); - } - - int timeout = getInteger( config.getInitParameter( "timeout" ), 0 ); - int hitsPerSite = getInteger( config.getInitParameter( "hitsPerSite" ), 0 ); - - try - { - this.master = new OpenSearchMaster( slavesFile, timeout ); - } - catch ( IOException ioe ) - { - throw new ServletException( ioe ); - } - - } - - public void destroy( ) - { - - } - - public void doGet( HttpServletRequest request, HttpServletResponse response ) - throws ServletException, IOException - { - long responseTime = System.nanoTime( ); - - request.setCharacterEncoding( "UTF-8" ); - - String query = getString ( request.getParameter( "query" ), "" ); - int startIndex = getInteger( request.getParameter( "start" ), 0 ); - int numHits = getInteger( request.getParameter( "hitsPerPage" ), 10 ); - int hitsPerSite = getInteger( request.getParameter( "hitsPerSite" ), this.hitsPerSite ); - - Document doc = this.master.query( query, startIndex, numHits, hitsPerSite ); - - Element eUrlParams = new Element( "urlParams", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - - for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) ) - { - String key = e.getKey( ); - for ( String value : e.getValue( ) ) - { - Element eParam = new Element( "param", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - eParam.setAttribute( "name", key ); - eParam.setAttribute( "value", value ); - eUrlParams.addContent( eParam ); - } - } - - doc.getRootElement( ).getChild( "channel" ).addContent( eUrlParams ); - - (new XMLOutputter()).output( doc, response.getOutputStream( ) ); - } - - String getString ( String value, String defaultValue ) - { - if ( value != null ) - { - value = value.trim(); - - if ( value.length( ) != 0 ) - { - return value; - } - } - - return defaultValue; - } - - int getInteger( String value, int defaultValue ) - { - if ( value != null ) - { - value = value.trim(); - - if ( value.length( ) != 0 ) - { - try - { - int i = Integer.parseInt( value ); - - return i; - } - catch ( NumberFormatException nfe ) - { - // TODO: log? - } - } - } - - return defaultValue; - } - -} Deleted: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-23 00:50:11 UTC (rev 2970) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-03-16 21:28:15 UTC (rev 2971) @@ -1,218 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.archive.nutchwax; - -import java.io.IOException; -import java.io.InputStream; -import java.io.UnsupportedEncodingException; -import java.net.HttpURLConnection; -import java.net.MalformedURLException; -import java.net.URL; -import java.net.URLConnection; -import java.net.URLEncoder; -import java.util.List; - -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; - -import org.jdom.Document; -import org.jdom.Element; -import org.jdom.Namespace; -import org.jdom.input.SAXBuilder; -import org.jdom.output.XMLOutputter; - -/** - * - */ -public class OpenSearchSlave -{ - public static final Log LOG = LogFactory.getLog( OpenSearchSlave.class ); - - private String urlTemplate; - - public OpenSearchSlave( String urlTemplate ) - { - this.urlTemplate = urlTemplate; - } - - public Document query( String query, int startIndex, int requestedNumResults, int hitsPerSite ) - throws Exception - { - URL url = buildRequestUrl( query, startIndex, requestedNumResults, hitsPerSite ); - - InputStream is = null; - try - { - LOG.info( "Querying slave: " + url ); - - is = getInputStream( url ); - - Document doc = (new SAXBuilder()).build( is ); - - doc = validate( doc ); - - return doc; - } - catch ( Exception e ) - { - LOG.error( url.toString(), e ); - throw e; - } - finally - { - // Ensure the InputStream is closed, which should trigger the - // underlying HTTP connection to be cleaned-up. - try { if ( is != null ) is.close( ); } catch ( IOException ioe ) { } // Not much we can do - } - } - - private Document validate( Document doc ) - throws Exception - { - if ( doc.getRootElement( ) == null ) throw new Exception( "Invalid OpenSearch response: missing /rss" ); - Element root = doc.getRootElement( ); - - if ( ! "rss".equals( root.getName( ) ) ) throw new Exception( "Invalid OpenSearch response: missing /rss" ); - Element channel = root.getChild( "channel" ); - - if ( channel == null ) throw new Exception( "Invalid OpenSearch response: missing /rss/channel" ); - - for ( Element item : (List<Element>) channel.getChildren( "item" ) ) - { - Element site = item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - if ( site == null ) - { - item.addContent( new Element( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) ); - } - - Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - if ( score == null ) - { - item.addContent( new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) ); - } - } - - return doc; - } - - /** - * - */ - public URL buildRequestUrl( String query, int startIndex, int requestedNumResults, int hitsPerSite ) - throws MalformedURLException, UnsupportedEncodingException - { - String url = this.urlTemplate; - - // Note about replaceAll: In the Java regex library, the replacement string has a few - // special characters: \ and $. Forunately, since we URL-encode the replacement string, - // any occurance of \ or $ is converted to %xy form. So we don't have to worry about it. :) - url = url.replaceAll( "[{]searchTerms[}]", URLEncoder.encode( query, "utf-8" ) ); - url = url.replaceAll( "[{]count[}]" , String.valueOf( requestedNumResults ) ); - url = url.replaceAll( "[{]startIndex[}]" , String.valueOf( startIndex ) ); - url = url.replaceAll( "[{]hitsPerSite[}]", String.valueOf( hitsPerSite ) ); - - // We don't know about any optional parameters, so we remove them (per the OpenSearch spec.) - url = url.replaceAll( "[{][^}]+[?][}]", "" ); - - return new URL( url ); - } - - - public InputStream getInputStream( URL url ) - throws IOException - { - URLConnection connection = url.openConnection( ); - connection.setDoOutput( false ); - connection.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; NutchWAX OpenSearchMaster)" ); - connection.connect( ); - - if ( connection instanceof HttpURLConnection ) - { - HttpURLConnection hc = (HttpURLConnection) connection; - - switch ( hc.getResponseCode( ) ) - { - case 200: - // All good. - break; - default: - // Problems! Bail out. - throw new IOException( "HTTP error from " + url + ": " + hc.getResponseMessage( ) ); - } - } - - InputStream is = connection.getInputStream( ); - - return is; - } - - public String toString() - { - return this.urlTemplate; - } - - public static void main( String args[] ) - throws Exception - { - String usage = "OpenSearchSlave [OPTIONS] urlTemplate query" - + "\n\t-h <n> Hits per site" - + "\n\t-n <n> Number of results" - + "\n"; - - if ( args.length < 2 ) - { - System.err.println( usage ); - System.exit( 1 ); - } - - String urlTemplate = args[args.length - 2]; - String query = args[args.length - 1]; - - int hitsPerSite = 0; - int numHits = 10; - for ( int i = 0 ; i < args.length - 2 ; i++ ) - { - try - { - if ( "-h".equals( args[i] ) ) - { - i++; - hitsPerSite = Integer.parseInt( args[i] ); - } - if ( "-n".equals( args[i] ) ) - { - i++; - numHits = Integer.parseInt( args[i] ); - } - } - catch ( NumberFormatException nfe ) - { - System.err.println( "Error: not a numeric value: " + args[i] ); - System.err.println( usage ); - System.exit( 1 ); - } - } - - OpenSearchSlave osl = new OpenSearchSlave( urlTemplate ); - - Document doc = osl.query( query, 0, numHits, hitsPerSite ); - - (new XMLOutputter()).output( doc, System.out ); - } - -} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-23 00:50:21
|
Revision: 2969 http://archive-access.svn.sourceforge.net/archive-access/?rev=2969&view=rev Author: binzino Date: 2010-02-23 00:25:39 +0000 (Tue, 23 Feb 2010) Log Message: ----------- Simplified addition of empty <score/> element if there is no score. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-22 22:39:00 UTC (rev 2968) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-23 00:25:39 UTC (rev 2969) @@ -91,10 +91,7 @@ Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); if ( score == null ) { - score = new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); - score.setText( "" ); - - item.addContent( score ); + item.addContent( new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) ); } } @@ -206,4 +203,4 @@ (new XMLOutputter()).output( doc, System.out ); } -} \ No newline at end of file +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-23 00:50:17
|
Revision: 2970 http://archive-access.svn.sourceforge.net/archive-access/?rev=2970&view=rev Author: binzino Date: 2010-02-23 00:50:11 +0000 (Tue, 23 Feb 2010) Log Message: ----------- Additional logging, especially for error conditions. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-23 00:25:39 UTC (rev 2969) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-23 00:50:11 UTC (rev 2970) @@ -27,6 +27,9 @@ import java.util.ArrayList; import java.util.LinkedList; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + import org.jdom.Document; import org.jdom.Element; import org.jdom.Namespace; @@ -38,8 +41,10 @@ */ public class OpenSearchMaster { + public static final Log LOG = LogFactory.getLog( OpenSearchMaster.class ); + List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( ); - long timeout = 30 * 1000; + long timeout = 0; public OpenSearchMaster( String slavesFile, long timeout ) throws IOException @@ -102,22 +107,21 @@ { if ( sqt.throwable != null ) { - // TODO: Handle problems with slaves continue ; } - // Dump all the results ("item" elements) into a single list. - Element channel = sqt.response.getRootElement( ).getChild( "channel" ); - items.addAll( (List<Element>) channel.getChildren( "item" ) ); - channel.removeChildren( "item" ); - try { + // Dump all the results ("item" elements) into a single list. + Element channel = sqt.response.getRootElement( ).getChild( "channel" ); + items.addAll( (List<Element>) channel.getChildren( "item" ) ); + channel.removeChildren( "item" ); + totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) ); } catch ( Exception e ) { - // TODO: Log error getting total. + LOG.error( "Error processing response from slave: " + sqt.slave, e ); } } @@ -146,10 +150,6 @@ collapsed.add( item ); count++; } - else - { - // TODO: Log collapse of item. - } } // Replace the list of items with the collapsed list. Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-23 00:25:39 UTC (rev 2969) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-23 00:50:11 UTC (rev 2970) @@ -27,6 +27,9 @@ import java.net.URLEncoder; import java.util.List; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + import org.jdom.Document; import org.jdom.Element; import org.jdom.Namespace; @@ -38,6 +41,8 @@ */ public class OpenSearchSlave { + public static final Log LOG = LogFactory.getLog( OpenSearchSlave.class ); + private String urlTemplate; public OpenSearchSlave( String urlTemplate ) @@ -53,6 +58,8 @@ InputStream is = null; try { + LOG.info( "Querying slave: " + url ); + is = getInputStream( url ); Document doc = (new SAXBuilder()).build( is ); @@ -61,6 +68,11 @@ return doc; } + catch ( Exception e ) + { + LOG.error( url.toString(), e ); + throw e; + } finally { // Ensure the InputStream is closed, which should trigger the This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:39:06
|
Revision: 2968 http://archive-access.svn.sourceforge.net/archive-access/?rev=2968&view=rev Author: binzino Date: 2010-02-22 22:39:00 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Add jdom.jar to .war file. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml 2010-02-22 22:28:00 UTC (rev 2967) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml 2010-02-22 22:39:00 UTC (rev 2968) @@ -193,6 +193,7 @@ <include name="commons-lang-*.jar"/> <include name="commons-logging-*.jar"/> <include name="log4j-*.jar"/> + <include name="jdom*.jar"/> </lib> <lib dir="${build.dir}"> <include name="${final.name}.jar"/> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:28:06
|
Revision: 2967 http://archive-access.svn.sourceforge.net/archive-access/?rev=2967&view=rev Author: binzino Date: 2010-02-22 22:28:00 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Initial revision. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/slaves.txt 2010-02-22 22:28:00 UTC (rev 2967) @@ -0,0 +1 @@ +http://localhost:8080/nw/opensearch?query={searchTerms}&start={startIndex}&hitsPerPage={count}&hitsPerSite={hitsPerSite} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:27:44
|
Revision: 2966 http://archive-access.svn.sourceforge.net/archive-access/?rev=2966&view=rev Author: binzino Date: 2010-02-22 22:27:37 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Added configuration of OpenSearchMasterServlet. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2010-02-22 22:25:58 UTC (rev 2965) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2010-02-22 22:27:37 UTC (rev 2966) @@ -20,31 +20,25 @@ --> <web-app> -<!-- order is very important here --> - <listener> <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> </listener> -<listener> - <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> -</listener> <servlet> - <servlet-name>Cached</servlet-name> - <servlet-class>org.apache.nutch.servlet.Cached</servlet-class> + <servlet-name>OpenSearch</servlet-name> + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> </servlet> <servlet> - <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> + <servlet-name>OpenSearchMaster</servlet-name> + <servlet-class>org.archive.nutchwax.OpenSearchMasterServlet</servlet-class> + <init-param> + <param-name>slaves</param-name> + <param-value>webapps/nw/slaves.txt</param-value> + </init-param> </servlet> <servlet-mapping> - <servlet-name>Cached</servlet-name> - <url-pattern>/servlet/cached</url-pattern> -</servlet-mapping> - -<servlet-mapping> <servlet-name>OpenSearch</servlet-name> <url-pattern>/opensearch</url-pattern> </servlet-mapping> @@ -54,12 +48,22 @@ <url-pattern>/search</url-pattern> </servlet-mapping> +<servlet-mapping> + <servlet-name>OpenSearchMaster</servlet-name> + <url-pattern>/mopensearch</url-pattern> +</servlet-mapping> + +<servlet-mapping> + <servlet-name>OpenSearchMaster</servlet-name> + <url-pattern>/msearch</url-pattern> +</servlet-mapping> + <filter> <filter-name>XSLT Filter</filter-name> <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> <init-param> <param-name>xsltUrl</param-name> - <param-value>webapps/nutchwax-0.12.4/search.xsl</param-value> + <param-value>webapps/nw/search.xsl</param-value> </init-param> </filter> @@ -68,6 +72,11 @@ <url-pattern>/search</url-pattern> </filter-mapping> +<filter-mapping> + <filter-name>XSLT Filter</filter-name> + <url-pattern>/msearch</url-pattern> +</filter-mapping> + <welcome-file-list> <welcome-file>search.html</welcome-file> <welcome-file>index.html</welcome-file> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:26:05
|
Revision: 2965 http://archive-access.svn.sourceforge.net/archive-access/?rev=2965&view=rev Author: binzino Date: 2010-02-22 22:25:58 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Initial fully functional revision. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java 2010-02-22 22:20:45 UTC (rev 2964) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java 2010-02-22 22:25:58 UTC (rev 2965) @@ -17,36 +17,132 @@ package org.archive.nutchwax; +import java.io.BufferedReader; +import java.io.FileInputStream; import java.io.IOException; -import java.io.BufferedReader; import java.io.InputStreamReader; -import java.io.FileInputStream; +import java.util.ArrayList; import java.util.List; -import java.util.ArrayList; +import java.util.Map; +import javax.servlet.ServletConfig; import javax.servlet.ServletException; -import javax.servlet.ServletConfig; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; +import org.jdom.Document; +import org.jdom.Element; +import org.jdom.Namespace; +import org.jdom.output.XMLOutputter; /** * */ public class OpenSearchMasterServlet extends HttpServlet { + OpenSearchMaster master; + + int hitsPerSite = 0; public void init( ServletConfig config ) throws ServletException { + String slavesFile = config.getInitParameter( "slaves" ); + + if ( slavesFile == null || slavesFile.trim().length() == 0 ) + { + throw new ServletException( "Required init parameter missing: slaves" ); + } + + int timeout = getInteger( config.getInitParameter( "timeout" ), 0 ); + int hitsPerSite = getInteger( config.getInitParameter( "hitsPerSite" ), 0 ); + + try + { + this.master = new OpenSearchMaster( slavesFile, timeout ); + } + catch ( IOException ioe ) + { + throw new ServletException( ioe ); + } + } + + public void destroy( ) + { } public void doGet( HttpServletRequest request, HttpServletResponse response ) throws ServletException, IOException { + long responseTime = System.nanoTime( ); + request.setCharacterEncoding( "UTF-8" ); + + String query = getString ( request.getParameter( "query" ), "" ); + int startIndex = getInteger( request.getParameter( "start" ), 0 ); + int numHits = getInteger( request.getParameter( "hitsPerPage" ), 10 ); + int hitsPerSite = getInteger( request.getParameter( "hitsPerSite" ), this.hitsPerSite ); + + Document doc = this.master.query( query, startIndex, numHits, hitsPerSite ); + + Element eUrlParams = new Element( "urlParams", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + + for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) ) + { + String key = e.getKey( ); + for ( String value : e.getValue( ) ) + { + Element eParam = new Element( "param", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + eParam.setAttribute( "name", key ); + eParam.setAttribute( "value", value ); + eUrlParams.addContent( eParam ); + } + } + + doc.getRootElement( ).getChild( "channel" ).addContent( eUrlParams ); + + (new XMLOutputter()).output( doc, response.getOutputStream( ) ); } + String getString ( String value, String defaultValue ) + { + if ( value != null ) + { + value = value.trim(); + + if ( value.length( ) != 0 ) + { + return value; + } + } + + return defaultValue; + } + + int getInteger( String value, int defaultValue ) + { + if ( value != null ) + { + value = value.trim(); + + if ( value.length( ) != 0 ) + { + try + { + int i = Integer.parseInt( value ); + + return i; + } + catch ( NumberFormatException nfe ) + { + // TODO: log? + } + } + } + + return defaultValue; + } + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:20:51
|
Revision: 2964 http://archive-access.svn.sourceforge.net/archive-access/?rev=2964&view=rev Author: binzino Date: 2010-02-22 22:20:45 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Added use of namespace when processing 'score' elements. Fixed timeout handling to allow for unlimited timeout. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-22 22:19:42 UTC (rev 2963) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-22 22:20:45 UTC (rev 2964) @@ -93,7 +93,7 @@ slaveThreads.add( sqt ); } - waitForThreads( slaveThreads, this.timeout, startTime ); + waitForThreads( slaveThreads, this.timeout ); LinkedList<Element> items = new LinkedList<Element>( ); long totalResults = 0; @@ -192,22 +192,13 @@ * Convenience method to wait for a collection of threads to complete, * or until a timeout after a startTime expires. */ - private void waitForThreads( List<SlaveQueryThread> threads, long timeout, long startTime ) + private void waitForThreads( List<SlaveQueryThread> threads, long timeout ) { for ( Thread t : threads ) { - long timeRemaining = timeout - (System.currentTimeMillis( ) - startTime); - - // If we are out of time, don't wait for any more threads. - if ( timeRemaining <= 0 ) - { - break; - } - - // Otherwise, wait for the next unfinished thread to finish. try { - t.join( timeRemaining ); + t.join( timeout ); } catch ( InterruptedException ie ) { @@ -320,8 +311,8 @@ if ( e1 == null ) return 1; if ( e2 == null ) return -1; - Element score1 = e1.getChild( "score" ); - Element score2 = e2.getChild( "score" ); + Element score1 = e1.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + Element score2 = e2.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); if ( score1 == score2 ) return 0; if ( score1 == null ) return 1; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 22:19:48
|
Revision: 2963 http://archive-access.svn.sourceforge.net/archive-access/?rev=2963&view=rev Author: binzino Date: 2010-02-22 22:19:42 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Removed extra 'nutch:' prefix from urlParams and param elements in output. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2010-02-22 05:18:57 UTC (rev 2962) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2010-02-22 22:19:42 UTC (rev 2963) @@ -201,7 +201,7 @@ addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) ); // Add a <nutch:urlParams> element containing a list of all the URL parameters. - Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "nutch:urlParams" ); + Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "urlParams" ); channel.appendChild( urlParams ); for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) ) @@ -209,7 +209,7 @@ String key = e.getKey( ); for ( String value : e.getValue( ) ) { - Element urlParam = doc.createElementNS(NS_MAP.get("nutch"), "nutch:param" ); + Element urlParam = doc.createElementNS(NS_MAP.get("nutch"), "param" ); addAttribute( doc, urlParam, "name", key ); addAttribute( doc, urlParam, "value", value ); urlParams.appendChild(urlParam); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 05:19:04
|
Revision: 2962 http://archive-access.svn.sourceforge.net/archive-access/?rev=2962&view=rev Author: binzino Date: 2010-02-22 05:18:57 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Added result score to OpenSearch output. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2010-02-22 05:18:00 UTC (rev 2961) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2010-02-22 05:18:57 UTC (rev 2962) @@ -46,6 +46,7 @@ import org.apache.nutch.searcher.NutchBean; import org.apache.nutch.searcher.Query; import org.apache.nutch.searcher.Summary; +import org.apache.hadoop.io.FloatWritable; /** * Present search results using A9's OpenSearch extensions to RSS, @@ -183,9 +184,8 @@ Element rss = addNode(doc, doc, "rss"); addAttribute(doc, rss, "version", "2.0"); - addAttribute(doc, rss, "xmlns:opensearch", - NS_MAP.get("opensearch")); - addAttribute(doc, rss, "xmlns:nutch", NS_MAP.get("nutch")); + addAttribute(doc, rss, "xmlns:opensearch", NS_MAP.get("opensearch")); + addAttribute(doc, rss, "xmlns:nutch", NS_MAP.get("nutch")); Element channel = addNode(doc, rss, "channel"); @@ -201,7 +201,7 @@ addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) ); // Add a <nutch:urlParams> element containing a list of all the URL parameters. - Element urlParams = doc.createElementNS(NS_MAP.get("nutch"), "nutch:urlParams" ); + Element urlParams = doc.createElementNS( NS_MAP.get("nutch"), "nutch:urlParams" ); channel.appendChild( urlParams ); for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) ) @@ -219,9 +219,9 @@ for (int i = 0; i < length; i++) { Hit hit = show[i]; HitDetails detail = details[i]; + String score = Float.toString( ((FloatWritable)hit.getSortValue( )).get() ); String title = detail.getValue("title"); - String url = detail.getValue("url"); - String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getUniqueKey(); + String url = detail.getValue("url"); if (title == null || title.equals("")) { // use url for docs w/o title title = url; @@ -229,6 +229,7 @@ Element item = addNode(doc, channel, "item"); + addNode(doc, item, "nutch", "score", score ); addNode(doc, item, "title", title); if (summaries[i] != null) { addNode(doc, item, "description", summaries[i].toString() ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 05:18:07
|
Revision: 2961 http://archive-access.svn.sourceforge.net/archive-access/?rev=2961&view=rev Author: binzino Date: 2010-02-22 05:18:00 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Added result score to output in main(). Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-02-22 05:17:20 UTC (rev 2960) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-02-22 05:18:00 UTC (rev 2961) @@ -33,6 +33,7 @@ import org.apache.nutch.parse.*; import org.apache.nutch.crawl.Inlinks; import org.apache.nutch.util.NutchConfiguration; +import org.apache.hadoop.io.FloatWritable; /** * One stop shopping for search-related functionality. @@ -443,6 +444,8 @@ { System.out.println( " " + i + + " " + + Float.toString( ((FloatWritable) show[i].getSortValue( )).get() ) + " " + java.util.Arrays.asList( details[i].getValues( "segment" ) ) + " " This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-22 05:17:29
|
Revision: 2960 http://archive-access.svn.sourceforge.net/archive-access/?rev=2960&view=rev Author: binzino Date: 2010-02-22 05:17:20 +0000 (Mon, 22 Feb 2010) Log Message: ----------- Initial revision of OpenSearch master/slave system. Work-in-progress. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java Added: trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE =================================================================== --- trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/lib/jdom.LICENSE 2010-02-22 05:17:20 UTC (rev 2960) @@ -0,0 +1,56 @@ +/*-- + + $Id: LICENSE.txt,v 1.11 2004/02/06 09:32:57 jhunter Exp $ + + Copyright (C) 2000-2004 Jason Hunter & Brett McLaughlin. + All rights reserved. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions, and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions, and the disclaimer that follows + these conditions in the documentation and/or other materials + provided with the distribution. + + 3. The name "JDOM" must not be used to endorse or promote products + derived from this software without prior written permission. For + written permission, please contact <request_AT_jdom_DOT_org>. + + 4. Products derived from this software may not be called "JDOM", nor + may "JDOM" appear in their name, without prior written permission + from the JDOM Project Management <request_AT_jdom_DOT_org>. + + In addition, we request (but do not require) that you include in the + end-user documentation provided with the redistribution and/or in the + software itself an acknowledgement equivalent to the following: + "This product includes software developed by the + JDOM Project (http://www.jdom.org/)." + Alternatively, the acknowledgment may be graphical using the logos + available at http://www.jdom.org/images/logos. + + THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED + WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + DISCLAIMED. IN NO EVENT SHALL THE JDOM AUTHORS OR THE PROJECT + CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + SUCH DAMAGE. + + This software consists of voluntary contributions made by many + individuals on behalf of the JDOM Project and was originally + created by Jason Hunter <jhunter_AT_jdom_DOT_org> and + Brett McLaughlin <brett_AT_jdom_DOT_org>. For more information + on the JDOM Project, please see <http://www.jdom.org/>. + + */ + Added: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar =================================================================== (Binary files differ) Property changes on: trunk/archive-access/projects/nutchwax/archive/lib/jdom.jar ___________________________________________________________________ Added: svn:mime-type + application/octet-stream Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMaster.java 2010-02-22 05:17:20 UTC (rev 2960) @@ -0,0 +1,364 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax; + +import java.io.IOException; +import java.io.BufferedReader; +import java.io.InputStreamReader; +import java.io.FileInputStream; +import java.util.Comparator; +import java.util.Collections; +import java.util.List; +import java.util.ArrayList; +import java.util.LinkedList; + +import org.jdom.Document; +import org.jdom.Element; +import org.jdom.Namespace; +import org.jdom.output.XMLOutputter; + + +/** + * + */ +public class OpenSearchMaster +{ + List<OpenSearchSlave> slaves = new ArrayList<OpenSearchSlave>( ); + long timeout = 30 * 1000; + + public OpenSearchMaster( String slavesFile, long timeout ) + throws IOException + { + this( slavesFile ); + this.timeout = timeout; + } + + public OpenSearchMaster( String slavesFile ) + throws IOException + { + BufferedReader r = null; + try + { + r = new BufferedReader( new InputStreamReader( new FileInputStream( slavesFile ), "utf-8" ) ); + + String line; + while ( (line = r.readLine()) != null ) + { + line = line.trim(); + if ( line.length() == 0 || line.charAt( 0 ) == '#' ) + { + // Ignore it. + continue ; + } + + OpenSearchSlave slave = new OpenSearchSlave( line ); + + this.slaves.add( slave ); + } + } + finally + { + try { if ( r != null ) r.close(); } catch ( IOException ioe ) { } + } + + } + + public Document query( String query, int startIndex, int numResults, int hitsPerSite ) + { + long startTime = System.currentTimeMillis( ); + + List<SlaveQueryThread> slaveThreads = new ArrayList<SlaveQueryThread>( this.slaves.size() ); + + for ( OpenSearchSlave slave : this.slaves ) + { + SlaveQueryThread sqt = new SlaveQueryThread( slave, query, 0, (startIndex+numResults), hitsPerSite ); + + sqt.start( ); + + slaveThreads.add( sqt ); + } + + waitForThreads( slaveThreads, this.timeout, startTime ); + + LinkedList<Element> items = new LinkedList<Element>( ); + long totalResults = 0; + + for ( SlaveQueryThread sqt : slaveThreads ) + { + if ( sqt.throwable != null ) + { + // TODO: Handle problems with slaves + continue ; + } + + // Dump all the results ("item" elements) into a single list. + Element channel = sqt.response.getRootElement( ).getChild( "channel" ); + items.addAll( (List<Element>) channel.getChildren( "item" ) ); + channel.removeChildren( "item" ); + + try + { + totalResults += Integer.parseInt( channel.getChild( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ).getTextTrim( ) ); + } + catch ( Exception e ) + { + // TODO: Log error getting total. + } + + } + + if ( items.size( ) > 0 && hitsPerSite > 0 ) + { + Collections.sort( items, new ElementSiteThenScoreComparator( ) ); + + LinkedList<Element> collapsed = new LinkedList<Element>( ); + + collapsed.add( items.removeFirst( ) ); + + int count = 1; + for ( Element item : items ) + { + String lastSite = collapsed.getLast( ).getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ); + + if ( lastSite.length( ) == 0 || + !lastSite.equals( item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim( ) ) ) + { + collapsed.add( item ); + count = 1; + } + else if ( count < hitsPerSite ) + { + collapsed.add( item ); + count++; + } + else + { + // TODO: Log collapse of item. + } + } + + // Replace the list of items with the collapsed list. + items = collapsed; + } + + Collections.sort( items, new ElementScoreComparator( ) ); + + // Build the final results OpenSearch XML document. + Element channel = new Element( "channel" ); + channel.addContent( new Element( "title" ) ); + channel.addContent( new Element( "description" ) ); + channel.addContent( new Element( "link" ) ); + + Element eTotalResults = new Element( "totalResults", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); + Element eStartIndex = new Element( "startIndex", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); + Element eItemsPerPage = new Element( "itemsPerPage", Namespace.getNamespace( "http://a9.com/-/spec/opensearchrss/1.0/" ) ); + + eTotalResults.setText( Long.toString( totalResults ) ); + eStartIndex. setText( Long.toString( startIndex ) ); + eItemsPerPage.setText( Long.toString( numResults ) ); + + channel.addContent( eTotalResults ); + channel.addContent( eStartIndex ); + channel.addContent( eItemsPerPage ); + + // Get a sub-list of only the items we want: [startIndex,(startIndex+numResults)] + List<Element> subList = items.subList( Math.min( startIndex, items.size( ) ), + Math.min( (startIndex+numResults), items.size( ) ) ); + channel.addContent( subList ); + + Element rss = new Element( "rss" ); + rss.addContent( channel ); + + return new Document( rss ); + } + + + /** + * Convenience method to wait for a collection of threads to complete, + * or until a timeout after a startTime expires. + */ + private void waitForThreads( List<SlaveQueryThread> threads, long timeout, long startTime ) + { + for ( Thread t : threads ) + { + long timeRemaining = timeout - (System.currentTimeMillis( ) - startTime); + + // If we are out of time, don't wait for any more threads. + if ( timeRemaining <= 0 ) + { + break; + } + + // Otherwise, wait for the next unfinished thread to finish. + try + { + t.join( timeRemaining ); + } + catch ( InterruptedException ie ) + { + break; + } + } + } + + + public static void main( String args[] ) + throws Exception + { + String usage = "OpenSearchMaster [OPTIONS] SLAVES.txt query" + + "\n\t-h <n> Hits per site" + + "\n\t-n <n> Number of results" + + "\n\t-s <n> Start index" + + "\n"; + + if ( args.length < 2 ) + { + System.err.println( usage ); + System.exit( 1 ); + } + + String slavesFile = args[args.length - 2]; + String query = args[args.length - 1]; + + int startIndex = 0; + int hitsPerSite = 0; + int numHits = 10; + for ( int i = 0 ; i < args.length - 2 ; i++ ) + { + try + { + if ( "-h".equals( args[i] ) ) + { + i++; + hitsPerSite = Integer.parseInt( args[i] ); + } + if ( "-n".equals( args[i] ) ) + { + i++; + numHits = Integer.parseInt( args[i] ); + } + if ( "-s".equals( args[i] ) ) + { + i++; + startIndex = Integer.parseInt( args[i] ); + } + } + catch ( NumberFormatException nfe ) + { + System.err.println( "Error: not a numeric value: " + args[i] ); + System.err.println( usage ); + System.exit( 1 ); + } + } + + OpenSearchMaster master = new OpenSearchMaster( slavesFile ); + + Document doc = master.query( query, startIndex, numHits, hitsPerSite ); + + (new XMLOutputter()).output( doc, System.out ); + } + +} + + +class SlaveQueryThread extends Thread +{ + OpenSearchSlave slave; + + String query; + int startIndex; + int numResults; + int hitsPerSite; + + Document response; + Throwable throwable; + + + SlaveQueryThread( OpenSearchSlave slave, String query, int startIndex, int numResults, int hitsPerSite ) + { + this.slave = slave; + this.query = query; + this.startIndex = startIndex; + this.numResults = numResults; + this.hitsPerSite = hitsPerSite; + } + + public void run( ) + { + try + { + this.response = this.slave.query( this.query, this.startIndex, this.numResults, this.hitsPerSite ); + } + catch ( Throwable t ) + { + this.throwable = t; + } + } +} + + +class ElementScoreComparator implements Comparator<Element> +{ + public int compare( Element e1, Element e2 ) + { + if ( e1 == e2 ) return 0; + if ( e1 == null ) return 1; + if ( e2 == null ) return -1; + + Element score1 = e1.getChild( "score" ); + Element score2 = e2.getChild( "score" ); + + if ( score1 == score2 ) return 0; + if ( score1 == null ) return 1; + if ( score2 == null ) return -1; + + String text1 = score1.getText().trim(); + String text2 = score2.getText().trim(); + + float value1 = 0.0f; + float value2 = 0.0f; + + try { value1 = Float.parseFloat( text1 ); } catch ( NumberFormatException nfe ) { } + try { value2 = Float.parseFloat( text2 ); } catch ( NumberFormatException nfe ) { } + + if ( value1 == value2 ) return 0; + + return value1 > value2 ? -1 : 1; + } +} + +class ElementSiteThenScoreComparator extends ElementScoreComparator +{ + public int compare( Element e1, Element e2 ) + { + if ( e1 == e2 ) return 0; + if ( e1 == null ) return 1; + if ( e2 == null ) return -1; + + String site1 = e1.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim(); + String site2 = e2.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ).getTextTrim(); + + if ( site1.equals( site2 ) ) + { + // Sites are equal, then compare scores. + return super.compare( e1, e2 ); + } + + return site1.compareTo( site2 ); + } +} \ No newline at end of file Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchMasterServlet.java 2010-02-22 05:17:20 UTC (rev 2960) @@ -0,0 +1,52 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax; + +import java.io.IOException; +import java.io.BufferedReader; +import java.io.InputStreamReader; +import java.io.FileInputStream; +import java.util.List; +import java.util.ArrayList; +import javax.servlet.ServletException; +import javax.servlet.ServletConfig; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + + +/** + * + */ +public class OpenSearchMasterServlet extends HttpServlet +{ + + public void init( ServletConfig config ) + throws ServletException + { + + + } + + public void doGet( HttpServletRequest request, HttpServletResponse response ) + throws ServletException, IOException + { + + } + +} Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchSlave.java 2010-02-22 05:17:20 UTC (rev 2960) @@ -0,0 +1,209 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax; + +import java.io.IOException; +import java.io.InputStream; +import java.io.UnsupportedEncodingException; +import java.net.HttpURLConnection; +import java.net.MalformedURLException; +import java.net.URL; +import java.net.URLConnection; +import java.net.URLEncoder; +import java.util.List; + +import org.jdom.Document; +import org.jdom.Element; +import org.jdom.Namespace; +import org.jdom.input.SAXBuilder; +import org.jdom.output.XMLOutputter; + +/** + * + */ +public class OpenSearchSlave +{ + private String urlTemplate; + + public OpenSearchSlave( String urlTemplate ) + { + this.urlTemplate = urlTemplate; + } + + public Document query( String query, int startIndex, int requestedNumResults, int hitsPerSite ) + throws Exception + { + URL url = buildRequestUrl( query, startIndex, requestedNumResults, hitsPerSite ); + + InputStream is = null; + try + { + is = getInputStream( url ); + + Document doc = (new SAXBuilder()).build( is ); + + doc = validate( doc ); + + return doc; + } + finally + { + // Ensure the InputStream is closed, which should trigger the + // underlying HTTP connection to be cleaned-up. + try { if ( is != null ) is.close( ); } catch ( IOException ioe ) { } // Not much we can do + } + } + + private Document validate( Document doc ) + throws Exception + { + if ( doc.getRootElement( ) == null ) throw new Exception( "Invalid OpenSearch response: missing /rss" ); + Element root = doc.getRootElement( ); + + if ( ! "rss".equals( root.getName( ) ) ) throw new Exception( "Invalid OpenSearch response: missing /rss" ); + Element channel = root.getChild( "channel" ); + + if ( channel == null ) throw new Exception( "Invalid OpenSearch response: missing /rss/channel" ); + + for ( Element item : (List<Element>) channel.getChildren( "item" ) ) + { + Element site = item.getChild( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + if ( site == null ) + { + item.addContent( new Element( "site", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ) ); + } + + Element score = item.getChild( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + if ( score == null ) + { + score = new Element( "score", Namespace.getNamespace( "http://www.nutch.org/opensearchrss/1.0/" ) ); + score.setText( "" ); + + item.addContent( score ); + } + } + + return doc; + } + + /** + * + */ + public URL buildRequestUrl( String query, int startIndex, int requestedNumResults, int hitsPerSite ) + throws MalformedURLException, UnsupportedEncodingException + { + String url = this.urlTemplate; + + // Note about replaceAll: In the Java regex library, the replacement string has a few + // special characters: \ and $. Forunately, since we URL-encode the replacement string, + // any occurance of \ or $ is converted to %xy form. So we don't have to worry about it. :) + url = url.replaceAll( "[{]searchTerms[}]", URLEncoder.encode( query, "utf-8" ) ); + url = url.replaceAll( "[{]count[}]" , String.valueOf( requestedNumResults ) ); + url = url.replaceAll( "[{]startIndex[}]" , String.valueOf( startIndex ) ); + url = url.replaceAll( "[{]hitsPerSite[}]", String.valueOf( hitsPerSite ) ); + + // We don't know about any optional parameters, so we remove them (per the OpenSearch spec.) + url = url.replaceAll( "[{][^}]+[?][}]", "" ); + + return new URL( url ); + } + + + public InputStream getInputStream( URL url ) + throws IOException + { + URLConnection connection = url.openConnection( ); + connection.setDoOutput( false ); + connection.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; NutchWAX OpenSearchMaster)" ); + connection.connect( ); + + if ( connection instanceof HttpURLConnection ) + { + HttpURLConnection hc = (HttpURLConnection) connection; + + switch ( hc.getResponseCode( ) ) + { + case 200: + // All good. + break; + default: + // Problems! Bail out. + throw new IOException( "HTTP error from " + url + ": " + hc.getResponseMessage( ) ); + } + } + + InputStream is = connection.getInputStream( ); + + return is; + } + + public String toString() + { + return this.urlTemplate; + } + + public static void main( String args[] ) + throws Exception + { + String usage = "OpenSearchSlave [OPTIONS] urlTemplate query" + + "\n\t-h <n> Hits per site" + + "\n\t-n <n> Number of results" + + "\n"; + + if ( args.length < 2 ) + { + System.err.println( usage ); + System.exit( 1 ); + } + + String urlTemplate = args[args.length - 2]; + String query = args[args.length - 1]; + + int hitsPerSite = 0; + int numHits = 10; + for ( int i = 0 ; i < args.length - 2 ; i++ ) + { + try + { + if ( "-h".equals( args[i] ) ) + { + i++; + hitsPerSite = Integer.parseInt( args[i] ); + } + if ( "-n".equals( args[i] ) ) + { + i++; + numHits = Integer.parseInt( args[i] ); + } + } + catch ( NumberFormatException nfe ) + { + System.err.println( "Error: not a numeric value: " + args[i] ); + System.err.println( usage ); + System.exit( 1 ); + } + } + + OpenSearchSlave osl = new OpenSearchSlave( urlTemplate ); + + Document doc = osl.query( query, 0, numHits, hitsPerSite ); + + (new XMLOutputter()).output( doc, System.out ); + } + +} \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-20 03:26:10
|
Revision: 2959 http://archive-access.svn.sourceforge.net/archive-access/?rev=2959&view=rev Author: binzino Date: 2010-02-20 03:26:03 +0000 (Sat, 20 Feb 2010) Log Message: ----------- Whoops, this should have gone in the previous commit. I missed it. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/build.xml 2010-02-20 03:26:03 UTC (rev 2959) @@ -0,0 +1,640 @@ +<?xml version="1.0"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<project name="Nutch" default="job"> + + <!-- Load all the default properties, and any the user wants --> + <!-- to contribute (without having to type -D or edit this file --> + <property file="${user.home}/build.properties" /> + <property file="${basedir}/build.properties" /> + <property file="${basedir}/default.properties" /> + <property name="test.junit.output.format" value="plain"/> + + <!-- the normal classpath --> + <path id="classpath"> + <pathelement location="${build.classes}"/> + <fileset dir="${lib.dir}"> + <include name="*.jar" /> + </fileset> + </path> + + <!-- the unit test classpath --> + <dirname property="plugins.classpath.dir" file="${build.plugins}"/> + <path id="test.classpath"> + <pathelement location="${test.build.classes}" /> + <pathelement location="${conf.dir}"/> + <pathelement location="${test.src.dir}"/> + <pathelement location="${plugins.classpath.dir}"/> + <path refid="classpath"/> + <pathelement location="${build.dir}/${final.name}.job" /> + </path> + + <!-- xmlcatalog definition for xslt task --> + <xmlcatalog id="docDTDs"> + <dtd publicId="-//W3C//DTD XHTML 1.0 Transitional//EN" + location="${xmlcatalog.dir}/xhtml1-transitional.dtd"/> + </xmlcatalog> + + <!-- ====================================================== --> + <!-- Stuff needed by all targets --> + <!-- ====================================================== --> + <target name="init"> + <mkdir dir="${build.dir}"/> + <mkdir dir="${build.classes}"/> + + <mkdir dir="${test.build.dir}"/> + <mkdir dir="${test.build.classes}"/> + + <touch datetime="01/25/1971 2:00 pm"> + <fileset dir="${conf.dir}" includes="**/*.template"/> + </touch> + + <copy todir="${conf.dir}" verbose="true"> + <fileset dir="${conf.dir}" includes="**/*.template"/> + <mapper type="glob" from="*.template" to="*"/> + </copy> + + <!-- unpack hadoop scripts from hadoop jar into bin directory --> + <mkdir dir="${build.dir}/hadoop"/> + <unjar dest="${build.dir}/hadoop"> + <fileset dir="${lib.dir}" includes="hadoop*.jar"/> + <patternset includes="bin.tgz"/> + </unjar> + + <untar src="${build.dir}/hadoop/bin.tgz" dest="bin" compression="gzip"/> + <!-- fix broken library paths with spaces --> + <replace file="bin/hadoop" token="PlatformName" value="PlatformName | sed -e 's/ /_/g'"/> + <chmod dir="bin" perm="ugo+rx" includes="*.sh,hadoop"/> + + <!-- unpack hadoop webapp from hadoop jar into build directory --> + <mkdir dir="${build.dir}/webapps"/> + <unjar dest="${build.dir}"> + <fileset dir="${lib.dir}" includes="hadoop*.jar"/> + <patternset includes="webapps/**"/> + </unjar> + + </target> + + <!-- ====================================================== --> + <!-- Compile the Java files --> + <!-- ====================================================== --> + <target name="compile" depends="compile-core, compile-plugins"/> + + <target name="compile-core" depends="init"> + <javac + encoding="${build.encoding}" + srcdir="${src.dir}" + includes="**/*.java" + destdir="${build.classes}" + debug="${javac.debug}" + optimize="${javac.optimize}" + target="${javac.version}" + source="${javac.version}" + deprecation="${javac.deprecation}"> + <classpath refid="classpath"/> + </javac> + </target> + + <target name="compile-plugins"> + <ant dir="src/plugin" target="deploy" inheritAll="false"/> + </target> + + <target name="generate-src" depends="init"> + <javacc target="${src.dir}/org/apache/nutch/analysis/NutchAnalysis.jj" + javacchome="${javacc.home}"> + </javacc> + + <fixcrlf srcdir="${src.dir}" eol="lf" includes="**/*.java"/> + + </target> + + <target name="dynamic" depends="generate-src, compile"> + </target> + + <!-- ================================================================== --> + <!-- Make nutch.jar --> + <!-- ================================================================== --> + <!-- --> + <!-- ================================================================== --> + <target name="jar" depends="compile-core"> + <copy file="${conf.dir}/nutch-default.xml" + todir="${build.classes}"/> + <copy file="${conf.dir}/nutch-site.xml" + todir="${build.classes}"/> + <jar jarfile="${build.dir}/${final.name}.jar" + basedir="${build.classes}"> + <manifest> + </manifest> + </jar> + </target> + + <!-- ================================================================== --> + <!-- Make job jar --> + <!-- ================================================================== --> + <!-- --> + <!-- ================================================================== --> + <target name="job" depends="compile"> + <jar jarfile="${build.dir}/${final.name}.job"> + <zipfileset dir="${build.classes}"/> + <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/> + <zipfileset dir="${lib.dir}" prefix="lib" + includes="**/*.jar" excludes="hadoop-*.jar"/> + <zipfileset dir="${build.plugins}" prefix="plugins"/> + </jar> + </target> + + <!-- ================================================================== --> + <!-- Make nutch.war --> + <!-- ================================================================== --> + <!-- --> + <!-- ================================================================== --> + <target name="war" depends="jar,compile,generate-docs"> + + <!-- generate the nutch.xml (servlet context) file --> + <xslt in="${basedir}/conf/nutch-default.xml" + out="${build.dir}/nutch.xml" + style="${basedir}/conf/context.xsl"> + <xmlcatalog refid="docDTDs"/> + <outputproperty name="indent" value="yes"/> + </xslt> + <war destfile="${build.dir}/${final.name}.war" + webxml="${web.src.dir}/web.xml"> + <fileset dir="${web.src.dir}/jsp"/> + <zipfileset dir="${docs.src}" includes="include/*.html"/> + <zipfileset dir="${build.docs}" includes="*/include/*.html"/> + <fileset dir="${docs.dir}"/> + <lib dir="${lib.dir}"> + <include name="lucene*.jar"/> + <include name="taglibs-*.jar"/> + <include name="hadoop-*.jar"/> + <include name="dom4j-*.jar"/> + <include name="xerces-*.jar"/> + <include name="tika-*.jar"/> + <include name="apache-solr-*.jar"/> + <include name="commons-httpclient-*.jar"/> + <include name="commons-codec-*.jar"/> + <include name="commons-collections-*.jar"/> + <include name="commons-beanutils-*.jar"/> + <include name="commons-cli-*.jar"/> + <include name="commons-lang-*.jar"/> + <include name="commons-logging-*.jar"/> + <include name="log4j-*.jar"/> + </lib> + <lib dir="${build.dir}"> + <include name="${final.name}.jar"/> + </lib> + <classes dir="${conf.dir}" excludes="**/*.template"/> + <classes dir="${web.src.dir}/locale"/> + <classes file="${web.src.dir}/log4j.properties"/> + <zipfileset prefix="WEB-INF/classes/plugins" dir="${build.plugins}"/> + <webinf dir="${lib.dir}"> + <include name="taglibs-*.tld"/> + </webinf> + </war> + </target> + + + <!-- ================================================================== --> + <!-- Compile test code --> + <!-- ================================================================== --> + <target name="compile-core-test" depends="compile-core"> + <javac + encoding="${build.encoding}" + srcdir="${test.src.dir}" + includes="org/apache/nutch/**/*.java" + destdir="${test.build.classes}" + debug="${javac.debug}" + optimize="${javac.optimize}" + target="${javac.version}" + source="${javac.version}" + deprecation="${javac.deprecation}"> + <classpath refid="test.classpath"/> + </javac> + </target> + + <!-- ================================================================== --> + <!-- Run code checks (PMD) --> + <!-- ================================================================== --> + <target name="pmd" depends="compile"> + <property name="pmd.report" location="${build.dir}/pmd-report.html" /> + <taskdef name="pmd" classname="net.sourceforge.pmd.ant.PMDTask"> + <classpath> + <fileset dir="${lib.dir}"> + <include name="pmd-ext/*.jar" /> + <include name="xerces*.jar" /> + </fileset> + </classpath> + </taskdef> + <pmd shortFilenames="true" failonerror="true" failOnRuleViolation="false" + encoding="${build.encoding}" failuresPropertyName="pmd.failures"> + <ruleset>unusedcode</ruleset> + <!--ruleset>basic</ruleset--> + <!--ruleset>optimizations</ruleset--> + <formatter type="html" toFile="${pmd.report}" /> + <!-- <formatter type="xml" toFile="${tempbuild}/$report_pmd.xml"/> --> + <fileset dir="${basedir}/src"> + <include name="java/**/*.java"/> + <include name="plugin/**/*.java"/> + <!-- Exclude generated sources --> + <exclude name="**/NutchAnalysis.java" /> + <exclude name="**/NutchAnalysisTokenManager.java" /> + </fileset> + </pmd> + <condition property="pmd.stop" value="true"> + <and> + <isset property="pmd.failures" /> + <not> + <equals arg1="0" arg2="${pmd.failures}" trim="true" /> + </not> + </and> + </condition> + <fail if="pmd.stop">FAILURE: PMD shows ${pmd.failures} rule violations. See ${pmd.report} for details.</fail> + </target> + + <!-- ================================================================== --> + <!-- Run unit tests --> + <!-- ================================================================== --> + <target name="test" depends="test-core, test-plugins"/> + + <target name="test-core" depends="job, compile-core-test"> + + <delete dir="${test.build.data}"/> + <mkdir dir="${test.build.data}"/> + <!-- + copy resources needed in junit tests + --> + <copy todir="${test.build.data}"> + <fileset dir="src/testresources" includes="**/*"/> + </copy> + <copy file="${test.src.dir}/nutch-site.xml" + todir="${test.build.classes}"/> + + <copy file="${test.src.dir}/log4j.properties" + todir="${test.build.classes}"/> + + <junit printsummary="yes" haltonfailure="no" fork="yes" dir="${basedir}" + errorProperty="tests.failed" failureProperty="tests.failed" maxmemory="1000m"> + <sysproperty key="test.build.data" value="${test.build.data}"/> + <sysproperty key="test.src.dir" value="${test.src.dir}"/> + <classpath refid="test.classpath"/> + <formatter type="${test.junit.output.format}" /> + <batchtest todir="${test.build.dir}" unless="testcase"> + <fileset dir="${test.src.dir}" + includes="**/Test*.java" excludes="**/${test.exclude}.java" /> + </batchtest> + <batchtest todir="${test.build.dir}" if="testcase"> + <fileset dir="${test.src.dir}" includes="**/${testcase}.java"/> + </batchtest> + </junit> + + <fail if="tests.failed">Tests failed!</fail> + + </target> + + <target name="test-plugins" depends="compile"> + <ant dir="src/plugin" target="test" inheritAll="false"/> + </target> + + <target name="nightly" depends="test, tar"> + </target> + + <!-- ================================================================== --> + <!-- Documentation --> + <!-- ================================================================== --> + <target name="javadoc" depends="compile"> + <mkdir dir="${build.javadoc}"/> + <javadoc + overview="${src.dir}/overview.html" + destdir="${build.javadoc}" + author="true" + version="true" + use="true" + windowtitle="${Name} ${version} API" + doctitle="${Name} ${version} API" + bottom="Copyright &copy; ${year} The Apache Software Foundation" + > + <arg value="${javadoc.proxy.host}"/> + <arg value="${javadoc.proxy.port}"/> + + <packageset dir="${src.dir}"/> + <packageset dir="${plugins.dir}/lib-http/src/java"/> + <packageset dir="${plugins.dir}/lib-parsems/src/java"/> + <packageset dir="${plugins.dir}/lib-regex-filter/src/java"/> + <packageset dir="${plugins.dir}/microformats-reltag/src/java"/> + <packageset dir="${plugins.dir}/ontology/src/java"/> + <packageset dir="${plugins.dir}/protocol-file/src/java"/> + <packageset dir="${plugins.dir}/protocol-ftp/src/java"/> + <packageset dir="${plugins.dir}/protocol-http/src/java"/> + <packageset dir="${plugins.dir}/protocol-httpclient/src/java"/> + <packageset dir="${plugins.dir}/parse-ext/src/java"/> + <packageset dir="${plugins.dir}/parse-html/src/java"/> + <packageset dir="${plugins.dir}/parse-js/src/java"/> + <packageset dir="${plugins.dir}/parse-text/src/java"/> + <packageset dir="${plugins.dir}/parse-pdf/src/java"/> +<!-- <packageset dir="${plugins.dir}/parse-rtf/src/java"/> plugin excluded from build due to licensing issues--> +<!-- <packageset dir="${plugins.dir}/parse-mp3/src/java"/> plugin excluded from build due to licensing issues--> + <packageset dir="${plugins.dir}/parse-msexcel/src/java"/> + <packageset dir="${plugins.dir}/parse-mspowerpoint/src/java"/> + <packageset dir="${plugins.dir}/parse-msword/src/java"/> + <packageset dir="${plugins.dir}/parse-oo/src/java"/> + <packageset dir="${plugins.dir}/parse-rss/src/java"/> + <packageset dir="${plugins.dir}/parse-swf/src/java"/> + <packageset dir="${plugins.dir}/parse-zip/src/java"/> + <packageset dir="${plugins.dir}/index-basic/src/java"/> + <packageset dir="${plugins.dir}/index-more/src/java"/> + <packageset dir="${plugins.dir}/query-basic/src/java"/> + <packageset dir="${plugins.dir}/query-more/src/java"/> + <packageset dir="${plugins.dir}/query-site/src/java"/> + <packageset dir="${plugins.dir}/query-url/src/java"/> + <packageset dir="${plugins.dir}/scoring-opic/src/java"/> + <packageset dir="${plugins.dir}/summary-basic/src/java"/> + <packageset dir="${plugins.dir}/summary-lucene/src/java"/> + <packageset dir="${plugins.dir}/urlfilter-automaton/src/java"/> + <packageset dir="${plugins.dir}/urlfilter-regex/src/java"/> + <packageset dir="${plugins.dir}/urlfilter-prefix/src/java"/> + <packageset dir="${plugins.dir}/creativecommons/src/java"/> + <packageset dir="${plugins.dir}/languageidentifier/src/java"/> + <packageset dir="${plugins.dir}/clustering-carrot2/src/java"/> + <packageset dir="${plugins.dir}/ontology/src/java"/> + + <packageset dir="${plugins.dir}/index-nutchwax/src/java"/> + <packageset dir="${plugins.dir}/query-nutchwax/src/java"/> + <packageset dir="${plugins.dir}/scoring-nutchwax/src/java"/> + <packageset dir="${plugins.dir}/urlfilter-nutchwax/src/java"/> + + <link href="${javadoc.link.java}"/> + <link href="${javadoc.link.lucene}"/> + <link href="${javadoc.link.hadoop}"/> + + <classpath refid="classpath"/> + <classpath> + <fileset dir="${plugins.dir}" > + <include name="**/*.jar"/> + </fileset> + </classpath> + + <group title="Core" packages="org.apache.nutch.*"/> + <group title="Plugins API" packages="${plugins.api}"/> + <group title="Protocol Plugins" packages="${plugins.protocol}"/> + <group title="URL Filter Plugins" packages="${plugins.urlfilter}"/> + <group title="Scoring Plugins" packages="${plugins.scoring}"/> + <group title="Parse Plugins" packages="${plugins.parse}"/> + <group title="Analysis Plugins" packages="${plugins.analysis}"/> + <group title="Indexing Filter Plugins" packages="${plugins.index}"/> + <group title="Query Filter Plugins" packages="${plugins.query}"/> + <group title="Summary Plugins" packages="${plugins.summary}"/> + <group title="Clustering Plugins" packages="${plugins.clustering}"/> + <group title="Ontology Plugins" packages="${plugins.ontology}"/> + <group title="Misc. Plugins" packages="${plugins.misc}"/> + </javadoc> + <!-- Copy the plugin.dtd file to the plugin doc-files dir --> + <copy file="${plugins.dir}/plugin.dtd" + todir="${build.javadoc}/org/apache/nutch/plugin/doc-files"/> + </target> + + <target name="default-doc"> + <style basedir="${conf.dir}" destdir="${docs.dir}" + includes="nutch-default.xml" style="conf/nutch-conf.xsl"/> + </target> + + <target name="generate-locale" if="doc.locale"> + <echo message="Generating docs for locale=${doc.locale}"/> + + <mkdir dir="${build.docs}/${doc.locale}/include"/> + <xslt in="${docs.src}/include/${doc.locale}/header.xml" + out="${build.docs}/${doc.locale}/include/header.html" + style="${docs.src}/style/nutch-header.xsl"> + <xmlcatalog refid="docDTDs"/> + </xslt> + + <dependset> + <srcfileset dir="${docs.src}/include/${doc.locale}" includes="*.xml"/> + <srcfileset dir="${docs.src}/style" includes="*.xsl"/> + <targetfileset dir="${docs.dir}/${doc.locale}" includes="*.html"/> + </dependset> + + <copy file="${docs.src}/style/nutch-page.xsl" + todir="${build.docs}/${doc.locale}" + preservelastmodified="true"/> + + <xslt basedir="${docs.src}/pages/${doc.locale}" + destdir="${docs.dir}/${doc.locale}" + includes="*.xml" + style="${build.docs}/${doc.locale}/nutch-page.xsl"> + <xmlcatalog refid="docDTDs"/> + </xslt> + </target> + + + <target name="generate-docs" depends="init"> + <dependset> + <srcfileset dir="${docs.src}/include" includes="*.html"/> + <targetfileset dir="${docs.dir}" includes="**/*.html"/> + </dependset> + + <mkdir dir="${build.docs}/include"/> + <copy todir="${build.docs}/include"> + <fileset dir="${docs.src}/include"/> + </copy> + + <antcall target="generate-locale"> + <param name="doc.locale" value="ca"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="de"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="en"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="es"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="fi"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="fr"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="hu"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="it"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="jp"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="ms"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="nl"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="pl"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="pt"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="sh"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="sr"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="sv"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="th"/> + </antcall> + + <antcall target="generate-locale"> + <param name="doc.locale" value="zh"/> + </antcall> + + <fixcrlf srcdir="${docs.dir}" eol="lf" encoding="utf-8" + includes="**/*.html"/> + + </target> + + <!-- ================================================================== --> + <!-- D I S T R I B U T I O N --> + <!-- ================================================================== --> + <!-- --> + <!-- ================================================================== --> + <target name="package" depends="jar, job, war, javadoc"> + <mkdir dir="${dist.dir}"/> + <mkdir dir="${dist.dir}/lib"/> + <mkdir dir="${dist.dir}/bin"/> + <mkdir dir="${dist.dir}/docs"/> + <mkdir dir="${dist.dir}/docs/api"/> + <mkdir dir="${dist.dir}/plugins"/> + + <copy todir="${dist.dir}/lib" includeEmptyDirs="false"> + <fileset dir="lib"/> + </copy> + + <copy todir="${dist.dir}/plugins"> + <fileset dir="${build.plugins}"/> + </copy> + + <copy todir="${dist.dir}/webapps"> + <fileset dir="${build.webapps}"/> + </copy> + + <copy file="${build.dir}/${final.name}.jar" todir="${dist.dir}"/> + <copy file="${build.dir}/${final.name}.job" todir="${dist.dir}"/> + <copy file="${build.dir}/${final.name}.war" todir="${dist.dir}"/> + + <copy todir="${dist.dir}/bin"> + <fileset dir="bin"/> + </copy> + + <copy todir="${dist.dir}/conf"> + <fileset dir="${conf.dir}" excludes="**/*.template"/> + </copy> + + <chmod perm="ugo+x" type="file"> + <fileset dir="${dist.dir}/bin"/> + </chmod> + + <copy todir="${dist.dir}/docs"> + <fileset dir="${docs.dir}"/> + </copy> + + <copy todir="${dist.dir}/docs/api"> + <fileset dir="${build.javadoc}"/> + </copy> + + <copy todir="${dist.dir}"> + <fileset dir="."> + <include name="*.txt" /> + <include name="KEYS" /> + </fileset> + </copy> + + <copy todir="${dist.dir}/src" includeEmptyDirs="true"> + <fileset dir="src"/> + </copy> + + <copy todir="${dist.dir}/" file="build.xml"/> + <copy todir="${dist.dir}/" file="default.properties"/> + + </target> + + <!-- ================================================================== --> + <!-- Make release tarball --> + <!-- ================================================================== --> + <target name="tar" depends="package"> + <tar compression="gzip" longfile="gnu" + destfile="${build.dir}/${final.name}.tar.gz"> + <tarfileset dir="${build.dir}" mode="664"> + <exclude name="${final.name}/bin/*" /> + <include name="${final.name}/**" /> + </tarfileset> + <tarfileset dir="${build.dir}" mode="755"> + <include name="${final.name}/bin/*" /> + </tarfileset> + </tar> + </target> + + <!-- ================================================================== --> + <!-- Clean. Delete the build files, and their directories --> + <!-- ================================================================== --> + <target name="clean"> + <delete dir="${build.dir}"/> + </target> + + <!-- ================================================================== --> + <!-- RAT targets --> + <!-- ================================================================== --> + <target name="rat-sources-typedef"> + <typedef resource="org/apache/rat/anttasks/antlib.xml" > + <classpath> + <fileset dir="." includes="rat*.jar"/> + </classpath> + </typedef> + </target> + + <target name="rat-sources" depends="rat-sources-typedef" + description="runs the tasks over src/java"> + <rat:report xmlns:rat="antlib:org.apache.rat.anttasks"> + <fileset dir="src"> + <include name="java/**/*"/> + <include name="plugin/**/src/**/*"/> + </fileset> + </rat:report> + </target> + +</project> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-20 03:21:06
|
Revision: 2958 http://archive-access.svn.sourceforge.net/archive-access/?rev=2958&view=rev Author: binzino Date: 2010-02-20 03:20:59 +0000 (Sat, 20 Feb 2010) Log Message: ----------- WAX-72 and WAX-71: Re-did build system. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2010-02-20 03:18:57 UTC (rev 2957) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2010-02-20 03:20:59 UTC (rev 2958) @@ -25,81 +25,52 @@ <!-- HACK: Need to import default.properties like Nutch does --> <property name="final.name" value="nutch-1.0" /> <property name="dist.dir" value="${build.dir}/${final.name}" /> - - <target name="nutch-compile-core"> - <!-- First, copy over Nutch source overlays --> + + <target name="init"> <exec executable="rsync"> <arg value="-vacC"/> <arg value="src/nutch/"/> <arg value="../../"/> </exec> - <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" /> + <exec executable="rsync"> + <arg value="-vacC"/> + <arg value="lib/"/> + <arg value="../../lib/"/> + </exec> + <exec executable="rsync"> + <arg value="-vacC"/> + <arg value="bin/"/> + <arg value="../../bin/"/> + </exec> + <exec executable="rsync"> + <arg value="-vacC"/> + <arg value="src/java/"/> + <arg value="../../src/java/"/> + </exec> + <exec executable="rsync"> + <arg value="-vacC"/> + <arg value="src/plugin/"/> + <arg value="../../src/plugin/"/> + </exec> </target> - - <target name="nutch-compile-plugins"> - <ant dir="${nutch.dir}" target="compile-plugins" inheritAll="false" /> - </target> - - <target name="compile-core" depends="nutch-compile-core"> - <javac - destdir="${build.dir}/classes" - debug="true" - verbose="false" - source="1.5" - target="1.5" - encoding="UTF-8" - fork="true" - nowarn="true" - deprecation="false"> - <src path="${src.dir}/java" /> - <include name="**/*.java" /> - <classpath> - <pathelement location="${build.dir}/classes" /> - <fileset dir="${lib.dir}"> - <include name="*.jar"/> - </fileset> - <fileset dir="${nutch.dir}/lib"> - <include name="*.jar"/> - </fileset> - </classpath> - </javac> - </target> - - <target name="compile-plugins"> - <ant dir="src/plugin" target="deploy" inheritAll="false" /> - </target> - - <!-- - These targets all call down to the corresponding target in the - Nutch build.xml file. This way all of the 'ant' build commands - can be executed from this directory and everything should get - built as expected. - --> - <target name="compile" depends="compile-core, compile-plugins, nutch-compile-plugins"> - </target> - - <target name="jar" depends="compile-core"> + + <target name="jar" depends="init"> <ant dir="${nutch.dir}" target="jar" inheritAll="false" /> </target> - <target name="job" depends="compile"> + <target name="job" depends="init"> <ant dir="${nutch.dir}" target="job" inheritAll="false" /> - - <!-- Add our NutchWAX libs to the .job created by Nutch's build. --> - <jar jarfile="${build.dir}/${final.name}.job" update="true"> - <zipfileset dir="lib" prefix="lib" includes="*.jar"/> - </jar> </target> - <target name="war" depends="compile"> + <target name="war" depends="init"> <ant dir="${nutch.dir}" target="war" inheritAll="false" /> </target> - <target name="javadoc" depends="compile"> + <target name="javadoc" depends="init"> <ant dir="${nutch.dir}" target="javadoc" inheritAll="false" /> </target> - <target name="tar" depends="package"> + <target name="tar" depends="init"> <ant dir="${nutch.dir}" target="tar" inheritAll="false" /> </target> @@ -107,24 +78,12 @@ <ant dir="${nutch.dir}" target="clean" inheritAll="false" /> </target> - <!-- This one does a little more after calling down to the relevant - Nutch target. After Nutch has copied everything into the - distribution directory, we add our script, libraries, etc. - --> - <target name="package" depends="jar, job, war, javadoc" > + <target name="package" depends="init"> <ant dir="${nutch.dir}" target="package" inheritAll="false" /> <ant target="onlypack" /> </target> <target name="onlypack"> - <copy todir="${dist.dir}/lib" includeEmptyDirs="false"> - <fileset dir="lib"/> - </copy> - - <copy todir="${dist.dir}/bin"> - <fileset dir="bin"/> - </copy> - <chmod perm="ugo+x" type="file"> <fileset dir="${dist.dir}/bin"/> </chmod> Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml 2010-02-20 03:20:59 UTC (rev 2958) @@ -0,0 +1,204 @@ +<?xml version="1.0"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<project name="Nutch" default="deploy-core" basedir="."> + + <target name="deploy-core"> + <ant target="compile-core" inheritall="false" dir="../.."/> + <ant target="deploy"/> + </target> + + <!-- ====================================================== --> + <!-- Build & deploy all the plugin jars. --> + <!-- ====================================================== --> + <target name="deploy"> + <ant dir="clustering-carrot2" target="deploy"/> + <ant dir="creativecommons" target="deploy"/> + <ant dir="feed" target="deploy"/> + <ant dir="index-basic" target="deploy"/> + <ant dir="index-anchor" target="deploy"/> + <ant dir="index-more" target="deploy"/> + <ant dir="field-basic" target="deploy"/> + <ant dir="field-boost" target="deploy"/> + <ant dir="languageidentifier" target="deploy"/> + <ant dir="lib-http" target="deploy"/> + <ant dir="lib-jakarta-poi" target="deploy"/> + <ant dir="lib-lucene-analyzers" target="deploy"/> + <ant dir="lib-nekohtml" target="deploy"/> + <ant dir="lib-parsems" target="deploy"/> + <ant dir="lib-regex-filter" target="deploy"/> + <ant dir="lib-xml" target="deploy"/> + <ant dir="microformats-reltag" target="deploy"/> + <ant dir="nutch-extensionpoints" target="deploy"/> + <ant dir="ontology" target="deploy"/> + <ant dir="protocol-file" target="deploy"/> + <ant dir="protocol-ftp" target="deploy"/> + <ant dir="protocol-http" target="deploy"/> + <ant dir="protocol-httpclient" target="deploy"/> + <ant dir="parse-ext" target="deploy"/> + <ant dir="parse-html" target="deploy"/> + <ant dir="parse-js" target="deploy"/> + <!-- <ant dir="parse-mp3" target="deploy"/> --> + <ant dir="parse-msexcel" target="deploy"/> + <ant dir="parse-mspowerpoint" target="deploy"/> + <ant dir="parse-msword" target="deploy"/> + <ant dir="parse-oo" target="deploy"/> + <ant dir="parse-pdf" target="deploy"/> + <ant dir="parse-rss" target="deploy"/> + <!-- <ant dir="parse-rtf" target="deploy"/> --> + <ant dir="parse-swf" target="deploy"/> + <ant dir="parse-text" target="deploy"/> + <ant dir="parse-zip" target="deploy"/> + <ant dir="query-basic" target="deploy"/> + <ant dir="query-more" target="deploy"/> + <ant dir="query-site" target="deploy"/> + <ant dir="query-custom" target="deploy"/> + <ant dir="query-url" target="deploy"/> + <ant dir="response-json" target="deploy"/> + <ant dir="response-xml" target="deploy"/> + <ant dir="scoring-opic" target="deploy"/> + <ant dir="scoring-link" target="deploy"/> + <ant dir="summary-basic" target="deploy"/> + <ant dir="subcollection" target="deploy"/> + <ant dir="summary-lucene" target="deploy"/> + <ant dir="tld" target="deploy"/> + <ant dir="urlfilter-automaton" target="deploy"/> + <ant dir="urlfilter-domain" target="deploy" /> + <ant dir="urlfilter-prefix" target="deploy"/> + <ant dir="urlfilter-regex" target="deploy"/> + <ant dir="urlfilter-suffix" target="deploy"/> + <ant dir="urlfilter-validator" target="deploy"/> + <ant dir="urlnormalizer-basic" target="deploy"/> + <ant dir="urlnormalizer-pass" target="deploy"/> + <ant dir="urlnormalizer-regex" target="deploy"/> + + <ant dir="index-nutchwax" target="deploy" /> + <ant dir="query-nutchwax" target="deploy" /> + <ant dir="scoring-nutchwax" target="deploy" /> + <ant dir="urlfilter-nutchwax" target="deploy" /> + + </target> + + <!-- ====================================================== --> + <!-- Test all of the plugins. --> + <!-- ====================================================== --> + <target name="test"> + <parallel threadCount="2"> + <ant dir="creativecommons" target="test"/> + <ant dir="index-more" target="test"/> + <ant dir="languageidentifier" target="test"/> + <ant dir="lib-http" target="test"/> + <ant dir="ontology" target="test"/> + <ant dir="protocol-httpclient" target="test"/> + <!--ant dir="parse-ext" target="test"/--> + <ant dir="parse-html" target="test"/> + <!-- <ant dir="parse-mp3" target="test"/> --> + <ant dir="parse-msexcel" target="test"/> + <ant dir="parse-mspowerpoint" target="test"/> + <ant dir="parse-msword" target="test"/> + <ant dir="parse-oo" target="test"/> + <ant dir="parse-pdf" target="test"/> + <ant dir="parse-rss" target="test"/> + <ant dir="feed" target="test"/> + <!-- <ant dir="parse-rtf" target="test"/> --> + <ant dir="parse-swf" target="test"/> + <ant dir="parse-zip" target="test"/> + <ant dir="query-url" target="test"/> + <ant dir="subcollection" target="test"/> + <ant dir="urlfilter-automaton" target="test"/> + <ant dir="urlfilter-domain" target="test" /> + <ant dir="urlfilter-regex" target="test"/> + <ant dir="urlfilter-suffix" target="test"/> + <ant dir="urlnormalizer-basic" target="test"/> + <ant dir="urlnormalizer-pass" target="test"/> + <ant dir="urlnormalizer-regex" target="test"/> + </parallel> + </target> + + <!-- ====================================================== --> + <!-- Clean all of the plugins. --> + <!-- ====================================================== --> + <target name="clean"> + <ant dir="analysis-de" target="clean"/> + <ant dir="analysis-fr" target="clean"/> + <ant dir="clustering-carrot2" target="clean"/> + <ant dir="creativecommons" target="clean"/> + <ant dir="feed" target="clean"/> + <ant dir="index-basic" target="clean"/> + <ant dir="index-anchor" target="clean"/> + <ant dir="index-more" target="clean"/> + <ant dir="field-basic" target="clean"/> + <ant dir="field-boost" target="clean"/> + <ant dir="languageidentifier" target="clean"/> + <ant dir="lib-commons-httpclient" target="clean"/> + <ant dir="lib-http" target="clean"/> + <ant dir="lib-jakarta-poi" target="clean"/> + <ant dir="lib-lucene-analyzers" target="clean"/> + <ant dir="lib-nekohtml" target="clean"/> + <ant dir="lib-parsems" target="clean"/> + <ant dir="lib-regex-filter" target="clean"/> + <ant dir="lib-xml" target="clean"/> + <ant dir="microformats-reltag" target="clean"/> + <ant dir="nutch-extensionpoints" target="clean"/> + <ant dir="ontology" target="clean"/> + <ant dir="protocol-file" target="clean"/> + <ant dir="protocol-ftp" target="clean"/> + <ant dir="protocol-http" target="clean"/> + <ant dir="protocol-httpclient" target="clean"/> + <ant dir="parse-ext" target="clean"/> + <ant dir="parse-html" target="clean"/> + <ant dir="parse-js" target="clean"/> + <ant dir="parse-mp3" target="clean"/> + <ant dir="parse-msexcel" target="clean"/> + <ant dir="parse-mspowerpoint" target="clean"/> + <ant dir="parse-msword" target="clean"/> + <ant dir="parse-oo" target="clean"/> + <ant dir="parse-pdf" target="clean"/> + <ant dir="parse-rss" target="clean"/> + <ant dir="parse-rtf" target="clean"/> + <ant dir="parse-swf" target="clean"/> + <ant dir="parse-text" target="clean"/> + <ant dir="parse-zip" target="clean"/> + <ant dir="query-basic" target="clean"/> + <ant dir="query-more" target="clean"/> + <ant dir="query-site" target="clean"/> + <ant dir="query-url" target="clean"/> + <ant dir="query-custom" target="clean"/> + <ant dir="response-json" target="clean"/> + <ant dir="response-xml" target="clean"/> + <ant dir="scoring-opic" target="clean"/> + <ant dir="scoring-link" target="clean"/> + <ant dir="subcollection" target="clean"/> + <ant dir="summary-basic" target="clean"/> + <ant dir="summary-lucene" target="clean"/> + <ant dir="tld" target="clean"/> + <ant dir="urlfilter-automaton" target="clean"/> + <ant dir="urlfilter-domain" target="clean" /> + <ant dir="urlfilter-prefix" target="clean"/> + <ant dir="urlfilter-regex" target="clean"/> + <ant dir="urlfilter-suffix" target="clean"/> + <ant dir="urlfilter-validator" target="clean"/> + <ant dir="urlnormalizer-basic" target="clean"/> + <ant dir="urlnormalizer-pass" target="clean"/> + <ant dir="urlnormalizer-regex" target="clean"/> + + <ant dir="index-nutchwax" target="clean" /> + <ant dir="query-nutchwax" target="clean" /> + <ant dir="scoring-nutchwax" target="clean" /> + <ant dir="urlfilter-nutchwax" target="clean" /> + </target> +</project> Property changes on: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/plugin/build.xml ___________________________________________________________________ Added: svn:executable + * Deleted: trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml 2010-02-20 03:18:57 UTC (rev 2957) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/build-plugin.xml 2010-02-20 03:20:59 UTC (rev 2958) @@ -1,216 +0,0 @@ -<?xml version="1.0"?> -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> -<!-- Imported by plugin build.xml files to define default targets. --> -<project> - - <property name="name" value="${ant.project.name}"/> - <property name="root" value="${basedir}"/> - - <!-- load plugin-specific properties first --> - <property file="${user.home}/${name}.build.properties" /> - <property file="${root}/build.properties" /> - - <property name="nutch.root" location="${root}/../../../../../"/> - - <property name="src.dir" location="${root}/src/java"/> - <property name="src.test" location="${root}/src/test"/> - - <available file="${src.test}" type="dir" property="test.available"/> - - <property name="conf.dir" location="${nutch.root}/conf"/> - - <property name="build.dir" location="${nutch.root}/build/${name}"/> - <property name="build.classes" location="${build.dir}/classes"/> - <property name="build.test" location="${build.dir}/test"/> - - <property name="deploy.dir" location="${nutch.root}/build/plugins/${name}"/> - - <!-- load nutch defaults last so that they can be overridden above --> - <property file="${nutch.root}/default.properties" /> - - <path id="plugin.deps"/> - - <fileset id="lib.jars" dir="${root}" includes="lib/*.jar"/> - - <!-- the normal classpath --> - <path id="classpath"> - <pathelement location="${build.classes}"/> - <fileset refid="lib.jars"/> - <pathelement location="${nutch.root}/build/classes"/> - <fileset dir="${nutch.root}/lib"> - <include name="*.jar" /> - </fileset> - <!-- This is the contrib/archive/lib directory --> - <fileset dir="../../../lib"> - <include name="*.jar" /> - </fileset> - <path refid="plugin.deps"/> - </path> - - <!-- the unit test classpath --> - <path id="test.classpath"> - <pathelement location="${build.test}" /> - <pathelement location="${nutch.root}/build/test/classes"/> - <pathelement location="${nutch.root}/src/test"/> - <pathelement location="${conf.dir}"/> - <pathelement location="${nutch.root}/build"/> - <path refid="classpath"/> - </path> - - <!-- ====================================================== --> - <!-- Stuff needed by all targets --> - <!-- ====================================================== --> - <target name="init"> - <mkdir dir="${build.dir}"/> - <mkdir dir="${build.classes}"/> - <mkdir dir="${build.test}"/> - - <antcall target="init-plugin"/> - </target> - - <!-- to be overridden by sub-projects --> - <target name="init-plugin"/> - - <!-- - ! Used to build plugin compilation dependencies - ! (to be overridden by plugins) - !--> - <target name="deps-jar"/> - - <!-- - ! Used to deploy plugin runtime dependencies - ! (to be overridden by plugins) - !--> - <target name="deps-test"/> - - <!-- ====================================================== --> - <!-- Compile the Java files --> - <!-- ====================================================== --> - <target name="compile" depends="init,deps-jar"> - <echo message="Compiling plugin: ${name}"/> - <javac - encoding="${build.encoding}" - srcdir="${src.dir}" - includes="**/*.java" - destdir="${build.classes}" - debug="${javac.debug}" - optimize="${javac.optimize}" - target="${javac.version}" - source="${javac.version}" - deprecation="${javac.deprecation}"> - <classpath refid="classpath"/> - </javac> - </target> - - <target name="compile-core"> - <ant target="compile-core" inheritall="false" dir="${nutch.root}"/> - <ant target="compile"/> - </target> - - <!-- ================================================================== --> - <!-- Make plugin .jar --> - <!-- ================================================================== --> - <!-- --> - <!-- ================================================================== --> - <target name="jar" depends="compile"> - <jar - jarfile="${build.dir}/${name}.jar" - basedir="${build.classes}" - /> - </target> - - <target name="jar-core" depends="compile-core"> - <jar - jarfile="${build.dir}/${name}.jar" - basedir="${build.classes}" - /> - </target> - - <!-- ================================================================== --> - <!-- Deploy plugin to ${deploy.dir} --> - <!-- ================================================================== --> - <!-- --> - <!-- ================================================================== --> - <target name="deploy" depends="jar, deps-test"> - <mkdir dir="${deploy.dir}"/> - <copy file="plugin.xml" todir="${deploy.dir}" - preservelastmodified="true"/> - <available property="lib-available" - file="${build.dir}/${name}.jar"/> - <antcall target="copy-generated-lib"/> - <copy todir="${deploy.dir}" flatten="true"> - <fileset refid="lib.jars"/> - </copy> - </target> - - <target name="copy-generated-lib" if="lib-available"> - <copy file="${build.dir}/${name}.jar" todir="${deploy.dir}" failonerror="false"/> - </target> - - <!-- ================================================================== --> - <!-- Compile test code --> - <!-- ================================================================== --> - <target name="compile-test" depends="compile" if="test.available"> - <javac - encoding="${build.encoding}" - srcdir="${src.test}" - includes="**/*.java" - destdir="${build.test}" - debug="${javac.debug}" - optimize="${javac.optimize}" - target="${javac.version}" - source="${javac.version}" - deprecation="${javac.deprecation}"> - <classpath refid="test.classpath"/> - </javac> - </target> - - <!-- ================================================================== --> - <!-- Run unit tests --> - <!-- ================================================================== --> - <target name="test" depends="compile-test, deploy" if="test.available"> - <echo message="Testing plugin: ${name}"/> - - <junit printsummary="yes" haltonfailure="no" fork="yes" - errorProperty="tests.failed" failureProperty="tests.failed"> - <sysproperty key="test.data" value="${build.test}/data"/> - <sysproperty key="test.input" value="${root}/data"/> - <classpath refid="test.classpath"/> - <formatter type="plain" /> - <batchtest todir="${build.test}" unless="testcase"> - <fileset dir="${src.test}" - includes="**/Test*.java" excludes="**/${test.exclude}.java" /> - </batchtest> - <batchtest todir="${build.test}" if="testcase"> - <fileset dir="${src.test}" includes="**/${testcase}.java"/> - </batchtest> - </junit> - - <fail if="tests.failed">Tests failed!</fail> - - </target> - - <!-- ================================================================== --> - <!-- Clean. Delete the build files, and their directories --> - <!-- ================================================================== --> - <target name="clean"> - <delete dir="${build.dir}"/> - <delete dir="${deploy.dir}"/> - </target> - -</project> Deleted: trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml 2010-02-20 03:18:57 UTC (rev 2957) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/build.xml 2010-02-20 03:20:59 UTC (rev 2958) @@ -1,45 +0,0 @@ -<?xml version="1.0"?> -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> -<project name="nutchwax" default="deploy-core" basedir="."> - - <target name="deploy-core"> - <ant target="compile-core" inheritall="false" dir="../../../../"/> - <ant target="deploy"/> - </target> - - <!-- ====================================================== --> - <!-- Build & deploy all the plugin jars. --> - <!-- ====================================================== --> - <target name="deploy"> - <ant dir="index-nutchwax" target="deploy"/> - <ant dir="query-nutchwax" target="deploy"/> - <ant dir="urlfilter-nutchwax" target="deploy"/> - <ant dir="scoring-nutchwax" target="deploy"/> - </target> - - <!-- ====================================================== --> - <!-- Clean all of the plugins. --> - <!-- ====================================================== --> - <target name="clean"> - <ant dir="index-nutchwax" target="clean"/> - <ant dir="query-nutchwax" target="clean"/> - <ant dir="urlfilter-nutchwax" target="clean"/> - <ant dir="scoring-nutchwax" target="clean"/> - </target> - -</project> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-02-20 03:19:04
|
Revision: 2957 http://archive-access.svn.sourceforge.net/archive-access/?rev=2957&view=rev Author: binzino Date: 2010-02-20 03:18:57 +0000 (Sat, 20 Feb 2010) Log Message: ----------- WAX-73. Change fieldcache to false. Also added scoring-nutchwax to the plugin list even though we don't normally use it. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2010-02-12 20:54:15 UTC (rev 2956) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2010-02-20 03:18:57 UTC (rev 2957) @@ -10,7 +10,7 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|urlfilter-nutchwax</value> + <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> </property> <!-- @@ -182,7 +182,7 @@ <property> <name>searcher.fieldcache</name> - <value>true</value> + <value>false</value> </property> </configuration> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2956 http://archive-access.svn.sourceforge.net/archive-access/?rev=2956&view=rev Author: binzino Date: 2010-02-12 20:54:15 +0000 (Fri, 12 Feb 2010) Log Message: ----------- Added logic to handle per-collection segments. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/DistributedSegmentBean.java 2010-02-12 20:54:15 UTC (rev 2956) @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.searcher; + +import java.io.IOException; +import java.net.InetSocketAddress; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentMap; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.ipc.RPC; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseText; + +public class DistributedSegmentBean implements SegmentBean { + + private static final ExecutorService executor = + Executors.newCachedThreadPool(); + + private final ScheduledExecutorService pingService; + + private class DistSummmaryTask implements Callable<Summary[]> { + private int id; + + private HitDetails[] details; + private Query query; + + public DistSummmaryTask(int id) { + this.id = id; + } + + public Summary[] call() throws Exception { + if (details == null) { + return null; + } + return beans[id].getSummary(details, query); + } + + public void setSummaryArgs(HitDetails[] details, Query query) { + this.details = details; + this.query = query; + } + + } + + private class SegmentWorker implements Runnable { + private int id; + + public SegmentWorker(int id) { + this.id = id; + } + + public void run() { + try { + String[] segments = beans[id].getSegmentNames(); + for (String segment : segments) { + segmentMap.put(segment, id); + } + } catch (IOException e) { + // remove all segments this bean was serving + Iterator<Map.Entry<String, Integer>> i = + segmentMap.entrySet().iterator(); + while (i.hasNext()) { + Map.Entry<String, Integer> entry = i.next(); + int curId = entry.getValue(); + if (curId == this.id) { + i.remove(); + } + } + } + } + } + + private long timeout; + + private SegmentBean[] beans; + + private boolean perCollection = false; + + private ConcurrentMap<String, Integer> segmentMap; + + private List<Callable<Summary[]>> summaryTasks; + + private List<SegmentWorker> segmentWorkers; + + public DistributedSegmentBean(Configuration conf, Path serversConfig) + throws IOException { + this.timeout = conf.getLong("ipc.client.timeout", 60000); + this.perCollection = conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false ); + + List<SegmentBean> beanList = new ArrayList<SegmentBean>(); + + List<InetSocketAddress> segmentServers = + NutchBean.readAddresses(serversConfig, conf); + + for (InetSocketAddress addr : segmentServers) { + SegmentBean bean = (RPCSegmentBean) RPC.getProxy(RPCSegmentBean.class, + FetchedSegments.VERSION, addr, conf); + beanList.add(bean); + } + + beans = beanList.toArray(new SegmentBean[beanList.size()]); + + summaryTasks = new ArrayList<Callable<Summary[]>>(beans.length); + segmentWorkers = new ArrayList<SegmentWorker>(beans.length); + + for (int i = 0; i < beans.length; i++) { + summaryTasks.add(new DistSummmaryTask(i)); + segmentWorkers.add(new SegmentWorker(i)); + } + + segmentMap = new ConcurrentHashMap<String, Integer>(); + + pingService = Executors.newScheduledThreadPool(beans.length); + for (SegmentWorker worker : segmentWorkers) { + pingService.scheduleAtFixedRate(worker, 0, 30, TimeUnit.SECONDS); + } + } + + private SegmentBean getBean(HitDetails details) { + String key = perCollection ? "collection":"segment"; + return beans[segmentMap.get(key)]; + } + + public String[] getSegmentNames() { + return segmentMap.keySet().toArray(new String[segmentMap.size()]); + } + + public byte[] getContent(HitDetails details) throws IOException { + return getBean(details).getContent(details); + } + + public long getFetchDate(HitDetails details) throws IOException { + return getBean(details).getFetchDate(details); + } + + public ParseData getParseData(HitDetails details) throws IOException { + return getBean(details).getParseData(details); + } + + public ParseText getParseText(HitDetails details) throws IOException { + return getBean(details).getParseText(details); + } + + public void close() throws IOException { + executor.shutdown(); + pingService.shutdown(); + for (SegmentBean bean : beans) { + bean.close(); + } + } + + public Summary getSummary(HitDetails details, Query query) + throws IOException { + return getBean(details).getSummary(details, query); + } + + @SuppressWarnings("unchecked") + public Summary[] getSummary(HitDetails[] detailsArr, Query query) + throws IOException { + List<HitDetails>[] detailsList = new ArrayList[summaryTasks.size()]; + for (int i = 0; i < detailsList.length; i++) { + detailsList[i] = new ArrayList<HitDetails>(); + } + for (HitDetails details : detailsArr) { + String key = details.getValue( perCollection ? "collection":"segment" ); + detailsList[segmentMap.get(key)].add(details); + } + for (int i = 0; i < summaryTasks.size(); i++) { + DistSummmaryTask task = (DistSummmaryTask)summaryTasks.get(i); + if (detailsList[i].size() > 0) { + HitDetails[] taskDetails = + detailsList[i].toArray(new HitDetails[detailsList[i].size()]); + task.setSummaryArgs(taskDetails, query); + } else { + task.setSummaryArgs(null, null); + } + } + + List<Future<Summary[]>> summaries; + try { + summaries = + executor.invokeAll(summaryTasks, timeout, TimeUnit.MILLISECONDS); + } catch (InterruptedException e) { + throw new RuntimeException(e); + } + + List<Summary> summaryList = new ArrayList<Summary>(); + for (Future<Summary[]> f : summaries) { + Summary[] summaryArray; + try { + summaryArray = f.get(); + if (summaryArray == null) { + continue; + } + for (Summary summary : summaryArray) { + summaryList.add(summary); + } + } catch (Exception e) { + if (e.getCause() instanceof IOException) { + throw (IOException) e.getCause(); + } + throw new RuntimeException(e); + } + } + + return summaryList.toArray(new Summary[summaryList.size()]); + } + +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |