You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
Revision: 2680 http://archive-access.svn.sourceforge.net/archive-access/?rev=2680&view=rev Author: bradtofel Date: 2009-01-29 23:52:10 +0000 (Thu, 29 Jan 2009) Log Message: ----------- BUGFIX(ACC-58): was not adding DateRangeFilter for UrlPrefix queries. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2008-12-18 19:12:47 UTC (rev 2679) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2009-01-29 23:52:10 UTC (rev 2680) @@ -372,6 +372,7 @@ filter.addFilter(drFilter); } else if(type == TYPE_URL) { filter.addFilter(new UrlPrefixMatchFilter(keyUrl)); + filter.addFilter(drFilter); } else { throw new BadQueryException("Unknown type"); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-18 19:12:56
|
Revision: 2679 http://archive-access.svn.sourceforge.net/archive-access/?rev=2679&view=rev Author: binzino Date: 2008-12-18 19:12:47 +0000 (Thu, 18 Dec 2008) Log Message: ----------- Make NutchWAX 0.12.3 release tag. Added Paths: ----------- tags/nutchwax-0_12_3/ tags/nutchwax-0_12_3/archive/ Property changes on: tags/nutchwax-0_12_3/archive ___________________________________________________________________ Added: svn:mergeinfo + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-18 18:37:45
|
Revision: 2678 http://archive-access.svn.sourceforge.net/archive-access/?rev=2678&view=rev Author: binzino Date: 2008-12-18 18:37:40 +0000 (Thu, 18 Dec 2008) Log Message: ----------- Updated documenation for 0.12.3 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt trunk/archive-access/projects/nutchwax/archive/HOWTO.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt Added: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -0,0 +1,392 @@ + +BUILD-NOTES.txt +2008-12-18 +Aaron Binns + +====================================================================== +Build notes +====================================================================== + +This document contains supplemental notes regarding the NutchWAX +build, expanding upon the information found in the various READMEs and +HOWTOs. + +====================================================================== + +This 0.12.x release of NutchWAX is radically different in source-code +form compared to the previous release, 0.10. + +One of the design goals of 0.12.x was to reduce or even eliminate the +"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX +releases had to copy/paste/edit large chunks of Nutch source code in +order to add the NutchWAX features. + +Also, the NutchWAX 0.12.x sources and build are designed to one day be +added into mainline Nutch as a proper "contrib" package; then +eventually be fully integrated into the core Nutch source code. + +====================================================================== + +Most of the NutchWAX source code is relatively straightfoward to those +already familiar with the inner workings of Nutch. Still, special +attention on one class is worth while: + + src/java/org/archive/nutchwax/Importer.java + +This is where ARC/WARC files are read and their documents are imported +into a Nutch segment. + +It is inspired by: + + nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + +on the Nutch SVN head. + +Our implementation differs in a few important ways: + + o Rather than taking a directory with ARC files as input, we take a + manifest file with URLs to ARC files. This way, the manifest is + split up among the distributed Hadoop jobs and the ARC files are + processed in whole by each worker. + + In the Nutch SVN, the ArcSegmentCreator.java expects the input + directory to contain the ARC files and (AFAICT) splits them up and + distributes them across the Hadoop workers. + + o We use the standard Internet Archive ARCReader and WARCReader + classes. Thus, NutchWAX can read both ARC and WARC files, whereas + the ArcSegmentCreator class can only read ARC files. + + o We add metadata fields to the document, which are then available + to the "index-nutchwax" plugin at indexing-time. + + Importer.importRecord() + ... + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); + contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); + contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); + ... + + +====================================================================== +Patching +====================================================================== + +When NutchWAX is built, a number of patches are automatically applied +to the Nutch source and configuration files. + +---------------------------------------------------------------------- +The file + + /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + +contains two errors: one where a mimetype is referenced before it is +defined; and a second where a definition has an illegal character. + +These errors cause Nutch to not recognize certain mimetypes and +therefore will ignore documents matching those mimetypes. + +There are two fixes: + + 1. Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + 2. Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +You can either apply these patches yourself, or copy an already-patched +copy from: + + /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml + +to + + /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + +---------------------------------------------------------------------- + +In the file 'conf/nutch-site.xml' we define some properties to +over-ride the values in 'conf/nutch-default.xml'. + +-------------------------------------------------- +plugin.includes +-------------------------------------------------- +Change the list of plugins from: + + protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) + +to + + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax + +In short, we add: + + index-nutchwax + query-nutchwax + urlfilter-nutchwax + parse-pdf + +and remove: + + urlfilter-regex + urlnormalizer-(pass|regex|basic) + +The only *required* changes are the additions of the NutchWAX index +and query plugins. The rest are optional, but recommended. + +The "parse-pdf" plugin is added simply because we have lots of PDFs in +our archives and we want to index them. We sometimes remove the +"parse-js" plugin if we don't care to index JavaScript files. + +We also remove the default Nutch URL filtering and normalizing plugins +because we do not need the URLs normalized nor filtered. We trust +that the tool that produced the ARC/WARC file will have normalized the +URLs contained therein according to its own rules so there's no need +to normalize here. Also, we don't filter by URL since we want to +index as much of the ARC/WARC file as we have parsers for. + +We do, however, add the NutchWAX URL filter. If de-duplication is +being performed upon import, this plugin is required. It performs URL +filtering of the list of ARC records to exclude based on +URL+digest+date. + +-------------------------------------------------- +indexingfilter.order +-------------------------------------------------- + +Add this property with a value of + + org.apache.nutch.indexer.basic.BasicIndexingFilter + org.archive.nutchwax.index.ConfigurableIndexingFilter + +So that the NutchWAX indexing filter is run after the Nutch basic +indexing filter. + +A full explanation is given in "README-dedup.txt". + +-------------------------------------------------- +mime.type.magic +-------------------------------------------------- +We disable mimetype detection in Nutch for two reasons: + +1. The ARC/WARC file specifies the Content-Type of the document. We + trust that the tool that created the ARC/WARC file got it right. + +2. The implementation in Nutch can use a lot of memory as the *entire* + document is read into memory as a byte[], then converted to a + String, then checked against the MIME database. This can lead to + out of memory errors for large files, such as music and video. + +To disable, simply set the property value to false. + + <property> + <name>mime.type.magic</name> + <value>false</value> + </property> + +-------------------------------------------------- +nutchwax.filter.index +-------------------------------------------------- +Configure the 'index-nutchwax' plugin. Specify how the metadata +fields added by the Importer are mapped to the Lucene documents during +indexing. + +The specifications here are of the form: + + src-key:lowercase:store:tokenize:exclusive:dest-key + +where the only required part is the "src-key", the rest will assume +the following defaults: + + lowercase = true + store = true + tokenize = false + exclusive = true + dest-key = src-key + +We recommend: + +<property> + <name>nutchwax.filter.index</name> + <value> + url:false:true:true + url:flase:true:false:true:exacturl + orig:false + digest:false + filename:false + fileoffset:false + collection + date + type + length + </value> +</property> + +The "url", "orig" and "digest" values are required, the rest are +optional, but strongly recommended. + +-------------------------------------------------- +nutchwax.filter.query +-------------------------------------------------- +Configure the 'query-nutchwax' plugin. Specify which fields to make +searchable via "field:[term|phrase]" query syntax, and whether they +are "raw" fields or not. + +The specification format is one of: + + field:<name>:<boost> + raw:<name>:<lowercase>:<boost> + group:<name>:<lowercase>:<delimiter>:<boost> + +Default values are + + lowercase = true + delimiter = "," + boost = 1.0f + +There is no "lowercase" property for "field" specification because the +Nutch FieldQueryFilter doesn't expose the option, unlike the +RawFieldQueryFilter. + +The "group" fields are raw fields that can accept multiple values, +separated by a delimiter. Multiple values appearing in a query are +automagically translated into required OR-groups, such as + + collection:"193,221,36" => +(collection:193 collection:221 collection:36) + +NOTE: We do *not* use this filter for handling "date" queries, there +is a specific filter for that: DateQueryFilter + +We recommend: + +<property> + <name>nutchwax.filter.query</name> + <value> + raw:digest:false + raw:filename:false + raw:fileoffset:false + raw:exacturl:false + group:collection + group:type + field:anchor + field:content + field:host + field:title + </value> +</property> + + +-------------------------------------------------- +nutchwax.urlfilter.wayback.exclusions +-------------------------------------------------- +File containing the exclusion list for importing. + +Normally, this is specified on the command line with the NutchWAX +Importer is invoked. It can be specified here if preferred. + +-------------------------------------------------- +nutchwax.urlfilter.wayback.canonicalizer +-------------------------------------------------- + +For CDX-based de-duplication, the same URL canonicalization algorithm +must be used here as was used to generate the CDX files. + +The default canonicalizer in Wayback's '(w)arc-indexer' utility +is + + org.archive.wayback.util.url.AggressiveUrlCanonicalizer + +which is the value provided in "nutch-site.xml". + +If the '(w)arc-indexer' is executed with the "-i" (identity) +command-line option, then the matching canonicalizer + + org.archive.wayback.util.url.IdentityUrlCanonicalizer + +must be specified here. + +-------------------------------------------------- +nutchwax.filter.http.status +-------------------------------------------------- +This property configures a filter with a list of ranges +of HTTP status codes to allow. + +Typically, most NutchWAX implementors do not wish to import and index +404, 500, 302 and other non-success pages. This is an inclusion +filter, meaning that only ARC records with an HTTP status code +matching any of the values will be imported. + +There is a special "unknown" value which can be used to include ARC +records that don't have an HTTP status code (for whatever reason). + +The default setting provided in nutch-site.xml is to allow any 2XX +success code: + + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + </value> + </property> + +But some other examples are: + + Allow any 2XX success code *and* redirects, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + 300-399 + </value> + </property> + + Be really strict about only certain codes, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200 + 301 + 302 + 304 + </value> + </property> + + Mix of ranges and specific codes, including the "unknown" + <property> + <name>nutchwax.filter.http.status</name> + <value> + Unknown + 200 + 300-399 + </value> + </property> + +-------------------------------------------------- +nutchwax.import.content.limit +-------------------------------------------------- +Similar to Nutch's + + file.content.limit + http.content.limit + ftp.content.limit + +properties, this specifies a limit on the size of a document imported +via NutchWAX. + +We recommend setting this to a size compatible with the memory +capacity of the computers performing the import. Something in the +1-4MB range is typical. + Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -31,7 +31,7 @@ in the full-text search index. Nutch's 'invertlinks' step inverts links and stores them in the -'linkdb' directory. We use the inlinks to boost the Lucene score of +'linkdb' directory. We use these inlinks to boost the Lucene score of documents in proportion to the number of inlinks. Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -5,9 +5,8 @@ Table of Contents o Prerequisites - - Nutch(WAX) installation + - NutchWAX installation - ARC/WARC files - o Configuration & Patching o Create a manifest o Import, Invert and Index o Search @@ -27,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutch-1.0-dev + /opt/nutchwax-0.12.3 2. ARC/WARC files. @@ -40,348 +39,6 @@ ====================================================================== -Patching -====================================================================== - -The vanilla NutchWAX as built according to the INSTALL.txt guide is -not quite ready to be used out-of-the-box. - -Before you can use NutchWAX, you must first patch a bug that exists in -the current Nutch SVN head. - -The file - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - - -====================================================================== -Configuring -====================================================================== - -Since we assume that you are already familiar with Nutch, then you -should already be familiar with configuring it. The configuration -is mainly defined in - - /opt/nutch-1.0-dev/conf/nutch-default.xml - -NutchWAX requires the modification of two existing properties and the -addition of two new ones. - -All of the modifications described below can be found in: - - /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml - -You can either apply the configuration changes yourself, or copy that -file to - - /opt/nutch-1.0-dev/conf/nutch-site.xml - --------------------------------------------------- -plugin.includes --------------------------------------------------- -Change the list of plugins from: - - protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) - -to - - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax - -In short, we add: - - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf - -and remove: - - urlfilter-regex - urlnormalizer-(pass|regex|basic) - -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. - -The "parse-pdf" plugin is added simply because we have lots of PDFs in -our archives and we want to index them. We sometimes remove the -"parse-js" plugin if we don't care to index JavaScript files. - -We also remove the default Nutch URL filtering and normalizing plugins -because we do not need the URLs normalized nor filtered. We trust -that the tool that produced the ARC/WARC file will have normalized the -URLs contained therein according to its own rules so there's no need -to normalize here. Also, we don't filter by URL since we want to -index as much of the ARC/WARC file as we have parsers for. - -We do, however, add the NutchWAX URL filter. If de-duplication is -being performed upon import, this plugin is required. It performs URL -filtering of the list of ARC records to exclude based on -URL+digest+date. - --------------------------------------------------- -indexingfilter.order --------------------------------------------------- - -Add this property with a value of - - org.apache.nutch.indexer.basic.BasicIndexingFilter - org.archive.nutchwax.index.ConfigurableIndexingFilter - -So that the NutchWAX indexing filter is run after the Nutch basic -indexing filter. - -A full explanation is given in "README-dedup.txt". - --------------------------------------------------- -mime.type.magic --------------------------------------------------- -We disable mimetype detection in Nutch for two reasons: - -1. The ARC/WARC file specifies the Content-Type of the document. We - trust that the tool that created the ARC/WARC file got it right. - -2. The implementation in Nutch can use a lot of memory as the *entire* - document is read into memory as a byte[], then converted to a - String, then checked against the MIME database. This can lead to - out of memory errors for large files, such as music and video. - -To disable, simply set the property value to false. - - <property> - <name>mime.type.magic</name> - <value>false</value> - </property> - --------------------------------------------------- -nutchwax.filter.index --------------------------------------------------- -Configure the 'index-nutchwax' plugin. Specify how the metadata -fields added by the Importer are mapped to the Lucene documents during -indexing. - -The specifications here are of the form: - - src-key:lowercase:store:tokenize:exclusive:dest-key - -where the only required part is the "src-key", the rest will assume -the following defaults: - - lowercase = true - store = true - tokenize = false - exclusive = true - dest-key = src-key - -We recommend: - -<property> - <name>nutchwax.filter.index</name> - <value> - url:false:true:true - url:flase:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length - </value> -</property> - -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. - --------------------------------------------------- -nutchwax.filter.query --------------------------------------------------- -Configure the 'query-nutchwax' plugin. Specify which fields to make -searchable via "field:[term|phrase]" query syntax, and whether they -are "raw" fields or not. - -The specification format is one of: - - field:<name>:<boost> - raw:<name>:<lowercase>:<boost> - group:<name>:<lowercase>:<delimiter>:<boost> - -Default values are - - lowercase = true - delimiter = "," - boost = 1.0f - -There is no "lowercase" property for "field" specification because the -Nutch FieldQueryFilter doesn't expose the option, unlike the -RawFieldQueryFilter. - -The "group" fields are raw fields that can accept multiple values, -separated by a delimiter. Multiple values appearing in a query are -automagically translated into required OR-groups, such as - - collection:"193,221,36" => +(collection:193 collection:221 collection:36) - -NOTE: We do *not* use this filter for handling "date" queries, there -is a specific filter for that: DateQueryFilter - -We recommend: - -<property> - <name>nutchwax.filter.query</name> - <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false - group:collection - group:type - field:anchor - field:content - field:host - field:title - </value> -</property> - - --------------------------------------------------- -nutchwax.urlfilter.wayback.exclusions --------------------------------------------------- -File containing the exclusion list for importing. - -Normally, this is specified on the command line with the NutchWAX -Importer is invoked. It can be specified here if preferred. - --------------------------------------------------- -nutchwax.urlfilter.wayback.canonicalizer --------------------------------------------------- - -For CDX-based de-duplication, the same URL canonicalization algorithm -must be used here as was used to generate the CDX files. - -The default canonicalizer in Wayback's '(w)arc-indexer' utility -is - - org.archive.wayback.util.url.AggressiveUrlCanonicalizer - -which is the value provided in "nutch-site.xml". - -If the '(w)arc-indexer' is executed with the "-i" (identity) -command-line option, then the matching canonicalizer - - org.archive.wayback.util.url.IdentityUrlCanonicalizer - -must be specified here. - --------------------------------------------------- -nutchwax.filter.http.status --------------------------------------------------- -This property configures a filter with a list of ranges -of HTTP status codes to allow. - -Typically, most NutchWAX implementors do not wish to import and index -404, 500, 302 and other non-success pages. This is an inclusion -filter, meaning that only ARC records with an HTTP status code -matching any of the values will be imported. - -There is a special "unknown" value which can be used to include ARC -records that don't have an HTTP status code (for whatever reason). - -The default setting provided in nutch-site.xml is to allow any 2XX -success code: - - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - </value> - </property> - -But some other examples are: - - Allow any 2XX success code *and* redirects, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - 300-399 - </value> - </property> - - Be really strict about only certain codes, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200 - 301 - 302 - 304 - </value> - </property> - - Mix of ranges and specific codes, including the "unknown" - <property> - <name>nutchwax.filter.http.status</name> - <value> - Unknown - 200 - 300-399 - </value> - </property> - --------------------------------------------------- -nutchwax.import.content.limit --------------------------------------------------- -Similar to Nutch's - - file.content.limit - http.content.limit - ftp.content.limit - -properties, this specifies a limit on the size of a document imported -via NutchWAX. - -We recommend setting this to a size compatible with the memory -capacity of the computers performing the import. Something in the -1-4MB range is typical. - - -====================================================================== Create a manifest ====================================================================== @@ -411,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest - $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -439,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer This calls the NutchBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -450,17 +107,9 @@ Web Deployment ====================================================================== -As users of Nutch are aware, the web application (nutch-1.0-dev.war) -bundled with Nutch contains duplicate copies of the configuration -files. +The Nutch(WAX) web application is bundled with NutchWAX as -So, all patches and configuration changes that we made to the -files in + /opt/nutchwax-0.12.3/nutch-1.0-dev.war - /opt/nutch-1.0-dev/conf - -will have to be duplicated in the Nutch webapp when it is deployed. - -This is not due to NutchWAX, this is a "feature" of regular Nutch. I -just thought it would be good to remind everyone since we did make -configuration changes for NutchWAX. +Simply deploy that web application in the same fashion as with +Nutch. Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -3,10 +3,22 @@ 2008-12-18 Aaron Binns +Table of Contents + o Introduction + o Build from source + - SVN: Nutch 1.0-dev + - SVN: NutchWAX + - Build and Install + o Install binary package + + +====================================================================== +Introduction +====================================================================== + This installation guide assumes the reader is already familiar with building, packaging and deploying Nutch 1.0-dev. - The NutchWAX 0.12 source and build system are designed to integrate into the existing Nutch 1.0-dev source and build. @@ -20,12 +32,12 @@ proper, then builds NutchWAX components and integrates them into the Nutch build directory. -We recommend that you execute all build commands from the NutchWAX -directory. This way, NutchWAX will ensure that any and all +In order to build NutchWAX, execute all build commands from the +NutchWAX directory. This way, NutchWAX will ensure that any and all dependencies in Nutch will be properly built and kept up-to-date. Towards this goal, we have duplicated the most common build targets -from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, -such as: +from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, such +as: o compile o jar @@ -39,8 +51,15 @@ sub-directory as normal. -Nutch-1.0-dev -------------- +====================================================================== +Build from Source +====================================================================== + +To build from source, you must check-out the Nutch and NutchWAX sources +from their respective 'subversion' source control servers. + +SVN: nutch-1.0-dev +------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is @@ -53,9 +72,12 @@ $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch +Please be sure to check-out this specific version of the Nutch source. +If you just grab the head of the trunk, there may be newer and +incompatible changed to Nutch. -NutchWAX --------- +SVN: NutchWAX +------------- Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into Nutch's "contrib" directory. @@ -65,7 +87,6 @@ This will create a sub-directory named "archive" containing the NutchWAX sources. - Build and install ----------------- Assuming you already have the required tool-set for building Nutch, @@ -91,3 +112,18 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz + $ mv nutch-1.0-dev nutchwax-0.12.3 + + +====================================================================== +Install binary package +====================================================================== + +Alternatively, grab a "binary" release package from the Internet +Archive's NutchWAX home page. + +Install it simply by untarring it, for example: + + $ cd /opt + $ tar xvfz nutchwax-0.12.3.tar.gz + Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -3,6 +3,16 @@ 2008-12-18 Aaron Binns +Table of Contents + o Introduction + o Build and Install + o Tutorial + + +====================================================================== +Introduction +====================================================================== + Welcome to NutchWAX 0.12.3! NutchWAX is a set of add-ons to Nutch in order to index and search @@ -17,7 +27,6 @@ Since NutchWAX is a set of add-ons to Nutch, you should already be familiar with Nutch before using NutchWAX. -====================================================================== The goal of NutchWAX is to enable full-text indexing and searching of documents stored in web archive file formats (ARC and WARC). @@ -26,13 +35,13 @@ to Nutch to read documents directly from ARC/WARC files. We call this process "importing" archive files. -Importing produces a Nutch segment, similar to Nutch crawling the -documents itself. In this scenario, document importing replaces the +Importing produces a Nutch segment, the same as when Nutch is used to +crawl documents itself. In essence, document importing replaces the conventional "generate/fetch/update" cycle of Nutch. Once the archival documents have been imported into a segment, the -regular Nutch commands to update the 'crawldb', invert the links and -index the document contents can proceed as normal. +regular Nutch commands to index the document contents can proceed as +normal. ====================================================================== @@ -71,73 +80,25 @@ conf/nutch-site.xml - Sample configuration properties file showing suggested settings for - Nutch and NutchWAX. + Additional configuration properties for NutchWAX, including + over-rides for properties defined in 'nutch-default.xml' There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX is distributed in source code form and is intended to be built in conjunction with Nutch. -See "INSTALL.txt" for details on building NutchWAX and Nutch. -See "HOWTO.txt" for a quick tutorial on importing, indexing and -searching a set of documents in a web archive file. - ====================================================================== - -This 0.12.x release of NutchWAX is radically different in source-code -form compared to the previous release, 0.10. - -One of the design goals of 0.12.x was to reduce or even eliminate the -"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX -releases had to copy/paste/edit large chunks of Nutch source code in -order to add the NutchWAX features. - -Also, the NutchWAX 0.12.x sources and build are designed to one day be -added into mainline Nutch as a proper "contrib" package; then -eventually be fully integrated into the core Nutch source code. - +Build and Install ====================================================================== -Most of the NutchWAX source code is relatively straightfoward to those -already familiar with the inner workings of Nutch. Still, special -attention on one class is worth while: +See "INSTALL.txt" for detailed instructions to build NutchWAX from +source or install a binary package. - src/java/org/archive/nutchwax/Importer.java -This is where ARC/WARC files are read and their documents are imported -into a Nutch segment. - -It is inspired by: - - nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java - -on the Nutch SVN head. - -Our implementation differs in a few important ways: - - o Rather than taking a directory with ARC files as input, we take a - manifest file with URLs to ARC files. This way, the manifest is - split up among the distributed Hadoop jobs and the ARC files are - processed in whole by each worker. - - In the Nutch SVN, the ArcSegmentCreator.java expects the input - directory to contain the ARC files and (AFAICT) splits them up and - distributes them across the Hadoop workers. - - o We use the standard Internet Archive ARCReader and WARCReader - classes. Thus, NutchWAX can read both ARC and WARC files, whereas - the ArcSegmentCreator class can only read ARC files. - - o We add metadata fields to the document, which are then available - to the "index-nutchwax" plugin at indexing-time. - - Importer.importRecord() - ... - contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); - contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); - contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); - contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); - ... - ====================================================================== +Tutorial +====================================================================== + +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -21,8 +21,45 @@ o Enhanced OpenSearchServlet o Improved XSLT sample for OpenSearch o System init.d script for searcher slaves - o Enhanced searcher slave aware of NutchWAX extensions + o Enhanced searcher slave which supports NutchWAX extensions + +One of the major changes to 0.12.3 is not a feature, enhancement or +bug-fix, but the way the NutchWAX source is "integrated" into the +Nutch source. + +Yes, the NutchWAX source is still kept in the contrib/archive +sub-directory, but when you invoke a build command from the +NutchWAX directory, such as + + $ cd nutch/contrib/archive + $ ant tar + +Many files from the NutchWAX source tree are copied directly into the +Nutch source tree before the build process begins. + +The reason for this is to make NutchWAX easier to use. + +In previous versions of NutchWAX, once 'ant' build command was +finished, the operator had to manually patch configuration files in +the Nutch directory. Upon a subsequent build, the files would be +over-written by Nutch's and would have to be patched again. + +It was a major hassle and complication. + +Another impetus for copying files into the Nutch source was to patch +bugs and make enhancements in the Nutch Java code which couldn't be +effectively done keeping the sources separate. When an 'ant' build +command is run a few Java files are copied from the NutchWAX source +tree into the Nutch source tree. + +In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of +this. Simply execute your build commands from 'contrib/archive' as +instructed in the HOWTO and no longer worry about patching +configuration files. If you wish to alter the NutchWAX configuration +file, make those changes in the NutchWAX source tree. + + ====================================================================== Issues ====================================================================== This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 19:53:29
|
Revision: 2677 http://archive-access.svn.sourceforge.net/archive-access/?rev=2677&view=rev Author: binzino Date: 2008-12-16 19:53:25 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Changed nutchwax.FetchedSegments.perCollection default value to false. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2008-12-16 19:52:42 UTC (rev 2676) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2008-12-16 19:53:25 UTC (rev 2677) @@ -144,7 +144,7 @@ --> <property> <name>nutchwax.FetchedSegments.perCollection</name> - <value>true</value> + <value>false</value> </property> <!-- The following are over-rides of property values in This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 19:52:45
|
Revision: 2676 http://archive-access.svn.sourceforge.net/archive-access/?rev=2676&view=rev Author: binzino Date: 2008-12-16 19:52:42 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Removed references to web and conf sub-dirs in "onlypack" target since they are now rolled into set of files copied into Nutch. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-16 07:38:28 UTC (rev 2675) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-16 19:52:42 UTC (rev 2676) @@ -104,14 +104,6 @@ <!-- This one does a little more after calling down to the relevant Nutch target. After Nutch has copied everything into the distribution directory, we add our script, libraries, etc. - - Rather than over-write the standard Nutch configuration files, - we place ours in a newly created directory - - contrib/archive/conf - - and let the individual user decide whether or not to - incorporate our modifications. --> <target name="package" depends="jar, job, war, javadoc" > <ant dir="${nutch.dir}" target="package" inheritAll="false" /> @@ -131,22 +123,12 @@ <fileset dir="${dist.dir}/bin"/> </chmod> - <mkdir dir="${dist.dir}/contrib/archive/conf"/> - <copy todir="${dist.dir}/contrib/archive/conf"> - <fileset dir="conf" /> - </copy> - <copy todir="${dist.dir}/contrib/archive"> <fileset dir="."> <include name="*.txt" /> </fileset> </copy> - <mkdir dir="${dist.dir}/contrib/archive/web"/> - <copy todir="${dist.dir}/contrib/archive/web"> - <fileset dir="src/web" /> - </copy> - <mkdir dir="${dist.dir}/contrib/archive/etc"/> <copy todir="${dist.dir}/contrib/archive/etc"> <fileset dir="src/etc" /> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 07:38:30
|
Revision: 2675 http://archive-access.svn.sourceforge.net/archive-access/?rev=2675&view=rev Author: binzino Date: 2008-12-16 07:38:28 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Fixed bug in web.xml related to <listener> tags. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2008-12-16 06:41:44 UTC (rev 2674) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2008-12-16 07:38:28 UTC (rev 2675) @@ -24,6 +24,8 @@ <listener> <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> +</listener> +<listener> <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> </listener> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 06:41:48
|
Revision: 2674 http://archive-access.svn.sourceforge.net/archive-access/?rev=2674&view=rev Author: binzino Date: 2008-12-16 06:41:44 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Moved web files into src/nutch sub-tree so they will be copied into Nutch corresponding sources directories for inclusion in Nutch ant build targets. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/src/web/ Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/jsp/search.xsl 2008-12-16 06:41:44 UTC (rev 2674) @@ -0,0 +1,281 @@ +<?xml version="1.0" encoding="utf-8" ?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<xsl:stylesheet + version="1.0" + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/" + xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/" +> +<xsl:output method="xml" /> + +<xsl:template match="rss/channel"> + <html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <title><xsl:value-of select="title" /></title> + <style media="all" lang="en" type="text/css"> + body + { + padding : 20px; + margin : 0; + font-family : Verdana; sans-serif; + font-size : 9pt; + color : #000000; + background-color: #ffffff; + } + .pageTitle + { + font-size : 125% ; + font-weight : bold ; + text-align : center ; + padding-bottom : 2em ; + } + .searchForm + { + margin : 20px 0 5px 0; + padding-bottom : 0px; + border-bottom : 1px solid black; + } + .searchResult + { + margin : 0; + padding : 0; + } + .searchResult h1 + { + margin : 0 0 5px 0 ; + padding : 0 ; + font-size : 120%; + } + .searchResult .details + { + font-size: 80%; + color: green; + } + .searchResult .dates + { + font-size: 80%; + } + .searchResult .dates a + { + color: #3366cc; + } + form#searchForm + { + margin : 0; padding: 0 0 10px 0; + } + .searchFields + { + padding : 3px 0; + } + .searchFields input + { + margin : 0 0 0 15px; + padding : 0; + } + input#query + { + margin : 0; + } + ol + { + margin : 5px 0 0 0; + padding : 0 0 0 2em; + } + ol li + { + margin : 0 0 15px 0; + } + </style> + </head> + <body> + <!-- Page header: title and search form --> + <div class="pageTitle" > + NutchWAX Sample XSLT + </div> + <div> + This simple XSLT demonstrates the transformation of OpenSearch XML results into a fully-functional, human-friendly HTML search page. No JSP needed. + </div> + <div class="searchForm"> + <form id="searchForm" name="searchForm" method="get" action="search" > + <span class="searchFields"> + Search for + <input id="query" name="query" type="text" size="40" value="{nutch:query}" /> + + <!-- Create hidden form fields for the rest of the URL parameters --> + <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start' and @name!='query']"> + <xsl:element name="input" namespace="http://www.w3.org/1999/xhtml"> + <xsl:attribute name="type">hidden</xsl:attribute> + <xsl:attribute name="name" ><xsl:value-of select="@name" /></xsl:attribute> + <xsl:attribute name="value"><xsl:value-of select="@value" /></xsl:attribute> + </xsl:element> + </xsl:for-each> + + <input type="submit" value="Search"/> + </span> + </form> + </div> + <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"></span></div> + <!-- Search results --> + <ol start="{opensearch:startIndex + 1}"> + <xsl:apply-templates select="item" /> + </ol> + <!-- Generate list of page links --> + <center> + <xsl:call-template name="pageLinks"> + <xsl:with-param name="labelPrevious" select="'«'" /> + <xsl:with-param name="labelNext" select="'»'" /> + </xsl:call-template> + </center> + </body> +</html> +</xsl:template> + + +<!-- ====================================================================== + NutchWAX XSLT template/fuction library. + + The idea is that the above xhtml code is what most NutchWAX users + will modify to tailor to their own look and feel. The stuff + below implements the core logic for generating results lists, + page links, etc. + + Hopefully NutchWAX web developers will be able to easily edit the + above xhtml and css and won't have to change the below. + ====================================================================== --> + +<!-- Template to emit a search result as an HTML list item (<li/>). + --> +<xsl:template match="item"> + <li> + <div class="searchResult"> + <h1><a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/',nutch:date,'/',link)}"><xsl:value-of select="title" /></a></h1> + <div> + <xsl:value-of select="description" /> + </div> + <div class="details"> + <xsl:value-of select="link" /> - <xsl:value-of select="round( nutch:length div 1024 )"/>k - <xsl:value-of select="nutch:type" /> + </div> + <div class="dates"> + <a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/*/',link)}">All versions</a> - <a href="?query={../nutch:query} site:{nutch:site}&hitsPerSite=0">More from <xsl:value-of select="nutch:site" /></a> + </div> + </div> + </li> +</xsl:template> + +<!-- Template to emit a date in YYYY/MM/DD format + --> +<xsl:template match="nutch:date" > + <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text> +</xsl:template> + +<!-- Template to emit a list of numbered page links, *including* + "previous" and "next" links on either end, using the given labels. + Parameters: + labelPrevious Link text for "previous page" link + labelNext Link text for "next page" link + --> +<xsl:template name="pageLinks"> + <xsl:param name="labelPrevious" /> + <xsl:param name="labelNext" /> + <!-- If we are on any page past the first, emit a "previous" link --> + <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1"> + <xsl:call-template name="pageLink"> + <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage)" /> + <xsl:with-param name="linkText" select="$labelPrevious" /> + </xsl:call-template> + <xsl:text> </xsl:text> + </xsl:if> + <!-- Now, emit numbered page links --> + <xsl:choose> + <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) < 11"> + <xsl:call-template name="numberedPageLinks" > + <xsl:with-param name="begin" select="1" /> + <xsl:with-param name="end" select="21" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:when> + <xsl:otherwise> + <xsl:call-template name="numberedPageLinks" > + <xsl:with-param name="begin" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" /> + <xsl:with-param name="end" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:otherwise> + </xsl:choose> + <!-- Lastly, emit a "next" link. --> + <xsl:text> </xsl:text> + <xsl:call-template name="pageLink"> + <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" /> + <xsl:with-param name="linkText" select="$labelNext" /> + </xsl:call-template> +</xsl:template> + +<!-- Template to emit a list of numbered links to results pages. + Parameters: + begin starting # inclusive + end ending # exclusive + current the current page, don't emit a link + --> +<xsl:template name="numberedPageLinks"> + <xsl:param name="begin" /> + <xsl:param name="end" /> + <xsl:param name="current" /> + <xsl:if test="$begin < $end"> + <xsl:choose> + <xsl:when test="$begin = $current" > + <xsl:value-of select="$current" /> + </xsl:when> + <xsl:otherwise> + <xsl:call-template name="pageLink" > + <xsl:with-param name="pageNum" select="$begin" /> + <xsl:with-param name="linkText" select="$begin" /> + </xsl:call-template> + </xsl:otherwise> + </xsl:choose> + <xsl:text> </xsl:text> + <xsl:call-template name="numberedPageLinks"> + <xsl:with-param name="begin" select="$begin + 1" /> + <xsl:with-param name="end" select="$end" /> + <xsl:with-param name="current" select="$current" /> + </xsl:call-template> + </xsl:if> +</xsl:template> + +<!-- Template to emit a single page link. All of the URL parameters + listed in the OpenSearch results are included in the link. + Parmeters: + pageNum page number of the link + linkText text of the link + --> +<xsl:template name="pageLink"> + <xsl:param name="pageNum" /> + <xsl:param name="linkText" /> + <xsl:element name="a" namespace="http://www.w3.org/1999/xhtml"> + <xsl:attribute name="href"> + <xsl:text>?</xsl:text> + <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start']"> + <xsl:value-of select="@name" /><xsl:text>=</xsl:text><xsl:value-of select="@value" /> + <xsl:text>&</xsl:text> + </xsl:for-each> + <xsl:text>start=</xsl:text><xsl:value-of select="($pageNum -1) * opensearch:itemsPerPage" /> + </xsl:attribute> + <xsl:value-of select="$linkText" /> + </xsl:element> +</xsl:template> + +</xsl:stylesheet> Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/web/web.xml 2008-12-16 06:41:44 UTC (rev 2674) @@ -0,0 +1,80 @@ +<?xml version="1.0" encoding="ISO-8859-1"?> +<!DOCTYPE web-app + PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" + "http://java.sun.com/dtd/web-app_2_3.dtd"> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<web-app> + +<!-- order is very important here --> + +<listener> + <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> +</listener> + +<servlet> + <servlet-name>Cached</servlet-name> + <servlet-class>org.apache.nutch.servlet.Cached</servlet-class> +</servlet> + +<servlet> + <servlet-name>OpenSearch</servlet-name> + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> +</servlet> + +<servlet-mapping> + <servlet-name>Cached</servlet-name> + <url-pattern>/servlet/cached</url-pattern> +</servlet-mapping> + +<servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/opensearch</url-pattern> +</servlet-mapping> + +<servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/search</url-pattern> +</servlet-mapping> + +<filter> + <filter-name>XSLT Filter</filter-name> + <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> + <init-param> + <param-name>xsltUrl</param-name> + <param-value>style/search.xsl</param-value> + </init-param> +</filter> + +<filter-mapping> + <filter-name>XSLT Filter</filter-name> + <url-pattern>/search</url-pattern> +</filter-mapping> + +<welcome-file-list> + <welcome-file>search.html</welcome-file> + <welcome-file>index.html</welcome-file> + <welcome-file>index.jsp</welcome-file> +</welcome-file-list> + +<taglib> + <taglib-uri>http://jakarta.apache.org/taglibs/i18n</taglib-uri> + <taglib-location>/WEB-INF/taglibs-i18n.tld</taglib-location> + </taglib> + +</web-app> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 06:24:10
|
Revision: 2673 http://archive-access.svn.sourceforge.net/archive-access/?rev=2673&view=rev Author: binzino Date: 2008-12-16 06:24:01 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Moved conf sub-dir so that it's automatically copied over into Nutch directory during build. This way the NutchWAX extensions are automatically included in the Nutch build. Operators/users don't have to do hand-editing of Nutch conf files to get NutchWAX enhancements. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/ Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/conf/ Property changes on: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf ___________________________________________________________________ Added: svn:mergeinfo + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 04:58:23
|
Revision: 2672 http://archive-access.svn.sourceforge.net/archive-access/?rev=2672&view=rev Author: binzino Date: 2008-12-16 04:58:21 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Changed to use NutchWAX OpenSearchServlet instead of Nutch's. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/web/web.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/web/web.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/web/web.xml 2008-12-16 03:00:10 UTC (rev 2671) +++ trunk/archive-access/projects/nutchwax/archive/src/web/web.xml 2008-12-16 04:58:21 UTC (rev 2672) @@ -34,7 +34,7 @@ <servlet> <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> </servlet> <servlet-mapping> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 03:00:15
|
Revision: 2671 http://archive-access.svn.sourceforge.net/archive-access/?rev=2671&view=rev Author: binzino Date: 2008-12-16 03:00:10 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Updated documentation for 0.12.3 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -157,62 +157,36 @@ ====================================================================== -Index +Index and Index merging ====================================================================== -The only chage we make to the indexing step is the destination of the -index directory. +Perform the index step as normal, yielding an 'indexes' directory. -By default, Nutch expects the per-segment index directory to live in a -sub-directory called 'indexes' and the index command is accordingly +E.g. $ nutch index indexes crawldb linkdb segments/* -Resulting in an index directory structure of the form +Then, merge the 'indexes' directory into a single Lucene index by +invoking the Nutch 'merge' command - indexes/part-00000 + $ nutch merge index indexes -For de-duplication, we use a slightly different directory structure, -which will be used by a de-duplication-aware NutchWaxBean at -search-time. The directory structure we use is: - pindexes/<segment>/part-00000 - -Using the segment name is not strictly required, but it is a good -practice and is strongly recommended. This way the segment and its -corresponding index directory are easily matched. - -Let's assume that the segment directory created during the import is -named - - segments/20080703050349 - -In that case, our index command becomes: - - $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349 - -Upon completion, the Lucene index is created in - - pindexes/20080703050349/part-0000 - -This index is exactly the same as one normally created by Nutch, the -only difference is the location. - - ====================================================================== Add Revisit Dates ====================================================================== -Now that we have the Nutch index, we add the revisit dates to it. +Now that we have a single, merged index, we create a "parallel" index +directory which contains the additional revisit dates. Examine the "all.dup" file again, it has lines of the form - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 + example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 These are the revisit dates that need to be added to the records in the Lucene index. When we generated the index, only the date of the @@ -220,35 +194,47 @@ As explained in README-dedup.txt, modifying the Lucene index to actually add these dates is infeasible. What we do is create a -parallel index next to the main index (the part-00000 created above) -that contains all the dates for each record. +parallel index next to the merged index that contains all the dates +for each record. The NutchWAX 'add-dates' command creates this parallel index for us. - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ + $ nutchwax add-dates index \ + index \ + dates \ all.dup -Yes, the part-0000 argument does appear twice. This is beacuse it is +Yes, the 'index' argument does appear twice. This is beacuse it is both the "key" index and the "source" index. - Suppose we did another crawl and had even more dates to add to the existing index. In that case we would run - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - pindexes/20080703050349/new-dates \ + $ nutchwax add-dates index \ + dates \ + new-dates \ new-crawl.dup - $ rm -r pindexes/20080703050349/dates - $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates + $ rm -r dates + $ mv new-dates dates This copies the existing dates from "dates" to "new-dates" and adds additional ones from "new-crawl.dup" along the way. Then we replace the previous "dates" index with the new one. +Now, Nutch doesn't know what to do with the extra 'dates' parallel +index, but NutchWAX does and it requires them to be arranged +in a directory structure of the following form: + pindexes/<name>/dates + /index + +Where "name" is any name of your choosing. For example, + + $ mkdir -p pindexes/200812180000 + $ mv dates pindexes/200812180000/ + $ mv index pindexes/200812180000/ + + WARC ---- This step is the same for ARCs and WARCs. @@ -318,6 +304,8 @@ <listener> <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + </listener> + <listener> <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> </listener> Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -0,0 +1,129 @@ + +HOWTO-pagerank.txt +2008-12-18 +Aaron Binns + +Table of Contents + o Prerequisites + o Overview + o Generate PageRank + o PageRank Scoring and Boosting + o Configuration and Indexing + + +====================================================================== +Prerequisites +====================================================================== + +This HOWTO assumes you've already read the main NutchWAX HOWTO and are +familiar with importing and indexing archive files with NutchWAX. + +Also, we assume that you are familiar with deploying the Nutch(WAX) +web application into a servlet container such as Tomcat. + + +====================================================================== +Overview +====================================================================== + +NutchWAX provides a pair of tools for extracting and utilizing +simplistic "page rank" information for scoring and sorting documents +in the full-text search index. + +Nutch's 'invertlinks' step inverts links and stores them in the +'linkdb' directory. We use the inlinks to boost the Lucene score of +documents in proportion to the number of inlinks. + + +====================================================================== +Generate PageRank +====================================================================== + +After the Nutch 'invertlinks' step is performed, run the NutchWAX +'pagerank' command to extract inlink information from the 'linkdb' + +For example + + $ nutch invertlinks linkdb -dir segments + $ nutchwax pagerank pagerank.txt linkdb + +The resulting "pagerank.txt" file is a simple text file containing +a count of the number of inlinks followed by the URL. + + $ sort -n pagerank.txt | tail + 367762 http://informe.presidencia.gob.mx/ + 367809 http://comovamos.presidencia.gob.mx/ + 367852 http://ocho.presidencia.gob.mx/ + 372681 http://www.gob.mx/ + 398073 http://pnd.presidencia.gob.mx/ + 399321 http://zedillo.presidencia.gob.mx/ + 496993 http://www.google-analytics.com/urchin.js + 702448 http://www.elbalero.gob.mx/ + 703517 http://www.mexicoenlinea.gob.mx/ + 764195 http://www.brasil.gov.br + +In the above example, the most linked-to URL has 764195 inlinks. + + +====================================================================== +PageRank Scoring and Boosting +====================================================================== + +During indexing, the NutchWAX PageRankScoringFilter uses the page rank +information to boost the Lucene documents score in proportion to the +number of inlinks. + +The formula used for boosting the Lucene document score is a simple +log10()-based calculation + + boost = log10( # inlinks ) + 1 + +In Lucene, the boost is a multiplier where a boost of 1.0 means "no +change" or "no boost" for the document score. By default, all +documents have a boost of 1.0 unless a scoring filter changes it. + +Thus, we add 1 to the log10() value so that our boost scores start and +1.0 and go up from there. + +The use of log10() gives us a linear boost based on the order of +magnitude of the number of inlinks. Consider the following boost +scores as determined by our formula: + + # inlinks boost + 1 1.00 + 10 2.00 + 82 2.91 + 100 3.00 + 532 3.72 + 1000 4.00 + 14892 5.17 + +A document with 1000 inlinks will have it's score boosted 4x compared +to a document with 1 inlink. + + +====================================================================== +Configuration and Indexing +====================================================================== + +To use the PageRankScoringFilter during indexing, replace the Nutch +OPIC scoring filter in the Nutch(WAX) configuration: + +nutch-site.xml + <property> + <name>plugin.includes</name> + <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> + </property> + +Where we change 'scoring-opic' to 'scoring-nutchwax'. + +Then, when we invoke the Nutch(WAX) 'index' command, we specify the +location of the page rank file. For example, + + $ nutch index \ + -Dnutchwax.scoringfilter.pagerank.ranks=pagerank.txt \ + indexes \ + linkdb \ + crawldb \ + segments/* + Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,13 +1,15 @@ HOWTO-xslt.txt -2008-07-25 +2008-12-18 Aaron Binns Table of Contents o Prerequisites - NutchWAX HOWTO.txt o Overview + o NutchWAX OpenSearchServlet o XSLTFilter and web.xml + o Sample ====================================================================== @@ -31,9 +33,10 @@ Servlet : OpenSearchServlet If you read the OpenSearchServlet.java source code and the search.jsp -page, you'll notice a lot of similarity, if not duplication of code. +page, you'll notice a lot of similarity, if not outright duplication +of code. -The Internet Archive Web Team plans to improve and expand upon the +The Internet Archive Web Team has improved and expanded upon the existing OpenSearchServlet interface as well as adding more XML-based capabilities, including replacements for the existing JSP pages. In short, moving away from JSP and toward XML. @@ -48,6 +51,21 @@ ====================================================================== +NutchWAX OpenSearchServlet +====================================================================== + +NutchWAX contains an enhanced OpenSearch servlet which is a drop-in +replacement for the default Nutch OpenSearch servlet. To use the +NutchWAX implementation, modify the 'web.xml' + +from: + <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + +to: + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> + + +====================================================================== XSLTFilter and web.xml ====================================================================== @@ -55,11 +73,11 @@ OpenSearchServlet is straightforward. Simply add the XSLTFilter to the servlet's path and specify the XSL transform to apply. -For example, consider the default Nutch web.xml +For example, consider the default NutchWAX web.xml <servlet> <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> </servlet> <servlet-mapping> @@ -68,13 +86,13 @@ </servlet-mapping> Let's say we want to retain the '/opensearch' path for the XML output, -and add the human-friendly HTML page at '/coolsearch' +and add the human-friendly HTML page at '/search' First, we add an additional 'servlet-mapping' for our new path: <servlet-mapping> <servlet-name>OpenSearch</servlet-name> - <url-pattern>/coolsearch</url-pattern> + <url-pattern>/search</url-pattern> </servlet-mapping> Then, we add the XSLTFilter, passing it a URL to the XSLT file @@ -93,7 +111,7 @@ <filter-mapping> <filter-name>XSLT Filter</filter-name> - <url-pattern>/coolsearch</url-pattern> + <url-pattern>/search</url-pattern> </filter-mapping> This way, we have two URLs, which run the exact same @@ -101,11 +119,11 @@ output whereas the other produces human-friendly HTML output. OpenSearch XML : http://someserver/opensearch?query=foo - Human-friendly HTML : http://someserver/coolsearch?query=foo + Human-friendly HTML : http://someserver/search?query=foo ====================================================================== -Samples +Sample ====================================================================== You can find sample 'web.xml' and 'search.xsl' files in Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,6 +1,6 @@ INSTALL.txt -2008-10-01 +2008-12-18 Aaron Binns This installation guide assumes the reader is already familiar with @@ -43,7 +43,7 @@ ------------- As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is built against is: 701524 Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,9 +1,9 @@ README.txt -2008-10-01 +2008-12-18 Aaron Binns -Welcome to NutchWAX 0.12.2! +Welcome to NutchWAX 0.12.3! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. @@ -60,6 +60,15 @@ Filtering plugin which can be used to exclude URLs from import. It can be used as part of a NutchWAX de-duplication scheme. + plugins/scoring-nutchwax + + Scoring plugin for use at index-time which reads from an external + "pagerank.txt" file for scoring documents based on the log10 of the + number of inlinks to a document. + + The use of this plugin is optional but can improve the quality of + search results, especially for very large collections. + conf/nutch-site.xml Sample configuration properties file showing suggested settings for @@ -131,6 +140,4 @@ contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); ... - ====================================================================== - Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2008-10-13 +2008-12-18 Aaron Binns -Release notes for NutchWAX 0.12.2 +Release notes for NutchWAX 0.12.3 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,9 +15,14 @@ Overview ====================================================================== -NutchWAX 0.12.2 contains some minor enhancements and fixes to NutchWAX -0.12.1. +NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2 + o PageRank calculation and scoring + o Enhanced OpenSearchServlet + o Improved XSLT sample for OpenSearch + o System init.d script for searcher slaves + o Enhanced searcher slave aware of NutchWAX extensions + ====================================================================== Issues ====================================================================== @@ -28,23 +33,6 @@ Issues resolved in this release: -WAX-19 - Add strict/loose option to DateAdder for revisit lines with extra - data on end. - -WAX-21 - Allow for blank lines and comment lines in manifest file. - -WAX-22 - Various code clean-ups based on code review using PMD tool. - -WAX-23 - Add a "field setter" filter to set a field to a static value in the - Lucene document during indexing. - -WAX-24 - DateAdder fails due to uncaught exception in URL canonicalization - -WAX-25 - Add utility/tool to dump unique values of a field in an index. - +WAX-26 + Add XML elements containing all search URL params for self-link + generation This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 02:59:13
|
Revision: 2670 http://archive-access.svn.sourceforge.net/archive-access/?rev=2670&view=rev Author: binzino Date: 2008-12-16 02:59:10 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Added a command for running the PageRanker tool. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2008-12-16 02:43:25 UTC (rev 2669) +++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2008-12-16 02:59:10 UTC (rev 2670) @@ -50,6 +50,10 @@ shift ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@ ;; + pagerank) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@ + ;; *) echo "" echo "Usage: nutchwax COMMAND" @@ -57,6 +61,7 @@ echo " import Import ARCs into a new Nutch segment" echo " add-dates Add dates to a parallel index" echo " dumpindex Dump an index or set of parallel indices to stdout" + echo " pagerank Generate pagerank file for URLs in a 'linkdb'." echo "" exit 1 ;; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 02:43:28
|
Revision: 2669 http://archive-access.svn.sourceforge.net/archive-access/?rev=2669&view=rev Author: binzino Date: 2008-12-16 02:43:25 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Removed Nutch OPIC scoring filter and replaced with NutchWAX PageRank scoring filter. Also added a comment about the HTTP code filter. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-16 02:42:20 UTC (rev 2668) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-16 02:43:25 UTC (rev 2669) @@ -10,7 +10,7 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax</value> + <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> </property> <!-- The indexing filter order *must* be specified in order for @@ -115,6 +115,9 @@ <description>Implementation of URL canonicalizer to use.</description> </property> +<!-- Only pass URLs with an HTTP status in this range. Used by the + NutchWAX importer. + --> <property> <name>nutchwax.filter.http.status</name> <value> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-16 02:42:28
|
Revision: 2668 http://archive-access.svn.sourceforge.net/archive-access/?rev=2668&view=rev Author: binzino Date: 2008-12-16 02:42:20 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Removed unused member variable. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java 2008-12-15 21:39:28 UTC (rev 2667) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java 2008-12-16 02:42:20 UTC (rev 2668) @@ -45,13 +45,13 @@ { public static final Log LOG = LogFactory.getLog(PageRanker.class); - public static final String DONE_NAME = "merge.done"; - - public PageRanker() { + public PageRanker() + { } - public PageRanker(Configuration conf) { + public PageRanker(Configuration conf) + { setConf(conf); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-15 22:55:54
|
Revision: 2667 http://archive-access.svn.sourceforge.net/archive-access/?rev=2667&view=rev Author: binzino Date: 2008-12-15 21:39:28 +0000 (Mon, 15 Dec 2008) Log Message: ----------- Copy the src/etc directory to the build/package directory, just like we do with conf and web. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-15 17:47:01 UTC (rev 2666) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-15 21:39:28 UTC (rev 2667) @@ -147,6 +147,11 @@ <fileset dir="src/web" /> </copy> + <mkdir dir="${dist.dir}/contrib/archive/etc"/> + <copy todir="${dist.dir}/contrib/archive/etc"> + <fileset dir="src/etc" /> + </copy> + </target> </project> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-15 17:47:04
|
Revision: 2666 http://archive-access.svn.sourceforge.net/archive-access/?rev=2666&view=rev Author: binzino Date: 2008-12-15 17:47:01 +0000 (Mon, 15 Dec 2008) Log Message: ----------- Oops, fix bug where I accidentally removed closing tag in previous edit. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-15 02:19:53 UTC (rev 2665) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-15 17:47:01 UTC (rev 2666) @@ -147,7 +147,6 @@ <!-- The following are over-rides of property values in nutch-default which the Internet Archive uses in most NutchWAX projects. --> - <property> <name>io.map.index.skip</name> <value>32</value> @@ -167,3 +166,5 @@ <name>searcher.summary.length</name> <value>80</value> </property> + +</configuration> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-15 02:19:55
|
Revision: 2665 http://archive-access.svn.sourceforge.net/archive-access/?rev=2665&view=rev Author: binzino Date: 2008-12-15 02:19:53 +0000 (Mon, 15 Dec 2008) Log Message: ----------- Added some property values which we commonly use in deployments. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-15 01:47:48 UTC (rev 2664) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-15 02:19:53 UTC (rev 2665) @@ -144,4 +144,26 @@ <value>true</value> </property> -</configuration> +<!-- The following are over-rides of property values in + nutch-default which the Internet Archive uses in + most NutchWAX projects. --> + +<property> + <name>io.map.index.skip</name> + <value>32</value> +</property> + +<property> + <name>searcher.max.hits</name> + <value>1000</value> +</property> + +<property> + <name>searcher.summary.context</name> + <value>8</value> +</property> + +<property> + <name>searcher.summary.length</name> + <value>80</value> +</property> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-15 02:11:18
|
Revision: 2664 http://archive-access.svn.sourceforge.net/archive-access/?rev=2664&view=rev Author: binzino Date: 2008-12-15 01:47:48 +0000 (Mon, 15 Dec 2008) Log Message: ----------- Added own version of OpenSerach servlet which adds some XML elements and has a few other enhancements. Also revised the sample XSLT to take advantage of these changes in the OpenSearch servlet. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2008-12-15 01:47:48 UTC (rev 2664) @@ -0,0 +1,372 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax; + +import java.io.IOException; +import java.net.URLEncoder; +import java.util.Map; +import java.util.HashMap; +import java.util.Set; +import java.util.HashSet; + +import javax.servlet.ServletException; +import javax.servlet.ServletConfig; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +import javax.xml.parsers.*; + +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.util.NutchConfiguration; +import org.w3c.dom.*; +import javax.xml.transform.TransformerFactory; +import javax.xml.transform.Transformer; +import javax.xml.transform.dom.DOMSource; +import javax.xml.transform.stream.StreamResult; + +import org.apache.nutch.searcher.Hit; +import org.apache.nutch.searcher.HitDetails; +import org.apache.nutch.searcher.Hits; +import org.apache.nutch.searcher.NutchBean; +import org.apache.nutch.searcher.Query; +import org.apache.nutch.searcher.Summary; + +/** + * Present search results using A9's OpenSearch extensions to RSS, + * plus a few Nutch-specific extensions. + */ +public class OpenSearchServlet extends HttpServlet +{ + private static final Map NS_MAP = new HashMap(); + private int MAX_HITS_PER_PAGE; + + static { + NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/"); + NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/"); + } + + private static final Set SKIP_DETAILS = new HashSet(); + static { + SKIP_DETAILS.add("url"); // redundant with RSS link + SKIP_DETAILS.add("title"); // redundant with RSS title + } + + private NutchBean bean; + private Configuration conf; + + public void init(ServletConfig config) throws ServletException { + try { + this.conf = NutchConfiguration.get(config.getServletContext()); + bean = NutchBean.get(config.getServletContext(), this.conf); + } catch (IOException e) { + throw new ServletException(e); + } + MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1); + } + + public void doGet(HttpServletRequest request, HttpServletResponse response) + throws ServletException, IOException { + + long responseTime = System.nanoTime( ); + + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("query request from " + request.getRemoteAddr()); + } + + // get parameters from request + request.setCharacterEncoding("UTF-8"); + String queryString = request.getParameter("query"); + if (queryString == null) + queryString = ""; + String urlQuery = URLEncoder.encode(queryString, "UTF-8"); + + // the query language + String queryLang = request.getParameter("lang"); + + int start = 0; // first hit to display + String startString = request.getParameter("start"); + if (startString != null) + start = Integer.parseInt(startString); + + int hitsPerPage = 10; // number of hits to display + String hitsString = request.getParameter("hitsPerPage"); + if (hitsString != null) + hitsPerPage = Integer.parseInt(hitsString); + if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE) + hitsPerPage = MAX_HITS_PER_PAGE; + + String sort = request.getParameter("sort"); + boolean reverse = sort != null && "true".equals(request.getParameter("reverse")); + + // De-Duplicate handling. Look for duplicates field and for how many + // duplicates per results to return. Default duplicates field is 'site' + // and duplicates per results default is '2'. + String dedupField = request.getParameter("dedupField"); + if (dedupField == null || dedupField.length() == 0) { + dedupField = "site"; + } + int hitsPerDup = 2; + String hitsPerDupString = request.getParameter("hitsPerDup"); + String hitsPerSiteString = request.getParameter("hitsPerSite"); + if (hitsPerDupString != null && hitsPerDupString.length() > 0) { + hitsPerDup = Integer.parseInt(hitsPerDupString); + } else { + // If 'hitsPerSite' present, use that value. + if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) { + hitsPerDup = Integer.parseInt(hitsPerSiteString); + } + } + + // Make up query string for use later drawing the 'rss' logo. + String params = "&hitsPerPage=" + hitsPerPage + + (queryLang == null ? "" : "&lang=" + queryLang) + + (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") + + (dedupField == null ? "" : "&dedupField=" + dedupField)); + + Query query = Query.parse(queryString, queryLang, this.conf); + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("query: " + queryString); + NutchBean.LOG.info("lang: " + queryLang); + } + + // execute the query + Hits hits; + try { + hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField, sort, reverse); + } catch (IOException e) { + if (NutchBean.LOG.isWarnEnabled()) { + NutchBean.LOG.warn("Search Error", e); + } + hits = new Hits(0,new Hit[0]); + } + + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("total hits: " + hits.getTotal()); + } + + responseTime = System.nanoTime( ) - responseTime; + + // generate xml results + int end = (int)Math.min(hits.getLength(), start + hitsPerPage); + int length = end-start; + + Hit[] show = hits.getHits(start, end-start); + HitDetails[] details = bean.getDetails(show); + Summary[] summaries = bean.getSummary(details, query); + + String requestUrl = request.getRequestURL().toString(); + String base = requestUrl.substring(0, requestUrl.lastIndexOf('/')); + + + try { + DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); + factory.setNamespaceAware(true); + Document doc = factory.newDocumentBuilder().newDocument(); + + Element rss = addNode(doc, doc, "rss"); + addAttribute(doc, rss, "version", "2.0"); + addAttribute(doc, rss, "xmlns:opensearch", + (String)NS_MAP.get("opensearch")); + addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch")); + + Element channel = addNode(doc, rss, "channel"); + + addNode(doc, channel, "title", "Nutch: " + queryString); + addNode(doc, channel, "description", "Nutch search results for query: " + + queryString); + addNode(doc, channel, "link", + base+"/search.jsp" + +"?query="+urlQuery + +"&start="+start + +"&hitsPerDup="+hitsPerDup + +params); + + addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal()); + addNode(doc, channel, "opensearch", "startIndex", ""+start); + addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage); + + addNode(doc, channel, "nutch", "query", queryString); + addNode(doc, channel, "nutch", "responseTime", Double.toString( ((long) responseTime / 1000 / 1000 ) / 1000.0 ) ); + + // Add a <nutch:urlParams> element containing a list of all the URL parameters. + Element urlParams = doc.createElementNS((String)NS_MAP.get("nutch"), "nutch:urlParams" ); + channel.appendChild( urlParams ); + + for ( Map.Entry<String,String[]> e : ((Map<String,String[]>) request.getParameterMap( )).entrySet( ) ) + { + String key = e.getKey( ); + for ( String value : e.getValue( ) ) + { + Element urlParam = doc.createElementNS((String)NS_MAP.get("nutch"), "nutch:param" ); + addAttribute( doc, urlParam, "name", key ); + addAttribute( doc, urlParam, "value", value ); + urlParams.appendChild(urlParam); + } + } + + // Hmm, we should indicate whether or not the "totalResults" + // number as being exact some other way; perhaps just have a + // <nutch:totalIsExact>true</nutch:totalIsExact> element. + /* + if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show + || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){ + addNode(doc, channel, "nutch", "nextPage", requestUrl + +"?query="+urlQuery + +"&start="+end + +"&hitsPerDup="+hitsPerDup + +params); + } + */ + + // Same here, this seems odd. + /* + if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) { + addNode(doc, channel, "nutch", "showAllHits", requestUrl + +"?query="+urlQuery + +"&hitsPerDup="+0 + +params); + } + */ + + for (int i = 0; i < length; i++) { + Hit hit = show[i]; + HitDetails detail = details[i]; + String title = detail.getValue("title"); + String url = detail.getValue("url"); + String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); + + if (title == null || title.equals("")) { // use url for docs w/o title + title = url; + } + + Element item = addNode(doc, channel, "item"); + + addNode(doc, item, "title", title); + if (summaries[i] != null) { + addNode(doc, item, "description", summaries[i].toString() ); + } + addNode(doc, item, "link", url); + + addNode(doc, item, "nutch", "site", hit.getDedupValue()); + + addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id); + addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id + +"&query="+urlQuery+"&lang="+queryLang); + + // Probably don't need this as the XML processor/front-end can + // easily do this themselves. + if (hit.moreFromDupExcluded()) { + addNode(doc, item, "nutch", "moreFromSite", requestUrl + +"?query=" + +URLEncoder.encode("site:"+hit.getDedupValue() + +" "+queryString, "UTF-8") + +"&hitsPerSite="+0 + +params); + } + + for (int j = 0; j < detail.getLength(); j++) { // add all from detail + String field = detail.getField(j); + if (!SKIP_DETAILS.contains(field)) + addNode(doc, item, "nutch", field, detail.getValue(j)); + } + } + + // dump DOM tree + + DOMSource source = new DOMSource(doc); + TransformerFactory transFactory = TransformerFactory.newInstance(); + Transformer transformer = transFactory.newTransformer(); + transformer.setOutputProperty("indent", "yes"); + StreamResult result = new StreamResult(response.getOutputStream()); + response.setContentType("text/xml"); + transformer.transform(source, result); + + } catch (javax.xml.parsers.ParserConfigurationException e) { + throw new ServletException(e); + } catch (javax.xml.transform.TransformerException e) { + throw new ServletException(e); + } + + } + + private static Element addNode(Document doc, Node parent, String name) { + Element child = doc.createElement(name); + parent.appendChild(child); + return child; + } + + private static void addNode(Document doc, Node parent, + String name, String text) { + if ( text == null ) text = ""; + Element child = doc.createElement(name); + child.appendChild(doc.createTextNode(getLegalXml(text))); + parent.appendChild(child); + } + + private static void addNode(Document doc, Node parent, + String ns, String name, String text) { + if ( text == null ) text = ""; + Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name); + child.appendChild(doc.createTextNode(getLegalXml(text))); + parent.appendChild(child); + } + + private static void addAttribute(Document doc, Element node, + String name, String value) { + Attr attribute = doc.createAttribute(name); + attribute.setValue(getLegalXml(value)); + node.getAttributes().setNamedItem(attribute); + } + + /* + * Ensure string is legal xml. + * @param text String to verify. + * @return Passed <code>text</code> or a new string with illegal + * characters removed if any found in <code>text</code>. + * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char + */ + protected static String getLegalXml(final String text) { + if (text == null) { + return null; + } + StringBuffer buffer = null; + for (int i = 0; i < text.length(); i++) { + char c = text.charAt(i); + if (!isLegalXml(c)) { + if (buffer == null) { + // Start up a buffer. Copy characters here from now on + // now we've found at least one bad character in original. + buffer = new StringBuffer(text.length()); + buffer.append(text.substring(0, i)); + } + } else { + if (buffer != null) { + buffer.append(c); + } + } + } + return (buffer != null)? buffer.toString(): text; + } + + private static boolean isLegalXml(final char c) { + return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff) + || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff); + } + +} Modified: trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl 2008-12-14 21:10:33 UTC (rev 2663) +++ trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl 2008-12-15 01:47:48 UTC (rev 2664) @@ -115,42 +115,49 @@ <span class="searchFields"> Search for <input id="query" name="query" type="text" size="40" value="{nutch:query}" /> + + <!-- Create hidden form fields for the rest of the URL parameters --> + <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start' and @name!='query']"> + <xsl:element name="input" namespace="http://www.w3.org/1999/xhtml"> + <xsl:attribute name="type">hidden</xsl:attribute> + <xsl:attribute name="name" ><xsl:value-of select="@name" /></xsl:attribute> + <xsl:attribute name="value"><xsl:value-of select="@value" /></xsl:attribute> + </xsl:element> + </xsl:for-each> + <input type="submit" value="Search"/> </span> </form> </div> - <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"><a href="{nutch:nextPage}">Next</a></span></div> + <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"></span></div> <!-- Search results --> <ol start="{opensearch:startIndex + 1}"> <xsl:apply-templates select="item" /> </ol> <!-- Generate list of page links --> <center> - <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1"> - <a href="search?query={nutch:query}&start={(floor(opensearch:startIndex div opensearch:itemsPerPage) - 1) * opensearch:itemsPerPage}">«</a><xsl:text> </xsl:text> - </xsl:if> - <xsl:choose> - <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) < 11"> - <xsl:call-template name="pageLinks" > - <xsl:with-param name="begin" select="1" /> - <xsl:with-param name="end" select="21" /> - <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> - </xsl:call-template> - </xsl:when> - <xsl:otherwise> - <xsl:call-template name="pageLinks" > - <xsl:with-param name="begin" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" /> - <xsl:with-param name="end" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" /> - <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> - </xsl:call-template> - </xsl:otherwise> - </xsl:choose> - <a href="{nutch:nextPage}">»</a> + <xsl:call-template name="pageLinks"> + <xsl:with-param name="labelPrevious" select="'«'" /> + <xsl:with-param name="labelNext" select="'»'" /> + </xsl:call-template> </center> </body> </html> </xsl:template> + +<!-- ====================================================================== + NutchWAX XSLT template/fuction library. + + The idea is that the above xhtml code is what most NutchWAX users + will modify to tailor to their own look and feel. The stuff + below implements the core logic for generating results lists, + page links, etc. + + Hopefully NutchWAX web developers will be able to easily edit the + above xhtml and css and won't have to change the below. + ====================================================================== --> + <!-- Template to emit a search result as an HTML list item (<li/>). --> <xsl:template match="item"> @@ -176,32 +183,99 @@ <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text> </xsl:template> -<!-- Template to generate a list of numbered links to results pages. +<!-- Template to emit a list of numbered page links, *including* + "previous" and "next" links on either end, using the given labels. Parameters: + labelPrevious Link text for "previous page" link + labelNext Link text for "next page" link + --> +<xsl:template name="pageLinks"> + <xsl:param name="labelPrevious" /> + <xsl:param name="labelNext" /> + <!-- If we are on any page past the first, emit a "previous" link --> + <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1"> + <xsl:call-template name="pageLink"> + <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage)" /> + <xsl:with-param name="linkText" select="$labelPrevious" /> + </xsl:call-template> + <xsl:text> </xsl:text> + </xsl:if> + <!-- Now, emit numbered page links --> + <xsl:choose> + <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) < 11"> + <xsl:call-template name="numberedPageLinks" > + <xsl:with-param name="begin" select="1" /> + <xsl:with-param name="end" select="21" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:when> + <xsl:otherwise> + <xsl:call-template name="numberedPageLinks" > + <xsl:with-param name="begin" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" /> + <xsl:with-param name="end" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:otherwise> + </xsl:choose> + <!-- Lastly, emit a "next" link. --> + <xsl:text> </xsl:text> + <xsl:call-template name="pageLink"> + <xsl:with-param name="pageNum" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 2" /> + <xsl:with-param name="linkText" select="$labelNext" /> + </xsl:call-template> +</xsl:template> + +<!-- Template to emit a list of numbered links to results pages. + Parameters: begin starting # inclusive end ending # exclusive current the current page, don't emit a link --> -<xsl:template name="pageLinks"> +<xsl:template name="numberedPageLinks"> <xsl:param name="begin" /> <xsl:param name="end" /> <xsl:param name="current" /> <xsl:if test="$begin < $end"> - <xsl:choose> - <xsl:when test="$begin = $current" > - <xsl:value-of select="$current" /> - </xsl:when> - <xsl:otherwise> - <a href="?query={nutch:query}&start={($begin -1) * opensearch:itemsPerPage}&hitsPerPage={opensearch:itemsPerPage}"><xsl:value-of select="$begin" /></a> - </xsl:otherwise> - </xsl:choose> - <xsl:text> </xsl:text> - <xsl:call-template name="pageLinks"> - <xsl:with-param name="begin" select="$begin + 1" /> - <xsl:with-param name="end" select="$end" /> - <xsl:with-param name="current" select="$current" /> + <xsl:choose> + <xsl:when test="$begin = $current" > + <xsl:value-of select="$current" /> + </xsl:when> + <xsl:otherwise> + <xsl:call-template name="pageLink" > + <xsl:with-param name="pageNum" select="$begin" /> + <xsl:with-param name="linkText" select="$begin" /> </xsl:call-template> + </xsl:otherwise> + </xsl:choose> + <xsl:text> </xsl:text> + <xsl:call-template name="numberedPageLinks"> + <xsl:with-param name="begin" select="$begin + 1" /> + <xsl:with-param name="end" select="$end" /> + <xsl:with-param name="current" select="$current" /> + </xsl:call-template> </xsl:if> </xsl:template> +<!-- Template to emit a single page link. All of the URL parameters + listed in the OpenSearch results are included in the link. + Parmeters: + pageNum page number of the link + linkText text of the link + --> +<xsl:template name="pageLink"> + <xsl:param name="pageNum" /> + <xsl:param name="linkText" /> + <xsl:element name="a" namespace="http://www.w3.org/1999/xhtml"> + <xsl:attribute name="href"> + <xsl:text>?</xsl:text> + <xsl:for-each select="nutch:urlParams/nutch:param[@name!='start']"> + <xsl:value-of select="@name" /><xsl:text>=</xsl:text><xsl:value-of select="@value" /> + <xsl:text>&</xsl:text> + </xsl:for-each> + <xsl:text>start=</xsl:text><xsl:value-of select="($pageNum -1) * opensearch:itemsPerPage" /> + </xsl:attribute> + <xsl:value-of select="$linkText" /> + </xsl:element> +</xsl:template> + </xsl:stylesheet> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2663 http://archive-access.svn.sourceforge.net/archive-access/?rev=2663&view=rev Author: binzino Date: 2008-12-14 21:10:33 +0000 (Sun, 14 Dec 2008) Log Message: ----------- Fixed bug where no settings lead to NPE due to uninitialized member variable. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java 2008-12-12 05:12:36 UTC (rev 2662) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java 2008-12-14 21:10:33 UTC (rev 2663) @@ -20,6 +20,7 @@ */ package org.archive.nutchwax.index; +import java.util.Collections; import java.util.List; import java.util.ArrayList; @@ -69,7 +70,7 @@ public static final Log LOG = LogFactory.getLog( FieldSetter.class ); private Configuration conf; - private List<FieldSetting> settings; + private List<FieldSetting> settings = Collections.emptyList(); public void setConf( Configuration conf ) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-12 05:12:41
|
Revision: 2662 http://archive-access.svn.sourceforge.net/archive-access/?rev=2662&view=rev Author: binzino Date: 2008-12-12 05:12:36 +0000 (Fri, 12 Dec 2008) Log Message: ----------- Fixed rsync args to exclude .svn subdirs and other stuff we don't want to copy over into the Nutch source tree. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-11 22:59:27 UTC (rev 2661) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-12 05:12:36 UTC (rev 2662) @@ -28,9 +28,7 @@ <target name="nutch-compile-core"> <!-- First, copy over Nutch source overlays --> <exec executable="rsync"> - <arg value="-vac"/> - <arg value="--exclude"/> - <arg value="*~"/> + <arg value="-vacC"/> <arg value="src/nutch/"/> <arg value="../../"/> </exec> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-11 22:59:31
|
Revision: 2661 http://archive-access.svn.sourceforge.net/archive-access/?rev=2661&view=rev Author: binzino Date: 2008-12-11 22:59:27 +0000 (Thu, 11 Dec 2008) Log Message: ----------- Add use of 'rsync' to copy Nutch source over-rides into Nutch main source dir before compilation. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-11 22:58:28 UTC (rev 2660) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2008-12-11 22:59:27 UTC (rev 2661) @@ -26,6 +26,14 @@ <property name="dist.dir" value="${build.dir}/nutch-1.0-dev" /> <target name="nutch-compile-core"> + <!-- First, copy over Nutch source overlays --> + <exec executable="rsync"> + <arg value="-vac"/> + <arg value="--exclude"/> + <arg value="*~"/> + <arg value="src/nutch/"/> + <arg value="../../"/> + </exec> <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" /> </target> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-11 22:58:33
|
Revision: 2660 http://archive-access.svn.sourceforge.net/archive-access/?rev=2660&view=rev Author: binzino Date: 2008-12-11 22:58:28 +0000 (Thu, 11 Dec 2008) Log Message: ----------- Initial checkin of Nutch source-files that are over-ridden and copied into the Nutch source tree when compiling. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/nutch/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2008-12-11 22:58:28 UTC (rev 2660) @@ -0,0 +1,375 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.searcher; + +import java.io.IOException; +import java.io.Reader; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.BufferedReader; + +import java.util.HashMap; +import java.util.Map; +import java.util.Iterator; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + +import org.apache.commons.lang.StringUtils; +import org.apache.hadoop.io.*; +import org.apache.hadoop.fs.*; +import org.apache.nutch.protocol.*; +import org.apache.nutch.parse.*; +import org.apache.nutch.util.HadoopFSUtil; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.mapred.*; +import org.apache.hadoop.mapred.lib.*; +import org.apache.nutch.crawl.*; + +/** Implements {@link HitSummarizer} and {@link HitContent} for a set of + * fetched segments. */ +public class FetchedSegments implements HitSummarizer, HitContent +{ + public static final Log LOG = LogFactory.getLog(FetchedSegments.class); + + private static class Segment implements Closeable { + + private static final Partitioner PARTITIONER = new HashPartitioner(); + + private FileSystem fs; + private Path segmentDir; + + private MapFile.Reader[] content; + private MapFile.Reader[] parseText; + private MapFile.Reader[] parseData; + private MapFile.Reader[] crawl; + private Configuration conf; + + public Segment(FileSystem fs, Path segmentDir, Configuration conf) throws IOException { + this.fs = fs; + this.segmentDir = segmentDir; + this.conf = conf; + } + + public CrawlDatum getCrawlDatum(Text url) throws IOException { + synchronized (this) { + if (crawl == null) + crawl = getReaders(CrawlDatum.FETCH_DIR_NAME); + } + return (CrawlDatum)getEntry(crawl, url, new CrawlDatum()); + } + + public byte[] getContent(Text url) throws IOException { + synchronized (this) { + if (content == null) + content = getReaders(Content.DIR_NAME); + } + return ((Content)getEntry(content, url, new Content())).getContent(); + } + + public ParseData getParseData(Text url) throws IOException { + synchronized (this) { + if (parseData == null) + parseData = getReaders(ParseData.DIR_NAME); + } + return (ParseData)getEntry(parseData, url, new ParseData()); + } + + public ParseText getParseText(Text url) throws IOException { + synchronized (this) { + if (parseText == null) + parseText = getReaders(ParseText.DIR_NAME); + } + return (ParseText)getEntry(parseText, url, new ParseText()); + } + + private MapFile.Reader[] getReaders(String subDir) throws IOException { + return MapFileOutputFormat.getReaders(fs, new Path(segmentDir, subDir), this.conf); + } + + private Writable getEntry(MapFile.Reader[] readers, Text url, + Writable entry) throws IOException { + return MapFileOutputFormat.getEntry(readers, PARTITIONER, url, entry); + } + + public void close() throws IOException { + if (content != null) { closeReaders(content); } + if (parseText != null) { closeReaders(parseText); } + if (parseData != null) { closeReaders(parseData); } + if (crawl != null) { closeReaders(crawl); } + } + + private void closeReaders(MapFile.Reader[] readers) throws IOException { + for (int i = 0; i < readers.length; i++) { + readers[i].close(); + } + } + + } + + private HashMap segments = new HashMap( ); + private boolean perCollection = false; + private Summarizer summarizer; + + /** Construct given a directory containing fetcher output. */ + public FetchedSegments(FileSystem fs, String segmentsDir, Configuration conf) throws IOException + { + this.summarizer = new SummarizerFactory(conf).getSummarizer(); + + Path[] segmentDirs = HadoopFSUtil.getPaths( fs.listStatus(new Path(segmentsDir), HadoopFSUtil.getPassDirectoriesFilter(fs)) ); + if ( segmentDirs == null ) + { + LOG.warn( "No segment directories: " + segmentsDir ); + return ; + } + + this.perCollection = conf.getBoolean( "nutchwax.FetchedSegments.perCollection", false ); + + LOG.info( "Per-collection segments: " + this.perCollection ); + + for ( int i = 0; i < segmentDirs.length; i++ ) + { + if ( this.perCollection ) + { + // Assume segmentDir is actually a 'collection' dir which + // contains a list of segments, such as: + // crawl/segments/194/segment-foo + // /segment-bar + // /segment-baz + // crawl/segments/366/segment-frotz + // /segment-fizzle + // /segment-bizzle + // The '194' and '366' are collection dirs, which contain the + // actual segment dirs. + Path collectionDir = segmentDirs[i]; + + Map perCollectionSegments = (Map) this.segments.get( collectionDir.getName( ) ); + if ( perCollectionSegments == null ) + { + perCollectionSegments = new HashMap( ); + this.segments.put( collectionDir.getName( ), perCollectionSegments ); + } + + // Now, get a list of all the sub-dirs of the collectionDir, + // and create segments for them, adding them to the + // per-collection map. + Path[] perCollectionSegmentDirs = HadoopFSUtil.getPaths( fs.listStatus( collectionDir, HadoopFSUtil.getPassDirectoriesFilter(fs) ) ); + for ( Path segmentDir : perCollectionSegmentDirs ) + { + perCollectionSegments.put( segmentDir.getName( ), new Segment( fs, segmentDir, conf ) ); + } + + addRemaps( fs, collectionDir, (Map<String,Segment>) perCollectionSegments ); + } + else + { + Path segmentDir = segmentDirs[i]; + segments.put(segmentDir.getName(), new Segment(fs, segmentDir, conf)); + } + } + + // If we not-doing perCollection segments, process a single + // "remap" file for the "segments" dir. + if ( ! this.perCollection ) + { + addRemaps( fs, new Path(segmentsDir), (Map<String,Segment>) segments ); + } + + LOG.info( "segments: " + segments ); + } + + protected void addRemaps( FileSystem fs, Path segmentDir, Map<String,Segment> segments ) + throws IOException + { + Path segmentRemapFile = new Path( segmentDir, "remap" ); + + if ( ! fs.exists( segmentRemapFile ) ) + { + LOG.warn( "Remap file doesn't exist: " + segmentRemapFile ); + + return ; + } + + // InputStream is = segmentRemapFile.getFileSystem( conf ).open( segmentRemapFile ); + InputStream is = fs.open( segmentRemapFile ); + + BufferedReader reader = new BufferedReader( new InputStreamReader( is, "UTF-8" ) ); + + String line; + while ( (line = reader.readLine()) != null ) + { + String fields[] = line.trim( ).split( "\\s+" ); + + if ( fields.length < 2 ) + { + LOG.warn( "Malformed remap line, not enough fields ("+fields.length+"): " + line ); + continue ; + } + + // Look for the "to" name in the segments. + Segment toSegment = segments.get( fields[1] ); + if ( toSegment == null ) + { + LOG.warn( "Segment remap destination doesn't exist: " + fields[1] ); + } + else + { + LOG.warn( "Remap: " + fields[0] + " => " + fields[1] ); + segments.put( fields[0], toSegment ); + } + } + } + + + public String[] getSegmentNames() { + return (String[])segments.keySet().toArray(new String[segments.size()]); + } + + public byte[] getContent(HitDetails details) throws IOException { + return getSegment(details).getContent(getUrl(details)); + } + + public ParseData getParseData(HitDetails details) throws IOException { + return getSegment(details).getParseData(getUrl(details)); + } + + public long getFetchDate(HitDetails details) throws IOException { + return getSegment(details).getCrawlDatum(getUrl(details)) + .getFetchTime(); + } + + public ParseText getParseText(HitDetails details) throws IOException { + return getSegment(details).getParseText(getUrl(details)); + } + + public Summary getSummary(HitDetails details, Query query) + throws IOException { + + if (this.summarizer == null) { return new Summary(); } + + Segment segment = getSegment(details); + ParseText parseText = segment.getParseText(getUrl(details)); + String text = (parseText != null) ? parseText.getText() : ""; + + return this.summarizer.getSummary(text, query); + } + + private class SummaryThread extends Thread { + private HitDetails details; + private Query query; + + private Summary summary; + private Throwable throwable; + + public SummaryThread(HitDetails details, Query query) { + this.details = details; + this.query = query; + } + + public void run() { + try { + this.summary = getSummary(details, query); + } catch (Throwable throwable) { + this.throwable = throwable; + } + } + + } + + + public Summary[] getSummary(HitDetails[] details, Query query) + throws IOException { + SummaryThread[] threads = new SummaryThread[details.length]; + for (int i = 0; i < threads.length; i++) { + threads[i] = new SummaryThread(details[i], query); + threads[i].start(); + } + + Summary[] results = new Summary[details.length]; + for (int i = 0; i < threads.length; i++) { + try { + threads[i].join(); + } catch (InterruptedException e) { + throw new RuntimeException(e); + } + if (threads[i].throwable instanceof IOException) { + throw (IOException)threads[i].throwable; + } else if (threads[i].throwable != null) { + throw new RuntimeException(threads[i].throwable); + } + results[i] = threads[i].summary; + } + return results; + } + + + private Segment getSegment(HitDetails details) + { + if ( this.perCollection ) + { + LOG.info( "getSegment: " + details ); + LOG.info( " collection: " + details.getValue("collection") ); + LOG.info( " segment : " + details.getValue("segment") ); + + String collectionId = details.getValue("collection"); + String segmentName = details.getValue("segment"); + + Map perCollectionSegments = (Map) this.segments.get( collectionId ); + + Segment segment = (Segment) perCollectionSegments.get( segmentName ); + + if ( segment == null ) + { + LOG.warn( "Didn't find segment: collection=" + collectionId + " segment=" + segmentName ); + } + + return segment; + } + else + { + LOG.info( "getSegment: " + details ); + LOG.info( " segment : " + details.getValue("segment") ); + + String segmentName = details.getValue( "segment" ); + Segment segment = (Segment) segments.get( segmentName ); + + if ( segment == null ) + { + LOG.warn( "Didn't find segment: " + segmentName ); + } + + return segment; + } + } + + private Text getUrl(HitDetails details) { + String url = details.getValue("orig"); + if (StringUtils.isBlank(url)) { + url = details.getValue("url"); + } + return new Text(url); + } + + public void close() throws IOException { + Iterator iterator = segments.values().iterator(); + while (iterator.hasNext()) { + ((Segment) iterator.next()).close(); + } + } + +} Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java 2008-12-11 22:58:28 UTC (rev 2660) @@ -0,0 +1,179 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.searcher; + +import java.io.File; +import java.io.IOException; +import java.util.List; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.FloatWritable; +import org.apache.hadoop.io.IntWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.WritableComparable; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.MultiReader; +import org.apache.lucene.search.FieldCache; +import org.apache.lucene.search.FieldDoc; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.nutch.indexer.FsDirectory; +import org.apache.nutch.indexer.NutchSimilarity; + +/** Implements {@link Searcher} and {@link HitDetailer} for either a single + * merged index, or a set of indexes. */ +public class IndexSearcher implements Searcher, HitDetailer { + + private org.apache.lucene.search.Searcher luceneSearcher; + private org.apache.lucene.index.IndexReader reader; + private LuceneQueryOptimizer optimizer; + private FileSystem fs; + private Configuration conf; + private QueryFilters queryFilters; + + /** Construct given a number of indexes. */ + public IndexSearcher(Path[] indexDirs, Configuration conf) throws IOException { + IndexReader[] readers = new IndexReader[indexDirs.length]; + this.conf = conf; + this.fs = FileSystem.get(conf); + for (int i = 0; i < indexDirs.length; i++) { + readers[i] = IndexReader.open(getDirectory(indexDirs[i])); + } + init(new MultiReader(readers), conf); + } + + /** Construct given a single merged index. */ + public IndexSearcher(Path index, Configuration conf) + throws IOException { + this.conf = conf; + this.fs = FileSystem.get(conf); + init(IndexReader.open(getDirectory(index)), conf); + } + + private void init(IndexReader reader, Configuration conf) throws IOException { + this.reader = reader; + this.luceneSearcher = new org.apache.lucene.search.IndexSearcher(reader); + this.luceneSearcher.setSimilarity(new NutchSimilarity()); + this.optimizer = new LuceneQueryOptimizer(conf); + this.queryFilters = new QueryFilters(conf); + } + + private Directory getDirectory(Path file) throws IOException { + if ("file".equals(this.fs.getUri().getScheme())) { + Path qualified = file.makeQualified(FileSystem.getLocal(conf)); + File fsLocal = new File(qualified.toUri()); + return FSDirectory.getDirectory(fsLocal.getAbsolutePath()); + } else { + return new FsDirectory(this.fs, file, false, this.conf); + } + } + + public Hits search(Query query, int numHits, + String dedupField, String sortField, boolean reverse) + + throws IOException { + org.apache.lucene.search.BooleanQuery luceneQuery = + this.queryFilters.filter(query); + + System.out.println( "Nutch query: " + query ); + System.out.println( "Lucene query: " + luceneQuery ); + + return translateHits + (optimizer.optimize(luceneQuery, luceneSearcher, numHits, + sortField, reverse), + dedupField, sortField); + } + + public String getExplanation(Query query, Hit hit) throws IOException { + return luceneSearcher.explain(this.queryFilters.filter(query), + hit.getIndexDocNo()).toHtml(); + } + + public HitDetails getDetails(Hit hit) throws IOException { + + Document doc = luceneSearcher.doc(hit.getIndexDocNo()); + + List docFields = doc.getFields(); + String[] fields = new String[docFields.size()]; + String[] values = new String[docFields.size()]; + for (int i = 0; i < docFields.size(); i++) { + Field field = (Field)docFields.get(i); + fields[i] = field.name(); + values[i] = field.stringValue(); + } + + return new HitDetails(fields, values); + } + + public HitDetails[] getDetails(Hit[] hits) throws IOException { + HitDetails[] results = new HitDetails[hits.length]; + for (int i = 0; i < hits.length; i++) + results[i] = getDetails(hits[i]); + return results; + } + + private Hits translateHits(TopDocs topDocs, + String dedupField, String sortField) + throws IOException { + + String[] dedupValues = null; + if (dedupField != null) + dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); + + ScoreDoc[] scoreDocs = topDocs.scoreDocs; + int length = scoreDocs.length; + Hit[] hits = new Hit[length]; + for (int i = 0; i < length; i++) { + + int doc = scoreDocs[i].doc; + + WritableComparable sortValue; // convert value to writable + if (sortField == null) { + sortValue = new FloatWritable(scoreDocs[i].score); + } else { + Object raw = ((FieldDoc)scoreDocs[i]).fields[0]; + if (raw instanceof Integer) { + sortValue = new IntWritable(((Integer)raw).intValue()); + } else if (raw instanceof Float) { + sortValue = new FloatWritable(((Float)raw).floatValue()); + } else if (raw instanceof String) { + sortValue = new Text((String)raw); + } else { + throw new RuntimeException("Unknown sort value type!"); + } + } + + String dedupValue = dedupValues == null ? null : dedupValues[doc]; + + hits[i] = new Hit(doc, sortValue, dedupValue); + } + return new Hits(topDocs.totalHits, hits); + } + + public void close() throws IOException { + if (luceneSearcher != null) { luceneSearcher.close(); } + if (reader != null) { reader.close(); } + } + +} Added: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java 2008-12-11 22:58:28 UTC (rev 2660) @@ -0,0 +1,333 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.searcher; + +import java.io.IOException; +import java.net.URLEncoder; +import java.util.Map; +import java.util.HashMap; +import java.util.Set; +import java.util.HashSet; + +import javax.servlet.ServletException; +import javax.servlet.ServletConfig; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +import javax.xml.parsers.*; + +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.util.NutchConfiguration; +import org.w3c.dom.*; +import javax.xml.transform.TransformerFactory; +import javax.xml.transform.Transformer; +import javax.xml.transform.dom.DOMSource; +import javax.xml.transform.stream.StreamResult; + + +/** Present search results using A9's OpenSearch extensions to RSS, plus a few + * Nutch-specific extensions. */ +public class OpenSearchServlet extends HttpServlet { + private static final Map NS_MAP = new HashMap(); + private int MAX_HITS_PER_PAGE; + + static { + NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/"); + NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/"); + } + + private static final Set SKIP_DETAILS = new HashSet(); + static { + SKIP_DETAILS.add("url"); // redundant with RSS link + SKIP_DETAILS.add("title"); // redundant with RSS title + } + + private NutchBean bean; + private Configuration conf; + + public void init(ServletConfig config) throws ServletException { + try { + this.conf = NutchConfiguration.get(config.getServletContext()); + bean = NutchBean.get(config.getServletContext(), this.conf); + } catch (IOException e) { + throw new ServletException(e); + } + MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1); + } + + public void doGet(HttpServletRequest request, HttpServletResponse response) + throws ServletException, IOException { + + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("query request from " + request.getRemoteAddr()); + } + + // get parameters from request + request.setCharacterEncoding("UTF-8"); + String queryString = request.getParameter("query"); + if (queryString == null) + queryString = ""; + String urlQuery = URLEncoder.encode(queryString, "UTF-8"); + + // the query language + String queryLang = request.getParameter("lang"); + + int start = 0; // first hit to display + String startString = request.getParameter("start"); + if (startString != null) + start = Integer.parseInt(startString); + + int hitsPerPage = 10; // number of hits to display + String hitsString = request.getParameter("hitsPerPage"); + if (hitsString != null) + hitsPerPage = Integer.parseInt(hitsString); + if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE) + hitsPerPage = MAX_HITS_PER_PAGE; + + String sort = request.getParameter("sort"); + boolean reverse = + sort!=null && "true".equals(request.getParameter("reverse")); + + // De-Duplicate handling. Look for duplicates field and for how many + // duplicates per results to return. Default duplicates field is 'site' + // and duplicates per results default is '2'. + String dedupField = request.getParameter("dedupField"); + if (dedupField == null || dedupField.length() == 0) { + dedupField = "site"; + } + int hitsPerDup = 2; + String hitsPerDupString = request.getParameter("hitsPerDup"); + if (hitsPerDupString != null && hitsPerDupString.length() > 0) { + hitsPerDup = Integer.parseInt(hitsPerDupString); + } else { + // If 'hitsPerSite' present, use that value. + String hitsPerSiteString = request.getParameter("hitsPerSite"); + if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) { + hitsPerDup = Integer.parseInt(hitsPerSiteString); + } + } + + // Make up query string for use later drawing the 'rss' logo. + String params = "&hitsPerPage=" + hitsPerPage + + (queryLang == null ? "" : "&lang=" + queryLang) + + (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") + + (dedupField == null ? "" : "&dedupField=" + dedupField)); + + Query query = Query.parse(queryString, queryLang, this.conf); + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("query: " + queryString); + NutchBean.LOG.info("lang: " + queryLang); + } + + // execute the query + Hits hits; + try { + hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField, + sort, reverse); + } catch (IOException e) { + if (NutchBean.LOG.isWarnEnabled()) { + NutchBean.LOG.warn("Search Error", e); + } + hits = new Hits(0,new Hit[0]); + } + + if (NutchBean.LOG.isInfoEnabled()) { + NutchBean.LOG.info("total hits: " + hits.getTotal()); + } + + // generate xml results + int end = (int)Math.min(hits.getLength(), start + hitsPerPage); + int length = end-start; + + Hit[] show = hits.getHits(start, end-start); + HitDetails[] details = bean.getDetails(show); + Summary[] summaries = bean.getSummary(details, query); + + String requestUrl = request.getRequestURL().toString(); + String base = requestUrl.substring(0, requestUrl.lastIndexOf('/')); + + + try { + DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); + factory.setNamespaceAware(true); + Document doc = factory.newDocumentBuilder().newDocument(); + + Element rss = addNode(doc, doc, "rss"); + addAttribute(doc, rss, "version", "2.0"); + addAttribute(doc, rss, "xmlns:opensearch", + (String)NS_MAP.get("opensearch")); + addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch")); + + Element channel = addNode(doc, rss, "channel"); + + addNode(doc, channel, "title", "Nutch: " + queryString); + addNode(doc, channel, "description", "Nutch search results for query: " + + queryString); + addNode(doc, channel, "link", + base+"/search.jsp" + +"?query="+urlQuery + +"&start="+start + +"&hitsPerDup="+hitsPerDup + +params); + + addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal()); + addNode(doc, channel, "opensearch", "startIndex", ""+start); + addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage); + + addNode(doc, channel, "nutch", "query", queryString); + + + if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show + || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){ + addNode(doc, channel, "nutch", "nextPage", requestUrl + +"?query="+urlQuery + +"&start="+end + +"&hitsPerDup="+hitsPerDup + +params); + } + + if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) { + addNode(doc, channel, "nutch", "showAllHits", requestUrl + +"?query="+urlQuery + +"&hitsPerDup="+0 + +params); + } + + for (int i = 0; i < length; i++) { + Hit hit = show[i]; + HitDetails detail = details[i]; + String title = detail.getValue("title"); + String url = detail.getValue("url"); + String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); + + if (title == null || title.equals("")) { // use url for docs w/o title + title = url; + } + + Element item = addNode(doc, channel, "item"); + + addNode(doc, item, "title", title); + if (summaries[i] != null) { + addNode(doc, item, "description", summaries[i].toString() ); + } + addNode(doc, item, "link", url); + + addNode(doc, item, "nutch", "site", hit.getDedupValue()); + + addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id); + addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id + +"&query="+urlQuery+"&lang="+queryLang); + + if (hit.moreFromDupExcluded()) { + addNode(doc, item, "nutch", "moreFromSite", requestUrl + +"?query=" + +URLEncoder.encode("site:"+hit.getDedupValue() + +" "+queryString, "UTF-8") + +"&hitsPerSite="+0 + +params); + } + + for (int j = 0; j < detail.getLength(); j++) { // add all from detail + String field = detail.getField(j); + if (!SKIP_DETAILS.contains(field)) + addNode(doc, item, "nutch", field, detail.getValue(j)); + } + } + + // dump DOM tree + + DOMSource source = new DOMSource(doc); + TransformerFactory transFactory = TransformerFactory.newInstance(); + Transformer transformer = transFactory.newTransformer(); + transformer.setOutputProperty("indent", "yes"); + StreamResult result = new StreamResult(response.getOutputStream()); + response.setContentType("text/xml"); + transformer.transform(source, result); + + } catch (javax.xml.parsers.ParserConfigurationException e) { + throw new ServletException(e); + } catch (javax.xml.transform.TransformerException e) { + throw new ServletException(e); + } + + } + + private static Element addNode(Document doc, Node parent, String name) { + Element child = doc.createElement(name); + parent.appendChild(child); + return child; + } + + private static void addNode(Document doc, Node parent, + String name, String text) { + Element child = doc.createElement(name); + child.appendChild(doc.createTextNode(getLegalXml(text))); + parent.appendChild(child); + } + + private static void addNode(Document doc, Node parent, + String ns, String name, String text) { + Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name); + child.appendChild(doc.createTextNode(getLegalXml(text))); + parent.appendChild(child); + } + + private static void addAttribute(Document doc, Element node, + String name, String value) { + Attr attribute = doc.createAttribute(name); + attribute.setValue(getLegalXml(value)); + node.getAttributes().setNamedItem(attribute); + } + + /* + * Ensure string is legal xml. + * @param text String to verify. + * @return Passed <code>text</code> or a new string with illegal + * characters removed if any found in <code>text</code>. + * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char + */ + protected static String getLegalXml(final String text) { + if (text == null) { + return null; + } + StringBuffer buffer = null; + for (int i = 0; i < text.length(); i++) { + char c = text.charAt(i); + if (!isLegalXml(c)) { + if (buffer == null) { + // Start up a buffer. Copy characters here from now on + // now we've found at least one bad character in original. + buffer = new StringBuffer(text.length()); + buffer.append(text.substring(0, i)); + } + } else { + if (buffer != null) { + buffer.append(c); + } + } + } + return (buffer != null)? buffer.toString(): text; + } + + private static boolean isLegalXml(final char c) { + return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff) + || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff); + } + +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-11 22:21:49
|
Revision: 2659 http://archive-access.svn.sourceforge.net/archive-access/?rev=2659&view=rev Author: binzino Date: 2008-12-11 22:21:44 +0000 (Thu, 11 Dec 2008) Log Message: ----------- Added proprty for per-collection segments. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-10 05:02:19 UTC (rev 2658) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-12-11 22:21:44 UTC (rev 2659) @@ -134,4 +134,14 @@ <value>1048576</value> </property> +<!-- Enable per-collection segment sub-dirs, e.g. + segments/<collectionId>/segment1 + /segment2 + ... + --> +<property> + <name>nutchwax.FetchedSegments.perCollection</name> + <value>true</value> +</property> + </configuration> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-10 05:02:22
|
Revision: 2658 http://archive-access.svn.sourceforge.net/archive-access/?rev=2658&view=rev Author: binzino Date: 2008-12-10 05:02:19 +0000 (Wed, 10 Dec 2008) Log Message: ----------- Initial revision. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/etc/ trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/ trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java Added: trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/etc/init.d/searcher-slave 2008-12-10 05:02:19 UTC (rev 2658) @@ -0,0 +1,63 @@ +#! /bin/sh +# +# ----------------------------------- +# Initscript for NutchWAX searcher slave +# ----------------------------------- + +set -e + +PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin +DESC="NutchWAX searcher slave" +NAME="searcher-slave" + +DAEMON="/3/search/nutchwax-0.12.2/bin/nutch org.archive.nutchwax.DistributedSearch\$Server 9000 /3/search/deploy" +NUTCH_HOME=/3/search/nutchwax-0.12.2 +JAVA_HOME=/usr +export NUTCH_HEAPSIZE=2500 +PIDFILE=/var/run/$NAME.pid +SCRIPTNAME=/etc/init.d/$NAME + +# Gracefully exit if the package has been removed. +test -x /usr/bin/java || exit 0 + +# --------------------------------------- +# Function that starts the daemon/service +# --------------------------------------- +d_start() +{ +start-stop-daemon --start -b -m -c webcrawl:webcrawl --pidfile $PIDFILE --exec $DAEMON +} + +# -------------------------------------- +# Function that stops the daemon/service +# -------------------------------------- +d_stop() +{ +start-stop-daemon --stop --pidfile $PIDFILE +} + +case "$1" in +start) +echo -n "Starting $DESC: $NAME" +d_start +echo "." +;; +stop) +echo -n "Stopping $DESC: $NAME" +d_stop +echo "." +;; +restart|force-reload) +echo -n "Restarting $DESC: $NAME" +d_stop +sleep 1 +d_start +echo "." +;; +*) +echo "Usage: $SCRIPTNAME {start|stop|restart|force-reload}" >&2 +exit 1 +;; +esac + +exit 0 Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/PageRanker.java 2008-12-10 05:02:19 UTC (rev 2658) @@ -0,0 +1,208 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax.tools; + +import java.io.*; +import java.util.*; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + +import org.apache.hadoop.io.*; +import org.apache.hadoop.fs.*; +import org.apache.hadoop.mapred.FileAlreadyExistsException; +import org.apache.hadoop.util.*; +import org.apache.hadoop.conf.*; +import org.apache.hadoop.util.ReflectionUtils; + +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.util.HadoopFSUtil; +import org.apache.nutch.util.LogUtil; +import org.apache.nutch.util.NutchConfiguration; + +import org.apache.lucene.store.Directory; +import org.apache.lucene.index.IndexWriter; + +/** + * + */ +public class PageRanker extends Configured implements Tool +{ + public static final Log LOG = LogFactory.getLog(PageRanker.class); + + public static final String DONE_NAME = "merge.done"; + + public PageRanker() { + + } + + public PageRanker(Configuration conf) { + setConf(conf); + } + + /** + * Create an index for the input files in the named directory. + */ + public static void main(String[] args) + throws Exception + { + int res = ToolRunner.run(NutchConfiguration.create(), new PageRanker(), args); + System.exit(res); + } + + /** + * + */ + public int run(String[] args) + throws Exception + { + String usage = "Usage: PageRanker [OPTIONS] outputFile <linkdb|paths>\n" + + "Emit PageRank values for URLs in linkDb(s). Suitable for use with\n" + + "PageRank scoring filter.\n" + + "\n" + + "OPTIONS:\n" + + " -p Use exact path as given, don't assume it's a typical\n" + + " linkdb with \"current/part-nnnnn\" subdirs.\n" + + " -t threshold Do not emit records with less than this many inlinks.\n" + + " Default value 10." + ; + if ( args.length < 1 ) + { + System.err.println( "Usage: " + usage ); + return -1; + } + + boolean exactPath = false; + int threshold = 10; + + int pos = 0; + for ( ; pos < args.length && args[pos].charAt(0) == '-' ; pos++ ) + { + if ( args[pos].equals( "-p" ) ) + { + exactPath = true; + } + if ( args[pos].equals( "-t" ) ) + { + pos++; + if ( args.length - pos < 1 ) + { + System.err.println( "Error: missing argument to -t option" ); + return -1; + } + try + { + threshold = Integer.parseInt( args[pos] ); + } + catch ( NumberFormatException nfe ) + { + System.err.println( "Error: bad value for -t option: " + args[pos] ); + return -1; + } + } + } + + Configuration conf = getConf( ); + FileSystem fs = FileSystem.get( conf ); + + if ( pos >= args.length ) + { + System.err.println( "Error: missing outputFile" ); + return -1; + } + + Path outputPath = new Path( args[pos++] ); + if ( fs.exists( outputPath ) ) + { + System.err.println( "Erorr: outputFile already exists: " + outputPath ); + return -1; + } + + PrintWriter output = new PrintWriter( new OutputStreamWriter( fs.create( outputPath ).getWrappedStream( ), "UTF-8" ) ); + + if ( pos >= args.length ) + { + System.err.println( "Error: missing linkdb" ); + return -1; + } + + List<Path> mapfiles = new ArrayList<Path>(); + + // If we are using exact paths, add each one to the list. + // Otherwise, assume the given path is to a linkdb and look for + // <linkdbPath>/current/part-nnnnn sub-dirs. + if ( exactPath ) + { + for ( ; pos < args.length ; pos++ ) + { + mapfiles.add( new Path( args[pos] ) ); + } + } + else + { + FileStatus[] fstats = fs.listStatus( new Path(args[pos]+"/current"), HadoopFSUtil.getPassDirectoriesFilter(fs)); + mapfiles.addAll(Arrays.asList(HadoopFSUtil.getPaths(fstats))); + } + + System.out.println( "mapfiles = " + mapfiles ); + try + { + for ( Path p : mapfiles ) + { + MapFile.Reader reader = new MapFile.Reader( fs, p.toString(), conf ); + + WritableComparable key = (WritableComparable) ReflectionUtils.newInstance( reader.getKeyClass() , conf ); + Writable value = (Writable) ReflectionUtils.newInstance( reader.getValueClass(), conf ); + + while ( reader.next( key, value ) ) + { + if ( key instanceof Text && value instanceof Inlinks ) + { + Text toUrl = (Text) key; + Inlinks inlinks = (Inlinks) value; + + if ( inlinks.size( ) < threshold ) + { + continue ; + } + + String toUrlString = toUrl.toString( ); + + // HACK: Should make this into some externally configurable regex. + if ( toUrlString.startsWith( "http" ) ) + { + output.println( inlinks.size( ) + " " + toUrl.toString() ); + } + } + } + } + + return 0; + } + catch ( Exception e ) + { + LOG.fatal( "PageRanker: " + StringUtils.stringifyException( e ) ); + return -1; + } + finally + { + output.flush( ); + output.close( ); + } + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2657 http://archive-access.svn.sourceforge.net/archive-access/?rev=2657&view=rev Author: binzino Date: 2008-12-10 05:01:14 +0000 (Wed, 10 Dec 2008) Log Message: ----------- Removed use of floor() in calculating the book multiplier. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java 2008-12-10 04:59:10 UTC (rev 2656) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java 2008-12-10 05:01:14 UTC (rev 2657) @@ -56,17 +56,14 @@ * </p><p> * Applies a simple log10 multipler to the document score based on the * base-10 log value of the number of inlinks. For example, a page with - * 13,032 inlinks will have a score/boost of 5. The actual formula is + * 13,032 inlinks will have a score/boost of 5.115. The actual formula is * </p> * <code> - * initialScore *= ( floor( log10( # inlinks ) ) + 1 ) + * newScore = initialScore * ( log10( # inlinks ) + 1 ) * </code> * <p> - * We use floor() to get an integer value from the log10() function - * since we're only interested in order of magnitude. We then add 1 - * so that a page with < 10 inlins will have a multipler of 1, and - * thus stay the same, 10-100 gets a multipler of 2, 100-1000 is 3, and - * so forth. + * We add the extra 1 for pages with only 1 inlink since log10(1)=0 and we + * don't want a 0 multiplier. * </p> * <p> * The number of inlinks for a page is not taken from the <code>inlinks</code> @@ -115,8 +112,6 @@ public void setConf( Configuration conf ) { this.conf = conf; - - //this.ranks = getPageRanks( conf ); } public void injectedScore(Text url, CrawlDatum datum) @@ -181,7 +176,7 @@ return initScore; } - String keyParts[] = key.toString( ).split( "\\s+" ); + String keyParts[] = key.toString( ).split( "\\s+", 2 ); if ( keyParts.length != 2 ) { @@ -201,7 +196,7 @@ return initScore; } - float newScore = initScore * (float) ( Math.floor( Math.log( rank ) ) + 1 ); + float newScore = initScore * (float) ( Math.log( rank ) + 1 ); LOG.info( "PageRankScoringFilter: initScore = " + newScore + " ; key = " + key ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-12-10 04:59:14
|
Revision: 2656 http://archive-access.svn.sourceforge.net/archive-access/?rev=2656&view=rev Author: binzino Date: 2008-12-10 04:59:10 +0000 (Wed, 10 Dec 2008) Log Message: ----------- Fixed but to pass back return code of invoked command. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2008-12-10 04:58:24 UTC (rev 2655) +++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2008-12-10 04:59:10 UTC (rev 2656) @@ -62,4 +62,5 @@ ;; esac -exit 0 +# Return the exit code of the command invoked. +exit $? This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |