You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bra...@us...> - 2009-07-17 23:09:30
|
Revision: 2757 http://archive-access.svn.sourceforge.net/archive-access/?rev=2757&view=rev Author: bradtofel Date: 2009-07-17 23:09:28 +0000 (Fri, 17 Jul 2009) Log Message: ----------- 1.4.2 release Added Paths: ----------- branches/wayback-1_4_2/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-09 17:34:59
|
Revision: 2754 http://archive-access.svn.sourceforge.net/archive-access/?rev=2754&view=rev Author: binzino Date: 2009-07-09 17:34:57 +0000 (Thu, 09 Jul 2009) Log Message: ----------- Updated for 0.12.6 release. Modified Paths: -------------- tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt tags/nutchwax-0_12_6/archive/HOWTO.txt tags/nutchwax-0_12_6/archive/INSTALL.txt tags/nutchwax-0_12_6/archive/README.txt tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-06-25 +2009-07-09 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.5/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_6/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_6/archive/HOWTO.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/HOWTO.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ HOWTO.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.5 + /opt/nutchwax-0.12.6 2. ARC/WARC files. @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.5/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.5/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.5/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.5/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.6/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.6/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.6/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.6/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.5/bin/nutch org.archive.nutchwax.NutchWaxBean computer + $ /opt/nutchwax-0.12.6/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.5/nutch-1.0-dev.war + /opt/nutchwax-0.12.6/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_6/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_6/archive/INSTALL.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/INSTALL.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ INSTALL.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.5 is still built against pre-1.0 +many changes that NutchWAX 0.12.6 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.5 is built against is: +The specific SVN revision that NutchWAX 0.12.6 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.6 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_6/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.5 sources. +NutchWAX 0.12.6 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.5 + $ mv nutch-1.0-dev nutchwax-0.12.6 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.5.tar.gz + $ tar xvfz nutchwax-0.12.6.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.5 comes with a Unix init.d script which can be used to +NutchWAX 0.12.6 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.5 + /opt/nutchwax-0.12.6 the script is found at - /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.6/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_6/archive/README.txt =================================================================== --- tags/nutchwax-0_12_6/archive/README.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/README.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ README.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.5! +Welcome to NutchWAX 0.12.6! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2009-06-25 +2009-07-09 Aaron Binns -Release notes for NutchWAX 0.12.5 +Release notes for NutchWAX 0.12.6 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,74 +15,44 @@ Overview ====================================================================== -NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4 +NutchWAX 0.12.6 contains a few convenient enhancements to 0.12.5 - o Command-line options for NutchWaxBean to configure number of - results to emit and how many hits per site to allow. + o Addition of 'search' and 'merge' commands to the 'nutchwax' + command-line driver. Now one can do - o Change default configuration to use NutchWAX indexing and query - filters instead of Nutch-provided ones. This give more consistent - control over indexing and query behavior. + nutchwax search foo - o No longer store the unique document key (URL+digest) in a separate - field in the index. Since the URL and digest are stored, just use - them to synthesize the unique document key as needed. + instead of - o Trimmed down the default configuration of indexing and query - filters to only store and index the minimum information needed for - typical NutchWAX installations. + nutch org.archive.nutchwax.NutchWaxBean foo + Similarly, the new NutchWAX index merging, which supports + parallel indexes, can be invoked via -====================================================================== -Configuration changes -====================================================================== + nutchwax merge output-index input-index... -As mentioned in the overview, NutchWAX 0.12.5 has some important -changes to the default configuration. + o Merging of parallel indexes into a single index. -Previously, the indexing and query filter configuration utilized a -combination of filters from Nutch and NutchWAX. This was in line with -our goal of NutchWAX being a set of add-ons to Nutch. + NutchWAX has a copy/paste/enhanced version of the Nutch index + merger that now supports parallel indexes. This allows parallel + indexes to be merged into a single index. To use this feature, + add the "-p" option to the NutchWAX 'merge' command indicating the + input index directories contain parallel index sub-dirs. -However, in practice, the mixing of these filters often lead to -confusion since the NutchWAX filters could be configured via -properties in the Nutch configuration files whereas the Nutch filters -were hard-coded and less powerful. + nutchwax merge -p output-index input-pindexes... -Now, all the Nutch indexing filters have been removed and are replaced -with the single NutchWAX indexing filter. Similarly, all but one -Nutch query filter are removed, replaced by the configurable NutchWAX -query filter. We do retain the Nutch 'query-basic' filter as it -contains the logic for automatically applying a query to multiple -fields with proportionate weights; something not subsumed by the -NutchWAX query filter. + o Option to specify the directory where the index(es) and segments + live when doing a command-line search. + Previously the directory was obtained from the nutch-default.xml + configuration file. This is inconvenient when testing different + indexes as one would have to edit the config file each time to + specify a different index to search. -In addition to removing the Nutch filters, the NutchWAX index and -query filters are streamlined to only index and store the minimum set -of metadata fields for typical deployments. + Now, the directory can be specified on the command line: -In previous versions of NutchWAX, the indexing filters were configured -to index and store nearly every piece of metadata available. Although -this seems desirable, it adds a lot of storage overhead to the index, -and can hamper run-time query speed just by having unnecessary -information in the index (more junk for the disk to seek around). + nutchwax search -d <dir> <query> -The NutchWAX 0.12.5 configuration omits the typically unnecessary -metadata fields from the index and only indexes those fields we think -are needed for typical searches. - -For example, while we do store the digest, we do not index it as it's -very unusual for someone to search for a document with a specific -SHA-1 digest value. You could decide you want that, in which case you -can edit the configuration and re-index the data. You would have to -correspondingly edit the query filter and its configuration to allow -for searching on that field as well. - -We have found that this streamlined indexing configuration yields -Lucene indexes about 25% smaller than with NutchWAX 0.12.4. - - ====================================================================== Issues ====================================================================== @@ -93,16 +63,9 @@ Issues resolved in this release: -WAX-45 Add ability to store but not index a field via - ConfigurableIndexingFilter. +WAX-51 Enhance index merging to combine parallel indexes. -WAX-46 Add option to DumpParallelIndex to output only single field. +WAX-52 Add option to NutchWaxBean to specify directory where + index+segments are to be found. -WAX-47 Stop storing document key in "orig" field in index, synthesize - it as needed from the "url" and "digest" fields. - -WAX-48 Use NutchWAX configurable query filter for site and url fields. - -WAX-49 Add "hitsPerSite" option to NutchWaxBean. - -WAX-50 Add "num hits to find" option to NutchWaxBean. +WAX-53 IndexMerging parallel indexes fails when index is empty. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-09 00:51:02
|
Revision: 2753 http://archive-access.svn.sourceforge.net/archive-access/?rev=2753&view=rev Author: binzino Date: 2009-07-09 00:50:58 +0000 (Thu, 09 Jul 2009) Log Message: ----------- Fix WAX-53. Added check for empty fieldToReader. Modified Paths: -------------- tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java Modified: tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java =================================================================== --- tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java 2009-07-07 22:07:17 UTC (rev 2752) +++ tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java 2009-07-09 00:50:58 UTC (rev 2753) @@ -472,6 +472,8 @@ private TermEnum termEnum; public ParallelTermEnum() throws IOException { + if ( fieldToReader.isEmpty( ) ) return ; + field = (String)fieldToReader.firstKey(); if (field != null) termEnum = ((IndexReader)fieldToReader.get(field)).terms(); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-07 22:07:25
|
Revision: 2752 http://archive-access.svn.sourceforge.net/archive-access/?rev=2752&view=rev Author: binzino Date: 2009-07-07 22:07:17 +0000 (Tue, 07 Jul 2009) Log Message: ----------- WAX-52. Added -d <dir> option to NutchWaxBean. Also added commands for index merging and searching to the 'nutchwax' script. Modified Paths: -------------- tags/nutchwax-0_12_6/archive/bin/nutchwax tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java Modified: tags/nutchwax-0_12_6/archive/bin/nutchwax =================================================================== --- tags/nutchwax-0_12_6/archive/bin/nutchwax 2009-07-07 21:53:03 UTC (rev 2751) +++ tags/nutchwax-0_12_6/archive/bin/nutchwax 2009-07-07 22:07:17 UTC (rev 2752) @@ -50,22 +50,30 @@ shift ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger $@ ;; + pageranker) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@ + ;; + parsetextmerger) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@ + ;; add-dates) shift ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder $@ ;; + merge) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.IndexMerger $@ + ;; dumpindex) shift ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@ ;; - pageranker) + search) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.NutchWaxBean $@ ;; - parsetextmerger) - shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@ - ;; *) echo "" echo "Usage: nutchwax COMMAND" @@ -76,7 +84,9 @@ echo " pageranker Generate pagerank.txt file from 'pagerankdb's or 'linkdb's" echo " parsetextmerger Merge segement parse_text/part-nnnnn directories." echo " add-dates Add dates to a parallel index" + echo " merge Merge indexes or parallel indexes" echo " dumpindex Dump an index or set of parallel indices to stdout" + echo " search Query a search index" echo "" exit 1 ;; Modified: tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-07-07 21:53:03 UTC (rev 2751) +++ tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-07-07 22:07:17 UTC (rev 2752) @@ -254,6 +254,7 @@ String usage = "NutchWaxBean [options] query" + "\n\t-h <n> Hits per site" + "\n\t-n <n> Number of results to find" + + "\n\t-d <dir> Search directory" + "\n"; if ( args.length == 0 ) @@ -263,6 +264,7 @@ } String queryString = args[args.length - 1]; + String searchDir = null; int hitsPerSite = 0; int numHits = 10; for ( int i = 0 ; i < args.length - 1 ; i++ ) @@ -279,6 +281,11 @@ i++; numHits = Integer.parseInt( args[i] ); } + if ( "-d".equals( args[i] ) ) + { + i++; + searchDir = args[i]; + } } catch ( NumberFormatException nfe ) { @@ -290,9 +297,15 @@ Configuration conf = NutchConfiguration.create(); + if ( searchDir != null ) + { + conf.set( "searcher.dir", searchDir ); + } NutchBean bean = new NutchBean(conf); NutchBeanModifier.modify( bean ); + System.out.println( "Searching in directory: " + conf.get( "searcher.dir" ) ); + Query query = Query.parse(queryString, conf); System.out.println("Hits per site: " + hitsPerSite); Hits hits = bean.search(query, numHits, hitsPerSite); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-07 21:53:07
|
Revision: 2751 http://archive-access.svn.sourceforge.net/archive-access/?rev=2751&view=rev Author: binzino Date: 2009-07-07 21:53:03 +0000 (Tue, 07 Jul 2009) Log Message: ----------- WAX-51. Copy/paste/enhance Nutch's IndexMerger to support merging of parallel indices. Added Paths: ----------- tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java Added: tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java =================================================================== --- tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java (rev 0) +++ tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java 2009-07-07 21:53:03 UTC (rev 2751) @@ -0,0 +1,211 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.archive.nutchwax; + +import java.io.*; +import java.util.*; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + +import org.apache.hadoop.fs.*; +import org.apache.hadoop.mapred.FileAlreadyExistsException; +import org.apache.hadoop.util.*; +import org.apache.hadoop.conf.*; + +import org.apache.nutch.util.HadoopFSUtil; +import org.apache.nutch.util.LogUtil; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.indexer.NutchSimilarity; +import org.apache.nutch.indexer.FsDirectory; + +import org.apache.lucene.store.Directory; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.ArchiveParallelReader; + +/************************************************************************* + * IndexMerger creates an index for the output corresponding to a + * single fetcher run. + * + * @author Doug Cutting + * @author Mike Cafarella + *************************************************************************/ +public class IndexMerger extends Configured implements Tool { + public static final Log LOG = LogFactory.getLog(IndexMerger.class); + + public static final String DONE_NAME = "merge.done"; + + public IndexMerger() { + + } + + public IndexMerger(Configuration conf) { + setConf(conf); + } + + /** + * Merge all input indexes to the single output index + */ + public void merge(IndexReader[] readers, Path outputIndex, Path localWorkingDir, boolean parallel) throws IOException { + LOG.info("merging indexes to: " + outputIndex); + + FileSystem localFs = FileSystem.getLocal(getConf()); + if (localFs.exists(localWorkingDir)) { + localFs.delete(localWorkingDir, true); + } + localFs.mkdirs(localWorkingDir); + + // Get local output target + // + FileSystem fs = FileSystem.get(getConf()); + if (fs.exists(outputIndex)) { + throw new FileAlreadyExistsException("Output directory " + outputIndex + " already exists!"); + } + + Path tmpLocalOutput = new Path(localWorkingDir, "merge-output"); + Path localOutput = fs.startLocalOutput(outputIndex, tmpLocalOutput); + + // + // Merge indices + // + IndexWriter writer = new IndexWriter(localOutput.toString(), null, true); + writer.setMergeFactor(getConf().getInt("indexer.mergeFactor", IndexWriter.DEFAULT_MERGE_FACTOR)); + writer.setMaxBufferedDocs(getConf().getInt("indexer.minMergeDocs", IndexWriter.DEFAULT_MAX_BUFFERED_DOCS)); + writer.setMaxMergeDocs(getConf().getInt("indexer.maxMergeDocs", IndexWriter.DEFAULT_MAX_MERGE_DOCS)); + writer.setTermIndexInterval(getConf().getInt("indexer.termIndexInterval", IndexWriter.DEFAULT_TERM_INDEX_INTERVAL)); + writer.setInfoStream(LogUtil.getDebugStream(LOG)); + writer.setUseCompoundFile(false); + writer.setSimilarity(new NutchSimilarity()); + writer.addIndexes(readers); + writer.close(); + + // + // Put target back + // + fs.completeLocalOutput(outputIndex, tmpLocalOutput); + LOG.info("done merging"); + } + + /** + * Create an index for the input files in the named directory. + */ + public static void main(String[] args) throws Exception { + int res = ToolRunner.run(NutchConfiguration.create(), new IndexMerger(), args); + System.exit(res); + } + + public int run(String[] args) throws Exception { + String usage = "IndexMerger [-workingdir <workingdir>] [-p] outputIndex indexesDir...\n\t-p Input directories contain parallel indexes.\n"; + if (args.length < 2) + { + System.err.println("Usage: " + usage); + return -1; + } + + // + // Parse args, read all index directories to be processed + // + FileSystem fs = FileSystem.get(getConf()); + List<Path> indexDirs = new ArrayList<Path>(); + + Path workDir = new Path("indexmerger-" + System.currentTimeMillis()); + int i = 0; + + boolean parallel=false; + + while ( args[i].startsWith( "-" ) ) + { + if ( "-workingdir".equals(args[i]) ) + { + i++; + workDir = new Path(args[i++], "indexmerger-" + System.currentTimeMillis()); + } + else if ( "-p".equals(args[i]) ) + { + i++; + parallel=true; + } + } + + Path outputIndex = new Path(args[i++]); + + List<IndexReader> readers = new ArrayList<IndexReader>( ); + + if ( ! parallel ) + { + for (; i < args.length; i++) + { + FileStatus[] fstats = fs.listStatus(new Path(args[i]), HadoopFSUtil.getPassDirectoriesFilter(fs)); + + for ( Path p : HadoopFSUtil.getPaths(fstats) ) + { + LOG.info( "Adding reader for: " + p ); + readers.add( IndexReader.open( new FsDirectory( fs, p, false, getConf( ) ) ) ); + } + } + } + else + { + for (; i < args.length; i++) + { + FileStatus[] fstats = fs.listStatus(new Path(args[i]), HadoopFSUtil.getPassDirectoriesFilter(fs)); + Path parallelDirs[] = HadoopFSUtil.getPaths( fstats ); + + if ( parallelDirs.length < 1 ) + { + LOG.info( "No sub-directories, skipping: " + args[i] ); + + continue; + } + else + { + LOG.info( "Adding parallel reader for: " + args[i] ); + } + + ArchiveParallelReader preader = new ArchiveParallelReader( ); + + // Sort the parallelDirs so that we add them in order. Order + // matters to the ParallelReader. + Arrays.sort( parallelDirs ); + + for ( Path p : parallelDirs ) + { + LOG.info( " Adding to parallel reader: " + p.getName( ) ); + preader.add( IndexReader.open( new FsDirectory( fs, p, false, getConf( ) ) ) ); + } + + readers.add( preader ); + } + } + + // + // Merge the indices + // + + try { + merge(readers.toArray(new IndexReader[readers.size()]), outputIndex, workDir, parallel); + return 0; + } catch (Exception e) { + LOG.fatal("IndexMerger: " + StringUtils.stringifyException(e)); + return -1; + } finally { + FileSystem.getLocal(getConf()).delete(workDir, true); + } + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-07 19:47:28
|
Revision: 2750 http://archive-access.svn.sourceforge.net/archive-access/?rev=2750&view=rev Author: binzino Date: 2009-07-07 19:47:24 +0000 (Tue, 07 Jul 2009) Log Message: ----------- Created NutchWAX 0.12.6 tag/branch from 0.12.5. Added Paths: ----------- tags/nutchwax-0_12_6/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-27 01:22:21
|
Revision: 2749 http://archive-access.svn.sourceforge.net/archive-access/?rev=2749&view=rev Author: binzino Date: 2009-06-27 00:17:23 +0000 (Sat, 27 Jun 2009) Log Message: ----------- Changed default to index the URL field. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 23:02:50 UTC (rev 2748) +++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-27 00:17:23 UTC (rev 2749) @@ -249,7 +249,7 @@ content:false:false:tokenized site:false:false:untokenized - url:false:true:no + url:false:true:tokenized digest:false:true:no collection:true:true:no_norms Modified: tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml 2009-06-25 23:02:50 UTC (rev 2748) +++ tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml 2009-06-27 00:17:23 UTC (rev 2749) @@ -47,7 +47,7 @@ content:false:false:tokenized site:false:false:untokenized - url:false:true:no + url:false:true:tokenized digest:false:true:no collection:true:true:no_norms This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-25 23:02:56
|
Revision: 2748 http://archive-access.svn.sourceforge.net/archive-access/?rev=2748&view=rev Author: binzino Date: 2009-06-25 23:02:50 +0000 (Thu, 25 Jun 2009) Log Message: ----------- Changed version from 0.12.4 to 0.12.5. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt tags/nutchwax-0_12_5/archive/HOWTO.txt Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 22:00:14 UTC (rev 2747) +++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 23:02:50 UTC (rev 2748) @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.5/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 22:00:14 UTC (rev 2747) +++ tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 23:02:50 UTC (rev 2748) @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.5/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.5/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.5/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.5/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer + $ /opt/nutchwax-0.12.5/bin/nutch org.archive.nutchwax.NutchWaxBean computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.4/nutch-1.0-dev.war + /opt/nutchwax-0.12.5/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-25 22:00:15
|
Revision: 2747 http://archive-access.svn.sourceforge.net/archive-access/?rev=2747&view=rev Author: binzino Date: 2009-06-25 22:00:14 +0000 (Thu, 25 Jun 2009) Log Message: ----------- Updated for 0.12.5 release. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt tags/nutchwax-0_12_5/archive/HOWTO.txt tags/nutchwax-0_12_5/archive/INSTALL.txt tags/nutchwax-0_12_5/archive/README.txt tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2008-12-18 +2009-06-25 Aaron Binns ====================================================================== @@ -130,27 +130,37 @@ to - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax + protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax In short, we add: - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf + parse-pdf + index-nutchwax + query-nutchwax + urlfilter-nutchwax and remove: - urlfilter-regex - urlnormalizer-(pass|regex|basic) + index-basic + index-anchor + query-site + query-url + urlfilter-regex + urlnormalizer-(pass|regex|basic) -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. The "parse-pdf" plugin is added simply because we have lots of PDFs in our archives and we want to index them. We sometimes remove the "parse-js" plugin if we don't care to index JavaScript files. +The Nutch index-basic and index-anchor filters are removed and +replaced with the NutchWAX index-nutchwax filter. Similarly, we +remove the Nutch query-site and query-url filters, replacing them with +the single NutchWAX query-nutchwax filter. By using the configurable +NutchWAX filters for indexing and querying, we get more powerful and +consistent behavior across metadata fields. Note that we do retain +the Nutch query-basic filter however. + We also remove the default Nutch URL filtering and normalizing plugins because we do not need the URLs normalized nor filtered. We trust that the tool that produced the ARC/WARC file will have normalized the @@ -166,6 +176,14 @@ -------------------------------------------------- indexingfilter.order -------------------------------------------------- +If we use the indexing filters as specified in the previous section, +then this property can remain unset. However, if you choose to use +the Nutch index-basic filter, then you *must* specify the order in +which the filters will be used. If you don't then the filters will be +applied in a random order (per Nutch's design) and since one may +over-write the values of another you won't know what values will +result. In that case, you need to specify the order. + Add this property with a value of org.apache.nutch.indexer.basic.BasicIndexingFilter @@ -174,8 +192,6 @@ So that the NutchWAX indexing filter is run after the Nutch basic indexing filter. -A full explanation is given in "README-dedup.txt". - -------------------------------------------------- mime.type.magic -------------------------------------------------- @@ -205,37 +221,44 @@ The specifications here are of the form: - src-key:lowercase:store:tokenize:exclusive:dest-key + src-key:lowercase:store:index:exclusive:dest-key where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true - tokenize = false + index = tokenized exclusive = true dest-key = src-key +For the 'index' property, the possible values are: + tokenized + untokenized + no_norms + no + +corresponding to the Lucene options of the same names. + We recommend: <property> <name>nutchwax.filter.index</name> <value> - url:false:true:true - url:false:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length + title:false:true:tokenized + content:false:false:tokenized + site:false:false:untokenized + + url:false:true:no + digest:false:true:no + + collection:true:true:no_norms + date:true:true:no_norms + type:true:true:no_norms + length:false:true:no </value> </property> -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. -------------------------------------------------- nutchwax.filter.query @@ -274,15 +297,10 @@ <property> <name>nutchwax.filter.query</name> <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false group:collection + group:site:false group:type - field:anchor field:content - field:host field:title </value> </property> @@ -428,3 +446,31 @@ <value>false</value> </property> + +-------------------------------------------------- +searcher.fieldcache +-------------------------------------------------- + +NutchWAX contains a patch controlling the use of a "fieldcache" in the +Nutch searcher. Without this patch Nutch will read the entire set of +hostnames from the index into an in-memory cache. This cache is then +consulted when performing de-duplication of results per the +"hitsPerSite" feature. + +For small-to-medium indexes, this can improve performance as the +de-duplication information is entirely in memory and no disk access is +required. + +However, for large indexes, in the tens of gigabytes in size, reading +the entire set of hostnames into an in-memory cache can exhaust the +Java heap. In this case, omitting the cache all together and just +reading the values off disk as needed is better. + +The NutchWAX patch controls the use of this cache based on this property +value. If set to false, then the cache is not used at all. + +<property> + <name>searcher.fieldcache</name> + <value>true</value> +</property> + Modified: tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt =================================================================== --- tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,6 +1,6 @@ HOWTO-xslt.txt -2008-12-18 +2009-06-25 Aaron Binns Table of Contents @@ -128,8 +128,5 @@ You can find sample 'web.xml' and 'search.xsl' files in - contrib/archive/web - -in the compiled Nutch package. Or in this source tree under - - src/web + ./src/nutch/src/web/jsp/search.xsl + ./src/nutch/src/web/web.xml Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,6 +1,6 @@ HOWTO.txt -2008-07-28 +2009-06-25 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.4 + /opt/nutchwax-0.12.5 2. ARC/WARC files. @@ -96,9 +96,9 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer -This calls the NutchBean to execute a simple keyword search for +This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the documents you imported. Modified: tags/nutchwax-0_12_5/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_5/archive/INSTALL.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/INSTALL.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,6 +1,6 @@ INSTALL.txt -2009-03-08 +2009-06-25 Aaron Binns Table of Contents @@ -62,10 +62,12 @@ SVN: nutch-1.0-dev ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. -Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.4 is -built against is: +Although the Nutch project released 1.0 in early 2009, there were so +many changes that NutchWAX 0.12.5 is still built against pre-1.0 +codebase. +The specific SVN revision that NutchWAX 0.12.5 is built against is: + 701524 To checkout this revision of Nutch, use: @@ -79,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.4 sources. +NutchWAX 0.12.5 sources. Build and install ----------------- @@ -113,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.4 + $ mv nutch-1.0-dev nutchwax-0.12.5 ====================================================================== @@ -126,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.4.tar.gz + $ tar xvfz nutchwax-0.12.5.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.4 comes with a Unix init.d script which can be used to +NutchWAX 0.12.5 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.4 + /opt/nutchwax-0.12.5 the script is found at - /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_5/archive/README.txt =================================================================== --- tags/nutchwax-0_12_5/archive/README.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/README.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,6 +1,6 @@ README.txt -2009-05-05 +2009-06-25 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.4! +Welcome to NutchWAX 0.12.5! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt 2009-06-25 20:23:20 UTC (rev 2746) +++ tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt 2009-06-25 22:00:14 UTC (rev 2747) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2009-05-05 +2009-06-25 Aaron Binns -Release notes for NutchWAX 0.12.4 +Release notes for NutchWAX 0.12.5 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,17 +15,75 @@ Overview ====================================================================== -NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3 +NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4 - o Option to omit storing of content during import. - o Support for per-collection segments in master/slave config. - o Additional diagnostic/log messages to help troubleshoot common - deployment mistakes. - o PageRankDb similar to LinkDb but only keeping inlink counts. - o Improved paging through results, handling "paging past the end". + o Command-line options for NutchWaxBean to configure number of + results to emit and how many hits per site to allow. + o Change default configuration to use NutchWAX indexing and query + filters instead of Nutch-provided ones. This give more consistent + control over indexing and query behavior. + o No longer store the unique document key (URL+digest) in a separate + field in the index. Since the URL and digest are stored, just use + them to synthesize the unique document key as needed. + + o Trimmed down the default configuration of indexing and query + filters to only store and index the minimum information needed for + typical NutchWAX installations. + + ====================================================================== +Configuration changes +====================================================================== + +As mentioned in the overview, NutchWAX 0.12.5 has some important +changes to the default configuration. + +Previously, the indexing and query filter configuration utilized a +combination of filters from Nutch and NutchWAX. This was in line with +our goal of NutchWAX being a set of add-ons to Nutch. + +However, in practice, the mixing of these filters often lead to +confusion since the NutchWAX filters could be configured via +properties in the Nutch configuration files whereas the Nutch filters +were hard-coded and less powerful. + +Now, all the Nutch indexing filters have been removed and are replaced +with the single NutchWAX indexing filter. Similarly, all but one +Nutch query filter are removed, replaced by the configurable NutchWAX +query filter. We do retain the Nutch 'query-basic' filter as it +contains the logic for automatically applying a query to multiple +fields with proportionate weights; something not subsumed by the +NutchWAX query filter. + + +In addition to removing the Nutch filters, the NutchWAX index and +query filters are streamlined to only index and store the minimum set +of metadata fields for typical deployments. + +In previous versions of NutchWAX, the indexing filters were configured +to index and store nearly every piece of metadata available. Although +this seems desirable, it adds a lot of storage overhead to the index, +and can hamper run-time query speed just by having unnecessary +information in the index (more junk for the disk to seek around). + +The NutchWAX 0.12.5 configuration omits the typically unnecessary +metadata fields from the index and only indexes those fields we think +are needed for typical searches. + +For example, while we do store the digest, we do not index it as it's +very unusual for someone to search for a document with a specific +SHA-1 digest value. You could decide you want that, in which case you +can edit the configuration and re-index the data. You would have to +correspondingly edit the query filter and its configuration to allow +for searching on that field as well. + +We have found that this streamlined indexing configuration yields +Lucene indexes about 25% smaller than with NutchWAX 0.12.4. + + +====================================================================== Issues ====================================================================== @@ -35,23 +93,16 @@ Issues resolved in this release: -WAX-27 Sensible output for requesting page of results past the end. +WAX-45 Add ability to store but not index a field via + ConfigurableIndexingFilter. -WAX-34 Add option to omit storing of content in segment +WAX-46 Add option to DumpParallelIndex to output only single field. -WAX-35 Add pagerankdb similar to linkdb but which only keeps counts - rather than actual inlinks. +WAX-47 Stop storing document key in "orig" field in index, synthesize + it as needed from the "url" and "digest" fields. -WAX-36 Some additional diagnostics on connecting results to segments - and snippets would be very helpful. +WAX-48 Use NutchWAX configurable query filter for site and url fields. -WAX-37 Per-collection segments not supported in distributed - master-slave configuration. +WAX-49 Add "hitsPerSite" option to NutchWaxBean. -WAX-38 Build omits neessary libraries from .job file. - -WAX-39 Write more efficient, specialized segment parse_text merging. - -WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher - -WAX-42 Add option to continue importing if an arcfile cannot be read. +WAX-50 Add "num hits to find" option to NutchWaxBean. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-25 20:23:25
|
Revision: 2746 http://archive-access.svn.sourceforge.net/archive-access/?rev=2746&view=rev Author: binzino Date: 2009-06-25 20:23:20 +0000 (Thu, 25 Jun 2009) Log Message: ----------- WAX-49, WAX-50: Added -h and -n options to specify number of hits-per-site and total number of hits requested. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-06-25 20:21:51 UTC (rev 2745) +++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-06-25 20:23:20 UTC (rev 2746) @@ -251,28 +251,59 @@ */ public static void main(String[] args) throws Exception { - String usage = "NutchWaxBean query"; + String usage = "NutchWaxBean [options] query" + + "\n\t-h <n> Hits per site" + + "\n\t-n <n> Number of results to find" + + "\n"; - if (args.length == 0) + if ( args.length == 0 ) { - System.err.println(usage); - System.exit(-1); + System.err.println( usage ); + System.exit( -1 ); } + + String queryString = args[args.length - 1]; + int hitsPerSite = 0; + int numHits = 10; + for ( int i = 0 ; i < args.length - 1 ; i++ ) + { + try + { + if ( "-h".equals( args[i] ) ) + { + i++; + hitsPerSite = Integer.parseInt( args[i] ); + } + if ( "-n".equals( args[i] ) ) + { + i++; + numHits = Integer.parseInt( args[i] ); + } + } + catch ( NumberFormatException nfe ) + { + System.err.println( "Error: not a numeric value: " + args[i] ); + System.err.println( usage ); + System.exit( -1 ); + } + } Configuration conf = NutchConfiguration.create(); NutchBean bean = new NutchBean(conf); NutchBeanModifier.modify( bean ); - Query query = Query.parse(args[0], conf); - Hits hits = bean.search(query, 10); - System.out.println("Total hits: " + hits.getTotal()); - int length = (int)Math.min(hits.getTotal(), 10); + Query query = Query.parse(queryString, conf); + System.out.println("Hits per site: " + hitsPerSite); + Hits hits = bean.search(query, numHits, hitsPerSite); + System.out.println("Total hits : " + hits.getTotal()); + System.out.println("Hits length: " + hits.getLength()); + int length = (int)Math.min(hits.getLength(), numHits); Hit[] show = hits.getHits(0, length); HitDetails[] details = bean.getDetails(show); Summary[] summaries = bean.getSummary(details, query); - for (int i = 0; i < hits.getLength(); i++) + for (int i = 0; i < length; i++) { // Use a slightly more verbose output than NutchBean. System.out.println( " " This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-25 20:21:52
|
Revision: 2745 http://archive-access.svn.sourceforge.net/archive-access/?rev=2745&view=rev Author: binzino Date: 2009-06-25 20:21:51 +0000 (Thu, 25 Jun 2009) Log Message: ----------- Since we have our own NutchWAX OpenSearchServlet, we no longer need any mods to the Nutch-provided one. Removed Paths: ------------- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Deleted: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java 2009-06-25 20:21:06 UTC (rev 2744) +++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java 2009-06-25 20:21:51 UTC (rev 2745) @@ -1,333 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.nutch.searcher; - -import java.io.IOException; -import java.net.URLEncoder; -import java.util.Map; -import java.util.HashMap; -import java.util.Set; -import java.util.HashSet; - -import javax.servlet.ServletException; -import javax.servlet.ServletConfig; -import javax.servlet.http.HttpServlet; -import javax.servlet.http.HttpServletRequest; -import javax.servlet.http.HttpServletResponse; - -import javax.xml.parsers.*; - -import org.apache.hadoop.conf.Configuration; -import org.apache.nutch.util.NutchConfiguration; -import org.w3c.dom.*; -import javax.xml.transform.TransformerFactory; -import javax.xml.transform.Transformer; -import javax.xml.transform.dom.DOMSource; -import javax.xml.transform.stream.StreamResult; - - -/** Present search results using A9's OpenSearch extensions to RSS, plus a few - * Nutch-specific extensions. */ -public class OpenSearchServlet extends HttpServlet { - private static final Map NS_MAP = new HashMap(); - private int MAX_HITS_PER_PAGE; - - static { - NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/"); - NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/"); - } - - private static final Set SKIP_DETAILS = new HashSet(); - static { - SKIP_DETAILS.add("url"); // redundant with RSS link - SKIP_DETAILS.add("title"); // redundant with RSS title - } - - private NutchBean bean; - private Configuration conf; - - public void init(ServletConfig config) throws ServletException { - try { - this.conf = NutchConfiguration.get(config.getServletContext()); - bean = NutchBean.get(config.getServletContext(), this.conf); - } catch (IOException e) { - throw new ServletException(e); - } - MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1); - } - - public void doGet(HttpServletRequest request, HttpServletResponse response) - throws ServletException, IOException { - - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("query request from " + request.getRemoteAddr()); - } - - // get parameters from request - request.setCharacterEncoding("UTF-8"); - String queryString = request.getParameter("query"); - if (queryString == null) - queryString = ""; - String urlQuery = URLEncoder.encode(queryString, "UTF-8"); - - // the query language - String queryLang = request.getParameter("lang"); - - int start = 0; // first hit to display - String startString = request.getParameter("start"); - if (startString != null) - start = Integer.parseInt(startString); - - int hitsPerPage = 10; // number of hits to display - String hitsString = request.getParameter("hitsPerPage"); - if (hitsString != null) - hitsPerPage = Integer.parseInt(hitsString); - if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE) - hitsPerPage = MAX_HITS_PER_PAGE; - - String sort = request.getParameter("sort"); - boolean reverse = - sort!=null && "true".equals(request.getParameter("reverse")); - - // De-Duplicate handling. Look for duplicates field and for how many - // duplicates per results to return. Default duplicates field is 'site' - // and duplicates per results default is '2'. - String dedupField = request.getParameter("dedupField"); - if (dedupField == null || dedupField.length() == 0) { - dedupField = "site"; - } - int hitsPerDup = 2; - String hitsPerDupString = request.getParameter("hitsPerDup"); - if (hitsPerDupString != null && hitsPerDupString.length() > 0) { - hitsPerDup = Integer.parseInt(hitsPerDupString); - } else { - // If 'hitsPerSite' present, use that value. - String hitsPerSiteString = request.getParameter("hitsPerSite"); - if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) { - hitsPerDup = Integer.parseInt(hitsPerSiteString); - } - } - - // Make up query string for use later drawing the 'rss' logo. - String params = "&hitsPerPage=" + hitsPerPage + - (queryLang == null ? "" : "&lang=" + queryLang) + - (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") + - (dedupField == null ? "" : "&dedupField=" + dedupField)); - - Query query = Query.parse(queryString, queryLang, this.conf); - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("query: " + queryString); - NutchBean.LOG.info("lang: " + queryLang); - } - - // execute the query - Hits hits; - try { - hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField, - sort, reverse); - } catch (IOException e) { - if (NutchBean.LOG.isWarnEnabled()) { - NutchBean.LOG.warn("Search Error", e); - } - hits = new Hits(0,new Hit[0]); - } - - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("total hits: " + hits.getTotal()); - } - - // generate xml results - int end = (int)Math.min(hits.getLength(), start + hitsPerPage); - int length = end-start; - - Hit[] show = hits.getHits(start, end-start); - HitDetails[] details = bean.getDetails(show); - Summary[] summaries = bean.getSummary(details, query); - - String requestUrl = request.getRequestURL().toString(); - String base = requestUrl.substring(0, requestUrl.lastIndexOf('/')); - - - try { - DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); - factory.setNamespaceAware(true); - Document doc = factory.newDocumentBuilder().newDocument(); - - Element rss = addNode(doc, doc, "rss"); - addAttribute(doc, rss, "version", "2.0"); - addAttribute(doc, rss, "xmlns:opensearch", - (String)NS_MAP.get("opensearch")); - addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch")); - - Element channel = addNode(doc, rss, "channel"); - - addNode(doc, channel, "title", "Nutch: " + queryString); - addNode(doc, channel, "description", "Nutch search results for query: " - + queryString); - addNode(doc, channel, "link", - base+"/search.jsp" - +"?query="+urlQuery - +"&start="+start - +"&hitsPerDup="+hitsPerDup - +params); - - addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal()); - addNode(doc, channel, "opensearch", "startIndex", ""+start); - addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage); - - addNode(doc, channel, "nutch", "query", queryString); - - - if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show - || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){ - addNode(doc, channel, "nutch", "nextPage", requestUrl - +"?query="+urlQuery - +"&start="+end - +"&hitsPerDup="+hitsPerDup - +params); - } - - if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) { - addNode(doc, channel, "nutch", "showAllHits", requestUrl - +"?query="+urlQuery - +"&hitsPerDup="+0 - +params); - } - - for (int i = 0; i < length; i++) { - Hit hit = show[i]; - HitDetails detail = details[i]; - String title = detail.getValue("title"); - String url = detail.getValue("url"); - String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); - - if (title == null || title.equals("")) { // use url for docs w/o title - title = url; - } - - Element item = addNode(doc, channel, "item"); - - addNode(doc, item, "title", title); - if (summaries[i] != null) { - addNode(doc, item, "description", summaries[i].toString() ); - } - addNode(doc, item, "link", url); - - addNode(doc, item, "nutch", "site", hit.getDedupValue()); - - addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id); - addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id - +"&query="+urlQuery+"&lang="+queryLang); - - if (hit.moreFromDupExcluded()) { - addNode(doc, item, "nutch", "moreFromSite", requestUrl - +"?query=" - +URLEncoder.encode("site:"+hit.getDedupValue() - +" "+queryString, "UTF-8") - +"&hitsPerSite="+0 - +params); - } - - for (int j = 0; j < detail.getLength(); j++) { // add all from detail - String field = detail.getField(j); - if (!SKIP_DETAILS.contains(field)) - addNode(doc, item, "nutch", field, detail.getValue(j)); - } - } - - // dump DOM tree - - DOMSource source = new DOMSource(doc); - TransformerFactory transFactory = TransformerFactory.newInstance(); - Transformer transformer = transFactory.newTransformer(); - transformer.setOutputProperty("indent", "yes"); - StreamResult result = new StreamResult(response.getOutputStream()); - response.setContentType("text/xml"); - transformer.transform(source, result); - - } catch (javax.xml.parsers.ParserConfigurationException e) { - throw new ServletException(e); - } catch (javax.xml.transform.TransformerException e) { - throw new ServletException(e); - } - - } - - private static Element addNode(Document doc, Node parent, String name) { - Element child = doc.createElement(name); - parent.appendChild(child); - return child; - } - - private static void addNode(Document doc, Node parent, - String name, String text) { - Element child = doc.createElement(name); - child.appendChild(doc.createTextNode(getLegalXml(text))); - parent.appendChild(child); - } - - private static void addNode(Document doc, Node parent, - String ns, String name, String text) { - Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name); - child.appendChild(doc.createTextNode(getLegalXml(text))); - parent.appendChild(child); - } - - private static void addAttribute(Document doc, Element node, - String name, String value) { - Attr attribute = doc.createAttribute(name); - attribute.setValue(getLegalXml(value)); - node.getAttributes().setNamedItem(attribute); - } - - /* - * Ensure string is legal xml. - * @param text String to verify. - * @return Passed <code>text</code> or a new string with illegal - * characters removed if any found in <code>text</code>. - * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char - */ - protected static String getLegalXml(final String text) { - if (text == null) { - return null; - } - StringBuffer buffer = null; - for (int i = 0; i < text.length(); i++) { - char c = text.charAt(i); - if (!isLegalXml(c)) { - if (buffer == null) { - // Start up a buffer. Copy characters here from now on - // now we've found at least one bad character in original. - buffer = new StringBuffer(text.length()); - buffer.append(text.substring(0, i)); - } - } else { - if (buffer != null) { - buffer.append(c); - } - } - } - return (buffer != null)? buffer.toString(): text; - } - - private static boolean isLegalXml(final char c) { - return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff) - || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff); - } - -} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-25 20:21:14
|
Revision: 2744 http://archive-access.svn.sourceforge.net/archive-access/?rev=2744&view=rev Author: binzino Date: 2009-06-25 20:21:06 +0000 (Thu, 25 Jun 2009) Log Message: ----------- WAX-47: Use 'url' field rather than 'exacturl' field as the former will (should) always be present whereas the latter may not. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java Modified: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java 2009-06-23 21:35:00 UTC (rev 2743) +++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java 2009-06-25 20:21:06 UTC (rev 2744) @@ -173,10 +173,10 @@ { if ( "site".equals( dedupField ) ) { - String exactUrl = reader.document( doc ).get( "exacturl"); + String url = reader.document( doc ).get( "url"); try { - java.net.URL u = new java.net.URL( exactUrl ); + java.net.URL u = new java.net.URL( url ); dedupValue = u.getHost(); System.out.println("Dedup value hack:" + dedupValue); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-23 21:35:02
|
Revision: 2743 http://archive-access.svn.sourceforge.net/archive-access/?rev=2743&view=rev Author: binzino Date: 2009-06-23 21:35:00 +0000 (Tue, 23 Jun 2009) Log Message: ----------- Fix WAX-45 and WAX-48. ConfigurableIndexingFilter can handle all the fields relevant to Nutch(WAX). Update the nute-site.xml accordingly. Also, remove the site and url query filters from nutch-site.xml and configure NutchWAX query filter to take over for them. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml Modified: tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml =================================================================== --- tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml 2009-06-23 21:17:31 UTC (rev 2742) +++ tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml 2009-06-23 21:35:00 UTC (rev 2743) @@ -10,19 +10,18 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> + <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> </property> -<!-- The indexing filter order *must* be specified in order for - NutchWAX's ConfigurableIndexingFilter to be called *after* the - BasicIndexingFilter. This is necessary so that the - ConfigurableIndexingFilter can over-write some of the values put - into the Lucene document by the BasicIndexingFilter. - - The over-written values are the 'url' and 'digest' fields, which - NutchWAX needs to handle specially in order for de-duplication to - work properly. - --> +<!-- + When using *only* the 'index-nutchwax' in 'plugin.includes' above, + we don't need to specify an order since there is only one plugin. + + However, if you choose to use the Nutch 'index-basic', then you have + to specify the order such that the NutchWAX ConfigurableIndexingFilter + is after it. Whichever plugin comes last over-writes the values + of those that come before it. + <property> <name>indexingfilter.order</name> <value> @@ -30,29 +29,31 @@ org.archive.nutchwax.index.ConfigurableIndexingFilter </value> </property> + --> <property> <!-- Configure the 'index-nutchwax' plugin. Specify how the metadata fields added by the Importer are mapped to the Lucene documents during indexing. - The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key" + The specifications here are of the form "src-key:lowercase:store:index:dest-key" Where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true - tokenize = false + index = tokenized exclusive = true dest-key = src-key --> <name>nutchwax.filter.index</name> <value> - url:false:true:true - url:false:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length + title:false:true:tokenized + content:false:false:tokenized + site:false:false:untokenized + + url:false:true:no + digest:false:true:no + + collection:true:true:no_norms + date:true:true:no_norms + type:true:true:no_norms + length:false:true:no </value> </property> @@ -70,15 +71,10 @@ <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter --> <name>nutchwax.filter.query</name> <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false group:collection + group:site:false group:type - field:anchor field:content - field:host field:title </value> </property> Modified: tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2009-06-23 21:17:31 UTC (rev 2742) +++ tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2009-06-23 21:35:00 UTC (rev 2743) @@ -20,6 +20,8 @@ */ package org.archive.nutchwax.index; +import java.net.MalformedURLException; +import java.net.URL; import java.util.List; import java.util.ArrayList; @@ -27,6 +29,7 @@ import org.apache.commons.logging.LogFactory; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.Field.Index; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.crawl.CrawlDatum; @@ -46,10 +49,14 @@ private Configuration conf; private List<FieldSpecification> fieldSpecs; + private int MAX_TITLE_LENGTH; + public void setConf( Configuration conf ) { this.conf = conf; - + + this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100); + String filterSpecs = conf.get( "nutchwax.filter.index" ); if ( null == filterSpecs ) @@ -65,12 +72,12 @@ { String spec[] = filterSpec.split("[:]"); - String srcKey = spec[0]; - boolean lowerCase = true; - boolean store = true; - boolean tokenize = false; - boolean exclusive = true; - String destKey = srcKey; + String srcKey = spec[0]; + boolean lowerCase = true; + boolean store = true; + Index index = Index.TOKENIZED; + boolean exclusive = true; + String destKey = srcKey; switch ( spec.length ) { default: @@ -79,7 +86,10 @@ case 5: exclusive = Boolean.parseBoolean( spec[4] ); case 4: - tokenize = Boolean.parseBoolean( spec[3] ); + index = "tokenized". equals(spec[3]) ? Index.TOKENIZED : + "untokenized".equals(spec[3]) ? Index.UN_TOKENIZED : + "no_norms". equals(spec[3]) ? Index.NO_NORMS : + Index.NO; case 3: store = Boolean.parseBoolean( spec[2] ); case 2: @@ -89,9 +99,9 @@ ; } - LOG.info( "Add field specification: " + srcKey + ":" + lowerCase + ":" + store + ":" + tokenize + ":" + exclusive + ":" + destKey ); + LOG.info( "Add field specification: " + srcKey + ":" + lowerCase + ":" + store + ":" + index + ":" + exclusive + ":" + destKey ); - this.fieldSpecs.add( new FieldSpecification( srcKey, lowerCase, store, tokenize, exclusive, destKey ) ); + this.fieldSpecs.add( new FieldSpecification( srcKey, lowerCase, store, index, exclusive, destKey ) ); } } @@ -100,16 +110,16 @@ String srcKey; boolean lowerCase; boolean store; - boolean tokenize; + Index index; boolean exclusive; String destKey; - public FieldSpecification( String srcKey, boolean lowerCase, boolean store, boolean tokenize, boolean exclusive, String destKey ) + public FieldSpecification( String srcKey, boolean lowerCase, boolean store, Index index, boolean exclusive, String destKey ) { this.srcKey = srcKey; this.lowerCase = lowerCase; this.store = store; - this.tokenize = tokenize; + this.index = index; this.exclusive = exclusive; this.destKey = destKey; } @@ -124,14 +134,47 @@ * Transfer NutchWAX field values stored in the parsed content to * the Lucene document. */ - public Document filter( Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks ) + public Document filter( Document doc, Parse parse, Text key, CrawlDatum datum, Inlinks inlinks ) throws IndexingException { Metadata meta = parse.getData().getContentMeta(); for ( FieldSpecification spec : this.fieldSpecs ) { - String value = meta.get( spec.srcKey ); + String value = null; + if ( "site".equals( spec.srcKey ) || "host".equals( spec.srcKey ) ) + { + try + { + value = (new URL( meta.get( "url" ) ) ).getHost( ); + } + catch ( MalformedURLException mue ) { /* Eat it */ } + } + else if ( "content".equals( spec.srcKey ) ) + { + value = parse.getText( ); + } + else if ( "title".equals( spec.srcKey ) ) + { + value = parse.getData().getTitle(); + if ( value.length() > MAX_TITLE_LENGTH ) // truncate title if needed + { + value = value.substring( 0, MAX_TITLE_LENGTH ); + } + } + else if ( "type".equals( spec.srcKey ) ) + { + value = meta.get( spec.srcKey ); + + if ( value == null ) continue ; + + int p = value.indexOf( ';' ); + if ( p >= 0 ) value = value.substring( 0, p ); + } + else + { + value = meta.get( spec.srcKey ); + } if ( value == null ) continue; @@ -144,11 +187,14 @@ { doc.removeFields( spec.destKey ); } - - doc.add( new Field( spec.destKey, - value, - spec.store ? Field.Store.YES : Field.Store.NO, - spec.tokenize ? Field.Index.TOKENIZED : Field.Index.UN_TOKENIZED ) ); + + if ( spec.store || spec.index != Index.NO ) + { + doc.add( new Field( spec.destKey, + value, + spec.store ? Field.Store.YES : Field.Store.NO, + spec.index ) ); + } } return doc; Modified: tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml =================================================================== --- tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml 2009-06-23 21:17:31 UTC (rev 2742) +++ tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml 2009-06-23 21:35:00 UTC (rev 2743) @@ -40,8 +40,8 @@ point="org.apache.nutch.searcher.QueryFilter"> <implementation id="ConfigurableQueryFilter" class="org.archive.nutchwax.query.ConfigurableQueryFilter"> - <parameter name="raw-fields" value="collection,date,digest,exacturl,filename,fileoffset,type" /> - <parameter name="fields" value="anchor,content,host,title" /> + <parameter name="raw-fields" value="collection,site,type" /> + <parameter name="fields" value="content,title" /> </implementation> </extension> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-23 21:17:33
|
Revision: 2742 http://archive-access.svn.sourceforge.net/archive-access/?rev=2742&view=rev Author: binzino Date: 2009-06-23 21:17:31 +0000 (Tue, 23 Jun 2009) Log Message: ----------- Changed getUrl() to getKey() and added code to synthesize the key from the URL and the digest value rather than relying on the "orig" field holding the key. This is to eliminate storing the key explicitly when it can be easily computed; saving space in the index. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java Modified: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2009-06-23 21:15:29 UTC (rev 2741) +++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2009-06-23 21:17:31 UTC (rev 2742) @@ -241,20 +241,20 @@ } public byte[] getContent(HitDetails details) throws IOException { - return getSegment(details).getContent(getUrl(details)); + return getSegment(details).getContent(getKey(details)); } public ParseData getParseData(HitDetails details) throws IOException { - return getSegment(details).getParseData(getUrl(details)); + return getSegment(details).getParseData(getKey(details)); } public long getFetchDate(HitDetails details) throws IOException { - return getSegment(details).getCrawlDatum(getUrl(details)) + return getSegment(details).getCrawlDatum(getKey(details)) .getFetchTime(); } public ParseText getParseText(HitDetails details) throws IOException { - return getSegment(details).getParseText(getUrl(details)); + return getSegment(details).getParseText(getKey(details)); } public Summary getSummary(HitDetails details, Query query) @@ -269,7 +269,7 @@ { try { - ParseText parseText = segment.getParseText(getUrl(details)); + ParseText parseText = segment.getParseText(getKey(details)); text = (parseText != null) ? parseText.getText() : ""; } catch ( Exception e ) @@ -380,11 +380,8 @@ } } - private Text getUrl(HitDetails details) { - String url = details.getValue("orig"); - if (StringUtils.isBlank(url)) { - url = details.getValue("url"); - } + private Text getKey(HitDetails details) { + String url = details.getValue("url") + " " + details.getValue("digest"); return new Text(url); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-23 21:15:36
|
Revision: 2741 http://archive-access.svn.sourceforge.net/archive-access/?rev=2741&view=rev Author: binzino Date: 2009-06-23 21:15:29 +0000 (Tue, 23 Jun 2009) Log Message: ----------- Removed output of (now) obsolete "orig" metadata field. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-06-23 21:13:28 UTC (rev 2740) +++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-06-23 21:15:29 UTC (rev 2741) @@ -282,8 +282,6 @@ + " " + java.util.Arrays.asList( details[i].getValues( "url" ) ) + " " - + java.util.Arrays.asList( details[i].getValues( "orig" ) ) - + " " + java.util.Arrays.asList( details[i].getValues( "digest" ) ) + " " + java.util.Arrays.asList( details[i].getValues( "date" ) ) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-23 21:13:31
|
Revision: 2740 http://archive-access.svn.sourceforge.net/archive-access/?rev=2740&view=rev Author: binzino Date: 2009-06-23 21:13:28 +0000 (Tue, 23 Jun 2009) Log Message: ----------- Fix WAX-46. Added command-line option to only dump a single field. Also added option to only output the # of records in the index. Modified Paths: -------------- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java =================================================================== --- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2009-06-22 21:29:05 UTC (rev 2739) +++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2009-06-23 21:13:28 UTC (rev 2740) @@ -23,6 +23,7 @@ import java.io.File; import java.util.Iterator; import java.util.Arrays; +import java.util.Collection; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.ArchiveParallelReader; @@ -37,10 +38,19 @@ } int offset = 0; - if ( args[0].equals( "-f" ) ) + if ( args[0].equals( "-l" ) || args[0].equals( "-c" ) ) { offset = 1; } + if ( args[0].equals( "-f" ) ) + { + if ( args.length < 2 ) + { + System.out.println( "Error: missing argument to -f\n" ); + usageAndExit( ); + } + offset = 2; + } String dirs[] = new String[args.length - offset]; System.arraycopy( args, offset, dirs, 0, args.length - offset ); @@ -51,23 +61,51 @@ reader.add( IndexReader.open( dir ) ); } - if ( offset > 0 ) + if ( args[0].equals( "-l" ) ) { listFields( reader ); } + else if ( args[0].equals( "-c" ) ) + { + countDocs( reader ); + } + else if ( args[0].equals( "-f" ) ) + { + dumpIndex( reader, args[1] ); + } else { dumpIndex( reader ); } } + private static void dumpIndex( IndexReader reader, String fieldName ) throws Exception + { + Collection fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL); + + if ( ! fieldNames.contains( fieldName ) ) + { + System.out.println( "Field not in index: " + fieldName ); + System.exit( 2 ); + } + + int numDocs = reader.numDocs(); + + for (int i = 0; i < numDocs; i++) + { + System.out.println( Arrays.toString( reader.document(i).getValues( (String) fieldName ) ) ); + } + + } + private static void dumpIndex( IndexReader reader ) throws Exception { - Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray(); + Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( ); + Arrays.sort( fieldNames ); - for (int i = 0; i < fieldNames.length; i++) + for ( int i = 0; i < fieldNames.length; i++ ) { - System.out.print(fieldNames[i] + "\t"); + System.out.print( fieldNames[i] + "\t" ); } System.out.println(); @@ -87,19 +125,27 @@ private static void listFields( IndexReader reader ) throws Exception { - Iterator it = reader.getFieldNames(IndexReader.FieldOption.ALL).iterator(); - - while (it.hasNext()) + Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( ); + Arrays.sort( fieldNames ); + + for ( int i = 0; i < fieldNames.length; i++ ) { - System.out.println(it.next()); + System.out.println( fieldNames[i] ); } - - reader.close(); } + private static void countDocs( IndexReader reader ) throws Exception + { + System.out.println( reader.numDocs( ) ); + } + private static void usageAndExit() { - System.out.println("Usage: DumpParallelIndex [-f] index1 ... indexN"); + System.out.println( "Usage: DumpParallelIndex [option] index1 ... indexN" ); + System.out.println( "Options:" ); + System.out.println( " -c Emit document count" ); + System.out.println( " -f <fieldname> Only dump specified field" ); + System.out.println( " -l List fields in index" ); System.exit(1); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-22 21:30:14
|
Revision: 2739 http://archive-access.svn.sourceforge.net/archive-access/?rev=2739&view=rev Author: binzino Date: 2009-06-22 21:29:05 +0000 (Mon, 22 Jun 2009) Log Message: ----------- Copied from nutchwax-0_12_4. Added Paths: ----------- tags/nutchwax-0_12_5/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-18 18:19:23
|
Revision: 2738 http://archive-access.svn.sourceforge.net/archive-access/?rev=2738&view=rev Author: binzino Date: 2009-06-18 18:19:19 +0000 (Thu, 18 Jun 2009) Log Message: ----------- WAX-46: Added -f option to specify a single field to dump. Also added, -c to emit count of records in an index. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2009-06-11 22:20:54 UTC (rev 2737) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2009-06-18 18:19:19 UTC (rev 2738) @@ -23,6 +23,7 @@ import java.io.File; import java.util.Iterator; import java.util.Arrays; +import java.util.Collection; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.ArchiveParallelReader; @@ -37,10 +38,19 @@ } int offset = 0; - if ( args[0].equals( "-f" ) ) + if ( args[0].equals( "-l" ) || args[0].equals( "-c" ) ) { offset = 1; } + if ( args[0].equals( "-f" ) ) + { + if ( args.length < 2 ) + { + System.out.println( "Error: missing argument to -f\n" ); + usageAndExit( ); + } + offset = 2; + } String dirs[] = new String[args.length - offset]; System.arraycopy( args, offset, dirs, 0, args.length - offset ); @@ -51,23 +61,51 @@ reader.add( IndexReader.open( dir ) ); } - if ( offset > 0 ) + if ( args[0].equals( "-l" ) ) { listFields( reader ); } + else if ( args[0].equals( "-c" ) ) + { + countDocs( reader ); + } + else if ( args[0].equals( "-f" ) ) + { + dumpIndex( reader, args[1] ); + } else { dumpIndex( reader ); } } + private static void dumpIndex( IndexReader reader, String fieldName ) throws Exception + { + Collection fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL); + + if ( ! fieldNames.contains( fieldName ) ) + { + System.out.println( "Field not in index: " + fieldName ); + System.exit( 2 ); + } + + int numDocs = reader.numDocs(); + + for (int i = 0; i < numDocs; i++) + { + System.out.println( Arrays.toString( reader.document(i).getValues( (String) fieldName ) ) ); + } + + } + private static void dumpIndex( IndexReader reader ) throws Exception { - Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray(); + Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( ); + Arrays.sort( fieldNames ); - for (int i = 0; i < fieldNames.length; i++) + for ( int i = 0; i < fieldNames.length; i++ ) { - System.out.print(fieldNames[i] + "\t"); + System.out.print( fieldNames[i] + "\t" ); } System.out.println(); @@ -87,19 +125,27 @@ private static void listFields( IndexReader reader ) throws Exception { - Iterator it = reader.getFieldNames(IndexReader.FieldOption.ALL).iterator(); - - while (it.hasNext()) + Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( ); + Arrays.sort( fieldNames ); + + for ( int i = 0; i < fieldNames.length; i++ ) { - System.out.println(it.next()); + System.out.println( fieldNames[i] ); } - - reader.close(); } + private static void countDocs( IndexReader reader ) throws Exception + { + System.out.println( reader.numDocs( ) ); + } + private static void usageAndExit() { - System.out.println("Usage: DumpParallelIndex [-f] index1 ... indexN"); + System.out.println( "Usage: DumpParallelIndex [option] index1 ... indexN" ); + System.out.println( "Options:" ); + System.out.println( " -c Emit document count" ); + System.out.println( " -f <fieldname> Only dump specified field" ); + System.out.println( " -l List fields in index" ); System.exit(1); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-06-11 22:21:12
|
Revision: 2737 http://archive-access.svn.sourceforge.net/archive-access/?rev=2737&view=rev Author: bradtofel Date: 2009-06-11 22:20:54 +0000 (Thu, 11 Jun 2009) Log Message: ----------- TWEAK: changed bad NotImplementedException to UnsupportedOperationException. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java 2009-06-09 22:48:09 UTC (rev 2736) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java 2009-06-11 22:20:54 UTC (rev 2737) @@ -34,8 +34,6 @@ import org.archive.wayback.util.CloseableIterator; import org.archive.wayback.util.flatfile.RecordIterator; -import sun.reflect.generics.reflectiveObjects.NotImplementedException; - /** * Simple log file tracking new names being added to a ResourceFileLocationDB. * @@ -169,7 +167,7 @@ * @see java.util.Iterator#remove() */ public void remove() { - throw new NotImplementedException(); + throw new UnsupportedOperationException(); } } } Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java 2009-06-09 22:48:09 UTC (rev 2736) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java 2009-06-11 22:20:54 UTC (rev 2737) @@ -31,8 +31,6 @@ import java.util.NoSuchElementException; -import sun.reflect.generics.reflectiveObjects.NotImplementedException; - /** * Composite of multiple Iterators that returns the next from a series of * all component Iterators based on Comparator constructor argument. @@ -100,7 +98,7 @@ * @see java.util.Iterator#remove() */ public void remove() { - throw new NotImplementedException(); + throw new UnsupportedOperationException(); } /* (non-Javadoc) Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java 2009-06-09 22:48:09 UTC (rev 2736) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java 2009-06-11 22:20:54 UTC (rev 2737) @@ -27,8 +27,6 @@ import java.io.IOException; import java.util.Iterator; -import sun.reflect.generics.reflectiveObjects.NotImplementedException; - /** * * @@ -90,6 +88,6 @@ * @see java.util.Iterator#remove() */ public void remove() { - throw new NotImplementedException(); + throw new UnsupportedOperationException(); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-06-09 22:48:10
|
Revision: 2736 http://archive-access.svn.sourceforge.net/archive-access/?rev=2736&view=rev Author: bradtofel Date: 2009-06-09 22:48:09 +0000 (Tue, 09 Jun 2009) Log Message: ----------- BUGFIX: Conditional GET SearchResult Annotater was indication duplicate type was due to Digest match. Added support for HTTP-Duplicate to CaptureSearchResult, and now the ConditionalGetAnnotationSearchResultAdapter uses these methods to indicate the correct type of duplicate record. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java 2009-06-09 21:20:22 UTC (rev 2735) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java 2009-06-09 22:48:09 UTC (rev 2736) @@ -203,6 +203,7 @@ public void setClosest(boolean value) { putBoolean(CAPTURE_CLOSEST_INDICATOR,value); } + public void flagDuplicateDigest(Date storedDate) { put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_DIGEST); put(CAPTURE_DUPLICATE_STORED_TS,dateToTS(storedDate)); @@ -216,19 +217,40 @@ return (dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST)); } public Date getDuplicateDigestStoredDate() { - String dupeType = get(CAPTURE_DUPLICATE_ANNOTATION); - Date date = null; - if(dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST)) { - date = tsToDate(get(CAPTURE_DUPLICATE_STORED_TS)); + if(isDuplicateDigest()) { + return tsToDate(get(CAPTURE_DUPLICATE_STORED_TS)); } - return date; + return null; } public String getDuplicateDigestStoredTimestamp() { + if(isDuplicateDigest()) { + return get(CAPTURE_DUPLICATE_STORED_TS); + } + return null; + } + + public void flagDuplicateHTTP(Date storedDate) { + put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_HTTP); + put(CAPTURE_DUPLICATE_STORED_TS,dateToTS(storedDate)); + } + public void flagDuplicateHTTP(String storedTS) { + put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_HTTP); + put(CAPTURE_DUPLICATE_STORED_TS,storedTS); + } + public boolean isDuplicateHTTP() { String dupeType = get(CAPTURE_DUPLICATE_ANNOTATION); - String ts = null; - if(dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST)) { - ts = get(CAPTURE_DUPLICATE_STORED_TS); + return (dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_HTTP)); + } + public Date getDuplicateHTTPStoredDate() { + if(isDuplicateHTTP()) { + return tsToDate(get(CAPTURE_DUPLICATE_STORED_TS)); } - return ts; + return null; } + public String getDuplicateHTTPStoredTimestamp() { + if(isDuplicateHTTP()) { + return get(CAPTURE_DUPLICATE_STORED_TS); + } + return null; + } } Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java 2009-06-09 21:20:22 UTC (rev 2735) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java 2009-06-09 22:48:09 UTC (rev 2736) @@ -78,7 +78,7 @@ o.setHttpCode(lastSeen.getHttpCode()); o.setMimeType(lastSeen.getMimeType()); o.setRedirectUrl(lastSeen.getRedirectUrl()); - o.flagDuplicateDigest(lastSeen.getCaptureTimestamp()); + o.flagDuplicateHTTP(lastSeen.getCaptureTimestamp()); return o; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2735 http://archive-access.svn.sourceforge.net/archive-access/?rev=2735&view=rev Author: bradtofel Date: 2009-06-09 21:20:22 +0000 (Tue, 09 Jun 2009) Log Message: ----------- TWEAK: removed unused import. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java 2009-06-09 21:18:20 UTC (rev 2734) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java 2009-06-09 21:20:22 UTC (rev 2735) @@ -24,8 +24,6 @@ */ package org.archive.wayback.resourceindex.adapters; -import java.util.HashMap; - import org.archive.wayback.core.CaptureSearchResult; import org.archive.wayback.util.Adapter; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2734 http://archive-access.svn.sourceforge.net/archive-access/?rev=2734&view=rev Author: bradtofel Date: 2009-06-09 21:18:20 +0000 (Tue, 09 Jun 2009) Log Message: ----------- FEATURE: added ConditionalGET annotation capability. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2009-06-09 21:12:27 UTC (rev 2733) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2009-06-09 21:18:20 UTC (rev 2734) @@ -28,8 +28,6 @@ import java.util.Iterator; import org.apache.commons.httpclient.URIException; -import org.archive.net.UURI; -import org.archive.net.UURIFactory; import org.archive.wayback.ResourceIndex; import org.archive.wayback.UrlCanonicalizer; import org.archive.wayback.core.CaptureSearchResult; @@ -43,12 +41,12 @@ import org.archive.wayback.exception.BadQueryException; import org.archive.wayback.exception.ResourceIndexNotAvailableException; import org.archive.wayback.exception.ResourceNotInArchiveException; +import org.archive.wayback.resourceindex.adapters.ConditionalGetAnnotationSearchResultAdapter; import org.archive.wayback.resourceindex.adapters.CaptureToUrlSearchResultAdapter; import org.archive.wayback.resourceindex.adapters.DeduplicationSearchResultAnnotationAdapter; import org.archive.wayback.resourceindex.filters.CounterFilter; import org.archive.wayback.resourceindex.filters.DateRangeFilter; import org.archive.wayback.resourceindex.filters.DuplicateRecordFilter; -import org.archive.wayback.resourceindex.filters.EndDateFilter; import org.archive.wayback.resourceindex.filters.GuardRailFilter; import org.archive.wayback.resourceindex.filters.HostMatchFilter; import org.archive.wayback.resourceindex.filters.SchemeMatchFilter; @@ -101,7 +99,10 @@ CloseableIterator<CaptureSearchResult> captures = source.getPrefixIterator(k); if(dedupeRecords) { + // hack hack!!! captures = new AdaptedIterator<CaptureSearchResult, CaptureSearchResult> + (captures, new ConditionalGetAnnotationSearchResultAdapter()); + captures = new AdaptedIterator<CaptureSearchResult, CaptureSearchResult> (captures, new DeduplicationSearchResultAnnotationAdapter()); } return captures; @@ -126,14 +127,15 @@ CaptureSearchResults results = new CaptureSearchResults(); CaptureQueryFilterState filterState = - new CaptureQueryFilterState(wbRequest,canonicalizer, type, filter); + new CaptureQueryFilterState(wbRequest, canonicalizer, type, + getUserFilters(wbRequest)); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> itr = getCaptureIterator(keyUrl); // set up the common Filters: ObjectFilter<CaptureSearchResult> filter = filterState.getFilter(); itr = new ObjectFilterIterator<CaptureSearchResult>(itr,filter); - + // Windowing: WindowFilterState<CaptureSearchResult> window = new WindowFilterState<CaptureSearchResult>(wbRequest); @@ -154,6 +156,7 @@ cleanupIterator(itr); return results; } + public UrlSearchResults doUrlQuery(WaybackRequest wbRequest) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, @@ -163,7 +166,7 @@ CaptureQueryFilterState filterState = new CaptureQueryFilterState(wbRequest,canonicalizer, - CaptureQueryFilterState.TYPE_URL, filter); + CaptureQueryFilterState.TYPE_URL, getUserFilters(wbRequest)); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> citr = getCaptureIterator(keyUrl); @@ -300,6 +303,27 @@ this.filter = filter; } + public ObjectFilterChain<CaptureSearchResult> getUserFilters(WaybackRequest request) { + ObjectFilterChain<CaptureSearchResult> userFilters = + new ObjectFilterChain<CaptureSearchResult>(); + + // has the user asked for only results on the exact host specified? + if(request.isExactHost()) { + userFilters.addFilter(new HostMatchFilter( + UrlOperations.urlToHost(request.getRequestUrl()))); + } + + if(request.isExactScheme()) { + userFilters.addFilter(new SchemeMatchFilter( + UrlOperations.urlToScheme(request.getRequestUrl()))); + } + if(filter != null) { + userFilters.addFilter(filter); + } + + return userFilters; + } + private class CaptureQueryFilterState { public final static int TYPE_REPLAY = 0; public final static int TYPE_CAPTURE = 1; @@ -315,7 +339,7 @@ public CaptureQueryFilterState(WaybackRequest request, UrlCanonicalizer canonicalizer, int type, - ObjectFilter<CaptureSearchResult> genericFilter) + ObjectFilterChain<CaptureSearchResult> userFilter) throws BadQueryException { String searchUrl = request.getRequestUrl(); @@ -346,12 +370,6 @@ preExclusionCounter = new CounterFilter(); DateRangeFilter drFilter = new DateRangeFilter(startDate,endDate); - if(genericFilter != null) { - filter.addFilter(genericFilter); - } - // has the user asked for only results on the exact host specified? - ObjectFilter<CaptureSearchResult> exactHost = - getExactHostFilter(request); // checks an exclusion service for every matching record ObjectFilter<CaptureSearchResult> exclusion = request.getExclusionFilter(); @@ -363,7 +381,7 @@ if(type == TYPE_REPLAY) { filter.addFilter(new UrlMatchFilter(keyUrl)); - filter.addFilter(new EndDateFilter(endDate)); + filter.addFilter(drFilter); SelfRedirectFilter selfRedirectFilter= new SelfRedirectFilter(); selfRedirectFilter.setCanonicalizer(canonicalizer); filter.addFilter(selfRedirectFilter); @@ -377,14 +395,10 @@ throw new BadQueryException("Unknown type"); } - if(exactHost != null) { - filter.addFilter(exactHost); + if(userFilter != null) { + filter.addFilters(userFilter.getFilters()); } - if(request.isExactScheme()) { - filter.addFilter(new SchemeMatchFilter( - UrlOperations.urlToScheme(request.getRequestUrl()))); - } // count how many results got to the ExclusionFilter: filter.addFilter(preExclusionCounter); @@ -425,26 +439,6 @@ } } - private static HostMatchFilter getExactHostFilter(WaybackRequest r) { - - HostMatchFilter filter = null; - if(r.isExactHost()) { - - String searchUrl = r.getRequestUrl(); - try { - - UURI searchURI = UURIFactory.getInstance(searchUrl); - String exactHost = searchURI.getHost(); - filter = new HostMatchFilter(exactHost); - - } catch (URIException e) { - // Really, this isn't gonna happen, we've already canonicalized - // it... should really optimize and do that just once. - e.printStackTrace(); - } - } - return filter; - } private class WindowFilterState<T> { int startResult; // calculated based on hits/page * pagenum int resultsPerPage; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2733 http://archive-access.svn.sourceforge.net/archive-access/?rev=2733&view=rev Author: bradtofel Date: 2009-06-09 21:12:27 +0000 (Tue, 09 Jun 2009) Log Message: ----------- INITIAL REV: class to annotate 304-dedupe WARC records with the values from the previous stored capture. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java 2009-06-09 21:12:27 UTC (rev 2733) @@ -0,0 +1,101 @@ +/* ConditionalGetAnnotationSearchResultAdapter + * + * $Id$ + * + * Created on 6:09:05 PM Mar 12, 2009. + * + * Copyright (C) 2009 Internet Archive. + * + * This file is part of wayback. + * + * wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +package org.archive.wayback.resourceindex.adapters; + +import java.util.HashMap; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.util.Adapter; + +/** + * WARC file allows 2 forms of deduplication. The first actually downloads + * documents and compares their digest with a database of previous values. When + * a new capture of a document exactly matches the previous digest, an + * abbreviated record is stored in the WARC file. The second form uses an HTTP + * conditional GET request, sending previous values returned for a given URL + * (etag, last-modified, etc). In this case, the remote server either sends a + * new document (200) which is stored normally, or the server will return a + * 304 (Not Modified) response, which is stored in the WARC file. + * + * For the first record type, the wayback indexer will output a placeholder + * record that includes the digest of the last-stored record. For 304 responses, + * the indexer outputs a normal looking record, but the record will have a + * SHA1 digest which is easily distinguishable as an "empty" document. The SHA1 + * is always: + * + * 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ + * + * This class will observe a stream of SearchResults, storing the values for + * the last seen non-empty SHA1 field. Any subsequent SearchResults with an + * empty SHA1 will be annotated, copying the values from the last non-empty + * record. + * + * This is highly experimental. + * + * @author brad + * @version $Date$, $Revision$ + */ + +public class ConditionalGetAnnotationSearchResultAdapter +implements Adapter<CaptureSearchResult,CaptureSearchResult> { + + private final static String EMPTY_VALUE = "-"; + private final static String EMPTY_SHA1 = "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"; + + private CaptureSearchResult lastSeen = null; + + public ConditionalGetAnnotationSearchResultAdapter() { + } + + private CaptureSearchResult annotate(CaptureSearchResult o) { + if(lastSeen == null) { + // TODO: log missing record digest reference + return null; + } + o.setFile(lastSeen.getFile()); + o.setOffset(lastSeen.getOffset()); + o.setDigest(lastSeen.getDigest()); + o.setHttpCode(lastSeen.getHttpCode()); + o.setMimeType(lastSeen.getMimeType()); + o.setRedirectUrl(lastSeen.getRedirectUrl()); + o.flagDuplicateDigest(lastSeen.getCaptureTimestamp()); + return o; + } + + private CaptureSearchResult remember(CaptureSearchResult o) { + lastSeen = o; + return o; + } + + public CaptureSearchResult adapt(CaptureSearchResult o) { + if(o.getFile().equals(EMPTY_VALUE)) { + if(o.getDigest().equals(EMPTY_SHA1)) { + return annotate(o); + } + return o; + } + return remember(o); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2732 http://archive-access.svn.sourceforge.net/archive-access/?rev=2732&view=rev Author: binzino Date: 2009-06-04 19:06:37 +0000 (Thu, 04 Jun 2009) Log Message: ----------- We have our own OpenSearchServlet in the org.archive.nutchwax package, so we no longer need to keep a patched version. Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Deleted: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java 2009-06-04 18:02:50 UTC (rev 2731) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java 2009-06-04 19:06:37 UTC (rev 2732) @@ -1,333 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.nutch.searcher; - -import java.io.IOException; -import java.net.URLEncoder; -import java.util.Map; -import java.util.HashMap; -import java.util.Set; -import java.util.HashSet; - -import javax.servlet.ServletException; -import javax.servlet.ServletConfig; -import javax.servlet.http.HttpServlet; -import javax.servlet.http.HttpServletRequest; -import javax.servlet.http.HttpServletResponse; - -import javax.xml.parsers.*; - -import org.apache.hadoop.conf.Configuration; -import org.apache.nutch.util.NutchConfiguration; -import org.w3c.dom.*; -import javax.xml.transform.TransformerFactory; -import javax.xml.transform.Transformer; -import javax.xml.transform.dom.DOMSource; -import javax.xml.transform.stream.StreamResult; - - -/** Present search results using A9's OpenSearch extensions to RSS, plus a few - * Nutch-specific extensions. */ -public class OpenSearchServlet extends HttpServlet { - private static final Map NS_MAP = new HashMap(); - private int MAX_HITS_PER_PAGE; - - static { - NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/"); - NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/"); - } - - private static final Set SKIP_DETAILS = new HashSet(); - static { - SKIP_DETAILS.add("url"); // redundant with RSS link - SKIP_DETAILS.add("title"); // redundant with RSS title - } - - private NutchBean bean; - private Configuration conf; - - public void init(ServletConfig config) throws ServletException { - try { - this.conf = NutchConfiguration.get(config.getServletContext()); - bean = NutchBean.get(config.getServletContext(), this.conf); - } catch (IOException e) { - throw new ServletException(e); - } - MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1); - } - - public void doGet(HttpServletRequest request, HttpServletResponse response) - throws ServletException, IOException { - - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("query request from " + request.getRemoteAddr()); - } - - // get parameters from request - request.setCharacterEncoding("UTF-8"); - String queryString = request.getParameter("query"); - if (queryString == null) - queryString = ""; - String urlQuery = URLEncoder.encode(queryString, "UTF-8"); - - // the query language - String queryLang = request.getParameter("lang"); - - int start = 0; // first hit to display - String startString = request.getParameter("start"); - if (startString != null) - start = Integer.parseInt(startString); - - int hitsPerPage = 10; // number of hits to display - String hitsString = request.getParameter("hitsPerPage"); - if (hitsString != null) - hitsPerPage = Integer.parseInt(hitsString); - if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE) - hitsPerPage = MAX_HITS_PER_PAGE; - - String sort = request.getParameter("sort"); - boolean reverse = - sort!=null && "true".equals(request.getParameter("reverse")); - - // De-Duplicate handling. Look for duplicates field and for how many - // duplicates per results to return. Default duplicates field is 'site' - // and duplicates per results default is '2'. - String dedupField = request.getParameter("dedupField"); - if (dedupField == null || dedupField.length() == 0) { - dedupField = "site"; - } - int hitsPerDup = 2; - String hitsPerDupString = request.getParameter("hitsPerDup"); - if (hitsPerDupString != null && hitsPerDupString.length() > 0) { - hitsPerDup = Integer.parseInt(hitsPerDupString); - } else { - // If 'hitsPerSite' present, use that value. - String hitsPerSiteString = request.getParameter("hitsPerSite"); - if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) { - hitsPerDup = Integer.parseInt(hitsPerSiteString); - } - } - - // Make up query string for use later drawing the 'rss' logo. - String params = "&hitsPerPage=" + hitsPerPage + - (queryLang == null ? "" : "&lang=" + queryLang) + - (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") + - (dedupField == null ? "" : "&dedupField=" + dedupField)); - - Query query = Query.parse(queryString, queryLang, this.conf); - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("query: " + queryString); - NutchBean.LOG.info("lang: " + queryLang); - } - - // execute the query - Hits hits; - try { - hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField, - sort, reverse); - } catch (IOException e) { - if (NutchBean.LOG.isWarnEnabled()) { - NutchBean.LOG.warn("Search Error", e); - } - hits = new Hits(0,new Hit[0]); - } - - if (NutchBean.LOG.isInfoEnabled()) { - NutchBean.LOG.info("total hits: " + hits.getTotal()); - } - - // generate xml results - int end = (int)Math.min(hits.getLength(), start + hitsPerPage); - int length = end-start; - - Hit[] show = hits.getHits(start, end-start); - HitDetails[] details = bean.getDetails(show); - Summary[] summaries = bean.getSummary(details, query); - - String requestUrl = request.getRequestURL().toString(); - String base = requestUrl.substring(0, requestUrl.lastIndexOf('/')); - - - try { - DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); - factory.setNamespaceAware(true); - Document doc = factory.newDocumentBuilder().newDocument(); - - Element rss = addNode(doc, doc, "rss"); - addAttribute(doc, rss, "version", "2.0"); - addAttribute(doc, rss, "xmlns:opensearch", - (String)NS_MAP.get("opensearch")); - addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch")); - - Element channel = addNode(doc, rss, "channel"); - - addNode(doc, channel, "title", "Nutch: " + queryString); - addNode(doc, channel, "description", "Nutch search results for query: " - + queryString); - addNode(doc, channel, "link", - base+"/search.jsp" - +"?query="+urlQuery - +"&start="+start - +"&hitsPerDup="+hitsPerDup - +params); - - addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal()); - addNode(doc, channel, "opensearch", "startIndex", ""+start); - addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage); - - addNode(doc, channel, "nutch", "query", queryString); - - - if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show - || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){ - addNode(doc, channel, "nutch", "nextPage", requestUrl - +"?query="+urlQuery - +"&start="+end - +"&hitsPerDup="+hitsPerDup - +params); - } - - if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) { - addNode(doc, channel, "nutch", "showAllHits", requestUrl - +"?query="+urlQuery - +"&hitsPerDup="+0 - +params); - } - - for (int i = 0; i < length; i++) { - Hit hit = show[i]; - HitDetails detail = details[i]; - String title = detail.getValue("title"); - String url = detail.getValue("url"); - String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo(); - - if (title == null || title.equals("")) { // use url for docs w/o title - title = url; - } - - Element item = addNode(doc, channel, "item"); - - addNode(doc, item, "title", title); - if (summaries[i] != null) { - addNode(doc, item, "description", summaries[i].toString() ); - } - addNode(doc, item, "link", url); - - addNode(doc, item, "nutch", "site", hit.getDedupValue()); - - addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id); - addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id - +"&query="+urlQuery+"&lang="+queryLang); - - if (hit.moreFromDupExcluded()) { - addNode(doc, item, "nutch", "moreFromSite", requestUrl - +"?query=" - +URLEncoder.encode("site:"+hit.getDedupValue() - +" "+queryString, "UTF-8") - +"&hitsPerSite="+0 - +params); - } - - for (int j = 0; j < detail.getLength(); j++) { // add all from detail - String field = detail.getField(j); - if (!SKIP_DETAILS.contains(field)) - addNode(doc, item, "nutch", field, detail.getValue(j)); - } - } - - // dump DOM tree - - DOMSource source = new DOMSource(doc); - TransformerFactory transFactory = TransformerFactory.newInstance(); - Transformer transformer = transFactory.newTransformer(); - transformer.setOutputProperty("indent", "yes"); - StreamResult result = new StreamResult(response.getOutputStream()); - response.setContentType("text/xml"); - transformer.transform(source, result); - - } catch (javax.xml.parsers.ParserConfigurationException e) { - throw new ServletException(e); - } catch (javax.xml.transform.TransformerException e) { - throw new ServletException(e); - } - - } - - private static Element addNode(Document doc, Node parent, String name) { - Element child = doc.createElement(name); - parent.appendChild(child); - return child; - } - - private static void addNode(Document doc, Node parent, - String name, String text) { - Element child = doc.createElement(name); - child.appendChild(doc.createTextNode(getLegalXml(text))); - parent.appendChild(child); - } - - private static void addNode(Document doc, Node parent, - String ns, String name, String text) { - Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name); - child.appendChild(doc.createTextNode(getLegalXml(text))); - parent.appendChild(child); - } - - private static void addAttribute(Document doc, Element node, - String name, String value) { - Attr attribute = doc.createAttribute(name); - attribute.setValue(getLegalXml(value)); - node.getAttributes().setNamedItem(attribute); - } - - /* - * Ensure string is legal xml. - * @param text String to verify. - * @return Passed <code>text</code> or a new string with illegal - * characters removed if any found in <code>text</code>. - * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char - */ - protected static String getLegalXml(final String text) { - if (text == null) { - return null; - } - StringBuffer buffer = null; - for (int i = 0; i < text.length(); i++) { - char c = text.charAt(i); - if (!isLegalXml(c)) { - if (buffer == null) { - // Start up a buffer. Copy characters here from now on - // now we've found at least one bad character in original. - buffer = new StringBuffer(text.length()); - buffer.append(text.substring(0, i)); - } - } else { - if (buffer != null) { - buffer.append(c); - } - } - } - return (buffer != null)? buffer.toString(): text; - } - - private static boolean isLegalXml(final char c) { - return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff) - || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff); - } - -} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-06-04 18:02:56
|
Revision: 2731 http://archive-access.svn.sourceforge.net/archive-access/?rev=2731&view=rev Author: binzino Date: 2009-06-04 18:02:50 +0000 (Thu, 04 Jun 2009) Log Message: ----------- Nutch 1.0 fixed their tika-mimetypes.xml, so we no longer need this patched/fixed version. Removed Paths: ------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml Deleted: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml 2009-05-20 02:55:09 UTC (rev 2730) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml 2009-06-04 18:02:50 UTC (rev 2731) @@ -1,364 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - - Description: This xml file defines the valid mime types used by Tika. - The mime types within this file are based on the types in the mime-types.xml - file available in Apache Nutch. ---> - -<mime-info> - - <mime-type type="text/plain"> - <magic priority="50"> - <match value="This is TeX," type="string" offset="0" /> - <match value="This is METAFONT," type="string" offset="0" /> - </magic> - <glob pattern="*.txt" /> - <glob pattern="*.asc" /> - </mime-type> - - <mime-type type="text/html"> - <magic priority="50"> - <match value="<!DOCTYPE HTML" type="string" - offset="0:64" /> - <match value="<!doctype html" type="string" - offset="0:64" /> - <match value="<HEAD" type="string" offset="0:64" /> - <match value="<head" type="string" offset="0:64" /> - <match value="<TITLE" type="string" offset="0:64" /> - <match value="<title" type="string" offset="0:64" /> - <match value="<html" type="string" offset="0:64" /> - <match value="<HTML" type="string" offset="0:64" /> - <match value="<BODY" type="string" offset="0" /> - <match value="<body" type="string" offset="0" /> - <match value="<TITLE" type="string" offset="0" /> - <match value="<title" type="string" offset="0" /> - <match value="<!--" type="string" offset="0" /> - <match value="<h1" type="string" offset="0" /> - <match value="<H1" type="string" offset="0" /> - <match value="<!doctype HTML" type="string" offset="0" /> - <match value="<!DOCTYPE html" type="string" offset="0" /> - </magic> - <glob pattern="*.html" /> - <glob pattern="*.htm" /> - </mime-type> - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - <mime-type type="application/xhtml+xml"> - <sub-class-of type="text/xml" /> - <glob pattern="*.xhtml" /> - <root-XML namespaceURI='http://www.w3.org/1999/xhtml' - localName='html' /> - </mime-type> - - <mime-type type="application/vnd.ms-powerpoint"> - <glob pattern="*.ppz" /> - <glob pattern="*.ppt" /> - <glob pattern="*.pps" /> - <glob pattern="*.pot" /> - <magic priority="50"> - <match value="0xcfd0e011" type="little32" offset="0" /> - </magic> - </mime-type> - - <mime-type type="application/vnd.ms-excel"> - <magic priority="50"> - <match value="Microsoft Excel 5.0 Worksheet" type="string" - offset="2080" /> - </magic> - <glob pattern="*.xls" /> - <glob pattern="*.xlc" /> - <glob pattern="*.xll" /> - <glob pattern="*.xlm" /> - <glob pattern="*.xlw" /> - <glob pattern="*.xla" /> - <glob pattern="*.xlt" /> - <glob pattern="*.xld" /> - <alias type="application/msexcel" /> - </mime-type> - - <mime-type type="application/vnd.oasis.opendocument.text"> - <glob pattern="*.odt" /> - </mime-type> - - - <mime-type type="application/zip"> - <alias type="application/x-zip-compressed" /> - <magic priority="40"> - <match value="PK\003\004" type="string" offset="0" /> - </magic> - <glob pattern="*.zip" /> - </mime-type> - - <mime-type type="application/vnd.oasis.opendocument.text"> - <glob pattern="*.oth" /> - </mime-type> - - <mime-type type="application/msword"> - <magic priority="50"> - <match value="\x31\xbe\x00\x00" type="string" offset="0" /> - <match value="PO^Q`" type="string" offset="0" /> - <match value="\376\067\0\043" type="string" offset="0" /> - <match value="\333\245-\0\0\0" type="string" offset="0" /> - <match value="Microsoft Word 6.0 Document" type="string" - offset="2080" /> - <match value="Microsoft Word document data" type="string" - offset="2112" /> - </magic> - <glob pattern="*.doc" /> - <alias type="application/vnd.ms-word" /> - </mime-type> - - <mime-type type="application/octet-stream"> - <magic priority="50"> - <match value="\037\036" type="string" offset="0" /> - <match value="017437" type="host16" offset="0" /> - <match value="0x1fff" type="host16" offset="0" /> - <match value="\377\037" type="string" offset="0" /> - <match value="0145405" type="host16" offset="0" /> - </magic> - <glob pattern="*.bin" /> - </mime-type> - - <mime-type type="application/pdf"> - <magic priority="50"> - <match value="%PDF-" type="string" offset="0" /> - </magic> - <glob pattern="*.pdf" /> - <alias type="application/x-pdf" /> - </mime-type> - - <mime-type type="application/atom+xml"> - <root-XML localName="feed" - namespaceURI="http://purl.org/atom/ns#" /> - </mime-type> - - <mime-type type="application/mac-binhex40"> - <glob pattern="*.hqx" /> - </mime-type> - - <mime-type type="application/mac-compactpro"> - <glob pattern="*.cpt" /> - </mime-type> - - <mime-type type="application/rtf"> - <glob pattern="*.rtf"/> - <alias type="text/rtf" /> - </mime-type> - - <mime-type type="application/rss+xml"> - <alias type="text/rss" /> - <root-XML localName="rss" /> - <root-XML namespaceURI="http://purl.org/rss/1.0/" /> - <glob pattern="*.rss" /> - </mime-type> - - <!-- added in by mattmann --> - <mime-type type="application/x-mif"> - <alias type="application/vnd.mif" /> - </mime-type> - - <mime-type type="application/vnd.wap.wbxml"> - <glob pattern="*.wbxml" /> - </mime-type> - - <mime-type type="application/vnd.wap.wmlc"> - <_comment>Compiled WML Document</_comment> - <glob pattern="*.wmlc" /> - </mime-type> - - <mime-type type="application/vnd.wap.wmlscriptc"> - <_comment>Compiled WML Script</_comment> - <glob pattern="*.wmlsc" /> - </mime-type> - - <mime-type type="text/vnd.wap.wmlscript"> - <_comment>WML Script</_comment> - <glob pattern="*.wmls" /> - </mime-type> - - <mime-type type="application/x-bzip"> - <alias type="application/x-bzip2" /> - </mime-type> - - <mime-type type="application/x-bzip-compressed-tar"> - <glob pattern="*.tbz" /> - <glob pattern="*.tbz2" /> - </mime-type> - - <mime-type type="application/x-cdlink"> - <_comment>Virtual CD-ROM CD Image File</_comment> - <glob pattern="*.vcd" /> - </mime-type> - - <mime-type type="application/x-director"> - <_comment>Shockwave Movie</_comment> - <glob pattern="*.dcr" /> - <glob pattern="*.dir" /> - <glob pattern="*.dxr" /> - </mime-type> - - <mime-type type="application/x-futuresplash"> - <_comment>Macromedia FutureSplash File</_comment> - <glob pattern="*.spl" /> - </mime-type> - - <mime-type type="application/x-java"> - <alias type="application/java" /> - </mime-type> - - <mime-type type="application/x-koan"> - <_comment>SSEYO Koan File</_comment> - <glob pattern="*.skp" /> - <glob pattern="*.skd" /> - <glob pattern="*.skt" /> - <glob pattern="*.skm" /> - </mime-type> - - <mime-type type="application/x-latex"> - <_comment>LaTeX Source Document</_comment> - <glob pattern="*.latex" /> - </mime-type> - - <!-- JC CHANGED - <mime-type type="application/x-mif"> - <_comment>FrameMaker MIF document</_comment> - <glob pattern="*.mif"/> - </mime-type> --> - - <mime-type type="application/ogg"> - <alias type="application/x-ogg" /> - </mime-type> - - <mime-type type="application/x-rar"> - <alias type="application/x-rar-compressed" /> - </mime-type> - - <mime-type type="application/x-shellscript"> - <alias type="application/x-sh" /> - </mime-type> - - <mime-type type="application/xhtml+xml"> - <glob pattern="*.xht" /> - </mime-type> - - <mime-type type="audio/midi"> - <glob pattern="*.kar" /> - </mime-type> - - <mime-type type="audio/x-pn-realaudio"> - <alias type="audio/x-realaudio" /> - </mime-type> - - <mime-type type="image/tiff"> - <magic priority="50"> - <match value="0x4d4d2a00" type="string" offset="0" /> - <match value="0x49492a00" type="string" offset="0" /> - </magic> - </mime-type> - - <mime-type type="message/rfc822"> - <magic priority="50"> - <match type="string" value="Relay-Version:" offset="0" /> - <match type="string" value="#! rnews" offset="0" /> - <match type="string" value="N#! rnews" offset="0" /> - <match type="string" value="Forward to" offset="0" /> - <match type="string" value="Pipe to" offset="0" /> - <match type="string" value="Return-Path:" offset="0" /> - <match type="string" value="From:" offset="0" /> - <match type="string" value="Message-ID:" offset="0" /> - <match type="string" value="Date:" offset="0" /> - </magic> - </mime-type> - - <mime-type type="application/x-javascript"> - <glob pattern="*.js" /> - </mime-type> - - - <mime-type type="image/vnd.wap.wbmp"> - <_comment>Wireless Bitmap File Format</_comment> - <glob pattern="*.wbmp" /> - </mime-type> - - <mime-type type="image/x-psd"> - <alias type="image/photoshop" /> - </mime-type> - - <mime-type type="image/x-xcf"> - <alias type="image/xcf" /> - <magic priority="50"> - <match type="string" value="gimp xcf " offset="0" /> - </magic> - </mime-type> - - <mime-type type="application/x-shockwave-flash"> - <glob pattern="*.swf"/> - <magic priority="50"> - <match type="string" value="FWS" offset="0"/> - <match type="string" value="CWS" offset="0"/> - </magic> - </mime-type> - - <mime-type type="model/iges"> - <_comment> - Initial Graphics Exchange Specification Format - </_comment> - <glob pattern="*.igs" /> - <glob pattern="*.iges" /> - </mime-type> - - <mime-type type="model/mesh"> - <glob pattern="*.msh" /> - <glob pattern="*.mesh" /> - <glob pattern="*.silo" /> - </mime-type> - - <mime-type type="model/vrml"> - <glob pattern="*.vrml" /> - </mime-type> - - <mime-type type="text/x-tcl"> - <alias type="application/x-tcl" /> - </mime-type> - - <mime-type type="text/x-tex"> - <alias type="application/x-tex" /> - </mime-type> - - <mime-type type="text/x-texinfo"> - <alias type="application/x-texinfo" /> - </mime-type> - - <mime-type type="text/x-troff-me"> - <alias type="application/x-troff-me" /> - </mime-type> - - <mime-type type="video/vnd.mpegurl"> - <glob pattern="*.mxu" /> - </mime-type> - - <mime-type type="x-conference/x-cooltalk"> - <_comment>Cooltalk Audio</_comment> - <glob pattern="*.ice" /> - </mime-type> - -</mime-info> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |