archive-access-cvs Mailing List for Web Archive Access Utilities (Page 43)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 41 42 43 44 45 .. 171 > >> (Page 43 of 171)

[Archive-access-cvs] SF.net SVN: archive-access:[2757] branches/wayback-1_4_2/

From: <bra...@us...> - 2009-07-17 23:09:30

Revision: 2757
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2757&view=rev
Author:   bradtofel
Date:     2009-07-17 23:09:28 +0000 (Fri, 17 Jul 2009)

Log Message:
-----------
1.4.2 release

Added Paths:
-----------
    branches/wayback-1_4_2/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2754] tags/nutchwax-0_12_6/archive

From: <bi...@us...> - 2009-07-09 17:34:59

Revision: 2754
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2754&view=rev
Author:   binzino
Date:     2009-07-09 17:34:57 +0000 (Thu, 09 Jul 2009)

Log Message:
-----------
Updated for 0.12.6 release.

Modified Paths:
--------------
    tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt
    tags/nutchwax-0_12_6/archive/HOWTO.txt
    tags/nutchwax-0_12_6/archive/INSTALL.txt
    tags/nutchwax-0_12_6/archive/README.txt
    tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt

Modified: tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt	2009-07-09 00:50:58 UTC (rev 2753)
+++ tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt	2009-07-09 17:34:57 UTC (rev 2754)
@@ -1,6 +1,6 @@
 
 BUILD-NOTES.txt
-2009-06-25
+2009-07-09
 Aaron Binns
 
 ======================================================================
@@ -79,7 +79,7 @@
 ----------------------------------------------------------------------
 The file
 
-  /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml
 
 contains two errors: one where a mimetype is referenced before it is
 defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
 You can either apply these patches yourself, or copy an already-patched
 copy from:
 
-  /opt/nutchwax-0.12.5/contrib/archive/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.6/contrib/archive/conf/tika-mimetypes.xml
 
 to 
 
-  /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml
 
 ----------------------------------------------------------------------
 

Modified: tags/nutchwax-0_12_6/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_6/archive/HOWTO.txt	2009-07-09 00:50:58 UTC (rev 2753)
+++ tags/nutchwax-0_12_6/archive/HOWTO.txt	2009-07-09 17:34:57 UTC (rev 2754)
@@ -1,6 +1,6 @@
 
 HOWTO.txt
-2009-06-25
+2009-07-09
 Aaron Binns
 
 Table of Contents
@@ -26,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.5
+      /opt/nutchwax-0.12.6
 
  2. ARC/WARC files.
 
@@ -68,10 +68,10 @@
 
   $ mkdir crawl
   $ cd crawl
-  $ /opt/nutchwax-0.12.5/bin/nutchwax import ../manifest
-  $ /opt/nutchwax-0.12.5/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutchwax-0.12.5/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutchwax-0.12.5/bin/nutch index indexes crawldb linkdb segments/*
+  $ /opt/nutchwax-0.12.6/bin/nutchwax import ../manifest
+  $ /opt/nutchwax-0.12.6/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutchwax-0.12.6/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutchwax-0.12.6/bin/nutch index indexes crawldb linkdb segments/*
   $ ls -F1
   crawldb/
   indexes/
@@ -96,7 +96,7 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.5/bin/nutch org.archive.nutchwax.NutchWaxBean computer
+  $ /opt/nutchwax-0.12.6/bin/nutchwax search computer
 
 This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
 
 The Nutch(WAX) web application is bundled with NutchWAX as
 
-  /opt/nutchwax-0.12.5/nutch-1.0-dev.war
+  /opt/nutchwax-0.12.6/nutch-1.0-dev.war
 
 Simply deploy that web application in the same fashion as with
 Nutch.

Modified: tags/nutchwax-0_12_6/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_6/archive/INSTALL.txt	2009-07-09 00:50:58 UTC (rev 2753)
+++ tags/nutchwax-0_12_6/archive/INSTALL.txt	2009-07-09 17:34:57 UTC (rev 2754)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2009-06-25
+2009-07-09
 Aaron Binns
 
 Table of Contents
@@ -63,10 +63,10 @@
 ------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Although the Nutch project released 1.0 in early 2009, there were so
-many changes that NutchWAX 0.12.5 is still built against pre-1.0
+many changes that NutchWAX 0.12.6 is still built against pre-1.0
 codebase.
 
-The specific SVN revision that NutchWAX 0.12.5 is built against is:
+The specific SVN revision that NutchWAX 0.12.6 is built against is:
 
   701524
 
@@ -81,14 +81,14 @@
 
 SVN: NutchWAX
 -------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.6
 source into Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_6/archive
 
 This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.5 sources.
+NutchWAX 0.12.6 sources.
 
 Build and install
 -----------------
@@ -115,7 +115,7 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
-  $ mv nutch-1.0-dev nutchwax-0.12.5
+  $ mv nutch-1.0-dev nutchwax-0.12.6
 
 
 ======================================================================
@@ -128,24 +128,24 @@
 Install it simply by untarring it, for example:
 
   $ cd /opt
-  $ tar xvfz nutchwax-0.12.5.tar.gz
+  $ tar xvfz nutchwax-0.12.6.tar.gz
 
 
 ======================================================================
 Install start-up scripts
 ======================================================================
 
-NutchWAX 0.12.5 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.6 comes with a Unix init.d script which can be used to
 automatically start the searcher slaves for a multi-node search
 configuration.
 
 Assuming you installed NutchWAX as
 
-  /opt/nutchwax-0.12.5
+  /opt/nutchwax-0.12.6
 
 the script is found at
 
-  /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave
+  /opt/nutchwax-0.12.6/contrib/archive/etc/init.d/searcher-slave
 
 This script can be placed in /etc/init.d then added to the list of
 startup scripts to run at bootup by using commands appropriate to your

Modified: tags/nutchwax-0_12_6/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_6/archive/README.txt	2009-07-09 00:50:58 UTC (rev 2753)
+++ tags/nutchwax-0_12_6/archive/README.txt	2009-07-09 17:34:57 UTC (rev 2754)
@@ -1,6 +1,6 @@
 
 README.txt
-2009-06-25
+2009-07-09
 Aaron Binns
 
 Table of Contents
@@ -13,7 +13,7 @@
 Introduction
 ======================================================================
 
-Welcome to NutchWAX 0.12.5!
+Welcome to NutchWAX 0.12.6!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.

Modified: tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt	2009-07-09 00:50:58 UTC (rev 2753)
+++ tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt	2009-07-09 17:34:57 UTC (rev 2754)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2009-06-25
+2009-07-09
 Aaron Binns
 
-Release notes for NutchWAX 0.12.5
+Release notes for NutchWAX 0.12.6
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,74 +15,44 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4
+NutchWAX 0.12.6 contains a few convenient enhancements to 0.12.5
 
-  o Command-line options for NutchWaxBean to configure number of
-    results to emit and how many hits per site to allow.
+  o Addition of 'search' and 'merge' commands to the 'nutchwax'
+    command-line driver.  Now one can do
 
-  o Change default configuration to use NutchWAX indexing and query
-    filters instead of Nutch-provided ones.  This give more consistent
-    control over indexing and query behavior.
+      nutchwax search foo
 
-  o No longer store the unique document key (URL+digest) in a separate
-    field in the index.  Since the URL and digest are stored, just use
-    them to synthesize the unique document key as needed.
+    instead of 
 
-  o Trimmed down the default configuration of indexing and query
-    filters to only store and index the minimum information needed for
-    typical NutchWAX installations.
+      nutch org.archive.nutchwax.NutchWaxBean foo
 
+    Similarly, the new NutchWAX index merging, which supports 
+    parallel indexes, can be invoked via
 
-======================================================================
-Configuration changes
-======================================================================
+      nutchwax merge output-index input-index...
 
-As mentioned in the overview, NutchWAX 0.12.5 has some important
-changes to the default configuration.
+  o Merging of parallel indexes into a single index.
 
-Previously, the indexing and query filter configuration utilized a
-combination of filters from Nutch and NutchWAX.  This was in line with
-our goal of NutchWAX being a set of add-ons to Nutch.
+    NutchWAX has a copy/paste/enhanced version of the Nutch index
+    merger that now supports parallel indexes.  This allows parallel
+    indexes to be merged into a single index.  To use this feature,
+    add the "-p" option to the NutchWAX 'merge' command indicating the
+    input index directories contain parallel index sub-dirs.
 
-However, in practice, the mixing of these filters often lead to
-confusion since the NutchWAX filters could be configured via
-properties in the Nutch configuration files whereas the Nutch filters
-were hard-coded and less powerful.
+      nutchwax merge -p output-index input-pindexes...
 
-Now, all the Nutch indexing filters have been removed and are replaced
-with the single NutchWAX indexing filter.  Similarly, all but one
-Nutch query filter are removed, replaced by the configurable NutchWAX
-query filter.  We do retain the Nutch 'query-basic' filter as it
-contains the logic for automatically applying a query to multiple
-fields with proportionate weights; something not subsumed by the
-NutchWAX query filter.
+  o Option to specify the directory where the index(es) and segments
+    live when doing a command-line search.
 
+    Previously the directory was obtained from the nutch-default.xml
+    configuration file.  This is inconvenient when testing different
+    indexes as one would have to edit the config file each time to
+    specify a different index to search.
 
-In addition to removing the Nutch filters, the NutchWAX index and
-query filters are streamlined to only index and store the minimum set
-of metadata fields for typical deployments.
+    Now, the directory can be specified on the command line:
 
-In previous versions of NutchWAX, the indexing filters were configured
-to index and store nearly every piece of metadata available.  Although
-this seems desirable, it adds a lot of storage overhead to the index,
-and can hamper run-time query speed just by having unnecessary
-information in the index (more junk for the disk to seek around).
+      nutchwax search -d <dir> <query>
 
-The NutchWAX 0.12.5 configuration omits the typically unnecessary
-metadata fields from the index and only indexes those fields we think
-are needed for typical searches.
-
-For example, while we do store the digest, we do not index it as it's
-very unusual for someone to search for a document with a specific
-SHA-1 digest value.  You could decide you want that, in which case you
-can edit the configuration and re-index the data.  You would have to
-correspondingly edit the query filter and its configuration to allow
-for searching on that field as well.
-
-We have found that this streamlined indexing configuration yields
-Lucene indexes about 25% smaller than with NutchWAX 0.12.4.
-
-
 ======================================================================
 Issues
 ======================================================================
@@ -93,16 +63,9 @@
 
 Issues resolved in this release:
 
-WAX-45 Add ability to store but not index a field via
-       ConfigurableIndexingFilter.
+WAX-51 Enhance index merging to combine parallel indexes.
 
-WAX-46 Add option to DumpParallelIndex to output only single field.
+WAX-52 Add option to NutchWaxBean to specify directory where
+       index+segments are to be found.
 
-WAX-47 Stop storing document key in "orig" field in index, synthesize
-       it as needed from the "url" and "digest" fields.
-
-WAX-48 Use NutchWAX configurable query filter for site and url fields.
-
-WAX-49 Add "hitsPerSite" option to NutchWaxBean.
-
-WAX-50 Add "num hits to find" option to NutchWaxBean.
+WAX-53 IndexMerging parallel indexes fails when index is empty.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2753] tags/nutchwax-0_12_6/archive/src/java/org /apache/lucene/index/ArchiveParallelReader.java

From: <bi...@us...> - 2009-07-09 00:51:02

Revision: 2753
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2753&view=rev
Author:   binzino
Date:     2009-07-09 00:50:58 +0000 (Thu, 09 Jul 2009)

Log Message:
-----------
Fix WAX-53.  Added check for empty fieldToReader.

Modified Paths:
--------------
    tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java

Modified: tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java
===================================================================
--- tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java	2009-07-07 22:07:17 UTC (rev 2752)
+++ tags/nutchwax-0_12_6/archive/src/java/org/apache/lucene/index/ArchiveParallelReader.java	2009-07-09 00:50:58 UTC (rev 2753)
@@ -472,6 +472,8 @@
     private TermEnum termEnum;
 
     public ParallelTermEnum() throws IOException {
+      if ( fieldToReader.isEmpty( ) ) return ;
+
       field = (String)fieldToReader.firstKey();
       if (field != null)
         termEnum = ((IndexReader)fieldToReader.get(field)).terms();


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2752] tags/nutchwax-0_12_6/archive

From: <bi...@us...> - 2009-07-07 22:07:25

Revision: 2752
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2752&view=rev
Author:   binzino
Date:     2009-07-07 22:07:17 +0000 (Tue, 07 Jul 2009)

Log Message:
-----------
WAX-52.  Added -d <dir> option to NutchWaxBean.
Also added commands for index merging and searching to the 'nutchwax' script.

Modified Paths:
--------------
    tags/nutchwax-0_12_6/archive/bin/nutchwax
    tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java

Modified: tags/nutchwax-0_12_6/archive/bin/nutchwax
===================================================================
--- tags/nutchwax-0_12_6/archive/bin/nutchwax	2009-07-07 21:53:03 UTC (rev 2751)
+++ tags/nutchwax-0_12_6/archive/bin/nutchwax	2009-07-07 22:07:17 UTC (rev 2752)
@@ -50,22 +50,30 @@
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger $@
     ;;
+  pageranker)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@
+    ;;
+  parsetextmerger)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@
+    ;;
   add-dates)
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder $@
     ;;
+  merge)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.IndexMerger $@
+    ;;
   dumpindex)
     shift
     ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@
     ;;
-  pageranker)
+  search)
     shift
-    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.NutchWaxBean $@
     ;;
-  parsetextmerger)
-    shift
-    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@
-    ;;
   *)
     echo ""
     echo "Usage: nutchwax COMMAND"
@@ -76,7 +84,9 @@
     echo "  pageranker        Generate pagerank.txt file from 'pagerankdb's or 'linkdb's"
     echo "  parsetextmerger   Merge segement parse_text/part-nnnnn directories."
     echo "  add-dates         Add dates to a parallel index"
+    echo "  merge             Merge indexes or parallel indexes"
     echo "  dumpindex         Dump an index or set of parallel indices to stdout"
+    echo "  search            Query a search index"
     echo ""
     exit 1
     ;;

Modified: tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java
===================================================================
--- tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-07-07 21:53:03 UTC (rev 2751)
+++ tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-07-07 22:07:17 UTC (rev 2752)
@@ -254,6 +254,7 @@
     String usage = "NutchWaxBean [options] query"
       + "\n\t-h <n>      Hits per site"
       + "\n\t-n <n>      Number of results to find"
+      + "\n\t-d <dir>    Search directory"
       + "\n";
 
     if ( args.length == 0 )
@@ -263,6 +264,7 @@
       }
 
     String queryString = args[args.length - 1];
+    String searchDir = null;
     int hitsPerSite = 0;
     int numHits = 10;
     for ( int i = 0 ; i < args.length - 1 ; i++ )
@@ -279,6 +281,11 @@
                 i++;
                 numHits = Integer.parseInt( args[i] );
               }
+            if ( "-d".equals( args[i] ) )
+              {
+                i++;
+                searchDir = args[i];
+              }
           }
         catch ( NumberFormatException nfe ) 
           {
@@ -290,9 +297,15 @@
     
     Configuration conf = NutchConfiguration.create();
 
+    if ( searchDir != null )
+      {
+        conf.set( "searcher.dir", searchDir );
+      }
     NutchBean bean = new NutchBean(conf);
     NutchBeanModifier.modify( bean );
 
+    System.out.println( "Searching in directory: " + conf.get( "searcher.dir" ) );
+
     Query query = Query.parse(queryString, conf);
     System.out.println("Hits per site: " + hitsPerSite);
     Hits hits = bean.search(query, numHits, hitsPerSite);


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2751] tags/nutchwax-0_12_6/archive/src/java/org /archive/nutchwax/IndexMerger.java

From: <bi...@us...> - 2009-07-07 21:53:07

Revision: 2751
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2751&view=rev
Author:   binzino
Date:     2009-07-07 21:53:03 +0000 (Tue, 07 Jul 2009)

Log Message:
-----------
WAX-51.  Copy/paste/enhance Nutch's IndexMerger to support merging of
parallel indices.

Added Paths:
-----------
    tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java

Added: tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java
===================================================================
--- tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java	                        (rev 0)
+++ tags/nutchwax-0_12_6/archive/src/java/org/archive/nutchwax/IndexMerger.java	2009-07-07 21:53:03 UTC (rev 2751)
@@ -0,0 +1,211 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.nutchwax;
+
+import java.io.*;
+import java.util.*;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.mapred.FileAlreadyExistsException;
+import org.apache.hadoop.util.*;
+import org.apache.hadoop.conf.*;
+
+import org.apache.nutch.util.HadoopFSUtil;
+import org.apache.nutch.util.LogUtil;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.indexer.NutchSimilarity;
+import org.apache.nutch.indexer.FsDirectory;
+
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.ArchiveParallelReader;
+
+/*************************************************************************
+ * IndexMerger creates an index for the output corresponding to a 
+ * single fetcher run.
+ * 
+ * @author Doug Cutting
+ * @author Mike Cafarella
+ *************************************************************************/
+public class IndexMerger extends Configured implements Tool {
+  public static final Log LOG = LogFactory.getLog(IndexMerger.class);
+
+  public static final String DONE_NAME = "merge.done";
+
+  public IndexMerger() {
+    
+  }
+  
+  public IndexMerger(Configuration conf) {
+    setConf(conf);
+  }
+  
+  /**
+   * Merge all input indexes to the single output index
+   */
+  public void merge(IndexReader[] readers, Path outputIndex, Path localWorkingDir, boolean parallel) throws IOException {
+    LOG.info("merging indexes to: " + outputIndex);
+
+    FileSystem localFs = FileSystem.getLocal(getConf());  
+    if (localFs.exists(localWorkingDir)) {
+      localFs.delete(localWorkingDir, true);
+    }
+    localFs.mkdirs(localWorkingDir);
+
+    // Get local output target
+    //
+    FileSystem fs = FileSystem.get(getConf());
+    if (fs.exists(outputIndex)) {
+      throw new FileAlreadyExistsException("Output directory " + outputIndex + " already exists!");
+    }
+
+    Path tmpLocalOutput = new Path(localWorkingDir, "merge-output");
+    Path localOutput = fs.startLocalOutput(outputIndex, tmpLocalOutput);
+
+    //
+    // Merge indices
+    //
+    IndexWriter writer = new IndexWriter(localOutput.toString(), null, true);
+    writer.setMergeFactor(getConf().getInt("indexer.mergeFactor", IndexWriter.DEFAULT_MERGE_FACTOR));
+    writer.setMaxBufferedDocs(getConf().getInt("indexer.minMergeDocs", IndexWriter.DEFAULT_MAX_BUFFERED_DOCS));
+    writer.setMaxMergeDocs(getConf().getInt("indexer.maxMergeDocs", IndexWriter.DEFAULT_MAX_MERGE_DOCS));
+    writer.setTermIndexInterval(getConf().getInt("indexer.termIndexInterval", IndexWriter.DEFAULT_TERM_INDEX_INTERVAL));
+    writer.setInfoStream(LogUtil.getDebugStream(LOG));
+    writer.setUseCompoundFile(false);
+    writer.setSimilarity(new NutchSimilarity());
+    writer.addIndexes(readers);
+    writer.close();
+
+    //
+    // Put target back
+    //
+    fs.completeLocalOutput(outputIndex, tmpLocalOutput);
+    LOG.info("done merging");
+  }
+
+  /** 
+   * Create an index for the input files in the named directory. 
+   */
+  public static void main(String[] args) throws Exception {
+    int res = ToolRunner.run(NutchConfiguration.create(), new IndexMerger(), args);
+    System.exit(res);
+  }
+  
+  public int run(String[] args) throws Exception {
+    String usage = "IndexMerger [-workingdir <workingdir>] [-p] outputIndex indexesDir...\n\t-p     Input directories contain parallel indexes.\n";
+    if (args.length < 2)
+      {
+        System.err.println("Usage: " + usage);
+        return -1;
+    }
+
+    //
+    // Parse args, read all index directories to be processed
+    //
+    FileSystem fs = FileSystem.get(getConf());
+    List<Path> indexDirs = new ArrayList<Path>();
+
+    Path workDir = new Path("indexmerger-" + System.currentTimeMillis());  
+    int i = 0;
+
+    boolean parallel=false;
+
+    while ( args[i].startsWith( "-" ) )
+      {
+        if ( "-workingdir".equals(args[i]) )
+          {
+            i++;
+            workDir = new Path(args[i++], "indexmerger-" + System.currentTimeMillis());
+          }
+        else if ( "-p".equals(args[i]) )
+          {
+            i++;
+            parallel=true;
+          }
+    }
+
+    Path outputIndex = new Path(args[i++]);
+
+    List<IndexReader> readers = new ArrayList<IndexReader>( );
+    
+    if ( ! parallel )
+      {
+        for (; i < args.length; i++)
+          {
+            FileStatus[] fstats = fs.listStatus(new Path(args[i]), HadoopFSUtil.getPassDirectoriesFilter(fs));
+            
+            for ( Path p : HadoopFSUtil.getPaths(fstats) )
+              {
+                LOG.info( "Adding reader for: " + p );
+                readers.add( IndexReader.open( new FsDirectory( fs, p, false, getConf( ) ) ) );
+              }
+          }
+      }
+    else
+      {
+        for (; i < args.length; i++)
+          {
+            FileStatus[] fstats = fs.listStatus(new Path(args[i]), HadoopFSUtil.getPassDirectoriesFilter(fs));
+            Path parallelDirs[] = HadoopFSUtil.getPaths( fstats );
+
+            if ( parallelDirs.length < 1 )
+              {
+                LOG.info( "No sub-directories, skipping: " + args[i] );
+                
+                continue;
+              }
+            else
+              {
+                LOG.info( "Adding parallel reader for: " + args[i] );
+              }
+
+            ArchiveParallelReader preader = new ArchiveParallelReader( );
+          
+            // Sort the parallelDirs so that we add them in order.  Order
+            // matters to the ParallelReader.
+            Arrays.sort( parallelDirs );
+            
+            for ( Path p : parallelDirs )
+              {
+                LOG.info( "    Adding to parallel reader: " + p.getName( ) );
+                preader.add( IndexReader.open( new FsDirectory( fs, p, false, getConf( ) ) ) ); 
+              }
+            
+            readers.add( preader );
+          }
+      }
+
+    //
+    // Merge the indices
+    //
+
+    try {
+      merge(readers.toArray(new IndexReader[readers.size()]), outputIndex, workDir, parallel);
+      return 0;
+    } catch (Exception e) {
+      LOG.fatal("IndexMerger: " + StringUtils.stringifyException(e));
+      return -1;
+    } finally {
+      FileSystem.getLocal(getConf()).delete(workDir, true);
+    }
+  }
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2750] tags/nutchwax-0_12_6/

From: <bi...@us...> - 2009-07-07 19:47:28

Revision: 2750
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2750&view=rev
Author:   binzino
Date:     2009-07-07 19:47:24 +0000 (Tue, 07 Jul 2009)

Log Message:
-----------
Created NutchWAX 0.12.6 tag/branch from 0.12.5.

Added Paths:
-----------
    tags/nutchwax-0_12_6/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2749] tags/nutchwax-0_12_5/archive

From: <bi...@us...> - 2009-06-27 01:22:21

Revision: 2749
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2749&view=rev
Author:   binzino
Date:     2009-06-27 00:17:23 +0000 (Sat, 27 Jun 2009)

Log Message:
-----------
Changed default to index the URL field.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
    tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml

Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 23:02:50 UTC (rev 2748)
+++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-27 00:17:23 UTC (rev 2749)
@@ -249,7 +249,7 @@
     content:false:false:tokenized
     site:false:false:untokenized
 
-    url:false:true:no
+    url:false:true:tokenized
     digest:false:true:no
 
     collection:true:true:no_norms

Modified: tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml	2009-06-25 23:02:50 UTC (rev 2748)
+++ tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml	2009-06-27 00:17:23 UTC (rev 2749)
@@ -47,7 +47,7 @@
     content:false:false:tokenized
     site:false:false:untokenized
 
-    url:false:true:no
+    url:false:true:tokenized
     digest:false:true:no
 
     collection:true:true:no_norms


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2748] tags/nutchwax-0_12_5/archive

From: <bi...@us...> - 2009-06-25 23:02:56

Revision: 2748
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2748&view=rev
Author:   binzino
Date:     2009-06-25 23:02:50 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
Changed version from 0.12.4 to 0.12.5.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
    tags/nutchwax-0_12_5/archive/HOWTO.txt

Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 22:00:14 UTC (rev 2747)
+++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 23:02:50 UTC (rev 2748)
@@ -79,7 +79,7 @@
 ----------------------------------------------------------------------
 The file
 
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml
 
 contains two errors: one where a mimetype is referenced before it is
 defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
 You can either apply these patches yourself, or copy an already-patched
 copy from:
 
-  /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.5/contrib/archive/conf/tika-mimetypes.xml
 
 to 
 
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
+  /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml
 
 ----------------------------------------------------------------------
 

Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 22:00:14 UTC (rev 2747)
+++ tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 23:02:50 UTC (rev 2748)
@@ -68,10 +68,10 @@
 
   $ mkdir crawl
   $ cd crawl
-  $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest
-  $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/*
+  $ /opt/nutchwax-0.12.5/bin/nutchwax import ../manifest
+  $ /opt/nutchwax-0.12.5/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutchwax-0.12.5/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutchwax-0.12.5/bin/nutch index indexes crawldb linkdb segments/*
   $ ls -F1
   crawldb/
   indexes/
@@ -96,7 +96,7 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer
+  $ /opt/nutchwax-0.12.5/bin/nutch org.archive.nutchwax.NutchWaxBean computer
 
 This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
 
 The Nutch(WAX) web application is bundled with NutchWAX as
 
-  /opt/nutchwax-0.12.4/nutch-1.0-dev.war
+  /opt/nutchwax-0.12.5/nutch-1.0-dev.war
 
 Simply deploy that web application in the same fashion as with
 Nutch.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2747] tags/nutchwax-0_12_5/archive

From: <bi...@us...> - 2009-06-25 22:00:15

Revision: 2747
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2747&view=rev
Author:   binzino
Date:     2009-06-25 22:00:14 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
Updated for 0.12.5 release.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
    tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
    tags/nutchwax-0_12_5/archive/HOWTO.txt
    tags/nutchwax-0_12_5/archive/INSTALL.txt
    tags/nutchwax-0_12_5/archive/README.txt
    tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt

Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 BUILD-NOTES.txt
-2008-12-18
+2009-06-25
 Aaron Binns
 
 ======================================================================
@@ -130,27 +130,37 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+  protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
 
 and remove:
 
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
 
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
 
 The "parse-pdf" plugin is added simply because we have lots of PDFs in
 our archives and we want to index them.  We sometimes remove the
 "parse-js" plugin if we don't care to index JavaScript files.
 
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter.  Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter.  By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields.  Note that we do retain
+the Nutch query-basic filter however.
+
 We also remove the default Nutch URL filtering and normalizing plugins
 because we do not need the URLs normalized nor filtered.  We trust
 that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +176,14 @@
 --------------------------------------------------
 indexingfilter.order
 --------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset.  However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used.  If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result.  In that case, you need to specify the order.
+
 Add this property with a value of
 
     org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +192,6 @@
 So that the NutchWAX indexing filter is run after the Nutch basic
 indexing filter.
 
-A full explanation is given in "README-dedup.txt".
-
 --------------------------------------------------
 mime.type.magic
 --------------------------------------------------
@@ -205,37 +221,44 @@
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:exclusive:dest-key
+  src-key:lowercase:store:index:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
 
   lowercase = true
   store     = true
-  tokenize  = false
+  index     = tokenized
   exclusive = true
   dest-key  = src-key
 
+For the 'index' property, the possible values are:
+  tokenized
+  untokenized
+  no_norms
+  no
+
+corresponding to the Lucene options of the same names.
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.index</name>
   <value>
-    url:false:true:true
-    url:false:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
+    title:false:true:tokenized
+    content:false:false:tokenized
+    site:false:false:untokenized
+
+    url:false:true:no
+    digest:false:true:no
+
+    collection:true:true:no_norms
+    date:true:true:no_norms
+    type:true:true:no_norms
+    length:false:true:no
   </value>
 </property>
 
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
 
 --------------------------------------------------
 nutchwax.filter.query
@@ -274,15 +297,10 @@
 <property>
   <name>nutchwax.filter.query</name>
   <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
     group:collection
+    group:site:false
     group:type
-    field:anchor
     field:content
-    field:host
     field:title
   </value>
 </property>
@@ -428,3 +446,31 @@
     <value>false</value>
   </property>
 
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher.  Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache.  This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap.  In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value.  If set to false, then the cache is not used at all.
+
+<property>
+  <name>searcher.fieldcache</name>
+  <value>true</value>
+</property>
+

Modified: tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 HOWTO-xslt.txt
-2008-12-18
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -128,8 +128,5 @@
 
 You can find sample 'web.xml' and 'search.xsl' files in 
 
-  contrib/archive/web
-
-in the compiled Nutch package.  Or in this source tree under
-
-  src/web
+  ./src/nutch/src/web/jsp/search.xsl
+  ./src/nutch/src/web/web.xml

Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 HOWTO.txt
-2008-07-28
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -26,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.4
+      /opt/nutchwax-0.12.5
 
  2. ARC/WARC files.
 
@@ -96,9 +96,9 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer
 
-This calls the NutchBean to execute a simple keyword search for
+This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
 documents you imported.
 

Modified: tags/nutchwax-0_12_5/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/INSTALL.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/INSTALL.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2009-03-08
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -62,10 +62,12 @@
 SVN: nutch-1.0-dev
 ------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
-Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.4 is
-built against is:
+Although the Nutch project released 1.0 in early 2009, there were so
+many changes that NutchWAX 0.12.5 is still built against pre-1.0
+codebase.
 
+The specific SVN revision that NutchWAX 0.12.5 is built against is:
+
   701524
 
 To checkout this revision of Nutch, use:
@@ -79,14 +81,14 @@
 
 SVN: NutchWAX
 -------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5
 source into Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive
 
 This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.4 sources.
+NutchWAX 0.12.5 sources.
 
 Build and install
 -----------------
@@ -113,7 +115,7 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
-  $ mv nutch-1.0-dev nutchwax-0.12.4
+  $ mv nutch-1.0-dev nutchwax-0.12.5
 
 
 ======================================================================
@@ -126,24 +128,24 @@
 Install it simply by untarring it, for example:
 
   $ cd /opt
-  $ tar xvfz nutchwax-0.12.4.tar.gz
+  $ tar xvfz nutchwax-0.12.5.tar.gz
 
 
 ======================================================================
 Install start-up scripts
 ======================================================================
 
-NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.5 comes with a Unix init.d script which can be used to
 automatically start the searcher slaves for a multi-node search
 configuration.
 
 Assuming you installed NutchWAX as
 
-  /opt/nutchwax-0.12.4
+  /opt/nutchwax-0.12.5
 
 the script is found at
 
-  /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
+  /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave
 
 This script can be placed in /etc/init.d then added to the list of
 startup scripts to run at bootup by using commands appropriate to your

Modified: tags/nutchwax-0_12_5/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/README.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/README.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 README.txt
-2009-05-05
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -13,7 +13,7 @@
 Introduction
 ======================================================================
 
-Welcome to NutchWAX 0.12.4!
+Welcome to NutchWAX 0.12.5!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.

Modified: tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2009-05-05
+2009-06-25
 Aaron Binns
 
-Release notes for NutchWAX 0.12.4
+Release notes for NutchWAX 0.12.5
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,17 +15,75 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4
 
-  o Option to omit storing of content during import.
-  o Support for per-collection segments in master/slave config.
-  o Additional diagnostic/log messages to help troubleshoot common
-    deployment mistakes.
-  o PageRankDb similar to LinkDb but only keeping inlink counts.
-  o Improved paging through results, handling "paging past the end".
+  o Command-line options for NutchWaxBean to configure number of
+    results to emit and how many hits per site to allow.
 
+  o Change default configuration to use NutchWAX indexing and query
+    filters instead of Nutch-provided ones.  This give more consistent
+    control over indexing and query behavior.
 
+  o No longer store the unique document key (URL+digest) in a separate
+    field in the index.  Since the URL and digest are stored, just use
+    them to synthesize the unique document key as needed.
+
+  o Trimmed down the default configuration of indexing and query
+    filters to only store and index the minimum information needed for
+    typical NutchWAX installations.
+
+
 ======================================================================
+Configuration changes
+======================================================================
+
+As mentioned in the overview, NutchWAX 0.12.5 has some important
+changes to the default configuration.
+
+Previously, the indexing and query filter configuration utilized a
+combination of filters from Nutch and NutchWAX.  This was in line with
+our goal of NutchWAX being a set of add-ons to Nutch.
+
+However, in practice, the mixing of these filters often lead to
+confusion since the NutchWAX filters could be configured via
+properties in the Nutch configuration files whereas the Nutch filters
+were hard-coded and less powerful.
+
+Now, all the Nutch indexing filters have been removed and are replaced
+with the single NutchWAX indexing filter.  Similarly, all but one
+Nutch query filter are removed, replaced by the configurable NutchWAX
+query filter.  We do retain the Nutch 'query-basic' filter as it
+contains the logic for automatically applying a query to multiple
+fields with proportionate weights; something not subsumed by the
+NutchWAX query filter.
+
+
+In addition to removing the Nutch filters, the NutchWAX index and
+query filters are streamlined to only index and store the minimum set
+of metadata fields for typical deployments.
+
+In previous versions of NutchWAX, the indexing filters were configured
+to index and store nearly every piece of metadata available.  Although
+this seems desirable, it adds a lot of storage overhead to the index,
+and can hamper run-time query speed just by having unnecessary
+information in the index (more junk for the disk to seek around).
+
+The NutchWAX 0.12.5 configuration omits the typically unnecessary
+metadata fields from the index and only indexes those fields we think
+are needed for typical searches.
+
+For example, while we do store the digest, we do not index it as it's
+very unusual for someone to search for a document with a specific
+SHA-1 digest value.  You could decide you want that, in which case you
+can edit the configuration and re-index the data.  You would have to
+correspondingly edit the query filter and its configuration to allow
+for searching on that field as well.
+
+We have found that this streamlined indexing configuration yields
+Lucene indexes about 25% smaller than with NutchWAX 0.12.4.
+
+
+======================================================================
 Issues
 ======================================================================
 
@@ -35,23 +93,16 @@
 
 Issues resolved in this release:
 
-WAX-27 Sensible output for requesting page of results past the end.
+WAX-45 Add ability to store but not index a field via
+       ConfigurableIndexingFilter.
 
-WAX-34 Add option to omit storing of content in segment
+WAX-46 Add option to DumpParallelIndex to output only single field.
 
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
-       rather than actual inlinks.
+WAX-47 Stop storing document key in "orig" field in index, synthesize
+       it as needed from the "url" and "digest" fields.
 
-WAX-36 Some additional diagnostics on connecting results to segments
-       and snippets would be very helpful.
+WAX-48 Use NutchWAX configurable query filter for site and url fields.
 
-WAX-37 Per-collection segments not supported in distributed
-       master-slave configuration.
+WAX-49 Add "hitsPerSite" option to NutchWaxBean.
 
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
-
-WAX-42 Add option to continue importing if an arcfile cannot be read.
+WAX-50 Add "num hits to find" option to NutchWaxBean.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2746] tags/nutchwax-0_12_5/archive/src/java/org /archive/nutchwax/NutchWaxBean.java

From: <bi...@us...> - 2009-06-25 20:23:25

Revision: 2746
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2746&view=rev
Author:   binzino
Date:     2009-06-25 20:23:20 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
WAX-49, WAX-50: Added -h and -n options to specify number of hits-per-site and total number of hits requested.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java

Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-06-25 20:21:51 UTC (rev 2745)
+++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-06-25 20:23:20 UTC (rev 2746)
@@ -251,28 +251,59 @@
    */
   public static void main(String[] args) throws Exception 
   {
-    String usage = "NutchWaxBean query";
+    String usage = "NutchWaxBean [options] query"
+      + "\n\t-h <n>      Hits per site"
+      + "\n\t-n <n>      Number of results to find"
+      + "\n";
 
-    if (args.length == 0) 
+    if ( args.length == 0 )
       {
-        System.err.println(usage);
-        System.exit(-1);
+        System.err.println( usage );
+        System.exit( -1 );
       }
+
+    String queryString = args[args.length - 1];
+    int hitsPerSite = 0;
+    int numHits = 10;
+    for ( int i = 0 ; i < args.length - 1 ; i++ )
+      {
+        try
+          {
+            if ( "-h".equals( args[i] ) )
+              {
+                i++;
+                hitsPerSite = Integer.parseInt( args[i] );
+              }
+            if ( "-n".equals( args[i] ) )
+              {
+                i++;
+                numHits = Integer.parseInt( args[i] );
+              }
+          }
+        catch ( NumberFormatException nfe ) 
+          {
+            System.err.println( "Error: not a numeric value: " + args[i] );
+            System.err.println( usage );
+            System.exit( -1 );
+          }
+      }
     
     Configuration conf = NutchConfiguration.create();
 
     NutchBean bean = new NutchBean(conf);
     NutchBeanModifier.modify( bean );
 
-    Query query = Query.parse(args[0], conf);
-    Hits hits = bean.search(query, 10);
-    System.out.println("Total hits: " + hits.getTotal());
-    int length = (int)Math.min(hits.getTotal(), 10);
+    Query query = Query.parse(queryString, conf);
+    System.out.println("Hits per site: " + hitsPerSite);
+    Hits hits = bean.search(query, numHits, hitsPerSite);
+    System.out.println("Total hits : " + hits.getTotal());
+    System.out.println("Hits length: " + hits.getLength());
+    int length = (int)Math.min(hits.getLength(), numHits);
     Hit[] show = hits.getHits(0, length);
     HitDetails[] details = bean.getDetails(show);
     Summary[] summaries = bean.getSummary(details, query);
     
-    for (int i = 0; i < hits.getLength(); i++) 
+    for (int i = 0; i < length; i++) 
       {
         // Use a slightly more verbose output than NutchBean.
         System.out.println( " " 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2745] tags/nutchwax-0_12_5/archive/src/nutch/ src/java/org/apache/nutch/searcher/OpenSearchServlet.java

From: <bi...@us...> - 2009-06-25 20:21:52

Revision: 2745
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2745&view=rev
Author:   binzino
Date:     2009-06-25 20:21:51 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
Since we have our own NutchWAX OpenSearchServlet, we no longer need any mods to the Nutch-provided one.

Removed Paths:
-------------
    tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java

Deleted: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	2009-06-25 20:21:06 UTC (rev 2744)
+++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	2009-06-25 20:21:51 UTC (rev 2745)
@@ -1,333 +0,0 @@
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.nutch.searcher;
-
-import java.io.IOException;
-import java.net.URLEncoder;
-import java.util.Map;
-import java.util.HashMap;
-import java.util.Set;
-import java.util.HashSet;
-
-import javax.servlet.ServletException;
-import javax.servlet.ServletConfig;
-import javax.servlet.http.HttpServlet;
-import javax.servlet.http.HttpServletRequest;
-import javax.servlet.http.HttpServletResponse;
-
-import javax.xml.parsers.*;
-
-import org.apache.hadoop.conf.Configuration;
-import org.apache.nutch.util.NutchConfiguration;
-import org.w3c.dom.*;
-import javax.xml.transform.TransformerFactory;
-import javax.xml.transform.Transformer;
-import javax.xml.transform.dom.DOMSource;
-import javax.xml.transform.stream.StreamResult;
-
-
-/** Present search results using A9's OpenSearch extensions to RSS, plus a few
- * Nutch-specific extensions. */   
-public class OpenSearchServlet extends HttpServlet {
-  private static final Map NS_MAP = new HashMap();
-  private int MAX_HITS_PER_PAGE;
-
-  static {
-    NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/");
-    NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/");
-  }
-
-  private static final Set SKIP_DETAILS = new HashSet();
-  static {
-    SKIP_DETAILS.add("url");                   // redundant with RSS link
-    SKIP_DETAILS.add("title");                 // redundant with RSS title
-  }
-
-  private NutchBean bean;
-  private Configuration conf;
-
-  public void init(ServletConfig config) throws ServletException {
-    try {
-      this.conf = NutchConfiguration.get(config.getServletContext());
-      bean = NutchBean.get(config.getServletContext(), this.conf);
-    } catch (IOException e) {
-      throw new ServletException(e);
-    }
-    MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
-  }
-
-  public void doGet(HttpServletRequest request, HttpServletResponse response)
-    throws ServletException, IOException {
-
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("query request from " + request.getRemoteAddr());
-    }
-
-    // get parameters from request
-    request.setCharacterEncoding("UTF-8");
-    String queryString = request.getParameter("query");
-    if (queryString == null)
-      queryString = "";
-    String urlQuery = URLEncoder.encode(queryString, "UTF-8");
-    
-    // the query language
-    String queryLang = request.getParameter("lang");
-    
-    int start = 0;                                // first hit to display
-    String startString = request.getParameter("start");
-    if (startString != null)
-      start = Integer.parseInt(startString);
-    
-    int hitsPerPage = 10;                         // number of hits to display
-    String hitsString = request.getParameter("hitsPerPage");
-    if (hitsString != null)
-      hitsPerPage = Integer.parseInt(hitsString);
-    if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE)
-      hitsPerPage = MAX_HITS_PER_PAGE;
-
-    String sort = request.getParameter("sort");
-    boolean reverse =
-      sort!=null && "true".equals(request.getParameter("reverse"));
-
-    // De-Duplicate handling.  Look for duplicates field and for how many
-    // duplicates per results to return. Default duplicates field is 'site'
-    // and duplicates per results default is '2'.
-    String dedupField = request.getParameter("dedupField");
-    if (dedupField == null || dedupField.length() == 0) {
-        dedupField = "site";
-    }
-    int hitsPerDup = 2;
-    String hitsPerDupString = request.getParameter("hitsPerDup");
-    if (hitsPerDupString != null && hitsPerDupString.length() > 0) {
-        hitsPerDup = Integer.parseInt(hitsPerDupString);
-    } else {
-        // If 'hitsPerSite' present, use that value.
-        String hitsPerSiteString = request.getParameter("hitsPerSite");
-        if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) {
-            hitsPerDup = Integer.parseInt(hitsPerSiteString);
-        }
-    }
-     
-    // Make up query string for use later drawing the 'rss' logo.
-    String params = "&hitsPerPage=" + hitsPerPage +
-        (queryLang == null ? "" : "&lang=" + queryLang) +
-        (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") +
-        (dedupField == null ? "" : "&dedupField=" + dedupField));
-
-    Query query = Query.parse(queryString, queryLang, this.conf);
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("query: " + queryString);
-      NutchBean.LOG.info("lang: " + queryLang);
-    }
-
-    // execute the query
-    Hits hits;
-    try {
-      hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField,
-          sort, reverse);
-    } catch (IOException e) {
-      if (NutchBean.LOG.isWarnEnabled()) {
-        NutchBean.LOG.warn("Search Error", e);
-      }
-      hits = new Hits(0,new Hit[0]);	
-    }
-
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("total hits: " + hits.getTotal());
-    }
-
-    // generate xml results
-    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
-    int length = end-start;
-
-    Hit[] show = hits.getHits(start, end-start);
-    HitDetails[] details = bean.getDetails(show);
-    Summary[] summaries = bean.getSummary(details, query);
-
-    String requestUrl = request.getRequestURL().toString();
-    String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
-      
-
-    try {
-      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
-      factory.setNamespaceAware(true);
-      Document doc = factory.newDocumentBuilder().newDocument();
- 
-      Element rss = addNode(doc, doc, "rss");
-      addAttribute(doc, rss, "version", "2.0");
-      addAttribute(doc, rss, "xmlns:opensearch",
-                   (String)NS_MAP.get("opensearch"));
-      addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch"));
-
-      Element channel = addNode(doc, rss, "channel");
-    
-      addNode(doc, channel, "title", "Nutch: " + queryString);
-      addNode(doc, channel, "description", "Nutch search results for query: "
-              + queryString);
-      addNode(doc, channel, "link",
-              base+"/search.jsp"
-              +"?query="+urlQuery
-              +"&start="+start
-              +"&hitsPerDup="+hitsPerDup
-              +params);
-
-      addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
-      addNode(doc, channel, "opensearch", "startIndex", ""+start);
-      addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
-
-      addNode(doc, channel, "nutch", "query", queryString);
-    
-
-      if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
-          || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){
-        addNode(doc, channel, "nutch", "nextPage", requestUrl
-                +"?query="+urlQuery
-                +"&start="+end
-                +"&hitsPerDup="+hitsPerDup
-                +params);
-      }
-
-      if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) {
-        addNode(doc, channel, "nutch", "showAllHits", requestUrl
-                +"?query="+urlQuery
-                +"&hitsPerDup="+0
-                +params);
-      }
-
-      for (int i = 0; i < length; i++) {
-        Hit hit = show[i];
-        HitDetails detail = details[i];
-        String title = detail.getValue("title");
-        String url = detail.getValue("url");
-        String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo();
-      
-        if (title == null || title.equals("")) {   // use url for docs w/o title
-          title = url;
-        }
-        
-        Element item = addNode(doc, channel, "item");
-
-        addNode(doc, item, "title", title);
-        if (summaries[i] != null) {
-          addNode(doc, item, "description", summaries[i].toString() );
-        }
-        addNode(doc, item, "link", url);
-
-        addNode(doc, item, "nutch", "site", hit.getDedupValue());
-
-        addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id);
-        addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id
-                +"&query="+urlQuery+"&lang="+queryLang);
-
-        if (hit.moreFromDupExcluded()) {
-          addNode(doc, item, "nutch", "moreFromSite", requestUrl
-                  +"?query="
-                  +URLEncoder.encode("site:"+hit.getDedupValue()
-                                     +" "+queryString, "UTF-8")
-                  +"&hitsPerSite="+0
-                  +params);
-        }
-
-        for (int j = 0; j < detail.getLength(); j++) { // add all from detail
-          String field = detail.getField(j);
-          if (!SKIP_DETAILS.contains(field))
-            addNode(doc, item, "nutch", field, detail.getValue(j));
-        }
-      }
-
-      // dump DOM tree
-
-      DOMSource source = new DOMSource(doc);
-      TransformerFactory transFactory = TransformerFactory.newInstance();
-      Transformer transformer = transFactory.newTransformer();
-      transformer.setOutputProperty("indent", "yes");
-      StreamResult result = new StreamResult(response.getOutputStream());
-      response.setContentType("text/xml");
-      transformer.transform(source, result);
-
-    } catch (javax.xml.parsers.ParserConfigurationException e) {
-      throw new ServletException(e);
-    } catch (javax.xml.transform.TransformerException e) {
-      throw new ServletException(e);
-    }
-      
-  }
-
-  private static Element addNode(Document doc, Node parent, String name) {
-    Element child = doc.createElement(name);
-    parent.appendChild(child);
-    return child;
-  }
-
-  private static void addNode(Document doc, Node parent,
-                              String name, String text) {
-    Element child = doc.createElement(name);
-    child.appendChild(doc.createTextNode(getLegalXml(text)));
-    parent.appendChild(child);
-  }
-
-  private static void addNode(Document doc, Node parent,
-                              String ns, String name, String text) {
-    Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name);
-    child.appendChild(doc.createTextNode(getLegalXml(text)));
-    parent.appendChild(child);
-  }
-
-  private static void addAttribute(Document doc, Element node,
-                                   String name, String value) {
-    Attr attribute = doc.createAttribute(name);
-    attribute.setValue(getLegalXml(value));
-    node.getAttributes().setNamedItem(attribute);
-  }
-
-  /*
-   * Ensure string is legal xml.
-   * @param text String to verify.
-   * @return Passed <code>text</code> or a new string with illegal
-   * characters removed if any found in <code>text</code>.
-   * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
-   */
-  protected static String getLegalXml(final String text) {
-      if (text == null) {
-          return null;
-      }
-      StringBuffer buffer = null;
-      for (int i = 0; i < text.length(); i++) {
-        char c = text.charAt(i);
-        if (!isLegalXml(c)) {
-	  if (buffer == null) {
-              // Start up a buffer.  Copy characters here from now on
-              // now we've found at least one bad character in original.
-	      buffer = new StringBuffer(text.length());
-              buffer.append(text.substring(0, i));
-          }
-        } else {
-           if (buffer != null) {
-             buffer.append(c);
-           }
-        }
-      }
-      return (buffer != null)? buffer.toString(): text;
-  }
- 
-  private static boolean isLegalXml(final char c) {
-    return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff)
-        || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff);
-  }
-
-}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2744] tags/nutchwax-0_12_5/archive/src/nutch/ src/java/org/apache/nutch/searcher/IndexSearcher.java

From: <bi...@us...> - 2009-06-25 20:21:14

Revision: 2744
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2744&view=rev
Author:   binzino
Date:     2009-06-25 20:21:06 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
WAX-47: Use 'url' field rather than 'exacturl' field as the former will (should) always be present whereas the latter may not.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java

Modified: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	2009-06-23 21:35:00 UTC (rev 2743)
+++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/IndexSearcher.java	2009-06-25 20:21:06 UTC (rev 2744)
@@ -173,10 +173,10 @@
         {
           if ( "site".equals( dedupField ) )
             {
-              String exactUrl = reader.document( doc ).get( "exacturl");
+              String url = reader.document( doc ).get( "url");
               try 
                 {
-                  java.net.URL u = new java.net.URL( exactUrl );
+                  java.net.URL u = new java.net.URL( url );
                   dedupValue = u.getHost();
                   
                   System.out.println("Dedup value hack:" + dedupValue);


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2743] tags/nutchwax-0_12_5/archive/src

From: <bi...@us...> - 2009-06-23 21:35:02

Revision: 2743
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2743&view=rev
Author:   binzino
Date:     2009-06-23 21:35:00 +0000 (Tue, 23 Jun 2009)

Log Message:
-----------
Fix WAX-45 and WAX-48.  ConfigurableIndexingFilter can handle all the fields relevant to Nutch(WAX).  Update the nute-site.xml accordingly.  Also, remove the site and url query filters from nutch-site.xml and configure NutchWAX query filter to take over for them.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml
    tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java
    tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml

Modified: tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml
===================================================================
--- tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml	2009-06-23 21:17:31 UTC (rev 2742)
+++ tags/nutchwax-0_12_5/archive/src/nutch/conf/nutch-site.xml	2009-06-23 21:35:00 UTC (rev 2743)
@@ -10,19 +10,18 @@
   <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. -->
   <!-- Also, add 'parse-pdf' -->
   <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' -->
-  <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
+  <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value>
 </property>
 
-<!-- The indexing filter order *must* be specified in order for
-     NutchWAX's ConfigurableIndexingFilter to be called *after* the
-     BasicIndexingFilter.  This is necessary so that the
-     ConfigurableIndexingFilter can over-write some of the values put
-     into the Lucene document by the BasicIndexingFilter.
-     
-     The over-written values are the 'url' and 'digest' fields, which
-     NutchWAX needs to handle specially in order for de-duplication to
-     work properly.
-  -->
+<!-- 
+  When using *only* the 'index-nutchwax' in 'plugin.includes' above, 
+  we don't need to specify an order since there is only one plugin.
+
+  However, if you choose to use the Nutch 'index-basic', then you have
+  to specify the order such that the NutchWAX ConfigurableIndexingFilter
+  is after it.  Whichever plugin comes last over-writes the values
+  of those that come before it.
+
 <property>
   <name>indexingfilter.order</name>
   <value>
@@ -30,29 +29,31 @@
     org.archive.nutchwax.index.ConfigurableIndexingFilter
   </value>
 </property>
+  -->
 
 <property>
   <!-- Configure the 'index-nutchwax' plugin.  Specify how the metadata fields added by the Importer are mapped to the Lucene documents during indexing.
-       The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key"
+       The specifications here are of the form "src-key:lowercase:store:index:dest-key"
        Where the only required part is the "src-key", the rest will assume the following defaults:
           lowercase = true
           store     = true
-          tokenize  = false
+          index     = tokenized
           exclusive = true
           dest-key  = src-key
     -->
   <name>nutchwax.filter.index</name>
   <value>
-    url:false:true:true
-    url:false:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
+    title:false:true:tokenized
+    content:false:false:tokenized
+    site:false:false:untokenized
+
+    url:false:true:no
+    digest:false:true:no
+
+    collection:true:true:no_norms
+    date:true:true:no_norms
+    type:true:true:no_norms
+    length:false:true:no
   </value>
 </property>
 
@@ -70,15 +71,10 @@
   <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter -->
   <name>nutchwax.filter.query</name>
   <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
     group:collection
+    group:site:false
     group:type
-    field:anchor
     field:content
-    field:host
     field:title
   </value>
 </property>

Modified: tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java	2009-06-23 21:17:31 UTC (rev 2742)
+++ tags/nutchwax-0_12_5/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java	2009-06-23 21:35:00 UTC (rev 2743)
@@ -20,6 +20,8 @@
  */
 package org.archive.nutchwax.index;
 
+import java.net.MalformedURLException;
+import java.net.URL;
 import java.util.List;
 import java.util.ArrayList;
 
@@ -27,6 +29,7 @@
 import org.apache.commons.logging.LogFactory;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
+import org.apache.lucene.document.Field.Index;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.crawl.CrawlDatum;
@@ -46,10 +49,14 @@
   private Configuration conf;
   private List<FieldSpecification> fieldSpecs;
 
+  private int MAX_TITLE_LENGTH;
+
   public void setConf( Configuration conf )
   {
     this.conf = conf;
-    
+
+    this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100);
+
     String filterSpecs = conf.get( "nutchwax.filter.index" );
     
     if ( null == filterSpecs )
@@ -65,12 +72,12 @@
       {
         String spec[] = filterSpec.split("[:]");
 
-        String  srcKey    = spec[0];
-        boolean lowerCase = true;
-        boolean store     = true;
-        boolean tokenize  = false;
-        boolean exclusive = true;
-        String  destKey   = srcKey;
+        String  srcKey     = spec[0];
+        boolean lowerCase  = true;
+        boolean store      = true;
+        Index   index      = Index.TOKENIZED;
+        boolean exclusive  = true;
+        String  destKey    = srcKey;
         switch ( spec.length )
           {
           default:
@@ -79,7 +86,10 @@
           case 5:
             exclusive = Boolean.parseBoolean( spec[4] );
           case 4:
-            tokenize  = Boolean.parseBoolean( spec[3] );
+            index     = "tokenized".  equals(spec[3]) ? Index.TOKENIZED : 
+                        "untokenized".equals(spec[3]) ? Index.UN_TOKENIZED : 
+                        "no_norms".   equals(spec[3]) ? Index.NO_NORMS :
+                        Index.NO;
           case 3:
             store     = Boolean.parseBoolean( spec[2] );
           case 2:
@@ -89,9 +99,9 @@
             ;
           }
 
-        LOG.info( "Add field specification: " + srcKey + ":" + lowerCase + ":" + store + ":" + tokenize + ":" + exclusive + ":" + destKey );
+        LOG.info( "Add field specification: " + srcKey + ":" + lowerCase + ":" + store + ":" + index + ":" + exclusive + ":" + destKey );
 
-        this.fieldSpecs.add( new FieldSpecification( srcKey, lowerCase, store, tokenize, exclusive, destKey ) );
+        this.fieldSpecs.add( new FieldSpecification( srcKey, lowerCase, store, index, exclusive, destKey ) );
       }
   }
 
@@ -100,16 +110,16 @@
     String  srcKey;
     boolean lowerCase;
     boolean store;
-    boolean tokenize;
+    Index   index;
     boolean exclusive;
     String  destKey;
 
-    public FieldSpecification( String srcKey, boolean lowerCase, boolean store, boolean tokenize, boolean exclusive, String destKey )
+    public FieldSpecification( String srcKey, boolean lowerCase, boolean store, Index index, boolean exclusive, String destKey )
     {
       this.srcKey    = srcKey;
       this.lowerCase = lowerCase;
       this.store     = store;
-      this.tokenize  = tokenize;
+      this.index     = index;
       this.exclusive = exclusive;
       this.destKey   = destKey;
     }
@@ -124,14 +134,47 @@
    * Transfer NutchWAX field values stored in the parsed content to
    * the Lucene document.
    */
-  public Document filter( Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks )
+  public Document filter( Document doc, Parse parse, Text key, CrawlDatum datum, Inlinks inlinks )
     throws IndexingException
   {
     Metadata meta = parse.getData().getContentMeta();
 
     for ( FieldSpecification spec : this.fieldSpecs )
       {
-        String value = meta.get( spec.srcKey );
+        String value = null;
+        if ( "site".equals( spec.srcKey ) || "host".equals( spec.srcKey ) )
+          {
+            try
+              {
+                value = (new URL( meta.get( "url" ) ) ).getHost( );
+              }
+            catch ( MalformedURLException mue ) { /* Eat it */ }
+          }
+        else if ( "content".equals( spec.srcKey ) ) 
+          {
+            value = parse.getText( );
+          }
+        else if ( "title".equals( spec.srcKey ) )
+          {
+            value = parse.getData().getTitle();
+            if ( value.length() > MAX_TITLE_LENGTH )      // truncate title if needed
+              {
+                value = value.substring( 0, MAX_TITLE_LENGTH );
+              }
+          }
+        else if ( "type".equals( spec.srcKey ) )
+          {
+            value = meta.get( spec.srcKey );
+
+            if ( value == null ) continue ;
+
+            int p = value.indexOf( ';' );
+            if ( p >= 0 ) value = value.substring( 0, p );
+          }
+        else
+          {
+            value = meta.get( spec.srcKey );
+          }
         
         if ( value == null ) continue;
 
@@ -144,11 +187,14 @@
           {
             doc.removeFields( spec.destKey );
           }
-        
-        doc.add( new Field( spec.destKey, 
-                            value, 
-                            spec.store    ? Field.Store.YES : Field.Store.NO, 
-                            spec.tokenize ? Field.Index.TOKENIZED : Field.Index.UN_TOKENIZED ) );
+
+        if ( spec.store || spec.index != Index.NO )
+          {
+            doc.add( new Field( spec.destKey, 
+                                value, 
+                                spec.store ? Field.Store.YES : Field.Store.NO, 
+                                spec.index ) );
+          }
       }
 
     return doc;

Modified: tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml
===================================================================
--- tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml	2009-06-23 21:17:31 UTC (rev 2742)
+++ tags/nutchwax-0_12_5/archive/src/plugin/query-nutchwax/plugin.xml	2009-06-23 21:35:00 UTC (rev 2743)
@@ -40,8 +40,8 @@
               point="org.apache.nutch.searcher.QueryFilter">
       <implementation id="ConfigurableQueryFilter"
                       class="org.archive.nutchwax.query.ConfigurableQueryFilter">
-        <parameter name="raw-fields" value="collection,date,digest,exacturl,filename,fileoffset,type" />
-        <parameter name="fields"     value="anchor,content,host,title" />
+        <parameter name="raw-fields" value="collection,site,type" />
+        <parameter name="fields"     value="content,title" />
       </implementation>
    </extension>
               


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2742] tags/nutchwax-0_12_5/archive/src/nutch/ src/java/org/apache/nutch/searcher/FetchedSegments.java

From: <bi...@us...> - 2009-06-23 21:17:33

Revision: 2742
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2742&view=rev
Author:   binzino
Date:     2009-06-23 21:17:31 +0000 (Tue, 23 Jun 2009)

Log Message:
-----------
Changed getUrl() to getKey() and added code to synthesize the key from the URL and the digest value rather than relying on the "orig" field holding the key.  This is to eliminate storing the key explicitly when it can be easily computed; saving space in the index.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java

Modified: tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	2009-06-23 21:15:29 UTC (rev 2741)
+++ tags/nutchwax-0_12_5/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java	2009-06-23 21:17:31 UTC (rev 2742)
@@ -241,20 +241,20 @@
   }
 
   public byte[] getContent(HitDetails details) throws IOException {
-    return getSegment(details).getContent(getUrl(details));
+    return getSegment(details).getContent(getKey(details));
   }
 
   public ParseData getParseData(HitDetails details) throws IOException {
-    return getSegment(details).getParseData(getUrl(details));
+    return getSegment(details).getParseData(getKey(details));
   }
 
   public long getFetchDate(HitDetails details) throws IOException {
-    return getSegment(details).getCrawlDatum(getUrl(details))
+    return getSegment(details).getCrawlDatum(getKey(details))
       .getFetchTime();
   }
 
   public ParseText getParseText(HitDetails details) throws IOException {
-    return getSegment(details).getParseText(getUrl(details));
+    return getSegment(details).getParseText(getKey(details));
   }
 
   public Summary getSummary(HitDetails details, Query query)
@@ -269,7 +269,7 @@
       {
         try
           {
-            ParseText parseText = segment.getParseText(getUrl(details));
+            ParseText parseText = segment.getParseText(getKey(details));
             text = (parseText != null) ? parseText.getText() : "";
           }
         catch ( Exception e )
@@ -380,11 +380,8 @@
       }
   }
 
-  private Text getUrl(HitDetails details) {
-    String url = details.getValue("orig");
-    if (StringUtils.isBlank(url)) {
-      url = details.getValue("url");
-    }
+  private Text getKey(HitDetails details) {
+    String url = details.getValue("url") + " " + details.getValue("digest");
     return new Text(url);
   }
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2741] tags/nutchwax-0_12_5/archive/src/java/org /archive/nutchwax/NutchWaxBean.java

From: <bi...@us...> - 2009-06-23 21:15:36

Revision: 2741
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2741&view=rev
Author:   binzino
Date:     2009-06-23 21:15:29 +0000 (Tue, 23 Jun 2009)

Log Message:
-----------
Removed output of (now) obsolete "orig" metadata field.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java

Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-06-23 21:13:28 UTC (rev 2740)
+++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/NutchWaxBean.java	2009-06-23 21:15:29 UTC (rev 2741)
@@ -282,8 +282,6 @@
                             + " " 
                             + java.util.Arrays.asList( details[i].getValues( "url"     ) )
                             + " " 
-                            + java.util.Arrays.asList( details[i].getValues( "orig"    ) )
-                            + " " 
                             + java.util.Arrays.asList( details[i].getValues( "digest"  ) )
                             + " " 
                             + java.util.Arrays.asList( details[i].getValues( "date"    ) )


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2740] tags/nutchwax-0_12_5/archive/src/java/org /archive/nutchwax/tools/DumpParallelIndex.java

From: <bi...@us...> - 2009-06-23 21:13:31

Revision: 2740
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2740&view=rev
Author:   binzino
Date:     2009-06-23 21:13:28 +0000 (Tue, 23 Jun 2009)

Log Message:
-----------
Fix WAX-46.  Added command-line option to only dump a single field. Also added option to only output the # of records in the index.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java

Modified: tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java
===================================================================
--- tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java	2009-06-22 21:29:05 UTC (rev 2739)
+++ tags/nutchwax-0_12_5/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java	2009-06-23 21:13:28 UTC (rev 2740)
@@ -23,6 +23,7 @@
 import java.io.File;
 import java.util.Iterator;
 import java.util.Arrays;
+import java.util.Collection;
 
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.ArchiveParallelReader;
@@ -37,10 +38,19 @@
       }
 
     int offset = 0;
-    if ( args[0].equals( "-f" ) )
+    if ( args[0].equals( "-l" ) || args[0].equals( "-c" ) )
       {
         offset = 1;
       }
+    if ( args[0].equals( "-f" ) )
+      {
+        if ( args.length < 2 )
+          {
+            System.out.println( "Error: missing argument to -f\n" );
+            usageAndExit( );
+          }
+        offset = 2;
+      }
 
     String dirs[] = new String[args.length - offset];
     System.arraycopy( args, offset, dirs, 0, args.length - offset );
@@ -51,23 +61,51 @@
         reader.add( IndexReader.open( dir ) );
       }
 
-    if ( offset > 0 )
+    if ( args[0].equals( "-l" ) )
       {
         listFields( reader );
       }
+    else if ( args[0].equals( "-c" ) )
+      {
+        countDocs( reader );
+      }
+    else if ( args[0].equals( "-f" ) )
+      {
+        dumpIndex( reader, args[1] );
+      }
     else
       {
         dumpIndex( reader );
       }
   }
   
+  private static void dumpIndex( IndexReader reader, String fieldName ) throws Exception
+  {
+    Collection fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL);
+
+    if ( ! fieldNames.contains( fieldName ) )
+      {
+        System.out.println( "Field not in index: " + fieldName );
+        System.exit( 2 );
+      }
+
+    int numDocs = reader.numDocs();
+    
+    for (int i = 0; i < numDocs; i++)
+    {
+      System.out.println( Arrays.toString( reader.document(i).getValues( (String) fieldName ) ) );
+    }
+    
+  }
+
   private static void dumpIndex( IndexReader reader ) throws Exception
   {
-    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray();
+    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( );
+    Arrays.sort( fieldNames );
 
-    for (int i = 0; i < fieldNames.length; i++)
+    for ( int i = 0; i < fieldNames.length; i++ )
     {
-      System.out.print(fieldNames[i] + "\t");
+      System.out.print( fieldNames[i] + "\t" );
     }
 
     System.out.println();
@@ -87,19 +125,27 @@
   
   private static void listFields( IndexReader reader ) throws Exception
   {
-    Iterator it = reader.getFieldNames(IndexReader.FieldOption.ALL).iterator();
-    
-    while (it.hasNext())
+    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( );
+    Arrays.sort( fieldNames );
+
+    for ( int i = 0; i < fieldNames.length; i++ )
     {
-      System.out.println(it.next());
+      System.out.println( fieldNames[i] );
     }
-    
-    reader.close();
   }
   
+  private static void countDocs( IndexReader reader ) throws Exception
+  {
+    System.out.println( reader.numDocs( ) );
+  }
+  
   private static void usageAndExit()
   {
-    System.out.println("Usage: DumpParallelIndex [-f] index1 ... indexN");
+    System.out.println( "Usage: DumpParallelIndex [option] index1 ... indexN" );
+    System.out.println( "Options:" );
+    System.out.println( "  -c                Emit document count" );
+    System.out.println( "  -f <fieldname>    Only dump specified field" );
+    System.out.println( "  -l                List fields in index" );
     System.exit(1);
   }
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2739] tags/nutchwax-0_12_5/

From: <bi...@us...> - 2009-06-22 21:30:14

Revision: 2739
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2739&view=rev
Author:   binzino
Date:     2009-06-22 21:29:05 +0000 (Mon, 22 Jun 2009)

Log Message:
-----------
Copied from nutchwax-0_12_4.

Added Paths:
-----------
    tags/nutchwax-0_12_5/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2738] trunk/archive-access/projects/nutchwax/ archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java

From: <bi...@us...> - 2009-06-18 18:19:23

Revision: 2738
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2738&view=rev
Author:   binzino
Date:     2009-06-18 18:19:19 +0000 (Thu, 18 Jun 2009)

Log Message:
-----------
WAX-46: Added -f option to specify a single field to dump.  Also
added, -c to emit count of records in an index.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java

Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java	2009-06-11 22:20:54 UTC (rev 2737)
+++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java	2009-06-18 18:19:19 UTC (rev 2738)
@@ -23,6 +23,7 @@
 import java.io.File;
 import java.util.Iterator;
 import java.util.Arrays;
+import java.util.Collection;
 
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.ArchiveParallelReader;
@@ -37,10 +38,19 @@
       }
 
     int offset = 0;
-    if ( args[0].equals( "-f" ) )
+    if ( args[0].equals( "-l" ) || args[0].equals( "-c" ) )
       {
         offset = 1;
       }
+    if ( args[0].equals( "-f" ) )
+      {
+        if ( args.length < 2 )
+          {
+            System.out.println( "Error: missing argument to -f\n" );
+            usageAndExit( );
+          }
+        offset = 2;
+      }
 
     String dirs[] = new String[args.length - offset];
     System.arraycopy( args, offset, dirs, 0, args.length - offset );
@@ -51,23 +61,51 @@
         reader.add( IndexReader.open( dir ) );
       }
 
-    if ( offset > 0 )
+    if ( args[0].equals( "-l" ) )
       {
         listFields( reader );
       }
+    else if ( args[0].equals( "-c" ) )
+      {
+        countDocs( reader );
+      }
+    else if ( args[0].equals( "-f" ) )
+      {
+        dumpIndex( reader, args[1] );
+      }
     else
       {
         dumpIndex( reader );
       }
   }
   
+  private static void dumpIndex( IndexReader reader, String fieldName ) throws Exception
+  {
+    Collection fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL);
+
+    if ( ! fieldNames.contains( fieldName ) )
+      {
+        System.out.println( "Field not in index: " + fieldName );
+        System.exit( 2 );
+      }
+
+    int numDocs = reader.numDocs();
+    
+    for (int i = 0; i < numDocs; i++)
+    {
+      System.out.println( Arrays.toString( reader.document(i).getValues( (String) fieldName ) ) );
+    }
+    
+  }
+
   private static void dumpIndex( IndexReader reader ) throws Exception
   {
-    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray();
+    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( );
+    Arrays.sort( fieldNames );
 
-    for (int i = 0; i < fieldNames.length; i++)
+    for ( int i = 0; i < fieldNames.length; i++ )
     {
-      System.out.print(fieldNames[i] + "\t");
+      System.out.print( fieldNames[i] + "\t" );
     }
 
     System.out.println();
@@ -87,19 +125,27 @@
   
   private static void listFields( IndexReader reader ) throws Exception
   {
-    Iterator it = reader.getFieldNames(IndexReader.FieldOption.ALL).iterator();
-    
-    while (it.hasNext())
+    Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray( );
+    Arrays.sort( fieldNames );
+
+    for ( int i = 0; i < fieldNames.length; i++ )
     {
-      System.out.println(it.next());
+      System.out.println( fieldNames[i] );
     }
-    
-    reader.close();
   }
   
+  private static void countDocs( IndexReader reader ) throws Exception
+  {
+    System.out.println( reader.numDocs( ) );
+  }
+  
   private static void usageAndExit()
   {
-    System.out.println("Usage: DumpParallelIndex [-f] index1 ... indexN");
+    System.out.println( "Usage: DumpParallelIndex [option] index1 ... indexN" );
+    System.out.println( "Options:" );
+    System.out.println( "  -c                Emit document count" );
+    System.out.println( "  -f <fieldname>    Only dump specified field" );
+    System.out.println( "  -l                List fields in index" );
     System.exit(1);
   }
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2737] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback

From: <bra...@us...> - 2009-06-11 22:21:12

Revision: 2737
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2737&view=rev
Author:   bradtofel
Date:     2009-06-11 22:20:54 +0000 (Thu, 11 Jun 2009)

Log Message:
-----------
TWEAK: changed bad NotImplementedException to UnsupportedOperationException.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java	2009-06-09 22:48:09 UTC (rev 2736)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/ResourceFileLocationDBLog.java	2009-06-11 22:20:54 UTC (rev 2737)
@@ -34,8 +34,6 @@
 import org.archive.wayback.util.CloseableIterator;
 import org.archive.wayback.util.flatfile.RecordIterator;
 
-import sun.reflect.generics.reflectiveObjects.NotImplementedException;
-
 /**
  * Simple log file tracking new names being added to a ResourceFileLocationDB.
  * 
@@ -169,7 +167,7 @@
 		 * @see java.util.Iterator#remove()
 		 */
 		public void remove() {
-			throw new NotImplementedException();
+			throw new UnsupportedOperationException();
 		}
 	}
 }

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java	2009-06-09 22:48:09 UTC (rev 2736)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/CompositeSortedIterator.java	2009-06-11 22:20:54 UTC (rev 2737)
@@ -31,8 +31,6 @@
 import java.util.NoSuchElementException;
 
 
-import sun.reflect.generics.reflectiveObjects.NotImplementedException;
-
 /**
  * Composite of multiple Iterators that returns the next from a series of
  * all component Iterators based on Comparator constructor argument.
@@ -100,7 +98,7 @@
 	 * @see java.util.Iterator#remove()
 	 */
 	public void remove() {
-		throw new NotImplementedException();
+		throw new UnsupportedOperationException();
 	}
 
 	/* (non-Javadoc)

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java	2009-06-09 22:48:09 UTC (rev 2736)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/PeekableIterator.java	2009-06-11 22:20:54 UTC (rev 2737)
@@ -27,8 +27,6 @@
 import java.io.IOException;
 import java.util.Iterator;
 
-import sun.reflect.generics.reflectiveObjects.NotImplementedException;
-
 /**
  *
  *
@@ -90,6 +88,6 @@
 	 * @see java.util.Iterator#remove()
 	 */
 	public void remove() {
-		throw new NotImplementedException();
+		throw new UnsupportedOperationException();
 	}
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2736] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback

From: <bra...@us...> - 2009-06-09 22:48:10

Revision: 2736
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2736&view=rev
Author:   bradtofel
Date:     2009-06-09 22:48:09 +0000 (Tue, 09 Jun 2009)

Log Message:
-----------
BUGFIX: Conditional GET SearchResult Annotater was indication duplicate type was due to Digest match. Added support for HTTP-Duplicate to CaptureSearchResult, and now the ConditionalGetAnnotationSearchResultAdapter uses these methods to indicate the correct type of duplicate record.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java	2009-06-09 21:20:22 UTC (rev 2735)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java	2009-06-09 22:48:09 UTC (rev 2736)
@@ -203,6 +203,7 @@
 	public void setClosest(boolean value) {
 		putBoolean(CAPTURE_CLOSEST_INDICATOR,value);
 	}
+
 	public void flagDuplicateDigest(Date storedDate) {
 		put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_DIGEST);
 		put(CAPTURE_DUPLICATE_STORED_TS,dateToTS(storedDate));
@@ -216,19 +217,40 @@
 		return (dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST));
 	}
 	public Date getDuplicateDigestStoredDate() {
-		String dupeType = get(CAPTURE_DUPLICATE_ANNOTATION);
-		Date date = null;
-		if(dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST)) {
-			date = tsToDate(get(CAPTURE_DUPLICATE_STORED_TS));
+		if(isDuplicateDigest()) {
+			return tsToDate(get(CAPTURE_DUPLICATE_STORED_TS));
 		}
-		return date;
+		return null;
 	}
 	public String getDuplicateDigestStoredTimestamp() {
+		if(isDuplicateDigest()) {
+			return get(CAPTURE_DUPLICATE_STORED_TS);
+		}
+		return null;
+	}
+
+	public void flagDuplicateHTTP(Date storedDate) {
+		put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_HTTP);
+		put(CAPTURE_DUPLICATE_STORED_TS,dateToTS(storedDate));
+	}
+	public void flagDuplicateHTTP(String storedTS) {
+		put(CAPTURE_DUPLICATE_ANNOTATION,CAPTURE_DUPLICATE_HTTP);
+		put(CAPTURE_DUPLICATE_STORED_TS,storedTS);
+	}
+	public boolean isDuplicateHTTP() {
 		String dupeType = get(CAPTURE_DUPLICATE_ANNOTATION);
-		String ts = null;
-		if(dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_DIGEST)) {
-			ts = get(CAPTURE_DUPLICATE_STORED_TS);
+		return (dupeType != null && dupeType.equals(CAPTURE_DUPLICATE_HTTP));
+	}
+	public Date getDuplicateHTTPStoredDate() {
+		if(isDuplicateHTTP()) {
+			return tsToDate(get(CAPTURE_DUPLICATE_STORED_TS));
 		}
-		return ts;
+		return null;
 	}
+	public String getDuplicateHTTPStoredTimestamp() {
+		if(isDuplicateHTTP()) {
+			return get(CAPTURE_DUPLICATE_STORED_TS);
+		}
+		return null;
+	}
 }

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	2009-06-09 21:20:22 UTC (rev 2735)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	2009-06-09 22:48:09 UTC (rev 2736)
@@ -78,7 +78,7 @@
 		o.setHttpCode(lastSeen.getHttpCode());
 		o.setMimeType(lastSeen.getMimeType());
 		o.setRedirectUrl(lastSeen.getRedirectUrl());
-		o.flagDuplicateDigest(lastSeen.getCaptureTimestamp());
+		o.flagDuplicateHTTP(lastSeen.getCaptureTimestamp());
 		return o;
 	}
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2735] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ ConditionalGetAnnotationSearchResultAdapter.java

From: <bra...@us...> - 2009-06-09 21:21:36

Revision: 2735
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2735&view=rev
Author:   bradtofel
Date:     2009-06-09 21:20:22 +0000 (Tue, 09 Jun 2009)

Log Message:
-----------
TWEAK: removed unused import.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	2009-06-09 21:18:20 UTC (rev 2734)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	2009-06-09 21:20:22 UTC (rev 2735)
@@ -24,8 +24,6 @@
  */
 package org.archive.wayback.resourceindex.adapters;
 
-import java.util.HashMap;
-
 import org.archive.wayback.core.CaptureSearchResult;
 import org.archive.wayback.util.Adapter;
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2734] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/ LocalResourceIndex.java

From: <bra...@us...> - 2009-06-09 21:19:22

Revision: 2734
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2734&view=rev
Author:   bradtofel
Date:     2009-06-09 21:18:20 +0000 (Tue, 09 Jun 2009)

Log Message:
-----------
FEATURE: added ConditionalGET annotation capability.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2009-06-09 21:12:27 UTC (rev 2733)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2009-06-09 21:18:20 UTC (rev 2734)
@@ -28,8 +28,6 @@
 import java.util.Iterator;
 
 import org.apache.commons.httpclient.URIException;
-import org.archive.net.UURI;
-import org.archive.net.UURIFactory;
 import org.archive.wayback.ResourceIndex;
 import org.archive.wayback.UrlCanonicalizer;
 import org.archive.wayback.core.CaptureSearchResult;
@@ -43,12 +41,12 @@
 import org.archive.wayback.exception.BadQueryException;
 import org.archive.wayback.exception.ResourceIndexNotAvailableException;
 import org.archive.wayback.exception.ResourceNotInArchiveException;
+import org.archive.wayback.resourceindex.adapters.ConditionalGetAnnotationSearchResultAdapter;
 import org.archive.wayback.resourceindex.adapters.CaptureToUrlSearchResultAdapter;
 import org.archive.wayback.resourceindex.adapters.DeduplicationSearchResultAnnotationAdapter;
 import org.archive.wayback.resourceindex.filters.CounterFilter;
 import org.archive.wayback.resourceindex.filters.DateRangeFilter;
 import org.archive.wayback.resourceindex.filters.DuplicateRecordFilter;
-import org.archive.wayback.resourceindex.filters.EndDateFilter;
 import org.archive.wayback.resourceindex.filters.GuardRailFilter;
 import org.archive.wayback.resourceindex.filters.HostMatchFilter;
 import org.archive.wayback.resourceindex.filters.SchemeMatchFilter;
@@ -101,7 +99,10 @@
 		CloseableIterator<CaptureSearchResult> captures = 
 			source.getPrefixIterator(k);
 		if(dedupeRecords) {
+			// hack hack!!!
 			captures = new AdaptedIterator<CaptureSearchResult, CaptureSearchResult>
+				(captures, new ConditionalGetAnnotationSearchResultAdapter());
+			captures = new AdaptedIterator<CaptureSearchResult, CaptureSearchResult>
 				(captures, new DeduplicationSearchResultAnnotationAdapter());
 		}
 		return captures;
@@ -126,14 +127,15 @@
 		CaptureSearchResults results = new CaptureSearchResults();
 
 		CaptureQueryFilterState filterState = 
-			new CaptureQueryFilterState(wbRequest,canonicalizer, type, filter);
+			new CaptureQueryFilterState(wbRequest, canonicalizer, type, 
+					getUserFilters(wbRequest));
 		String keyUrl = filterState.getKeyUrl();
 
 		CloseableIterator<CaptureSearchResult> itr = getCaptureIterator(keyUrl);
 		// set up the common Filters:
 		ObjectFilter<CaptureSearchResult> filter = filterState.getFilter();
 		itr = new ObjectFilterIterator<CaptureSearchResult>(itr,filter);
-		
+
 		// Windowing:
 		WindowFilterState<CaptureSearchResult> window = 
 			new WindowFilterState<CaptureSearchResult>(wbRequest);
@@ -154,6 +156,7 @@
 		cleanupIterator(itr);
 		return results;		
 	}
+
 	public UrlSearchResults doUrlQuery(WaybackRequest wbRequest)
 		throws ResourceIndexNotAvailableException, 
 		ResourceNotInArchiveException, BadQueryException, 
@@ -163,7 +166,7 @@
 
 		CaptureQueryFilterState filterState = 
 			new CaptureQueryFilterState(wbRequest,canonicalizer,
-					CaptureQueryFilterState.TYPE_URL, filter);
+					CaptureQueryFilterState.TYPE_URL, getUserFilters(wbRequest));
 		String keyUrl = filterState.getKeyUrl();
 
 		CloseableIterator<CaptureSearchResult> citr = getCaptureIterator(keyUrl);
@@ -300,6 +303,27 @@
 		this.filter = filter;
 	}
 	
+	public ObjectFilterChain<CaptureSearchResult> getUserFilters(WaybackRequest request) {
+		ObjectFilterChain<CaptureSearchResult> userFilters =
+			new ObjectFilterChain<CaptureSearchResult>();
+
+		// has the user asked for only results on the exact host specified?
+		if(request.isExactHost()) {
+			userFilters.addFilter(new HostMatchFilter(
+					UrlOperations.urlToHost(request.getRequestUrl())));
+		}
+
+		if(request.isExactScheme()) {
+			userFilters.addFilter(new SchemeMatchFilter(
+					UrlOperations.urlToScheme(request.getRequestUrl())));
+		}
+		if(filter != null) {
+			userFilters.addFilter(filter);
+		}
+
+		return userFilters;
+	}
+	
 	private class CaptureQueryFilterState {
 		public final static int TYPE_REPLAY = 0;
 		public final static int TYPE_CAPTURE = 1;
@@ -315,7 +339,7 @@
 		
 		public CaptureQueryFilterState(WaybackRequest request, 
 				UrlCanonicalizer canonicalizer, int type, 
-				ObjectFilter<CaptureSearchResult> genericFilter)
+				ObjectFilterChain<CaptureSearchResult> userFilter)
 		throws BadQueryException {
 			
 			String searchUrl = request.getRequestUrl();
@@ -346,12 +370,6 @@
 			preExclusionCounter = new CounterFilter();
 			DateRangeFilter drFilter = new DateRangeFilter(startDate,endDate);
 
-			if(genericFilter != null) {
-				filter.addFilter(genericFilter);
-			}
-			// has the user asked for only results on the exact host specified?
-			ObjectFilter<CaptureSearchResult> exactHost = 
-				getExactHostFilter(request);
 			// checks an exclusion service for every matching record
 			ObjectFilter<CaptureSearchResult> exclusion = 
 				request.getExclusionFilter();
@@ -363,7 +381,7 @@
 			
 			if(type == TYPE_REPLAY) {
 				filter.addFilter(new UrlMatchFilter(keyUrl));
-				filter.addFilter(new EndDateFilter(endDate));
+				filter.addFilter(drFilter);
 				SelfRedirectFilter selfRedirectFilter= new SelfRedirectFilter();
 				selfRedirectFilter.setCanonicalizer(canonicalizer);
 				filter.addFilter(selfRedirectFilter);
@@ -377,14 +395,10 @@
 				throw new BadQueryException("Unknown type");
 			}
 
-			if(exactHost != null) {
-				filter.addFilter(exactHost);
+			if(userFilter != null) {
+				filter.addFilters(userFilter.getFilters());
 			}
 
-			if(request.isExactScheme()) {
-				filter.addFilter(new SchemeMatchFilter(
-						UrlOperations.urlToScheme(request.getRequestUrl())));
-			}
 			// count how many results got to the ExclusionFilter:
 			filter.addFilter(preExclusionCounter);
 
@@ -425,26 +439,6 @@
 		}
 	}
 
-	private static HostMatchFilter getExactHostFilter(WaybackRequest r) { 
-
-		HostMatchFilter filter = null;
-		if(r.isExactHost()) {
-
-			String searchUrl = r.getRequestUrl();
-			try {
-
-				UURI searchURI = UURIFactory.getInstance(searchUrl);
-				String exactHost = searchURI.getHost();
-				filter = new HostMatchFilter(exactHost);
-
-			} catch (URIException e) {
-				// Really, this isn't gonna happen, we've already canonicalized
-				// it... should really optimize and do that just once.
-				e.printStackTrace();
-			}
-		}
-		return filter;
-	}
 	private class WindowFilterState<T> {
 		int startResult; // calculated based on hits/page * pagenum
 		int resultsPerPage;


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2733] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ ConditionalGetAnnotationSearchResultAdapter.java

From: <bra...@us...> - 2009-06-09 21:12:31

Revision: 2733
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2733&view=rev
Author:   bradtofel
Date:     2009-06-09 21:12:27 +0000 (Tue, 09 Jun 2009)

Log Message:
-----------
INITIAL REV: class to annotate 304-dedupe WARC records with the values from the previous stored capture.

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java

Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/ConditionalGetAnnotationSearchResultAdapter.java	2009-06-09 21:12:27 UTC (rev 2733)
@@ -0,0 +1,101 @@
+/* ConditionalGetAnnotationSearchResultAdapter
+ *
+ * $Id$
+ *
+ * Created on 6:09:05 PM Mar 12, 2009.
+ *
+ * Copyright (C) 2009 Internet Archive.
+ *
+ * This file is part of wayback.
+ *
+ * wayback is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or
+ * any later version.
+ *
+ * wayback is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser Public License
+ * along with wayback; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+package org.archive.wayback.resourceindex.adapters;
+
+import java.util.HashMap;
+
+import org.archive.wayback.core.CaptureSearchResult;
+import org.archive.wayback.util.Adapter;
+
+/**
+ * WARC file allows 2 forms of deduplication. The first actually downloads
+ * documents and compares their digest with a database of previous values. When
+ * a new capture of a document exactly matches the previous digest, an 
+ * abbreviated record is stored in the WARC file. The second form uses an HTTP
+ * conditional GET request, sending previous values returned for a given URL
+ * (etag, last-modified, etc). In this case, the remote server either sends a
+ * new document (200) which is stored normally, or the server will return a 
+ * 304 (Not Modified) response, which is stored in the WARC file.
+ * 
+ * For the first record type, the wayback indexer will output a placeholder 
+ * record that includes the digest of the last-stored record. For 304 responses,
+ * the indexer outputs a normal looking record, but the record will have a
+ * SHA1 digest which is easily distinguishable as an "empty" document. The SHA1
+ * is always:
+ * 
+ *   3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
+ *   
+ * This class will observe a stream of SearchResults, storing the values for
+ * the last seen non-empty SHA1 field. Any subsequent SearchResults with an 
+ * empty SHA1 will be annotated, copying the values from the last non-empty 
+ * record. 
+ * 
+ * This is highly experimental.
+ *
+ * @author brad
+ * @version $Date$, $Revision$
+ */
+
+public class ConditionalGetAnnotationSearchResultAdapter
+implements Adapter<CaptureSearchResult,CaptureSearchResult> {
+	
+	private final static String EMPTY_VALUE = "-";
+	private final static String EMPTY_SHA1 = "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ";
+
+	private CaptureSearchResult lastSeen = null;
+
+	public ConditionalGetAnnotationSearchResultAdapter() {
+	}
+
+	private CaptureSearchResult annotate(CaptureSearchResult o) {
+		if(lastSeen == null) {
+			// TODO: log missing record digest reference
+			return null;
+		}
+		o.setFile(lastSeen.getFile());
+		o.setOffset(lastSeen.getOffset());
+		o.setDigest(lastSeen.getDigest());
+		o.setHttpCode(lastSeen.getHttpCode());
+		o.setMimeType(lastSeen.getMimeType());
+		o.setRedirectUrl(lastSeen.getRedirectUrl());
+		o.flagDuplicateDigest(lastSeen.getCaptureTimestamp());
+		return o;
+	}
+
+	private CaptureSearchResult remember(CaptureSearchResult o) {
+		lastSeen = o;
+		return o;
+	}
+
+	public CaptureSearchResult adapt(CaptureSearchResult o) {
+		if(o.getFile().equals(EMPTY_VALUE)) {
+			if(o.getDigest().equals(EMPTY_SHA1)) {
+				return annotate(o);
+			}
+			return o;
+		}
+		return remember(o);
+	}
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2732] trunk/archive-access/projects/nutchwax/ archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java

From: <bi...@us...> - 2009-06-04 19:06:41

Revision: 2732
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2732&view=rev
Author:   binzino
Date:     2009-06-04 19:06:37 +0000 (Thu, 04 Jun 2009)

Log Message:
-----------
We have our own OpenSearchServlet in the org.archive.nutchwax package,
so we no longer need to keep a patched version.

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java

Deleted: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	2009-06-04 18:02:50 UTC (rev 2731)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/OpenSearchServlet.java	2009-06-04 19:06:37 UTC (rev 2732)
@@ -1,333 +0,0 @@
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.nutch.searcher;
-
-import java.io.IOException;
-import java.net.URLEncoder;
-import java.util.Map;
-import java.util.HashMap;
-import java.util.Set;
-import java.util.HashSet;
-
-import javax.servlet.ServletException;
-import javax.servlet.ServletConfig;
-import javax.servlet.http.HttpServlet;
-import javax.servlet.http.HttpServletRequest;
-import javax.servlet.http.HttpServletResponse;
-
-import javax.xml.parsers.*;
-
-import org.apache.hadoop.conf.Configuration;
-import org.apache.nutch.util.NutchConfiguration;
-import org.w3c.dom.*;
-import javax.xml.transform.TransformerFactory;
-import javax.xml.transform.Transformer;
-import javax.xml.transform.dom.DOMSource;
-import javax.xml.transform.stream.StreamResult;
-
-
-/** Present search results using A9's OpenSearch extensions to RSS, plus a few
- * Nutch-specific extensions. */   
-public class OpenSearchServlet extends HttpServlet {
-  private static final Map NS_MAP = new HashMap();
-  private int MAX_HITS_PER_PAGE;
-
-  static {
-    NS_MAP.put("opensearch", "http://a9.com/-/spec/opensearchrss/1.0/");
-    NS_MAP.put("nutch", "http://www.nutch.org/opensearchrss/1.0/");
-  }
-
-  private static final Set SKIP_DETAILS = new HashSet();
-  static {
-    SKIP_DETAILS.add("url");                   // redundant with RSS link
-    SKIP_DETAILS.add("title");                 // redundant with RSS title
-  }
-
-  private NutchBean bean;
-  private Configuration conf;
-
-  public void init(ServletConfig config) throws ServletException {
-    try {
-      this.conf = NutchConfiguration.get(config.getServletContext());
-      bean = NutchBean.get(config.getServletContext(), this.conf);
-    } catch (IOException e) {
-      throw new ServletException(e);
-    }
-    MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
-  }
-
-  public void doGet(HttpServletRequest request, HttpServletResponse response)
-    throws ServletException, IOException {
-
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("query request from " + request.getRemoteAddr());
-    }
-
-    // get parameters from request
-    request.setCharacterEncoding("UTF-8");
-    String queryString = request.getParameter("query");
-    if (queryString == null)
-      queryString = "";
-    String urlQuery = URLEncoder.encode(queryString, "UTF-8");
-    
-    // the query language
-    String queryLang = request.getParameter("lang");
-    
-    int start = 0;                                // first hit to display
-    String startString = request.getParameter("start");
-    if (startString != null)
-      start = Integer.parseInt(startString);
-    
-    int hitsPerPage = 10;                         // number of hits to display
-    String hitsString = request.getParameter("hitsPerPage");
-    if (hitsString != null)
-      hitsPerPage = Integer.parseInt(hitsString);
-    if(MAX_HITS_PER_PAGE > 0 && hitsPerPage > MAX_HITS_PER_PAGE)
-      hitsPerPage = MAX_HITS_PER_PAGE;
-
-    String sort = request.getParameter("sort");
-    boolean reverse =
-      sort!=null && "true".equals(request.getParameter("reverse"));
-
-    // De-Duplicate handling.  Look for duplicates field and for how many
-    // duplicates per results to return. Default duplicates field is 'site'
-    // and duplicates per results default is '2'.
-    String dedupField = request.getParameter("dedupField");
-    if (dedupField == null || dedupField.length() == 0) {
-        dedupField = "site";
-    }
-    int hitsPerDup = 2;
-    String hitsPerDupString = request.getParameter("hitsPerDup");
-    if (hitsPerDupString != null && hitsPerDupString.length() > 0) {
-        hitsPerDup = Integer.parseInt(hitsPerDupString);
-    } else {
-        // If 'hitsPerSite' present, use that value.
-        String hitsPerSiteString = request.getParameter("hitsPerSite");
-        if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) {
-            hitsPerDup = Integer.parseInt(hitsPerSiteString);
-        }
-    }
-     
-    // Make up query string for use later drawing the 'rss' logo.
-    String params = "&hitsPerPage=" + hitsPerPage +
-        (queryLang == null ? "" : "&lang=" + queryLang) +
-        (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") +
-        (dedupField == null ? "" : "&dedupField=" + dedupField));
-
-    Query query = Query.parse(queryString, queryLang, this.conf);
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("query: " + queryString);
-      NutchBean.LOG.info("lang: " + queryLang);
-    }
-
-    // execute the query
-    Hits hits;
-    try {
-      hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField,
-          sort, reverse);
-    } catch (IOException e) {
-      if (NutchBean.LOG.isWarnEnabled()) {
-        NutchBean.LOG.warn("Search Error", e);
-      }
-      hits = new Hits(0,new Hit[0]);	
-    }
-
-    if (NutchBean.LOG.isInfoEnabled()) {
-      NutchBean.LOG.info("total hits: " + hits.getTotal());
-    }
-
-    // generate xml results
-    int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
-    int length = end-start;
-
-    Hit[] show = hits.getHits(start, end-start);
-    HitDetails[] details = bean.getDetails(show);
-    Summary[] summaries = bean.getSummary(details, query);
-
-    String requestUrl = request.getRequestURL().toString();
-    String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
-      
-
-    try {
-      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
-      factory.setNamespaceAware(true);
-      Document doc = factory.newDocumentBuilder().newDocument();
- 
-      Element rss = addNode(doc, doc, "rss");
-      addAttribute(doc, rss, "version", "2.0");
-      addAttribute(doc, rss, "xmlns:opensearch",
-                   (String)NS_MAP.get("opensearch"));
-      addAttribute(doc, rss, "xmlns:nutch", (String)NS_MAP.get("nutch"));
-
-      Element channel = addNode(doc, rss, "channel");
-    
-      addNode(doc, channel, "title", "Nutch: " + queryString);
-      addNode(doc, channel, "description", "Nutch search results for query: "
-              + queryString);
-      addNode(doc, channel, "link",
-              base+"/search.jsp"
-              +"?query="+urlQuery
-              +"&start="+start
-              +"&hitsPerDup="+hitsPerDup
-              +params);
-
-      addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
-      addNode(doc, channel, "opensearch", "startIndex", ""+start);
-      addNode(doc, channel, "opensearch", "itemsPerPage", ""+hitsPerPage);
-
-      addNode(doc, channel, "nutch", "query", queryString);
-    
-
-      if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
-          || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){
-        addNode(doc, channel, "nutch", "nextPage", requestUrl
-                +"?query="+urlQuery
-                +"&start="+end
-                +"&hitsPerDup="+hitsPerDup
-                +params);
-      }
-
-      if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) {
-        addNode(doc, channel, "nutch", "showAllHits", requestUrl
-                +"?query="+urlQuery
-                +"&hitsPerDup="+0
-                +params);
-      }
-
-      for (int i = 0; i < length; i++) {
-        Hit hit = show[i];
-        HitDetails detail = details[i];
-        String title = detail.getValue("title");
-        String url = detail.getValue("url");
-        String id = "idx=" + hit.getIndexNo() + "&id=" + hit.getIndexDocNo();
-      
-        if (title == null || title.equals("")) {   // use url for docs w/o title
-          title = url;
-        }
-        
-        Element item = addNode(doc, channel, "item");
-
-        addNode(doc, item, "title", title);
-        if (summaries[i] != null) {
-          addNode(doc, item, "description", summaries[i].toString() );
-        }
-        addNode(doc, item, "link", url);
-
-        addNode(doc, item, "nutch", "site", hit.getDedupValue());
-
-        addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id);
-        addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id
-                +"&query="+urlQuery+"&lang="+queryLang);
-
-        if (hit.moreFromDupExcluded()) {
-          addNode(doc, item, "nutch", "moreFromSite", requestUrl
-                  +"?query="
-                  +URLEncoder.encode("site:"+hit.getDedupValue()
-                                     +" "+queryString, "UTF-8")
-                  +"&hitsPerSite="+0
-                  +params);
-        }
-
-        for (int j = 0; j < detail.getLength(); j++) { // add all from detail
-          String field = detail.getField(j);
-          if (!SKIP_DETAILS.contains(field))
-            addNode(doc, item, "nutch", field, detail.getValue(j));
-        }
-      }
-
-      // dump DOM tree
-
-      DOMSource source = new DOMSource(doc);
-      TransformerFactory transFactory = TransformerFactory.newInstance();
-      Transformer transformer = transFactory.newTransformer();
-      transformer.setOutputProperty("indent", "yes");
-      StreamResult result = new StreamResult(response.getOutputStream());
-      response.setContentType("text/xml");
-      transformer.transform(source, result);
-
-    } catch (javax.xml.parsers.ParserConfigurationException e) {
-      throw new ServletException(e);
-    } catch (javax.xml.transform.TransformerException e) {
-      throw new ServletException(e);
-    }
-      
-  }
-
-  private static Element addNode(Document doc, Node parent, String name) {
-    Element child = doc.createElement(name);
-    parent.appendChild(child);
-    return child;
-  }
-
-  private static void addNode(Document doc, Node parent,
-                              String name, String text) {
-    Element child = doc.createElement(name);
-    child.appendChild(doc.createTextNode(getLegalXml(text)));
-    parent.appendChild(child);
-  }
-
-  private static void addNode(Document doc, Node parent,
-                              String ns, String name, String text) {
-    Element child = doc.createElementNS((String)NS_MAP.get(ns), ns+":"+name);
-    child.appendChild(doc.createTextNode(getLegalXml(text)));
-    parent.appendChild(child);
-  }
-
-  private static void addAttribute(Document doc, Element node,
-                                   String name, String value) {
-    Attr attribute = doc.createAttribute(name);
-    attribute.setValue(getLegalXml(value));
-    node.getAttributes().setNamedItem(attribute);
-  }
-
-  /*
-   * Ensure string is legal xml.
-   * @param text String to verify.
-   * @return Passed <code>text</code> or a new string with illegal
-   * characters removed if any found in <code>text</code>.
-   * @see http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
-   */
-  protected static String getLegalXml(final String text) {
-      if (text == null) {
-          return null;
-      }
-      StringBuffer buffer = null;
-      for (int i = 0; i < text.length(); i++) {
-        char c = text.charAt(i);
-        if (!isLegalXml(c)) {
-	  if (buffer == null) {
-              // Start up a buffer.  Copy characters here from now on
-              // now we've found at least one bad character in original.
-	      buffer = new StringBuffer(text.length());
-              buffer.append(text.substring(0, i));
-          }
-        } else {
-           if (buffer != null) {
-             buffer.append(c);
-           }
-        }
-      }
-      return (buffer != null)? buffer.toString(): text;
-  }
- 
-  private static boolean isLegalXml(final char c) {
-    return c == 0x9 || c == 0xa || c == 0xd || (c >= 0x20 && c <= 0xd7ff)
-        || (c >= 0xe000 && c <= 0xfffd) || (c >= 0x10000 && c <= 0x10ffff);
-  }
-
-}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access:[2731] trunk/archive-access/projects/nutchwax/ archive/src/nutch/conf/tika-mimetypes.xml

From: <bi...@us...> - 2009-06-04 18:02:56

Revision: 2731
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2731&view=rev
Author:   binzino
Date:     2009-06-04 18:02:50 +0000 (Thu, 04 Jun 2009)

Log Message:
-----------
Nutch 1.0 fixed their tika-mimetypes.xml, so we no longer need this
patched/fixed version.

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml

Deleted: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml	2009-05-20 02:55:09 UTC (rev 2730)
+++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/tika-mimetypes.xml	2009-06-04 18:02:50 UTC (rev 2731)
@@ -1,364 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-	Licensed to the Apache Software Foundation (ASF) under one or more
-	contributor license agreements.  See the NOTICE file distributed with
-	this work for additional information regarding copyright ownership.
-	The ASF licenses this file to You under the Apache License, Version 2.0
-	(the "License"); you may not use this file except in compliance with
-	the License.  You may obtain a copy of the License at
-	
-	http://www.apache.org/licenses/LICENSE-2.0
-	
-	Unless required by applicable law or agreed to in writing, software
-	distributed under the License is distributed on an "AS IS" BASIS,
-	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-	See the License for the specific language governing permissions and
-	limitations under the License.
-	
-	Description: This xml file defines the valid mime types used by Tika.
-	The mime types within this file are based on the types in the mime-types.xml 
-	file available in Apache Nutch.
--->
-
-<mime-info>
-
-	<mime-type type="text/plain">
-		<magic priority="50">
-			<match value="This is TeX," type="string" offset="0" />
-			<match value="This is METAFONT," type="string" offset="0" />
-		</magic>
-		<glob pattern="*.txt" />
-		<glob pattern="*.asc" />
-	</mime-type>
-
-	<mime-type type="text/html">
-		<magic priority="50">
-			<match value="&lt;!DOCTYPE HTML" type="string"
-				offset="0:64" />
-			<match value="&lt;!doctype html" type="string"
-				offset="0:64" />
-			<match value="&lt;HEAD" type="string" offset="0:64" />
-			<match value="&lt;head" type="string" offset="0:64" />
-			<match value="&lt;TITLE" type="string" offset="0:64" />
-			<match value="&lt;title" type="string" offset="0:64" />
-			<match value="&lt;html" type="string" offset="0:64" />
-			<match value="&lt;HTML" type="string" offset="0:64" />
-			<match value="&lt;BODY" type="string" offset="0" />
-			<match value="&lt;body" type="string" offset="0" />
-			<match value="&lt;TITLE" type="string" offset="0" />
-			<match value="&lt;title" type="string" offset="0" />
-			<match value="&lt;!--" type="string" offset="0" />
-			<match value="&lt;h1" type="string" offset="0" />
-			<match value="&lt;H1" type="string" offset="0" />
-			<match value="&lt;!doctype HTML" type="string" offset="0" />
-			<match value="&lt;!DOCTYPE html" type="string" offset="0" />
-		</magic>
-		<glob pattern="*.html" />
-		<glob pattern="*.htm" />
-	</mime-type>
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-	<mime-type type="application/xhtml+xml">
-		<sub-class-of type="text/xml" />
-		<glob pattern="*.xhtml" />
-		<root-XML namespaceURI='http://www.w3.org/1999/xhtml'
-			localName='html' />
-	</mime-type>
-
-	<mime-type type="application/vnd.ms-powerpoint">
-		<glob pattern="*.ppz" />
-		<glob pattern="*.ppt" />
-		<glob pattern="*.pps" />
-		<glob pattern="*.pot" />
-		<magic priority="50">
-			<match value="0xcfd0e011" type="little32" offset="0" />
-		</magic>
-	</mime-type>
-
-	<mime-type type="application/vnd.ms-excel">
-		<magic priority="50">
-			<match value="Microsoft Excel 5.0 Worksheet" type="string"
-				offset="2080" />
-		</magic>
-		<glob pattern="*.xls" />
-		<glob pattern="*.xlc" />
-		<glob pattern="*.xll" />
-		<glob pattern="*.xlm" />
-		<glob pattern="*.xlw" />
-		<glob pattern="*.xla" />
-		<glob pattern="*.xlt" />
-		<glob pattern="*.xld" />
-		<alias type="application/msexcel" />
-	</mime-type>
-
-	<mime-type type="application/vnd.oasis.opendocument.text">
-		<glob pattern="*.odt" />
-	</mime-type>
-
-
-	<mime-type type="application/zip">
-		<alias type="application/x-zip-compressed" />
-		<magic priority="40">
-			<match value="PK\003\004" type="string" offset="0" />
-		</magic>
-		<glob pattern="*.zip" />
-	</mime-type>
-
-	<mime-type type="application/vnd.oasis.opendocument.text">
-		<glob pattern="*.oth" />
-	</mime-type>
-
-	<mime-type type="application/msword">
-		<magic priority="50">
-			<match value="\x31\xbe\x00\x00" type="string" offset="0" />
-			<match value="PO^Q`" type="string" offset="0" />
-			<match value="\376\067\0\043" type="string" offset="0" />
-			<match value="\333\245-\0\0\0" type="string" offset="0" />
-			<match value="Microsoft Word 6.0 Document" type="string"
-				offset="2080" />
-			<match value="Microsoft Word document data" type="string"
-				offset="2112" />
-		</magic>
-		<glob pattern="*.doc" />
-		<alias type="application/vnd.ms-word" />
-	</mime-type>
-
-	<mime-type type="application/octet-stream">
-		<magic priority="50">
-			<match value="\037\036" type="string" offset="0" />
-			<match value="017437" type="host16" offset="0" />
-			<match value="0x1fff" type="host16" offset="0" />
-			<match value="\377\037" type="string" offset="0" />
-			<match value="0145405" type="host16" offset="0" />
-		</magic>
-		<glob pattern="*.bin" />
-	</mime-type>
-
-	<mime-type type="application/pdf">
-		<magic priority="50">
-			<match value="%PDF-" type="string" offset="0" />
-		</magic>
-		<glob pattern="*.pdf" />
-		<alias type="application/x-pdf" />
-	</mime-type>
-
-	<mime-type type="application/atom+xml">
-		<root-XML localName="feed"
-			namespaceURI="http://purl.org/atom/ns#" />
-	</mime-type>
-
-	<mime-type type="application/mac-binhex40">
-		<glob pattern="*.hqx" />
-	</mime-type>
-
-	<mime-type type="application/mac-compactpro">
-		<glob pattern="*.cpt" />
-	</mime-type>
-
-	<mime-type type="application/rtf">
-	    <glob pattern="*.rtf"/>
-		<alias type="text/rtf" />
-	</mime-type>
-
-	<mime-type type="application/rss+xml">
-		<alias type="text/rss" />
-		<root-XML localName="rss" />
-		<root-XML namespaceURI="http://purl.org/rss/1.0/" />
-		<glob pattern="*.rss" />
-	</mime-type>
-
-	<!--  added in by mattmann -->
-	<mime-type type="application/x-mif">
-		<alias type="application/vnd.mif" />
-	</mime-type>
-
-	<mime-type type="application/vnd.wap.wbxml">
-		<glob pattern="*.wbxml" />
-	</mime-type>
-
-	<mime-type type="application/vnd.wap.wmlc">
-		<_comment>Compiled WML Document</_comment>
-		<glob pattern="*.wmlc" />
-	</mime-type>
-
-	<mime-type type="application/vnd.wap.wmlscriptc">
-		<_comment>Compiled WML Script</_comment>
-		<glob pattern="*.wmlsc" />
-	</mime-type>
-
-	<mime-type type="text/vnd.wap.wmlscript">
-		<_comment>WML Script</_comment>
-		<glob pattern="*.wmls" />
-	</mime-type>
-
-	<mime-type type="application/x-bzip">
-		<alias type="application/x-bzip2" />
-	</mime-type>
-
-	<mime-type type="application/x-bzip-compressed-tar">
-		<glob pattern="*.tbz" />
-		<glob pattern="*.tbz2" />
-	</mime-type>
-
-	<mime-type type="application/x-cdlink">
-		<_comment>Virtual CD-ROM CD Image File</_comment>
-		<glob pattern="*.vcd" />
-	</mime-type>
-
-	<mime-type type="application/x-director">
-		<_comment>Shockwave Movie</_comment>
-		<glob pattern="*.dcr" />
-		<glob pattern="*.dir" />
-		<glob pattern="*.dxr" />
-	</mime-type>
-
-	<mime-type type="application/x-futuresplash">
-		<_comment>Macromedia FutureSplash File</_comment>
-		<glob pattern="*.spl" />
-	</mime-type>
-
-	<mime-type type="application/x-java">
-		<alias type="application/java" />
-	</mime-type>
-
-	<mime-type type="application/x-koan">
-		<_comment>SSEYO Koan File</_comment>
-		<glob pattern="*.skp" />
-		<glob pattern="*.skd" />
-		<glob pattern="*.skt" />
-		<glob pattern="*.skm" />
-	</mime-type>
-
-	<mime-type type="application/x-latex">
-		<_comment>LaTeX Source Document</_comment>
-		<glob pattern="*.latex" />
-	</mime-type>
-
-	<!-- JC CHANGED
-		<mime-type type="application/x-mif">
-		<_comment>FrameMaker MIF document</_comment>
-		<glob pattern="*.mif"/>
-		</mime-type> -->
-
-	<mime-type type="application/ogg">
-		<alias type="application/x-ogg" />
-	</mime-type>
-
-	<mime-type type="application/x-rar">
-		<alias type="application/x-rar-compressed" />
-	</mime-type>
-
-	<mime-type type="application/x-shellscript">
-		<alias type="application/x-sh" />
-	</mime-type>
-
-	<mime-type type="application/xhtml+xml">
-		<glob pattern="*.xht" />
-	</mime-type>
-
-	<mime-type type="audio/midi">
-		<glob pattern="*.kar" />
-	</mime-type>
-
-	<mime-type type="audio/x-pn-realaudio">
-		<alias type="audio/x-realaudio" />
-	</mime-type>
-
-	<mime-type type="image/tiff">
-		<magic priority="50">
-			<match value="0x4d4d2a00" type="string" offset="0" />
-			<match value="0x49492a00" type="string" offset="0" />
-		</magic>
-	</mime-type>
-
-	<mime-type type="message/rfc822">
-		<magic priority="50">
-			<match type="string" value="Relay-Version:" offset="0" />
-			<match type="string" value="#! rnews" offset="0" />
-			<match type="string" value="N#! rnews" offset="0" />
-			<match type="string" value="Forward to" offset="0" />
-			<match type="string" value="Pipe to" offset="0" />
-			<match type="string" value="Return-Path:" offset="0" />
-			<match type="string" value="From:" offset="0" />
-			<match type="string" value="Message-ID:" offset="0" />
-			<match type="string" value="Date:" offset="0" />
-		</magic>
-	</mime-type>
-	
-	<mime-type type="application/x-javascript">
-        <glob pattern="*.js" />
-    </mime-type>
-    
-
-	<mime-type type="image/vnd.wap.wbmp">
-		<_comment>Wireless Bitmap File Format</_comment>
-		<glob pattern="*.wbmp" />
-	</mime-type>
-
-	<mime-type type="image/x-psd">
-		<alias type="image/photoshop" />
-	</mime-type>
-
-	<mime-type type="image/x-xcf">
-		<alias type="image/xcf" />
-		<magic priority="50">
-			<match type="string" value="gimp xcf " offset="0" />
-		</magic>
-	</mime-type>
-	
-	<mime-type type="application/x-shockwave-flash">
-      <glob pattern="*.swf"/>
-      <magic priority="50">
-        <match type="string" value="FWS" offset="0"/>
-        <match type="string" value="CWS" offset="0"/>
-      </magic>
-    </mime-type>
-
-	<mime-type type="model/iges">
-		<_comment>
-			Initial Graphics Exchange Specification Format
-		</_comment>
-		<glob pattern="*.igs" />
-		<glob pattern="*.iges" />
-	</mime-type>
-
-	<mime-type type="model/mesh">
-		<glob pattern="*.msh" />
-		<glob pattern="*.mesh" />
-		<glob pattern="*.silo" />
-	</mime-type>
-
-	<mime-type type="model/vrml">
-		<glob pattern="*.vrml" />
-	</mime-type>
-
-	<mime-type type="text/x-tcl">
-		<alias type="application/x-tcl" />
-	</mime-type>
-
-	<mime-type type="text/x-tex">
-		<alias type="application/x-tex" />
-	</mime-type>
-
-	<mime-type type="text/x-texinfo">
-		<alias type="application/x-texinfo" />
-	</mime-type>
-
-	<mime-type type="text/x-troff-me">
-		<alias type="application/x-troff-me" />
-	</mime-type>
-
-	<mime-type type="video/vnd.mpegurl">
-		<glob pattern="*.mxu" />
-	</mime-type>
-
-	<mime-type type="x-conference/x-cooltalk">
-		<_comment>Cooltalk Audio</_comment>
-		<glob pattern="*.ice" />
-	</mime-type>
-
-</mime-info>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 41 42 43 44 45 .. 171 > >> (Page 43 of 171)