archive-access-cvs Mailing List for Web Archive Access Utilities (Page 57)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 55 56 57 58 59 .. 171 > >> (Page 57 of 171)

[Archive-access-cvs] SF.net SVN: archive-access: [2405] tags/nutchwax-0_12/

From: <bi...@us...> - 2008-07-03 22:01:39

Revision: 2405
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2405&view=rev
Author:   binzino
Date:     2008-07-03 15:01:43 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Create NutchWAX 0.12 release tag.

Added Paths:
-----------
    tags/nutchwax-0_12/

Copied: tags/nutchwax-0_12 (from rev 2404, trunk/archive-access/projects/nutchwax)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2404] trunk/archive-access/projects/nutchwax/ archive/INSTALL.txt

From: <bi...@us...> - 2008-07-03 21:39:48

Revision: 2404
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2404&view=rev
Author:   binzino
Date:     2008-07-03 14:39:57 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Updated with current SVN revision for Nutch that we build against.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-07-03 20:37:17 UTC (rev 2403)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-07-03 21:39:57 UTC (rev 2404)
@@ -46,11 +46,11 @@
 Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is
 built against is:
 
-  673464
+  673823
 
 To checkout this revision of Nutch, use:
 
- $ svn checkout -r 673464 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
+ $ svn checkout -r 673823 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
 
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2403] trunk/archive-access/projects/nutchwax/ archive/RELEASE-NOTES.txt

From: <bi...@us...> - 2008-07-03 20:37:13

Revision: 2403
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2403&view=rev
Author:   binzino
Date:     2008-07-03 13:37:17 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Added: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-07-03 20:37:17 UTC (rev 2403)
@@ -0,0 +1,62 @@
+
+RELEASE-NOTES.TXT
+2007-07-03
+Aaron Binns
+
+Release notes for NutchWAX 0.12
+
+For the most recent updates and information on NutchWAX,
+please visit the project wiki at:
+
+  http://webteam.archive.org/confluence/display/search/NutchWAX
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX 0.12-beta-1 was released on June 2, 2008.  We anticipated
+releasing another beta mid-June with bug fixes and some minor
+enhancements based on feedback from the community.
+
+During internal testing by the Internet Archive Web Team, a few
+serious problems were found, the most critical being the failure to
+store different copies of the same URL when importing large batches of
+archive files.
+
+The NutchWAX team canceled the mid-month release in order to focus on
+fixing this problem.
+
+The good news is that not only has that problem been fixed, but the
+solution is part of a broader enhancement to manage the de-duplication
+of archive contnet during import and indexing.
+
+For more details on de-duplication in NutchWAX, please see
+
+  HOWTO-dedup.txt
+  README-dedup.txt
+
+
+======================================================================
+Issues
+======================================================================
+
+For an up-to-date list of NutchWAX issues:
+
+  http://webteam.archive.org/jira/browse/WAX
+
+Issues resolved in this release:
+
+WAX-9  Entire file not imported
+WAX-8  Investigate why so many PDFs fail to parse
+
+  Fixing the first one caused nearly all of the PDF parsing errors to
+  disappear.
+
+WAX-7  Change config to that URL filters are not applied during link inversion
+
+  This is easily achieved by using command-line options when invoking
+  the Nutch "invertlinks" command.
+
+WAX-3  Observe content size limit on importing
+WAX-2  Date queries cause TooManyClauses exceptions


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2402] trunk/archive-access/projects/nutchwax/ archive/HOWTO-dedup.txt

From: <bi...@us...> - 2008-07-03 18:53:14

Revision: 2402
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2402&view=rev
Author:   binzino
Date:     2008-07-03 11:53:12 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Added comments regarding WARCs.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-07-03 18:29:09 UTC (rev 2401)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-07-03 18:53:12 UTC (rev 2402)
@@ -75,7 +75,7 @@
 
 
 ======================================================================
-Generate DUP
+Generate DUP/Revisits
 ======================================================================
 
 Now that we have 'all.cdx' containing a sorted list of all the records
@@ -98,6 +98,25 @@
 
 This file is then used as an exlusion filter for importing.
 
+
+WARC
+----
+If we are using WARC files with revisit records instead of ARC files,
+then we don't generate a list of duplicate records because there
+shouldn't be any.
+
+However, the revisit records in the WARC files do have the dates when
+a URL was revisited and seen to have not changed -- which is more or
+less the same thing as our "dup" lines above.
+
+For extracting these revisits from WARC CDX files, we use the
+'revisits' utility provided by NutchWAX
+
+  $ revisits all-warc.cdx > all-warc.dup
+
+The output of 'revisits' is in the same format as 'dedup-cdx'.
+
+
 ======================================================================
 Import
 ======================================================================
@@ -121,7 +140,13 @@
 If you examine the Nutch "hadoop.log" file, you will see INFO-level
 lines from the NutchWAX Importer showing which URLs were excluded.
 
+WARC
+----
+If you are importing WARC files with revisit records, then you
+typically won't need to provide an exclusion file as the WARC files
+were de-duplicated during the crawl.
 
+
 ======================================================================
 Update and Invert 
 ======================================================================
@@ -224,6 +249,15 @@
 the previous "dates" index with the new one.
 
 
+WARC
+----
+This step is the same for ARCs and WARCs.
+
+The only difference is that our "all.dup" file containing the list of
+revisit dates was created by different utilities: 'dedup-cdx' for ARCs
+and 'revisits' for WARCs.
+
+
 ======================================================================
 Search
 ======================================================================


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2401] trunk/archive-access/projects/nutchwax/ archive/HOWTO-dedup.txt

From: <bi...@us...> - 2008-07-03 18:29:05

Revision: 2401
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2401&view=rev
Author:   binzino
Date:     2008-07-03 11:29:09 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt

Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-07-03 18:29:09 UTC (rev 2401)
@@ -0,0 +1,289 @@
+
+HOWTO-dedup.txt
+2008-07-03
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+   - NutchWAX HOWTO.txt
+   - Wayback 1.2.1
+ o Overview
+ o Generate CDX
+ o Generate DUP
+ o Import
+ o Update and Invert
+ o Index
+ o Add Revisit Dates
+ o Search
+ o Web deployment
+
+
+======================================================================
+Prerequisites
+======================================================================
+
+This de-duplication HOWTO assumes you've already read the main HOWTO
+and are familiar with importing and indexing archive files with
+NutchWAX.
+
+For de-duplication, the Wayback Machine tools are required.  This guide
+assumes you have Wayback 1.2.1 installed in
+
+  /opt/wayback-1.2.1
+
+
+======================================================================
+Overview
+======================================================================
+
+The README-dedup.txt explains the de-duplication process in greater
+detail, including implementation details.
+
+NutchWAX does not automagically detect and eliminate duplicate records
+when importing and indexing.  However, tools are provided to help the
+user implement a system to perform de-duplication.
+
+This guide describes one such system using the tools provided by
+NutchWAX and Wayback.
+
+
+======================================================================
+Generate CDX
+======================================================================
+
+The first step is to generate a list of duplicate records for a set of
+ARC files.
+
+This step is not necessary if your archive files are in WARC format
+and de-duplication was performed during the crawl.
+
+To generate the list of duplicates, we use the Wayback 'arc-indexer'
+with the NutchWAX 'dedup-cdx' utility.  The CDX files *must* be
+sorted.
+
+  $ arc-indexer foo.arc.gz | sort > foo.cdx
+  $ arc-indexer bar.arc.gz | sort > bar.cdx
+  $ arc-indexer baz.arc.gz | sort > baz.cdx
+
+Then we combine the CDX files into one sorted CDX containing all the
+records:
+
+  $ sort -m foo.cdx bar.cdx baz.cdx > all.cdx
+
+The "-m" option speeds up the sort by merging the already-sorted
+files.
+
+
+======================================================================
+Generate DUP
+======================================================================
+
+Now that we have 'all.cdx' containing a sorted list of all the records
+in the ARC files, we can generate a list of duplicates therein:
+
+  $ dedup-cdx all.cdx > all.dup
+
+This "all.dup" file contains lines of the form
+
+   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+
+Where each line is
+
+   URL digest date
+
+This file is then used as an exlusion filter for importing.
+
+======================================================================
+Import
+======================================================================
+
+The import process is essentially the same as in NutchWAX, but now
+we use "all.dup" as our exclusion list.
+
+First, we create a manifest
+
+  $ cat > manifest
+  foo.arc.gz test-collection
+  bar.arc.gz test-collection
+  baz.arc.gz test-collection
+  ^D
+
+  $ nutchwax import -e all.dup manifest
+
+The result will be a newly-created Nutch segment, same as importing
+without de-duplication.
+
+If you examine the Nutch "hadoop.log" file, you will see INFO-level
+lines from the NutchWAX Importer showing which URLs were excluded.
+
+
+======================================================================
+Update and Invert 
+======================================================================
+
+Perform the Nutch "updatedb" and "invertlinks" steps as normal.
+
+Nothing special/different to do here with respect to de-duplication.
+
+
+======================================================================
+Index
+======================================================================
+
+The only chage we make to the indexing step is the destination of the
+index directory.
+
+By default, Nutch expects the per-segment index directory to live in a
+sub-directory called 'indexes' and the index command is accordingly
+
+  $ nutch index indexes crawldb linkdb segments/*
+
+Resulting in an index directory structure of the form
+
+    indexes/part-00000
+
+For de-duplication, we use a slightly different directory structure,
+which will be used by a de-duplication-aware NutchWaxBean at
+search-time.  The directory structure we use is:
+
+    pindexes/<segment>/part-00000
+
+Using the segment name is not strictly required, but it is a good
+practice and is strongly recommended.  This way the segment and its
+corresponding index directory are easily matched.
+
+Let's assume that the segment directory created during the import is
+named
+
+  segments/20080703050349
+
+In that case, our index command becomes:
+
+  $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349
+
+Upon completion, the Lucene index is created in
+
+  pindexes/20080703050349/part-0000
+
+This index is exactly the same as one normally created by Nutch, the
+only difference is the location.
+
+
+======================================================================
+Add Revisit Dates
+======================================================================
+
+Now that we have the Nutch index, we add the revisit dates to it.
+
+Examine the "all.dup" file again, it has lines of the form
+
+   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+
+These are the revisit dates that need to be added to the records in
+the Lucene index.  When we generated the index, only the date of the
+first visit was put in the index.  Now we have to add these.
+
+As explained in README-dedup.txt, modifying the Lucene index to
+actually add these dates is infeasible.  What we do is create a
+parallel index next to the main index (the part-00000 created above)
+that contains all the dates for each record.
+
+The NutchWAX 'add-dates' command creates this parallel index for us.
+
+  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/dates \
+                       all.dup
+
+Yes, the part-0000 argument does appear twice.  This is beacuse it is
+both the "key" index and the "source" index.
+
+
+Suppose we did another crawl and had even more dates to add to the
+existing index.  In that case we would run
+
+  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/dates \
+                       pindexes/20080703050349/new-dates \
+                       new-crawl.dup
+  $ rm -r pindexes/20080703050349/dates
+  $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates
+
+This copies the existing dates from "dates" to "new-dates" and adds
+additional ones from "new-crawl.dup" along the way.  Then we replace
+the previous "dates" index with the new one.
+
+
+======================================================================
+Search
+======================================================================
+
+Test/debug searches can be run from the command-line, but instead of
+using the 'NutchBean' we use 'NutchWaxBean'.
+
+The "NutchWaxBean" extends NutchBean by adding support for parallel
+indexes.
+
+  $ nutch org.archive.nutchwax.NutchWaxBean <query>
+
+The "NutchWaxBean" also gives slightly more verbose and useful ouput,
+
+  $ nutch org.archive.nutchwax.NutchWaxBean carolina
+  Total hits: 247338
+   0 [20080702053119] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080618133218, 20080618133218]
+   ... Studios Blue Ridge Motion Pictures Carolina Pinnacle Creative Network EUE/Screen ... Trailblazer Studios Federal Tax Incentive Carolina Pinnacle Studios  ... 
+   1 [20080703023605] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080613200046, 20080618133218]
+
+The output consists of 
+
+  hit number
+  segment
+  url
+  key (which is url + digest)
+  digest
+  dates
+
+The most useful bit here for testing de-duplication is the list of
+dates.
+
+
+======================================================================
+Web Deployment
+======================================================================
+
+As noted in the HOWTO.txt document, when the nutch(wax) webapp is
+deployed, changes made to the configuration must be also applied to
+the deployed webapp.
+
+In addition to those configuration changes, the "web.xml" file must
+also be modified.
+
+In Nutch, the "web.xml" file contains a directive to call a static
+method on 'NutchBean' to initialize it.  In order to search the
+parallel indexes we have to use 'NutchWaxBean'.  This is done by
+modifying the "web.xml" to call a NutchWaxBean initializer after the
+NutchBean initializer.
+
+Change "web.xml" from
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+  </listener>
+
+to:
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+    <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
+  </listener>
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2400] trunk/archive-access/projects/nutchwax/ archive/HOWTO.txt

From: <bi...@us...> - 2008-07-03 18:28:48

Revision: 2400
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2400&view=rev
Author:   binzino
Date:     2008-07-03 11:28:55 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Added info on new configuration properties.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-07-03 02:03:41 UTC (rev 2399)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-07-03 18:28:55 UTC (rev 2400)
@@ -120,12 +120,13 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
   index-nutchwax
   query-nutchwax
+  urlfilter-nutchwax
   parse-pdf
 
 and remove:
@@ -136,19 +137,37 @@
 The only *required* changes are the additions of the NutchWAX index
 and query plugins.  The rest are optional, but recommended.
 
-The addition of the "parse-pdf" plugin is simply because we have lots
-of PDFs in our archives and we want to index them.  We sometimes
-remove the "parse-js" plugin if we don't care to index JavaScript
-files.
+The "parse-pdf" plugin is added simply because we have lots of PDFs in
+our archives and we want to index them.  We sometimes remove the
+"parse-js" plugin if we don't care to index JavaScript files.
 
-We also remove the URL filtering and normalizing plugins because we do
-not need the URLs normalized nor filtered.  We trust that the tool
-that produced the ARC/WARC file will have normalized the URLs
-contained therein according to its own rules so there's no need to
-normalize here.  Also, we don't filter by URL since we want to index
-as much of the ARC/WARC file as we have parsers for.
+We also remove the default Nutch URL filtering and normalizing plugins
+because we do not need the URLs normalized nor filtered.  We trust
+that the tool that produced the ARC/WARC file will have normalized the
+URLs contained therein according to its own rules so there's no need
+to normalize here.  Also, we don't filter by URL since we want to
+index as much of the ARC/WARC file as we have parsers for.
 
+We do, however, add the NutchWAX URL filter.  If de-duplication is
+being performed upon import, this plugin is required.  It performs URL
+filtering of the list of ARC records to exclude based on
+URL+digest+date.
+
 --------------------------------------------------
+indexingfilter.order
+--------------------------------------------------
+
+Add this property with a value of
+
+    org.apache.nutch.indexer.basic.BasicIndexingFilter
+    org.archive.nutchwax.index.ConfigurableIndexingFilter
+
+So that the NutchWAX indexing filter is run after the Nutch basic
+indexing filter.
+
+A full explanation is given in "README-dedup.txt".
+
+--------------------------------------------------
 mime.type.magic
 --------------------------------------------------
 We disable mimetype detection in Nutch for two reasons:
@@ -172,12 +191,12 @@
 nutchwax.filter.index
 --------------------------------------------------
 Configure the 'index-nutchwax' plugin.  Specify how the metadata
-fields added by the ArcsToSegment are mapped to the Lucene documents
-during indexing.
+fields added by the Importer are mapped to the Lucene documents during
+indexing.
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:dest-key
+  src-key:lowercase:store:tokenize:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
@@ -185,6 +204,7 @@
   lowercase = true
   store     = true
   tokenize  = false
+  exclusive = true
   dest-key  = src-key
 
 We recommend:
@@ -192,6 +212,9 @@
 <property>
   <name>nutchwax.filter.index</name>
   <value>
+    url:false:true:true
+    orig:false
+    digest:false
     arcname:false
     collection
     date
@@ -199,39 +222,50 @@
   </value>
 </property>
 
+The "url", "orig" and "digest" values are required, the rest are
+optional, but strongly recommended.
+
 --------------------------------------------------
 nutchwax.filter.query
 --------------------------------------------------
 Configure the 'query-nutchwax' plugin.  Specify which fields to make
-searchable via "[field]:[term|phrase]" query syntax, and whether they
+searchable via "field:[term|phrase]" query syntax, and whether they
 are "raw" fields or not.
 
-The specification format is 
+The specification format is one of:
 
-  raw:name:lowercase:boost 
-or
-  field:name:boost
+  field:<name>:<boost>
+  raw:<name>:<lowercase>:<boost>
+  group:<name>:<lowercase>:<delimiter>:<boost>
 
 Default values are
 
   lowercase = true
+  delimiter = ","
   boost     = 1.0f
 
 There is no "lowercase" property for "field" specification because the
 Nutch FieldQueryFilter doesn't expose the option, unlike the
 RawFieldQueryFilter.
 
-NTOE: We do *not* use this filter for handling "date" queries, there is a
-specific filter for that: DateQueryFilter
+The "group" fields are raw fields that can accept multiple values,
+separated by a delimiter.  Multiple values appearing in a query are
+automagically translated into required OR-groups, such as
 
+  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
+
+NOTE: We do *not* use this filter for handling "date" queries, there
+is a specific filter for that: DateQueryFilter
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.query</name>
   <value>
+    raw:digest:false
     raw:arcname:false
-    raw:collection
-    raw:type
+    group:collection
+    group:type
     field:anchor
     field:content
     field:host
@@ -240,6 +274,52 @@
 </property>
 
 
+--------------------------------------------------
+nutchwax.urlfilter.wayback.exclusions
+--------------------------------------------------
+File containing the exclusion list for importing.
+
+Normally, this is specified on the command line with the NutchWAX
+Importer is invoked.  It can be specified here if preferred.
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.canonicalizer
+--------------------------------------------------
+
+For CDX-based de-duplication, the same URL canonicalization algorithm
+must be used here as was used to generate the CDX files.
+
+The default canonicalizer in Wayback's '(w)arc-indexer' utility
+is 
+
+  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
+
+which is the value provided in "nutch-site.xml".
+
+If the '(w)arc-indexer' is executed with the "-i" (identity)
+command-line option, then the matching canonicalizer
+
+  org.archive.wayback.util.url.IdentityUrlCanonicalizer
+
+must be specified here.
+
+--------------------------------------------------
+nutchwax.import.content.limit
+--------------------------------------------------
+Similar to Nutch's
+
+  file.content.limit
+  http.content.limit
+  ftp.content.limit
+
+properties, this specifies a limit on the size of a document imported
+via NutchWAX.
+
+We recommend setting this to a size compatible with the memory
+capacity of the computers performing the import.  Something in the
+1-4MB range is typical.
+
+
 ======================================================================
 Create a manifest
 ======================================================================


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2399] trunk/archive-access/projects/nutchwax/ archive/README.txt

From: <bi...@us...> - 2008-07-03 02:03:32

Revision: 2399
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2399&view=rev
Author:   binzino
Date:     2008-07-02 19:03:41 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Updated with changes in RC-1.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/README.txt

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-07-03 02:03:09 UTC (rev 2398)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-07-03 02:03:41 UTC (rev 2399)
@@ -1,6 +1,6 @@
 
 README.txt
-2008-05-20
+2008-07-02
 Aaron Binns
 
 Welcome to NutchWAX 0.12!
@@ -22,13 +22,13 @@
 The goal of NutchWAX is to enable full-text indexing and searching of
 documents stored in web archive file formats (ARC and WARC).
 
-The way we achieve that goal is by providing add-on tools and plugins
+The way we achieve that goal is by providing plugins and add-on tools
 to Nutch to read documents directly from ARC/WARC files.  We call this
 process "importing" archive files.
 
-Importing produces a Nutch segment, the same as if Nutch had actually
-crawled the documents itself.  In this scenario, document importing
-replaces the conventional "generate/fetch/update" cycle of Nutch.
+Importing produces a Nutch segment, similar to Nutch crawling the
+documents itself.  In this scenario, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
 
 Once the archival documents have been imported into a segment, the
 regular Nutch commands to update the 'crawldb', invert the links and
@@ -36,12 +36,12 @@
 
 ======================================================================
 
-The NutchWAX add-ons consist of:
+The main NutchWAX add-ons are:
 
  bin/nutchwax
 
-   A shell script that is used to run the NutchWAX command-line tools,
-   such as document importing.
+   A shell script that is used to run the NutchWAX commands, such as
+   document importing.
 
    This is patterned after the 'bin/nutch' shell script.
 
@@ -55,6 +55,16 @@
    Query plugin which allows for querying against the metadata fields
    added by 'index-nutchwax'.
 
+ plugins/urlfilter-nutchwax
+
+   Filtering plugin which can be used to exclude URLs from import.  It
+   can be used as part of a NutchWAX de-duplication scheme.
+
+ conf/nutch-site.xml
+
+   Sample configuration properties file showing suggested settings for
+   Nutch and NutchWAX.
+
 There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
 is distributed in source code form and is intended to be built in
 conjunction with Nutch.
@@ -84,7 +94,7 @@
 already familiar with the inner workings of Nutch.  Still, special
 attention on one class is worth while:
 
-  src/java/org/archive/nutchwax/ArcsToSegment.java
+  src/java/org/archive/nutchwax/Importer.java
 
 This is where ARC/WARC files are read and their documents are imported
 into a Nutch segment.
@@ -113,10 +123,14 @@
   o We add metadata fields to the document, which are then available
     to the "index-nutchwax" plugin at indexing-time.
 
-    ArcsToSegment.importRecord()
+    Importer.importRecord()
       ...
       contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
       contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
       contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
       contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
       ...
+
+
+======================================================================
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2398] trunk/archive-access/projects/nutchwax/ archive/bin/revisits

From: <bi...@us...> - 2008-07-03 02:03:00

Revision: 2398
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2398&view=rev
Author:   binzino
Date:     2008-07-02 19:03:09 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Changed sort to sort -u to only emit uniq revisits.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/bin/revisits

Modified: trunk/archive-access/projects/nutchwax/archive/bin/revisits
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/revisits	2008-07-03 02:02:38 UTC (rev 2397)
+++ trunk/archive-access/projects/nutchwax/archive/bin/revisits	2008-07-03 02:03:09 UTC (rev 2398)
@@ -9,4 +9,4 @@
     exit 1;
 fi
 
-cat $@ | awk '{ if ( $9 == "-" ) print $1 " sha1:" $6 " " $2 }' | sort 
+cat $@ | awk '{ if ( $9 == "-" ) print $1 " sha1:" $6 " " $2 }' | sort -u


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2397] trunk/archive-access/projects/nutchwax/ archive/INSTALL.txt

From: <bi...@us...> - 2008-07-03 02:02:28

Revision: 2397
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2397&view=rev
Author:   binzino
Date:     2008-07-02 19:02:38 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Updated with latest Nutch SVN revision NW 0.12 built against.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-07-03 02:01:46 UTC (rev 2396)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-07-03 02:02:38 UTC (rev 2397)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-06-02
+2008-07-02
 Aaron Binns
 
 This installation guide assumes the reader is already familiar with
@@ -46,11 +46,11 @@
 Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is
 built against is:
 
-  650739
+  673464
 
 To checkout this revision of Nutch, use:
 
- $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
+ $ svn checkout -r 673464 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
 
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2396] trunk/archive-access/projects/nutchwax/ archive/README-dedup.txt

From: <bi...@us...> - 2008-07-03 02:01:39

Revision: 2396
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2396&view=rev
Author:   binzino
Date:     2008-07-02 19:01:46 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Initial revision.  Very rough draft.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/README-dedup.txt

Added: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	2008-07-03 02:01:46 UTC (rev 2396)
@@ -0,0 +1,697 @@
+
+README-dedup.txt
+2008-07-02
+Aaron Binns
+
+De-duplication and NutchWAX
+
+This document assumes that the reader is familiar with the topic of
+de-duplication with regards to archiving web data.  That said, let us
+review what we mean by de-duplication in NutchWAX.
+
+When archive files (ARC/WARC) are written, the tool used to create
+them may or may not prevent multiple copies of the same URL to be
+written.  Some archive file creation tools perform duplicate
+prevention, but many do not.
+
+What NutchWAX has to contend with is the scenario where one or more
+archive files that are imported and indexed have multiple copies of an
+URL.
+
+Ideally, NutchWAX would only import and index one unique version of
+the URL.  If the same version of the URL was seen a second, third,
+fourth, etc. time, then NutchWAX would simply update the existing
+record in its search index by adding the subsequent crawl dates to it.
+This way, if a URL was crawled 10 times and didn't change, there would
+only be one entry in the search index for it, but with 10 crawl dates
+associated with it.
+
+======================================================================
+
+This sounds simple enough, but in practice the implementation is not
+as straightfoward as suggested by the above.
+
+For one, Nutch's underlying Lucene search indexes are not easily
+modified "in place".  That is, updating an existing record by adding
+an additional date to it is not easily accomplished via the Lucene
+public APIs.  The Lucene documentation informs us that records are not
+modified in place, but rather are deleted and re-added with the
+modified/new information.
+
+Doing a complete delete+re-add on a large Lucene database containing
+possibly millions of records is a computationally expensive process.
+Furthermore, since many fields in Nutch's Lucene indexes are not
+stored, it is infeasible to delete and re-add them w/o data loss.
+
+Fortunately, using parallel Lucene indexes and the ParallelIndexReader
+can help solve the problem.  More on that later.
+
+======================================================================
+
+Another challenge in handling duplicates is defining what makes for a
+unique version of a URL.
+
+Most tools, Nutch included, use the URL as a unique identifier for a
+page.  Since most tools don't care about old versions of pages,
+retaining only the latest version and using the URL to identify it is
+sufficient.
+
+However, for archive data, we need to use more than just the URL to
+identify a page, we need something that has the URL but also some
+notion of the *version* of the page.
+
+For example, consider a page like
+
+  http://www.cnn.com/index.html
+
+This page changes frequently.  If it were crawled 10 times, once per
+week, each crawl could capture a different version of the page.  We
+would have 10 different, unique versions.
+
+Now, if NutchWAX used *only* the URL as the unique identifier for the
+page, there would be no way to distinguish the first one from the
+second, from the third, etc.
+
+NutchWAX needs a unique identifier that has the URL and also some
+notion of the *version* of the page.  For that we use a digest of the
+page's content.  The digest is used as a version number of sorts.
+Each version will have a different digest.  So, if we need to find a
+specific version of the page, we can use the URL combined with the
+digest to uniquely identify it.
+
+Currently we use SHA-1 for digesting the content.
+
+Using URL+digest rather than just the URL as a unique page identifier
+is conceptually simple, but does have some repercussions within Nutch.
+Nutch assumes that the URL alone is a unique identifier and that
+assumption is coded into the software in various ways.  To use the
+URL+digest instead, we had to work around some of those hard-coded
+assumptions in various ways.  More on that later.
+
+======================================================================
+
+The next challenge is to know if a version of a URL (the URL+digest
+described above) has already been imported and indexed so that we
+don't import it again.
+
+To prevent the importing of multiple copies of the same version of a
+page, we could get the URL+digest of the page to be imported, then
+look in the existing Nutch index to see if we alread have it.  If we
+do, do not import it, instead add the crawl date to the existing
+record in the search index.
+
+Now, the above describes two challenges:
+
+ 1. Searching the existing index to see if there is an existing record
+    to be updated.
+
+ 2. Updating an existing record.  This was discussed above and we do
+    have a solution, which we'll describe in more detail later.
+
+The first doesn't seem challenging at first and in theory it isn't.
+However, in practice it is difficult becuase for a a large deployment,
+we usually have many Lucene indexes spread over many machines.  It's
+not as simple as opening up a single Lucene index on the local machine
+and searching for a matching URL+digest.  In one of the deployments at
+the Internet Archive, we have 100s of Lucene indexes spread over 5
+machines.
+
+Now, we could use the Nutch web search rather than accessing the
+Lucene indexes directly.  That is, to find out if we have already
+indexed a URL+digest, we could send an HTTP request to the Nutch
+search server asking if the URL+digest is already in the index or not.
+
+Although this is a workable solution, performing a search for each and
+every URL being imported would likely put too much strain on the
+search server and would slow down the importing process.  When some
+import & index jobs process 100s of millions of documents and take
+weeks to run, adding a 5-second HTTP request to each URL import is a
+significant cost.
+
+What would be ideal is a centralized database of all the URLs
+processed by NutchWAX.  Ideally, this centralized database would also
+be used by the archiver (e.g. Heritrix) to perform de-duplication
+during a crawl; and also by the Wayback for storing historical
+metadata.
+
+The de-duplication strategy described in this document utilizes the
+Wayback tools and CDX files as the central URL database for performing
+NutchWAX de-duplication.
+
+======================================================================
+
+Review
+------
+
+Our de-duplication strategy for NutchWAX as described so far has three
+key elements:
+
+ o Use URL+digest as a unique identifier for a unique version of a page.
+ o Use ParallelIndexReader to provide index record modification/update.
+ o Use Wayback and CDX files as a central database of URL processing state.
+
+======================================================================
+
+Using CDX files to detect duplicate pages in a set of archive files is
+fortunately rather straightforward.
+
+CDX files are text files with one line for each and every page
+(record) in an archive file.  These CDX lines have three bits of data
+we can use for detecting duplicate pages:
+
+ o URL
+ o digest
+ o date
+
+NutchWAX provides a 'dedup-cdx' script that reads a CDX file and
+produces a "duplicates" file containing the URL, digest and date of
+each duplicate copy of a unique version of a URL in the CDX file.
+
+For example, suppose we have a collection of 100 ARC files.  In those
+ARC files, the page
+
+  http://www.example.org/index.html
+
+appears 10 times, but only 5 of those are different, the other 5 are
+duplicate copies.  Suppose we have
+
+  Date         Digest    Content sample
+  2007-10-01   abc123    Hello, welcome to my page.
+  2007-10-02   abc123    Hello, welcome to my page.
+  2007-10-03   def456    Sorry I haven't updated this in a while.
+  2007-10-04   def456    Sorry I haven't updated this in a while.
+  2007-10-05   abc123    Hello, welcome to my page.
+  2007-10-06   abc123    Hello, welcome to my page.
+  2007-10-07   ghi789    Hey, I finally updated this.
+  2007-10-08   jkl012    Under construction.
+  2007-10-09   jkl012    Under construction.
+  2007-10-10   mno345    My homepage is great!
+
+Notice how we started with the "abc123" version, changed to the
+"def456" version then reverted back to the "abc123" version.  In this
+simple example, we have an webmaster who just can't make up his mind
+on what to say.
+
+Thep point is that our CDX file will have lines of the form
+
+  20071001 abc123 example.org/index.html
+  20071002 abc123 example.org/index.html
+  20071003 def456 example.org/index.html
+  20071004 def456 example.org/index.html
+  20071005 abc123 example.org/index.html
+  20071006 abc123 example.org/index.html
+  20071007 ghi789 example.org/index.html
+  20071008 jkl012 example.org/index.html
+  20071009 jkl012 example.org/index.html
+  20071010 mno345 example.org/index.html
+
+It's easy to find the duplicate lines in the CDX file.
+
+The NutchWAX 'dedup-cdx' script will extract the duplicates, writing out all
+the duplicate lines, except for the first.  For the above, the output is
+
+  20071002 abc123 example.org/index.html
+  20071004 def456 example.org/index.html
+  20071005 abc123 example.org/index.html
+  20071006 abc123 example.org/index.html
+  20071009 jkl012 example.org/index.html
+
+Only the 2nd, 3rd, etc. instance of a URL+digest line are printed.
+The first instance of "abc123" is not printed, or is "ghi789" since it
+has no duplicates.
+
+Now what do we do with these?
+
+When importing archive files with NutchWAX, we pass it this list of
+duplicates, which it uses as an exclusion list.  Any URL+digest+date
+on the list is excluded from import, all others pass through.
+
+Looking at our CDX sample again
+
+  Date     Digest URL                    Import?
+  20071001 abc123 example.org/index.html   Y
+  20071002 abc123 example.org/index.html   N
+  20071003 def456 example.org/index.html   Y
+  20071004 def456 example.org/index.html   N
+  20071005 abc123 example.org/index.html   N
+  20071006 abc123 example.org/index.html   N
+  20071007 ghi789 example.org/index.html   Y
+  20071008 jkl012 example.org/index.html   Y
+  20071009 jkl012 example.org/index.html   N
+  20071010 mno345 example.org/index.html   Y
+
+Excellent, we've just prevented duplicate copies of the same version
+of a page from being imported!
+
+======================================================================
+
+But what about the fact that we crawled the page on 5 dates and it
+didn't change, we want to record that somewhere right?
+
+Yes.
+
+NutchWAX provides an "add-dates" command (in the 'nutchwax' script)
+for adding dates to an existing index by creating a parallel index for
+it.
+
+Using our "add-dates" command, we can add those crawl dates to the
+index so that each unique version of the page will have all the crawl
+dates associated with it.  For our above example, resulting in:
+
+  Date     Digest URL                    
+  20071001, abc123 example.org/index.html
+  20071002,
+  20071005,
+  20071006
+
+  20071003, def456 example.org/index.html
+  20071004
+
+  20071007  ghi789 example.org/index.html
+
+  20071008, jkl012 example.org/index.html
+  20071009
+
+  20071010  mno345 example.org/index.html
+
+Voila!
+
+======================================================================
+
+Recap
+-----
+
+By using CDX files and the NutchWAX tools we are able to de-duplicate
+during import.  
+
+For example, for a list of arcs
+
+  $ wayback/bin/arc-indexer foo.arc.gz > foo.cdx
+  $ nutchwax/bin/dedup-cdx foo.cdx > foo.dup
+  $ echo "foo.arc.gz" > manifest
+  $ nutchwax/bin/nutchwax import -e foo.dup manifest
+  $ nutchwax/bin/nutch updatedb crawldb -dir segments
+  $ nutchwax/bin/nutch invertlinks linkdb -dir segments
+  $ nutchwax/bin/nutch index indexes crawldb linkdb segments/*
+  $ nutchwax/bin/nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup
+
+The important steps being the creation of the the "foo.dup" file
+containing the duplicate records, the use of that file to exclude
+duplicates during import, and the use of that same file for adding the
+crawl dates to the index.
+
+======================================================================
+
+Parallel Indexes
+
+Since updating an existing Lucene index is not feasible, we "virtually
+update" an index by using a modified version of the Lucene
+ParallelIndexReader.
+
+The basic idea is to take the metadata field you want to update and
+put it in a parallel index.  In DB table-speak, this would be moving a
+column to a separate table and using the record index/position as the
+foreign key to join the two tables.
+
+The NutchWAX 'add-dates' command does this for the date metadata
+field.  It will take an existing index and create a parallel index,
+adding dates listed in an external file.
+
+The command-line syntax is of the form:
+
+  nutchwax add-dates <key index> <source indices>... <dest index> <dates>
+
+Suppose we have an index created by the Nutch "index" command and we also have
+a list of crawl dates we want to add to it.  The index is in a sub-directory
+"indexes/part-00000" and the dates are in a file "dates.txt"
+
+  $ nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates dates.txt
+
+In this case our key index and source index are the same, since we
+want to preserve any dates in the original index and add the new dates
+to them.  But let's suppose we've already done this once, but then have even more
+dates to add, in a file "dates2.txt"
+
+  $ nutchwax add-dates indexes/part-00000 indexes/dates indexes/dates2 dates2.txt
+  $ rm -r indexes/dates
+  $ mv indexes/dates2 indexes/dates
+
+In this case, we copy the values from the existing "dates" index,
+adding the new dates to them.  Afterwards, we replace the old "dates"
+index with the new, fully up-to-date one.
+
+----------------------------------------------------------------------
+
+Using Parallel Index
+
+This is all well and good, but how to we make Nutch(WAX) use these
+parallel indices?
+
+NutchWAX provides a NutchWaxBean, which extends NutchBean by adding
+support for parallel indices.  The NutchWaxBean follows the NutchBean
+conventions by looking for a directory containing the indices in a
+directory named "crawl" or as specified in the "searcher.dir"
+configuration property.
+
+However, rather than looking for indices in "index" and "indexes",
+NutchWaxBean looks in "pindexes".  If that directory is found, it
+iterates through all sub-directories and expects each to contain a set
+of parallel indices within it.  A sample directory structure might
+look like:
+
+  crawl/pindexes/foo
+                    dates
+                    main
+                 bar
+                    dates
+                    main
+                 baz
+                    dates
+                    main
+
+where "dates" and "main" are parallel indexes.
+
+----------------------------------------------------------------------
+
+This is all fine and good when calling the NutchWaxBean from
+the command-line, but what about in a webapp?
+
+The NutchBean has a static method for self-initialization upon recipt
+of a application startup message from the servlet container.  We have
+a similar hook in NutchWaxBean, which is run after the NutchBean is
+initialized.
+
+The NutchWaxBean hook must be added to the Nutch web.xml file:
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+    <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
+  </listener>
+
+If you don't do this, then the NutchBean won't use the
+ParallelIndexReader and your parallel indices won't be used.
+
+======================================================================
+
+WARC + revisit records
+
+The WARC format supports revisit records.  Revisit records are
+typically written by WARC writing tools (such as Heritrix) when a URL
+is visited a second, third, etc time and the content hasn't changed.
+
+Taking our example from above, whenever the page is crawled and hasn't
+changed, a revisit record would be written to the WARC file.
+
+For de-duplication, WARC files with revisit records are nice becuase
+the crawler is doing the duplicate detection for us.  Rather than write
+a duplicate copy of the page, it writes a record that has
+
+  URL
+  digest
+  date
+
+of the visit.  Now, if you look at the output of 'dedup-cdx' you'll
+notice similarity.
+
+In fact, WARC records can be used to create a list of additional crawl
+dates without having to actually perform the full CDX de-duplication
+(which can be computationally expensive).
+
+A CDX file generated from a WARC will have the 9th field set to "-"
+for revisit records.  We can use this to easily find those lines and
+generate a list of crawl dates for a URL+digest.
+
+NutchWAX comes with a script called 'revisits' the does precisely
+that.  It takes CDX files as input, finds the lines for the revisit
+records, then emits them in a form that can be used by the 'add-dates'
+command.
+
+For example
+
+  $ wayback/bin/warc-indexer foo.warc.gz > foo.cdx
+  $ nutchwax/bin/revisits foo.cdx > foo.dup
+  $ nutchwax/bin/add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup
+
+Since the WARC files are known not to contain duplicates, we don't
+have to de-dup them in order to provide the importing process with an
+exclusion list.  However, we still use the 'revisits' script to
+generate a list of crawl dates for the revisit records so we can add
+them to the parallel index.
+
+======================================================================
+
+Doesn't NutchWAX (0.10) already handle duplicates?
+
+All this business about URL vs. URL+digest as a unique identifier for
+a version of a page may seem a surprising to some.  Many users of
+NutchWAX have been importing and indexing ARC files and haven't seen a
+situation where a newer version of a URL over-writes an older one.
+
+That is true, in certain circumstances different versions of a page
+will peacefully co-exist in a Nutch deployment.
+
+* The key is in the grouping of ARC files for importing. *
+
+When I said that by default, only one version of a URL can live in a
+Nutch index I was being a bit general.  Actually, only one version of
+a URL can live in a Nutch *segment*.
+
+When a batch of ARC files are imported, a new segment is created.  If
+you are lucky, then ARC files containing duplicates will be imported
+in different batches and the different versions of the same URL will
+each live in a separate segment.
+
+Consider the most extreme case, where a NutchWAX user imports ARC
+files one-at-a-time.  The result would be a Nutch segment for each ARC
+file.  This would be nice because if there were 5 different versions
+of a URL in 5 different ARC files; then there will be 5 segments, each
+containing one of the 5 versions of the URL.  No conflicts among
+versions.
+
+However, using a one-segment-per-ARC plan is not practical since most
+NutchWAX users have 1000s, 10000s, 100000s or more ARC files.  Having
+100000 segment directories on disk is simply not practical.
+
+Most NutchWAX users import ARC files in groups that either correspond
+to distinct crawls, or groups that are sized according to memory
+and/or CPU limits.
+
+We can't rely on good fortune to provide us with ARC file batches that
+don't have multiple versions of a URL.
+
+----------------------------------------------------------------------
+
+The worst-case scenario is if all the ARC files for a single
+collection are imported in one batch.  In this case, they would all go
+into a single Nutch segment and only 1 version of each URL would be
+imported and indexed.  All other versions would be discarded.
+
+----------------------------------------------------------------------
+
+If Nutch does this "automatic deduplication" by URL within each
+segment, why does it have a "dedup" command?
+
+That command is designed to operate on a set of segment indexes.  The
+segments are deduped internally automatically, the "dedup" command
+removes duplicates across segments.
+
+----------------------------------------------------------------------
+
+The fact that a later version of a URL replaced an earlier one is not
+always easy to notice just by performing searches against the
+resulting index.  One would have to know the contents of the pages
+such that a query would be able to find one specific version -- or not
+if it wasn't there.
+
+And especially with large collections, if a version of a page is
+missing from the search index, it could easily go unnoticed for quite
+some time.
+
+
+One way to test an existing index is to use CDX files in conjunction
+with the NutchWAX 'dumpindex' command.  
+
+  o Generate a list of duplicate records from all CDX for the entire
+    collection.
+
+  o Using the Wayback, identify a URL that has many different
+    versions.  Choose a URL that will be indexed for full-text search,
+    such as a HTML, text or PDF document; not an image.
+
+  o Dump the entire Lucene index with NutchWAX 'dumpindex' and 
+    find all the records for the URL.
+
+Chances are some of the versions of the URL will be in the index but
+not all.
+
+
+======================================================================
+
+Is this all necessary?
+
+No.  If you don't want to de-duplicate ARC files during import and
+indexing you don't have to.
+
+You can continue to perform the import, update, invert and index steps
+like before and just live with the consequences of not de-duplicating.
+
+If you don't de-duplicate, you will just have redundant records in
+your search index.  This means that you'll have a search result hit
+for each copy of the page in the index.  If you imported the same page
+10 times, then a search query that finds that page will find all 10
+copies and return 10 identical search results -- one for eaach copy.
+
+
+In addition, the de-duplication feature and the add-dates feature with
+the parallel index are also independent of each other.  You can
+de-duplicate but decide to not use parallel indices to add dates to
+the records in the Lucene index.
+
+In this case, you would only have 1 date associated with each record:
+the date the record was imorted.  Any information about subsequent
+revisits to the same version of the page would not be in the search
+index.
+
+
+Also, if you have a system of your own devising that keeps track of
+duplicates in archive files; have it output the duplicates files in
+the same form as the 'dedup-cdx' script.  The import command doesn't
+care where the exclusion list comes from, just that it has the correct
+format.
+
+
+======================================================================
+
+Implementation notes on URL+digest vs URL
+
+Although the use of 'dedup-cdx' and associated tools for de-duping and
+managing revisit dates are entirely optional and have no impact on
+Nutch(WAX) if not used, one area of change in NutchWAX that does
+impact Nutch is changing the unique 'key' for a document from URL to
+URL+digest.
+
+Without this change, you cannot have different versions of the same
+URL in a Nutch segment.  Such a limitation is simply incompatible with
+NutchWAX and archive files.  This change is not optional.
+
+The core of the change from URL to URL+digest happens in the NutchWAX
+Indexer class.  In that class the segment is created and all the
+document-related information is added to it.  When a document is added
+to a segment, it is written to a Haddop MapFile.
+
+Hadoop MapFiles act like Java Maps.  They are essentially key/value
+pairs.  In Nutch, the key is the URL and the value is a collection of
+information for that URL.
+
+In the Importer.java source code, where we add the information to the
+segment, we use 
+
+  <URL> <digest>
+
+as the key, such as
+
+  "http://www.example.org/index.html sha1:HJG5ZWG3MQQKHIN43BXJY3FUWP7WTU43"
+
+instead of simply
+
+  "http://www.example.org/index.html"
+
+We also stuff the URL into the document in a metadata field titled
+"url", which we use later in our indexing filter plugin.
+
+This is simple enough in the Importer code, it does however have a few
+consequences elsewhere in Nutch.   The places where it affects Nutch
+are where Nutch assumes
+
+   URL == key
+
+There are two places in particular where this assumption causes a
+problem because the URL is no longer the key.
+
+1. BasicIndexingFilter (index-basic plugin)
+
+In the call from Indexer.java to BasicIndexingFilter.java, the key is 
+treated as the URL:
+
+   Indexer.java:
+
+   249:  doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
+
+   BasicIndexingFilter.java:
+
+   55 public Document filter(Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
+   56 throws IndexingException {
+   57
+   58  Text reprUrl = (Text) datum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
+   59  String reprUrlString = reprUrl != null ? reprUrl.toString() : null;
+   60  String urlString = url.toString();
+    
+The Indexer passes the key, the BasicIndexingFilter treats it as the
+URL.
+
+Not only that, but the BasicIndexingFilter goes on to insert that
+urlString into the Lucene document in the "url" field.
+
+We work around this by configuring our NutchWAX indexin filter plugin
+to run *after* the BasicIndexingFilter and over-write the "url" field
+with the correct URL.
+
+We do this by setting the Nutch configuration property (in
+nutch-site.xml for example) with
+
+  <property>
+    <name>indexingfilter.order</name>
+    <value>
+      org.apache.nutch.indexer.basic.BasicIndexingFilter
+      org.archive.nutchwax.index.ConfigurableIndexingFilter
+     </value>
+  </property>
+
+without this property, the indexing filters are run in an arbitrary
+order.  We need our ConfigurableIndexingFilter to run after the
+BasicIndexingFilter.
+
+The configuration for the ConfigurableIndexingFilter specifies that
+the "url" field will be filled with the value from the "url" metadata
+field (which we set in Importer.java remember) and over-write any
+previous value.
+
+
+2. FetchedSegments
+
+This class has a lovely little routine called "getUrl" which is used
+*not* to get the URL per se, rather it gets the URL from a Lucene
+document /in order to use it as a document key/.
+
+Let's take a look:
+
+  private Text getUrl(HitDetails details) {
+    String url = details.getValue("orig");
+    if (StringUtils.isBlank(url)) {
+      url = details.getValue("url");
+    }
+    return new Text(url);
+  }
+
+The problem is that we've stored the true URL in the "url" field, so
+the value returned is the true URL.  Now when the code that calls this
+method tries to use it as the key, it can't find the document since
+the key is "URL digest".
+
+Since this method is private and this code is rather deep inside of
+Nutch, over-riding it with a subclass isn't feasible.
+
+But, if you notice, getURL does come with a little oddity where it
+first consults "orig" before "url".  We don't use "orig" for anything,
+so in our Importer, we set the "orig" metadata field to be the key.
+
+This way, when getUrl calls 
+
+    String url = details.getValue("orig");
+
+the key is found and everything is happy.
+
+Yes, it's a hack.  No, I'm not ashamed.
+
+======================================================================
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2395] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/ WindowEndFilter.java

From: <bra...@us...> - 2008-07-02 01:13:16

Revision: 2395
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2395&view=rev
Author:   bradtofel
Date:     2008-07-01 18:13:24 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
FEATURE: added numSeen()

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/WindowEndFilter.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/WindowEndFilter.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/WindowEndFilter.java	2008-07-02 01:02:08 UTC (rev 2394)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/WindowEndFilter.java	2008-07-02 01:13:24 UTC (rev 2395)
@@ -48,6 +48,9 @@
 	public int getNumReturned() {
 		return numReturned;
 	}
+	public int getNumSeen() {
+		return numSeen;
+	}
 	/* (non-Javadoc)
 	 * @see org.archive.wayback.util.ObjectFilter#filterObject(java.lang.Object)
 	 */


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2394] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/ LocalResourceIndex.java

From: <bra...@us...> - 2008-07-02 01:02:11

Revision: 2394
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2394&view=rev
Author:   bradtofel
Date:     2008-07-01 18:02:08 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
BUGFIX(unreported): was not setting number of results requested

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2008-07-02 00:35:41 UTC (rev 2393)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java	2008-07-02 01:02:08 UTC (rev 2394)
@@ -478,7 +478,7 @@
 		}
 		public void annotateResults(SearchResults results) {
 			results.setFirstReturned(startResult);
-			results.setReturnedCount(resultsPerPage);
+			results.setNumRequested(resultsPerPage);
 
 			// how many went by the filters:
 			results.setMatchingCount(startFilter.getNumSeen());


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2393] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp/query

From: <bra...@us...> - 2008-07-02 00:35:32

Revision: 2393
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2393&view=rev
Author:   bradtofel
Date:     2008-07-01 17:35:41 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
BUGFIX(unreported) SearchResult count methods now return long values not int values.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp	2008-07-02 00:33:10 UTC (rev 2392)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp	2008-07-02 00:35:41 UTC (rev 2393)
@@ -16,7 +16,7 @@
 
 String searchString = results.getSearchUrl();
 
-  int resultCount = results.getResultsReturned();
+long resultCount = results.getResultsReturned();
 
   Timestamp searchStartTs = results.getStartTimestamp();
   Timestamp searchEndTs = results.getEndTimestamp();

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp	2008-07-02 00:33:10 UTC (rev 2392)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp	2008-07-02 00:35:41 UTC (rev 2393)
@@ -25,11 +25,11 @@
 //  new PathQuerySearchResultPartitioner(results.getResults(),
 //      results.getURIConverter());
 
-int firstResult = results.getFirstResult();
-int lastResult = results.getLastResult();
-int resultCount = results.getResultsMatching();
+long firstResult = results.getFirstResult();
+long lastResult = results.getLastResult();
+long resultCount = results.getResultsMatching();
 
-int totalCaptures = results.getResultsMatching();
+long totalCaptures = results.getResultsMatching();
 
 %>
 <%= fmt.format("PathPrefixQuery.showingResults",firstResult,lastResult,


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2392] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties

From: <bra...@us...> - 2008-07-02 00:33:11

Revision: 2392
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2392&view=rev
Author:   bradtofel
Date:     2008-07-01 17:33:10 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
German translation, thanks Andreas!

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties

Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties	2008-07-02 00:33:10 UTC (rev 2392)
@@ -0,0 +1,114 @@
+Exception.wayback.title=Wayback Fehler
+Exception.wayback.message=Ein unbekannter Fehler ist aufgetreten. {0}
+Exception.accessControl.title=Zugriffsfehler
+Exception.accessControl.message=Der Zugriff auf den Inhalt ist gesperrt. {0}
+Exception.authenticationControl.title=Authentisierungsfehler
+Exception.authenticationControl.message=Dieser Inhalt ist f\xFCr den aktuellen Benutzer oder vom aktuellen Ort nicht m&ouml;glich. {0}
+Exception.badContent.title=Inhaltsfehler
+Exception.badContent.message=Der archivierte Inhalt konnte nicht wiedergegeben werden.
+Exception.badQuery.title=Anfragefehler
+Exception.badQuery.message=F&uuml;r die Anfrage fehlen Informationen oder konnte vom Server nicht verstanden werden. {0}
+Exception.betterRequest.title=Anfragefehler
+Exception.betterRequest.message=Die gemachte Anfrage kann durch einen andere Anfrage besser ausgedr&uuml;ckt werden. {0}
+Exception.configuration.title=Konfigurationsfehler
+Exception.configuration.message=Das Service wurde nicht korrekt konfiguriert. {0}
+Exception.resourceIndexNotAvailable.title=Der Ressourcen Index ist nicht verf&uuml;bar Exception
+Exception.resourceIndexNotAvailable.message=Der, f&uuml;r Anfrage notwendige Ressourcen Index ist zwischenzeitlich nicht verf&uuml;gbar. Bitte versuchen Sie es sp&auml;ter nocheinmal.
+Exception.resourceNotAvailable.title=Ressource ist nicht verf&uuml;gbar
+Exception.resourceNotAvailable.message=Die angeforderte Ressource ist zwischenzeitlich nicht verf&uuml;gbar. Bitte versuchen Sie es sp&auml;ter nocheinmal.
+Exception.resourceNotInArchive.title=Ressource ist nicht im Archiv
+Exception.resourceNotInArchive.message=Die angeforderte Ressource ist nicht im Archiv.
+
+UIGlobal.pageTitle=Internet Archive Wayback Machine
+UIGlobal.helpLink=Hilfe
+UIGlobal.enterWebAddress=Internet Adresse:
+UIGlobal.selectYearAll=Alle
+UIGlobal.urlSearchButton=Suche
+UIGlobal.advancedSearchLink=Erweiterte Suche
+UIGlobal.homeLink=Home
+UIGlobal.indexPage=Das ist der neue Wayback Machine Prototyp. Jede URL, die in den ARC Dateien verf&uuml;gbar ist, kann oben gesucht werden.
+UIGlobal.helpPage=Bitte beziehen sie sich auf <a href="{0}">Wayback FAQ</a>.
+
+MetaReplay.title=Document Metadata
+MetaReplay.HTTPHeaders=HTTP Headers
+MetaReplay.originalURL=Original URL
+MetaReplay.URLKey=URL Schl&uuml;ssel
+MetaReplay.captureDate=Speicherdatum
+MetaReplay.captureDateDisplay={0,date,dd.MM.yyyy HH:mm:ss}
+MetaReplay.archiveID=Archive ID
+MetaReplay.MIMEType=Mime Type
+MetaReplay.digest=Digest
+
+TimelineView.viewingVersion=Anzeige der Version {0,number,integer} von {1,number,integer}
+TimelineView.viewingVersionDate={0,date,dd.MM.yyyy HH:mm:ss}
+TimelineView.timeRange=Zeitraum
+TimelineView.timeRange.years=Jahre
+TimelineView.timeRange.twomonths=Monate
+TimelineView.timeRange.months=Monate
+TimelineView.timeRange.days=Tage
+TimelineView.timeRange.hours=Stunden
+TimelineView.timeRange.unknown=unbekannt
+TimelineView.timeRange.auto=Auto({0})
+TimelineView.metaDataCheck=Metadata:
+TimelineView.markDateTitle={0,date,dd.MM.yyyy HH:mm:ss}
+TimelineView.firstVersionTitle=Erste Version ({0,date,dd.MM.yyyy HH:mm:ss})
+TimelineView.prevVersionTitle=Vorherige Version ({0,date,dd.MM.yyyy HH:mm:ss})
+TimelineView.nextVersionTitle=N&auml;chste Version ({0,date,dd.MM.yyyy HH:mm:ss})
+TimelineView.lastVersionTitle=Letzte Version ({0,date,dd.MM.yyyy HH:mm:ss})
+TimelineView.frameSetTitle=WB-Zeitstrahl
+TimelineView.frameSetNoFramesMessage=Ein Browser der Frames unterst&uuml;tzt wird f\xFCr die Anzeige ben&ouml;tigt.
+
+
+ReplayView.banner=Wayback - externe Links, Formulare und Suchabfragen werden f\xFCr diese Kollektion nicht funktionieren. Url: {0} time: {1,date,dd.MM.yyyy HH:mm:ss}
+ReplayView.bannerHideLink=[versteckt]
+
+PathQuery.resultsSummary={0,number,integer} Resultate f&uuml;r {1}
+PathQuery.resultRange=zwischen {0,date,dd.MM.yyyy} und {1,date,dd.MM.yyyy}
+PathQuery.newVersionIndicator=(neue Version)
+PathQuery.redirectIndicator=(redirect)
+PathQuery.classicResultLinkText={0,date,dd.MM.yyyy}
+
+PathPrefixQuery.showingResults=Anzeige von {0,number,integer} bis {1,number,integer} von {2,number,integer} Resultaten f&uuml;r {3}
+PathPrefixQuery.unchangedIndicator=unver&auml;ndert
+
+PathQueryClassic.searchedFor=Suche nach <a href="{0}"><b>{0}</b></a>
+PathQueryClassic.searchResults=Suchergebnis f&uuml;r {0,date,dd.MM.yyyy} - {1,date,dd.MM.yyyy}
+PathQueryClassic.resultsSummary={0,choice,0#0 Treffer|1#1 Treffer|1<{0,number,integer} Treffer}
+PathQueryClassic.versionsSummary={0,choice,0#(0 Versionen)|1#(1 Version)|1<({0,number,integer} Versionen)}
+
+
+# 0 = number of unique versions of a page
+PathPrefixQuery.versionCount={0,choice,1#1 Version|1<{0,number,integer} Versionen}
+
+# shown when only a single capture of an URL is found in the index:
+# 0 = Date of capture
+PathPrefixQuery.singleCaptureDate=1 Seite von {0,date,dd.MM.yyyy}
+
+# shown when multiple captures of an URL are found in the index:
+# 0 = number of captures
+# 1 = Date of first capture
+# 2 = Date of last capture
+PathPrefixQuery.multiCaptureDate={0,choice,1#1 Seite|1<{0,number,integer} Seiten} zwischen {1,date,dd.MM.yyyy} und {2,date,dd.MM.yyyy}
+
+ResultPartition.columnSummary={0,choice,0#0 Seiten|1#1 Seite|1<{0,number,integer} Seiten}
+ResultPartitions.day={0,date,d.M.}
+ResultPartitions.hour={0,date,h a}
+ResultPartitions.month={0,date,M/yyyy}
+ResultPartitions.twoMonth={0,date,M/yyyy} - {1,date,M/yyyy}
+ResultPartitions.week={0,date,d.M.} - {1,date,d.M.}
+ResultPartitions.year={0,date,yyyy}
+
+ReplayView.javaScriptComment=\
+//     DATEI ARCHIVIERT AM {0,date,dd.MM.yyyy HH:mm:ss} UND EMPFANGEN VOM\n\
+//     INTERNET ARCHIVE AM {1,date,dd.MM.yyyy HH:mm:ss}.\n\
+//     JAVASCRIPT HINZUGEF&Uuml;\xDCGT VON WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.\n\
+//\n\
+//     JEDER ANDERE INHALT IST EBENSO GESCH&Uuml;TZT DURCH COPYRIGHT (17 U.S.C.\n\
+//     SECTION 108(a)(3)).\n\
+\n
+
+AdvancedSearch.url=URL:
+AdvancedSearch.exactDate=Genaues Datum:
+AdvancedSearch.earliestDate=Fr&uuml;hestes Datum:
+AdvancedSearch.latestDate=Sp&auml;testes Datum:
+AdvancedSearch.submitButton=Suche


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2391] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp/jsp

From: <bra...@us...> - 2008-07-02 00:31:36

Revision: 2391
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2391&view=rev
Author:   bradtofel
Date:     2008-07-01 17:31:45 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REMOVED: replaced with /query/(HTML|XML)(Url|Capture)Results.jsp

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLResults.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLResults.jsp

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLResults.jsp	2008-07-02 00:30:47 UTC (rev 2390)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLResults.jsp	2008-07-02 00:31:45 UTC (rev 2391)
@@ -1,194 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="java.util.Iterator" %>
-<%@ page import="java.util.ArrayList" %>
-<%@ page import="java.util.Date" %>
-<%@ page import="org.archive.wayback.WaybackConstants" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
-<%@ page import="org.archive.wayback.core.Timestamp" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.resourceindex.filters.CaptureToUrlResultFilter" %>
-
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<jsp:include page="/template/UI-header.jsp" flush="true" />
-<%
-
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-
-String searchString = results.getSearchUrl();
-
-
-if(results.isCaptureResults()) {
-
-	int resultCount = results.getResultsReturned();
-
-	Timestamp searchStartTs = results.getStartTimestamp();
-	Timestamp searchEndTs = results.getEndTimestamp();
-	Date searchStartDate = searchStartTs.getDate();
-	Date searchEndDate = searchEndTs.getDate();
-
-	Iterator itr = results.resultsIterator();
-	%>
-	<%= fmt.format("PathQuery.resultsSummary",resultCount,searchString) %>
-	<br></br>
-	<%= fmt.format("PathQuery.resultRange",searchStartDate,searchEndDate) %>
-	<hr></hr>
-	<%
-	boolean first = false;
-	String lastMD5 = null;
-	while(itr.hasNext()) {
-		SearchResult result = (SearchResult) itr.next();
-
-		String url = result.get(WaybackConstants.RESULT_URL);
-
-		String prettyDate = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
-		String origHost = result.get(WaybackConstants.RESULT_ORIG_HOST);
-		String MD5 = result.get(WaybackConstants.RESULT_MD5_DIGEST);
-		String redirectFlag = (0 == result.get(
-			WaybackConstants.RESULT_REDIRECT_URL).compareTo("-")) 
-			?	"" : fmt.format("PathQuery.redirectIndicator");
-		String httpResponse = result.get(WaybackConstants.RESULT_HTTP_CODE);
-		String mimeType = result.get(WaybackConstants.RESULT_MIME_TYPE);
-
-		String arcFile = result.get(WaybackConstants.RESULT_ARC_FILE);
-		String arcOffset = result.get(WaybackConstants.RESULT_OFFSET);
-
-		String replayUrl = results.resultToReplayUrl(result);
-
-		boolean updated = false;
-		if(lastMD5 == null) {
-			lastMD5 = MD5;
-			updated = true;
-		} else if(0 != lastMD5.compareTo(MD5)) {
-			updated = true;
-			lastMD5 = MD5;
-		}
-		if(updated) {
-			%>
-			<a href="<%= replayUrl %>"><%= prettyDate %></a>
-			<span style="color:black;"><%= origHost %></span>
-			<span style="color:gray;"><%= httpResponse %></span>
-			<span style="color:brown;"><%= mimeType %></span>
-	<!--
-			<span style="color:red;"><%= arcFile %></span>
-			<span style="color:red;"><%= arcOffset %></span>
-	-->
-			<%= redirectFlag %>
-			<%= fmt.format("PathQuery.newVersionIndicator") %>
-
-			<br/>
-			<%
-		} else {
-			%>
-			&nbsp;&nbsp;&nbsp;<a href="<%= replayUrl %>"><%= prettyDate %></a>
-			<span style="color:green;"><%= origHost %></span>
-	<!--
-			<span style="color:red;"><%= arcFile %></span>
-			<span style="color:red;"><%= arcOffset %></span>
-	-->
-			<br/>
-			<%
-		}
-	}	
-	
-} else if(results.isUrlResults()) {
-
-	
-	
-	Date searchStartDate = results.getStartTimestamp().getDate();
-	Date searchEndDate = results.getEndTimestamp().getDate();
-	
-//	PathQuerySearchResultPartitioner partitioner = 
-//		new PathQuerySearchResultPartitioner(results.getResults(),
-//				results.getURIConverter());
-	
-	int firstResult = results.getFirstResult();
-	int lastResult = results.getLastResult();
-	int resultCount = results.getResultsMatching();
-	
-	int totalCaptures = results.getResultsMatching();
-	
-	%>
-	<%= fmt.format("PathPrefixQuery.showingResults",firstResult,lastResult,
-					resultCount,searchString) %>
-	<br/>
-
-	<hr></hr>
-	<%
-	Iterator itr = results.resultsIterator();
-	while(itr.hasNext()) {
-		SearchResult result = (SearchResult) itr.next();
-
-		String url = result.get(CaptureToUrlResultFilter.RESULT_ORIGINAL_URL);
-		String urlKey = result.get(CaptureToUrlResultFilter.RESULT_URL);
-		String firstDateTS = result.get(CaptureToUrlResultFilter.RESULT_FIRST_CAPTURE);
-		String lastDateTS = result.get(CaptureToUrlResultFilter.RESULT_LAST_CAPTURE);
-		int numCaptures = Integer.valueOf(result.get(CaptureToUrlResultFilter.RESULT_NUM_CAPTURES));
-		int numVersions = Integer.valueOf(result.get(CaptureToUrlResultFilter.RESULT_NUM_VERSIONS));
-
-		Date firstDate = results.timestampToDate(firstDateTS);
-		Date lastDate = results.timestampToDate(lastDateTS);
-		
-		if(numCaptures == 1) {
-			String anchor = results.makeReplayUrl(url,firstDateTS);
-			%>
-			<a href="<%= anchor %>">
-				<%= url %>
-			</a>
-			<span class="mainSearchText">
-				<%= fmt.format("PathPrefixQuery.versionCount",numVersions) %>
-			</span>
-			<br/>
-			<span class="mainSearchText">
-				<%= fmt.format("PathPrefixQuery.singleCaptureDate",firstDate) %>
-			</span>
-			<%
-			
-		} else {
-			String anchor = results.makeCaptureQueryUrl(url);
-			%>
-			<a href="<%= anchor %>">
-				<%= url %>
-			</a>
-			<span class="mainSearchText">
-				<%= fmt.format("PathPrefixQuery.versionCount",numVersions) %>
-			</span>
-			<br/>
-			<span class="mainSearchText">
-				<%= fmt.format("PathPrefixQuery.multiCaptureDate",numCaptures,firstDate,lastDate) %>
-			</span>
-			<%		
-		}
-		%>
-		<br/>
-		<br/>	
-		<%
-	}
-}
-// show page indicators:
-int curPage = results.getCurPage();
-if(curPage > results.getNumPages()) {
-	%>
-	<hr></hr>
-	<a href="<%= results.urlForPage(1) %>">First results</a>
-	<%
-} else if(results.getNumPages() > 1) {
-	%>
-	<hr></hr>
-	<%
-	for(int i = 1; i <= results.getNumPages(); i++) {
-		if(i == curPage) {
-			%>
-			<b><%= i %></b>
-			<%		
-		} else {
-			%>
-			<a href="<%= results.urlForPage(i) %>"><%= i %></a>
-			<%
-		}
-	}
-}
-%>
-
-<jsp:include page="/template/UI-footer.jsp" flush="true" />

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLResults.jsp	2008-07-02 00:30:47 UTC (rev 2390)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLResults.jsp	2008-07-02 00:31:45 UTC (rev 2391)
@@ -1,58 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<%@ page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8"%>
-<%@ page import="java.util.Iterator" %>
-<%@ page import="java.util.ArrayList" %>
-<%@ page import="java.util.Properties" %>
-<%@ page import="java.util.Enumeration" %>
-<%@ page import="org.archive.wayback.WaybackConstants" %>
-<%@ page import="org.archive.wayback.core.SearchResults" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
-<%@ page import="org.archive.wayback.core.Timestamp" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
-<%
-UIQueryResults uiResults = (UIQueryResults) UIResults.getFromRequest(request);
-SearchResults results = uiResults.getResults();
-Iterator itr = uiResults.resultsIterator();
-%>
-<wayback>
-	<request>
-<%
-	Properties p = results.getFilters();
-	for (Enumeration e = p.keys(); e.hasMoreElements();) {
-		String key = UIQueryResults.encodeXMLEntity((String) e.nextElement());
-		String value = UIQueryResults.encodeXMLContent((String) p.get(key));
-		%>
-		<<%= key %>><%= value %></<%= key %>>
-		<%
-	}
-	String type = WaybackConstants.RESULTS_TYPE_CAPTURE;
-	if(uiResults.isUrlResults()) {
-		type = WaybackConstants.RESULTS_TYPE_URL;
-	}
-%>
-    <<%= WaybackConstants.RESULTS_TYPE %>><%= type %></<%= WaybackConstants.RESULTS_TYPE %>>
-	</request>
-	<results>
-<%
-	while(itr.hasNext()) {
-		%>
-		<result>
-		<%
-		SearchResult result = (SearchResult) itr.next();
-		Properties p2 = result.getData();
-		for (Enumeration e = p2.keys(); e.hasMoreElements();) {
-			// TODO: encode!
-			String key = UIQueryResults.encodeXMLEntity((String) e.nextElement());
-			String value = UIQueryResults.encodeXMLContent((String) p2.get(key));
-			%>
-			<<%= key %>><%= value %></<%= key %>>
-			<%
-		}
-		%>
-		</result>
-		<%
-	}
-%>	
-	</results>
-</wayback>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2390] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp

From: <bra...@us...> - 2008-07-02 00:30:38

Revision: 2390
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2390&view=rev
Author:   bradtofel
Date:     2008-07-01 17:30:47 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
MOVE: replay related .jsp files to /replay/

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Redirect.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ResultMeta.jsp

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/Redirect.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/ResultMeta.jsp

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/Redirect.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/Redirect.jsp	2008-07-02 00:25:51 UTC (rev 2389)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/Redirect.jsp	2008-07-02 00:30:47 UTC (rev 2390)
@@ -1,14 +0,0 @@
-<%@ page import="org.archive.wayback.core.Timestamp" %>
-
-<%
- String url = request.getParameter("url");
- String time = request.getParameter("time");
-  
- // Put time-mapping for this id, or if no id, the ip-addr.
- String id = request.getHeader("Proxy-Id");
- if(id == null)	id = request.getRemoteAddr();
- Timestamp.addTimestampForId(request.getContextPath(),id, time);
- 
- // Now redirect to the page the user wanted.
- response.sendRedirect(url);
-%>

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/ResultMeta.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/ResultMeta.jsp	2008-07-02 00:25:51 UTC (rev 2389)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/ResultMeta.jsp	2008-07-02 00:30:47 UTC (rev 2390)
@@ -1,125 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="java.util.Iterator" %>
-<%@ page import="java.util.Map" %>
-<%@ page import="org.archive.wayback.core.Timestamp" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<%
-
-UIReplayResult uiResults = (UIReplayResult) UIResults.getFromRequest(request);
-StringFormatter fmt = uiResults.getFormatter();
-
-String origUrl = uiResults.getOriginalUrl();
-String urlKey = uiResults.getUrlKey();
-String archiveID = uiResults.getArchiveID();
-Timestamp captureTS = uiResults.getCaptureTimestamp();
-String capturePrettyDateTime = fmt.format("MetaReplay.captureDateDisplay",
-	captureTS.getDate());
-String mimeType = uiResults.getMimeType();
-String digest = uiResults.getDigest();
-Map<String,String> headers = uiResults.getHttpHeaders();
-
-%>
-<html>
-	<head>
-		<title>
-			<%= fmt.format("MetaReplay.title") + urlKey +" / " +
-				capturePrettyDateTime %>
-		</title>
-	</head>
-	<body>
-		<h2>
-			<%= fmt.format("MetaReplay.title") %>
-		</h2>
-		<table>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.originalURL") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= origUrl %>
-					</b>
-				</td>
-			</tr>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.URLKey") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= urlKey %>
-					</b>
-				</td>
-			</tr>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.captureDate") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= capturePrettyDateTime %>
-					</b>
-				</td>
-			</tr>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.archiveID") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= archiveID %>
-					</b>
-				</td>
-			</tr>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.MIMEType") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= mimeType %>
-					</b>
-				</td>
-			</tr>
-			<tr>
-				<td class="field-cell">
-					<%= fmt.format("MetaReplay.digest") %>
-				</td>
-				<td class="value-cell">
-					<b>
-						<%= digest %>
-					</b>
-				</td>
-			</tr>
-		</table>
-		<p>
-			<h2>
-				<%= fmt.format("MetaReplay.HTTPHeaders") %>
-			</h2>
-			<table>
-			<%
-			Iterator<String> itr = headers.keySet().iterator();
-			while(itr.hasNext()) {
-				String key = itr.next();
-				String value = headers.get(key);
-				%>
-				<tr>
-					<td class="field-cell">
-						<%= key %>
-					</td>
-					<td class="value-cell">
-						<b>
-							<%= value %>
-						</b>
-					</td>
-				</tr>
-				<%
-			}
-			%>
-			</table>
-		
-	</body>
-</html>
-

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Redirect.jsp (from rev 2055, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/Redirect.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Redirect.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Redirect.jsp	2008-07-02 00:30:47 UTC (rev 2390)
@@ -0,0 +1,14 @@
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+
+<%
+ String url = request.getParameter("url");
+ String time = request.getParameter("time");
+  
+ // Put time-mapping for this id, or if no id, the ip-addr.
+ String id = request.getHeader("Proxy-Id");
+ if(id == null)	id = request.getRemoteAddr();
+ Timestamp.addTimestampForId(request.getContextPath(),id, time);
+ 
+ // Now redirect to the page the user wanted.
+ response.sendRedirect(url);
+%>

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ResultMeta.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/ResultMeta.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ResultMeta.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ResultMeta.jsp	2008-07-02 00:30:47 UTC (rev 2390)
@@ -0,0 +1,125 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.util.Map" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<%
+
+UIReplayResult uiResults = (UIReplayResult) UIResults.getFromRequest(request);
+StringFormatter fmt = uiResults.getFormatter();
+
+String origUrl = uiResults.getOriginalUrl();
+String urlKey = uiResults.getUrlKey();
+String archiveID = uiResults.getArchiveID();
+Timestamp captureTS = uiResults.getCaptureTimestamp();
+String capturePrettyDateTime = fmt.format("MetaReplay.captureDateDisplay",
+	captureTS.getDate());
+String mimeType = uiResults.getMimeType();
+String digest = uiResults.getDigest();
+Map<String,String> headers = uiResults.getHttpHeaders();
+
+%>
+<html>
+	<head>
+		<title>
+			<%= fmt.format("MetaReplay.title") + urlKey +" / " +
+				capturePrettyDateTime %>
+		</title>
+	</head>
+	<body>
+		<h2>
+			<%= fmt.format("MetaReplay.title") %>
+		</h2>
+		<table>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.originalURL") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= origUrl %>
+					</b>
+				</td>
+			</tr>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.URLKey") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= urlKey %>
+					</b>
+				</td>
+			</tr>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.captureDate") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= capturePrettyDateTime %>
+					</b>
+				</td>
+			</tr>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.archiveID") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= archiveID %>
+					</b>
+				</td>
+			</tr>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.MIMEType") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= mimeType %>
+					</b>
+				</td>
+			</tr>
+			<tr>
+				<td class="field-cell">
+					<%= fmt.format("MetaReplay.digest") %>
+				</td>
+				<td class="value-cell">
+					<b>
+						<%= digest %>
+					</b>
+				</td>
+			</tr>
+		</table>
+		<p>
+			<h2>
+				<%= fmt.format("MetaReplay.HTTPHeaders") %>
+			</h2>
+			<table>
+			<%
+			Iterator<String> itr = headers.keySet().iterator();
+			while(itr.hasNext()) {
+				String key = itr.next();
+				String value = headers.get(key);
+				%>
+				<tr>
+					<td class="field-cell">
+						<%= key %>
+					</td>
+					<td class="value-cell">
+						<b>
+							<%= value %>
+						</b>
+					</td>
+				</tr>
+				<%
+			}
+			%>
+			</table>
+		
+	</body>
+</html>
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2389] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp/query

From: <bra...@us...> - 2008-07-02 00:25:42

Revision: 2389
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2389&view=rev
Author:   bradtofel
Date:     2008-07-01 17:25:51 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
MOVE: moved all query related .jsps to /query/.
      separated URL and Capture query renderers into seprate .jsp files
      now use UICaptureQueryResults and UIUrlQueryResults for context.

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLCaptureResults.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLUrlResults.jsp

Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLCaptureResults.jsp	2008-07-02 00:25:51 UTC (rev 2389)
@@ -0,0 +1,114 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.util.ArrayList" %>
+<%@ page import="java.util.Date" %>
+<%@ page import="org.archive.wayback.WaybackConstants" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.query.UICaptureQueryResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<jsp:include page="/template/UI-header.jsp" flush="true" />
+<%
+
+UICaptureQueryResults results = (UICaptureQueryResults) UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+
+String searchString = results.getSearchUrl();
+
+  int resultCount = results.getResultsReturned();
+
+  Timestamp searchStartTs = results.getStartTimestamp();
+  Timestamp searchEndTs = results.getEndTimestamp();
+  Date searchStartDate = searchStartTs.getDate();
+  Date searchEndDate = searchEndTs.getDate();
+
+  Iterator<CaptureSearchResult> itr = results.resultsIterator();
+  %>
+  <%= fmt.format("PathQuery.resultsSummary",resultCount,searchString) %>
+  <br></br>
+  <%= fmt.format("PathQuery.resultRange",searchStartDate,searchEndDate) %>
+  <hr></hr>
+  <%
+  boolean first = false;
+  String lastMD5 = null;
+  while(itr.hasNext()) {
+	  CaptureSearchResult result = (CaptureSearchResult) itr.next();
+
+    String url = result.getUrlKey();
+
+    String prettyDate = result.getCaptureTimestamp();
+    String origHost = result.getOriginalHost();
+    String MD5 = result.getDigest();
+    String redirectFlag = (0 == result.getRedirectUrl().compareTo("-")) 
+      ? "" : fmt.format("PathQuery.redirectIndicator");
+    String httpResponse = result.getHttpCode();
+    String mimeType = result.getMimeType();
+
+    String arcFile = result.getFile();
+    String arcOffset = String.valueOf(result.getOffset());
+
+    String replayUrl = results.resultToReplayUrl(result);
+
+    boolean updated = false;
+    if(lastMD5 == null) {
+      lastMD5 = MD5;
+      updated = true;
+    } else if(0 != lastMD5.compareTo(MD5)) {
+      updated = true;
+      lastMD5 = MD5;
+    }
+    if(updated) {
+      %>
+      <a href="<%= replayUrl %>"><%= prettyDate %></a>
+      <span style="color:black;"><%= origHost %></span>
+      <span style="color:gray;"><%= httpResponse %></span>
+      <span style="color:brown;"><%= mimeType %></span>
+  <!--
+      <span style="color:red;"><%= arcFile %></span>
+      <span style="color:red;"><%= arcOffset %></span>
+  -->
+      <%= redirectFlag %>
+      <%= fmt.format("PathQuery.newVersionIndicator") %>
+
+      <br/>
+      <%
+    } else {
+      %>
+      &nbsp;&nbsp;&nbsp;<a href="<%= replayUrl %>"><%= prettyDate %></a>
+      <span style="color:green;"><%= origHost %></span>
+  <!--
+      <span style="color:red;"><%= arcFile %></span>
+      <span style="color:red;"><%= arcOffset %></span>
+  -->
+      <br/>
+      <%
+    }
+  } 
+
+// show page indicators:
+int curPage = results.getCurPage();
+if(curPage > results.getNumPages()) {
+  %>
+  <hr></hr>
+  <a href="<%= results.urlForPage(1) %>">First results</a>
+  <%
+} else if(results.getNumPages() > 1) {
+  %>
+  <hr></hr>
+  <%
+  for(int i = 1; i <= results.getNumPages(); i++) {
+    if(i == curPage) {
+      %>
+      <b><%= i %></b>
+      <%    
+    } else {
+      %>
+      <a href="<%= results.urlForPage(i) %>"><%= i %></a>
+      <%
+    }
+  }
+}
+%>
+
+<jsp:include page="/template/UI-footer.jsp" flush="true" />

Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/HTMLUrlResults.jsp	2008-07-02 00:25:51 UTC (rev 2389)
@@ -0,0 +1,116 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.util.ArrayList" %>
+<%@ page import="java.util.Date" %>
+<%@ page import="org.archive.wayback.WaybackConstants" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.core.UrlSearchResult" %>
+<%@ page import="org.archive.wayback.query.UIUrlQueryResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<jsp:include page="/template/UI-header.jsp" flush="true" />
+<%
+
+UIUrlQueryResults results = (UIUrlQueryResults) UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+
+String searchString = results.getSearchUrl();
+
+
+
+Date searchStartDate = results.getStartTimestamp().getDate();
+Date searchEndDate = results.getEndTimestamp().getDate();
+
+//PathQuerySearchResultPartitioner partitioner = 
+//  new PathQuerySearchResultPartitioner(results.getResults(),
+//      results.getURIConverter());
+
+int firstResult = results.getFirstResult();
+int lastResult = results.getLastResult();
+int resultCount = results.getResultsMatching();
+
+int totalCaptures = results.getResultsMatching();
+
+%>
+<%= fmt.format("PathPrefixQuery.showingResults",firstResult,lastResult,
+        resultCount,searchString) %>
+<br/>
+
+<hr></hr>
+<%
+Iterator<UrlSearchResult> itr = results.resultsIterator();
+while(itr.hasNext()) {
+  UrlSearchResult result = itr.next();
+
+  String urlKey = result.getUrlKey();
+  String originalUrl = result.getOriginalUrl();
+  String firstDateTS = result.getFirstCaptureTimestamp();
+  String lastDateTS = result.getLastCaptureTimestamp();
+  long numCaptures = result.getNumCaptures();
+  long numVersions = result.getNumVersions();
+
+  Date firstDate = results.timestampToDate(firstDateTS);
+  Date lastDate = results.timestampToDate(lastDateTS);
+  
+  if(numCaptures == 1) {
+    String anchor = results.makeReplayUrl(originalUrl,firstDateTS);
+    %>
+    <a href="<%= anchor %>">
+      <%= urlKey %>
+    </a>
+    <span class="mainSearchText">
+      <%= fmt.format("PathPrefixQuery.versionCount",numVersions) %>
+    </span>
+    <br/>
+    <span class="mainSearchText">
+      <%= fmt.format("PathPrefixQuery.singleCaptureDate",firstDate) %>
+    </span>
+    <%
+    
+  } else {
+    String anchor = results.makeCaptureQueryUrl(originalUrl);
+    %>
+    <a href="<%= anchor %>">
+      <%= urlKey %>
+    </a>
+    <span class="mainSearchText">
+      <%= fmt.format("PathPrefixQuery.versionCount",numVersions) %>
+    </span>
+    <br/>
+    <span class="mainSearchText">
+      <%= fmt.format("PathPrefixQuery.multiCaptureDate",numCaptures,firstDate,lastDate) %>
+    </span>
+    <%    
+  }
+  %>
+  <br/>
+  <br/> 
+  <%
+}
+
+// show page indicators:
+int curPage = results.getCurPage();
+if(curPage > results.getNumPages()) {
+  %>
+  <hr></hr>
+  <a href="<%= results.urlForPage(1) %>">First results</a>
+  <%
+} else if(results.getNumPages() > 1) {
+  %>
+  <hr></hr>
+  <%
+  for(int i = 1; i <= results.getNumPages(); i++) {
+    if(i == curPage) {
+      %>
+      <b><%= i %></b>
+      <%    
+    } else {
+      %>
+      <a href="<%= results.urlForPage(i) %>"><%= i %></a>
+      <%
+    }
+  }
+}
+%>
+
+<jsp:include page="/template/UI-footer.jsp" flush="true" />

Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLCaptureResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLCaptureResults.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLCaptureResults.jsp	2008-07-02 00:25:51 UTC (rev 2389)
@@ -0,0 +1,59 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<%@ page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8"%>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.util.ArrayList" %>
+<%@ page import="java.util.Map" %>
+<%@ page import="java.util.Enumeration" %>
+<%@ page import="org.archive.wayback.WaybackConstants" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResults" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.query.UICaptureQueryResults" %>
+<%
+UICaptureQueryResults uiResults = (UICaptureQueryResults) UIResults.getFromRequest(request);
+
+CaptureSearchResults results = uiResults.getResults();
+Iterator<CaptureSearchResult> itr = uiResults.resultsIterator();
+%>
+<wayback>
+  <request>
+<%
+  Map<String,String> p = results.getFilters();
+  Iterator<String> kitr = p.keySet().iterator();
+  while(kitr.hasNext()) {
+    String key = kitr.next();
+    String oKey = UIResults.encodeXMLEntity(key);
+    String oValue = UIResults.encodeXMLContent(p.get(key));
+    %>
+    <<%= oKey %>><%= oValue %></<%= oKey %>>
+    <%
+  }
+%>
+    <<%= WaybackConstants.RESULTS_TYPE %>><%= WaybackConstants.RESULTS_TYPE_CAPTURE %></<%= WaybackConstants.RESULTS_TYPE %>>
+  </request>
+  <results>
+<%
+  while(itr.hasNext()) {
+    %>
+    <result>
+    <%
+    CaptureSearchResult result = itr.next();
+    Map<String,String> p2 = result.toCanonicalStringMap();
+    kitr = p2.keySet().iterator();
+    
+    while(kitr.hasNext()) {
+       String key = kitr.next();
+       String oKey = UIResults.encodeXMLEntity(key);
+       String oValue = UIResults.encodeXMLContent(p2.get(key));
+      %>
+      <<%= oKey %>><%= oValue %></<%= oKey %>>
+      <%
+    }
+    %>
+    </result>
+    <%
+  }
+%>  
+  </results>
+</wayback>

Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLUrlResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLUrlResults.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/XMLUrlResults.jsp	2008-07-02 00:25:51 UTC (rev 2389)
@@ -0,0 +1,59 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<%@ page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8"%>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.util.ArrayList" %>
+<%@ page import="java.util.Map" %>
+<%@ page import="java.util.Enumeration" %>
+<%@ page import="org.archive.wayback.WaybackConstants" %>
+<%@ page import="org.archive.wayback.core.UrlSearchResults" %>
+<%@ page import="org.archive.wayback.core.UrlSearchResult" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.query.UIUrlQueryResults" %>
+<%
+UIUrlQueryResults uiResults = (UIUrlQueryResults) UIResults.getFromRequest(request);
+
+UrlSearchResults results = uiResults.getResults();
+Iterator<UrlSearchResult> itr = uiResults.resultsIterator();
+%>
+<wayback>
+  <request>
+<%
+  Map<String,String> p = results.getFilters();
+  Iterator<String> kitr = p.keySet().iterator();
+  while(kitr.hasNext()) {
+	  String key = kitr.next();
+    String oKey = UIResults.encodeXMLEntity(key);
+    String oValue = UIResults.encodeXMLContent(p.get(key));
+    %>
+    <<%= oKey %>><%= oValue %></<%= oKey %>>
+    <%
+  }
+%>
+    <<%= WaybackConstants.RESULTS_TYPE %>><%= WaybackConstants.RESULTS_TYPE_URL %></<%= WaybackConstants.RESULTS_TYPE %>>
+  </request>
+  <results>
+<%
+  while(itr.hasNext()) {
+    %>
+    <result>
+    <%
+    UrlSearchResult result = itr.next();
+    Map<String,String> p2 = result.toCanonicalStringMap();
+    kitr = p2.keySet().iterator();
+    
+    while(kitr.hasNext()) {
+       String key = kitr.next();
+       String oKey = UIResults.encodeXMLEntity(key);
+       String oValue = UIResults.encodeXMLContent(p2.get(key));
+      %>
+      <<%= oKey %>><%= oValue %></<%= oKey %>>
+      <%
+    }
+    %>
+    </result>
+    <%
+  }
+%>  
+  </results>
+</wayback>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2388] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp

From: <bra...@us...> - 2008-07-02 00:25:07

Revision: 2388
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2388&view=rev
Author:   bradtofel
Date:     2008-07-01 17:25:15 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
MOVE: moved all query related .jsps to /query/.
      separated URL and Capture query renderers into seprate .jsp files
      now use UICaptureQueryResults and UIUrlQueryResults for context.

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/CalendarResults.jsp

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CalendarResults.jsp

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CalendarResults.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CalendarResults.jsp	2008-07-02 00:22:06 UTC (rev 2387)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CalendarResults.jsp	2008-07-02 00:25:15 UTC (rev 2388)
@@ -1,175 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="java.util.ArrayList" %>
-<%@ page import="java.util.Date" %>
-<%@ page import="java.util.Iterator" %>
-<%@ page import="java.text.ParseException" %>
-<%@ page import="org.archive.wayback.WaybackConstants" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
-<%@ page import="org.archive.wayback.core.Timestamp" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
-<%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartitionsFactory" %>
-<%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartition" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<jsp:include page="/template/UI-header.jsp" flush="true" />
-<%
-
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-String searchString = results.getSearchUrl();
-
-Date searchStartDate = results.getStartTimestamp().getDate();
-Date searchEndDate = results.getEndTimestamp().getDate();
-int firstResult = results.getFirstResult();
-int lastResult = results.getLastResult();
-int resultCount = results.getResultsMatching();
-
-//Timestamp searchStartTs = results.getStartTimestamp();
-//Timestamp searchEndTs = results.getEndTimestamp();
-//String prettySearchStart = results.prettyDateFull(searchStartTs.getDate());
-//String prettySearchEnd = results.prettyDateFull(searchEndTs.getDate());
-
-ArrayList<ResultsPartition> partitions = ResultsPartitionsFactory.get(
-		results.getResults(),results.getWbRequest());
-int numPartitions = partitions.size();
-%>
-<table border="0" cellpadding="5" width="100%" class="mainSearchBanner" cellspacing="0">
-   <tr>
-      <td>
-            <%= fmt.format("PathQueryClassic.searchedFor",searchString) %>
-      </td>
-      <td align="right">
-            <%= fmt.format("PathQueryClassic.resultsSummary",resultCount) %>
-      </td>
-   </tr>
-</table>
-<br>
-
-
-<table border="0" width="100%">
-   <tr bgcolor="#CCCCCC">
-      <td colspan="<%= numPartitions %>" align="center" class="mainCalendar">
-         <%= fmt.format("PathQueryClassic.searchResults",searchStartDate,searchEndDate) %>
-      </td>
-   </tr>
-
-<!--    RESULT COLUMN HEADERS -->
-   <tr bgcolor="#CCCCCC">
-<%
-	for(int i = 0; i < numPartitions; i++) {
-		ResultsPartition partition = partitions.get(i);
-%>
-      <td align="center" class="mainBigBody">
-         <%= partition.getTitle() %>
-      </td>
-<%
-	}
-%>
-   </tr>
-<!--    /RESULT COLUMN HEADERS -->
-
-
-
-<!--    RESULT COLUMN COUNTS -->
-   <tr bgcolor="#CCCCCC">
-<%
-	for(int i = 0; i < numPartitions; i++) {
-		ResultsPartition partition = (ResultsPartition) partitions.get(i);
-%>
-      <td align="center" class="mainBigBody">
-         <%= fmt.format("ResultPartition.columnSummary",partition.resultsCount()) %>
-      </td>
-<%
-	}
-%>
-   </tr>
-<!--    /RESULT COLUMN COUNTS -->
-
-
-<!--    RESULT COLUMN DATA -->
-   <tr bgcolor="#EBEBEB">
-<%
-	boolean first = false;
-	String lastMD5 = null;
-
-	for(int i = 0; i < numPartitions; i++) {
-		ResultsPartition partition = (ResultsPartition) partitions.get(i);
-		ArrayList<SearchResult> partitionResults = partition.getMatches();
-%>
-      <td nowrap class="mainBody" valign="top">
-<%
-		if(partitionResults.size() == 0) {
-%>
-         &nbsp;
-<%
-		} else {
-
-		  for(int j = 0; j < partitionResults.size(); j++) {
-		  
-		  	SearchResult result = partitionResults.get(j);
-			String url = result.get(WaybackConstants.RESULT_URL);
-			String captureDate = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
-			Timestamp captureTS = Timestamp.parseBefore(captureDate);
-			String prettyDate = fmt.format("PathQuery.classicResultLinkText",
-				captureTS.getDate());
-			String origHost = result.get(WaybackConstants.RESULT_ORIG_HOST);
-			String MD5 = result.get(WaybackConstants.RESULT_MD5_DIGEST);
-			String redirectFlag = (0 == result.get(
-				WaybackConstants.RESULT_REDIRECT_URL).compareTo("-")) 
-				?	"" : fmt.format("PathPrefixQuery.redirectIndicator");
-			String httpResponse = result.get(WaybackConstants.RESULT_HTTP_CODE);
-			String mimeType = result.get(WaybackConstants.RESULT_MIME_TYPE);
-		
-			String arcFile = result.get(WaybackConstants.RESULT_ARC_FILE);
-			String arcOffset = result.get(WaybackConstants.RESULT_OFFSET);
-		
-			String replayUrl = results.resultToReplayUrl(result);
-		
-			boolean updated = false;
-			if(lastMD5 == null) {
-				lastMD5 = MD5;
-				updated = true;
-			} else if(0 != lastMD5.compareTo(MD5)) {
-				updated = true;
-				lastMD5 = MD5;
-			}
-			String updateStar = updated ? "*" : "";
-%>
-         <a href="<%= replayUrl %>"><%= prettyDate %></a> <%= updateStar %><br></br>
-<%
-		  
-		  }
-		
-		}
-%>
-      </td>
-<%
-	}
-	
-%>
-   </tr>
-<!--    /RESULT COLUMN DATA -->
-</table>
-
-
-<%
-// show page indicators:
-if(results.getNumPages() > 1) {
-	int curPage = results.getCurPage();
-	%>
-	<hr></hr>
-	<%
-	for(int i = 1; i <= results.getNumPages(); i++) {
-		if(i == curPage) {
-			%>
-			<b><%= i %></b>
-			<%		
-		} else {
-			%>
-			<a href="<%= results.urlForPage(i) %>"><%= i %></a>
-			<%
-		}
-	}
-}
-%>
-<jsp:include page="/template/UI-footer.jsp" flush="true" />

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/CalendarResults.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CalendarResults.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/CalendarResults.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/query/CalendarResults.jsp	2008-07-02 00:25:15 UTC (rev 2388)
@@ -0,0 +1,174 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="java.util.ArrayList" %>
+<%@ page import="java.util.Date" %>
+<%@ page import="java.util.Iterator" %>
+<%@ page import="java.text.ParseException" %>
+<%@ page import="org.archive.wayback.WaybackConstants" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
+<%@ page import="org.archive.wayback.core.Timestamp" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.query.UICaptureQueryResults" %>
+<%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartitionsFactory" %>
+<%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartition" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<jsp:include page="/template/UI-header.jsp" flush="true" />
+<%
+
+UICaptureQueryResults results = (UICaptureQueryResults) UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+String searchString = results.getSearchUrl();
+
+Date searchStartDate = results.getStartTimestamp().getDate();
+Date searchEndDate = results.getEndTimestamp().getDate();
+long firstResult = results.getFirstResult();
+long lastResult = results.getLastResult();
+long resultCount = results.getResultsMatching();
+
+//Timestamp searchStartTs = results.getStartTimestamp();
+//Timestamp searchEndTs = results.getEndTimestamp();
+//String prettySearchStart = results.prettyDateFull(searchStartTs.getDate());
+//String prettySearchEnd = results.prettyDateFull(searchEndTs.getDate());
+
+ArrayList<ResultsPartition> partitions = ResultsPartitionsFactory.get(
+    results.getResults(),results.getWbRequest());
+int numPartitions = partitions.size();
+%>
+<table border="0" cellpadding="5" width="100%" class="mainSearchBanner" cellspacing="0">
+   <tr>
+      <td>
+            <%= fmt.format("PathQueryClassic.searchedFor",searchString) %>
+      </td>
+      <td align="right">
+            <%= fmt.format("PathQueryClassic.resultsSummary",resultCount) %>
+      </td>
+   </tr>
+</table>
+<br>
+
+
+<table border="0" width="100%">
+   <tr bgcolor="#CCCCCC">
+      <td colspan="<%= numPartitions %>" align="center" class="mainCalendar">
+         <%= fmt.format("PathQueryClassic.searchResults",searchStartDate,searchEndDate) %>
+      </td>
+   </tr>
+
+<!--    RESULT COLUMN HEADERS -->
+   <tr bgcolor="#CCCCCC">
+<%
+  for(int i = 0; i < numPartitions; i++) {
+    ResultsPartition partition = partitions.get(i);
+%>
+      <td align="center" class="mainBigBody">
+         <%= partition.getTitle() %>
+      </td>
+<%
+  }
+%>
+   </tr>
+<!--    /RESULT COLUMN HEADERS -->
+
+
+
+<!--    RESULT COLUMN COUNTS -->
+   <tr bgcolor="#CCCCCC">
+<%
+  for(int i = 0; i < numPartitions; i++) {
+    ResultsPartition partition = (ResultsPartition) partitions.get(i);
+%>
+      <td align="center" class="mainBigBody">
+         <%= fmt.format("ResultPartition.columnSummary",partition.resultsCount()) %>
+      </td>
+<%
+  }
+%>
+   </tr>
+<!--    /RESULT COLUMN COUNTS -->
+
+
+<!--    RESULT COLUMN DATA -->
+   <tr bgcolor="#EBEBEB">
+<%
+  boolean first = false;
+  String lastMD5 = null;
+
+  for(int i = 0; i < numPartitions; i++) {
+    ResultsPartition partition = (ResultsPartition) partitions.get(i);
+    ArrayList<CaptureSearchResult> partitionResults = partition.getMatches();
+%>
+      <td nowrap class="mainBody" valign="top">
+<%
+    if(partitionResults.size() == 0) {
+%>
+         &nbsp;
+<%
+    } else {
+
+      for(int j = 0; j < partitionResults.size(); j++) {
+      
+        CaptureSearchResult result = partitionResults.get(j);
+      String url = result.getUrlKey();
+      String captureDate = result.getCaptureTimestamp();
+      Timestamp captureTS = Timestamp.parseBefore(captureDate);
+      String prettyDate = fmt.format("PathQuery.classicResultLinkText",
+        captureTS.getDate());
+      String origHost = result.getOriginalHost();
+      String MD5 = result.getDigest();
+      String redirectFlag = (0 == result.getRedirectUrl().compareTo("-")) 
+        ? "" : fmt.format("PathPrefixQuery.redirectIndicator");
+      String httpResponse = result.getHttpCode();
+      String mimeType = result.getMimeType();
+    
+      String arcFile = result.getFile();
+      String arcOffset = String.valueOf(result.getOffset());
+    
+      String replayUrl = results.resultToReplayUrl(result);
+    
+      boolean updated = false;
+      if(lastMD5 == null) {
+        lastMD5 = MD5;
+        updated = true;
+      } else if(0 != lastMD5.compareTo(MD5)) {
+        updated = true;
+        lastMD5 = MD5;
+      }
+      String updateStar = updated ? "*" : "";
+%>
+         <a href="<%= replayUrl %>"><%= prettyDate %></a> <%= updateStar %><br></br>
+<%
+      
+      }
+    
+    }
+%>
+      </td>
+<%
+  }
+  
+%>
+   </tr>
+<!--    /RESULT COLUMN DATA -->
+</table>
+
+
+<%
+// show page indicators:
+if(results.getNumPages() > 1) {
+  int curPage = results.getCurPage();
+  %>
+  <hr></hr>
+  <%
+  for(int i = 1; i <= results.getNumPages(); i++) {
+    if(i == curPage) {
+      %>
+      <b><%= i %></b>
+      <%    
+    } else {
+      %>
+      <a href="<%= results.urlForPage(i) %>"><%= i %></a>
+      <%
+    }
+  }
+}
+%>
+<jsp:include page="/template/UI-footer.jsp" flush="true" />


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2387] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp/replay

From: <bra...@us...> - 2008-07-02 00:22:00

Revision: 2387
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2387&view=rev
Author:   bradtofel
Date:     2008-07-01 17:22:06 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REFACTOR: now uses UIReplayResult object to extract context

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ArchiveComment.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ClientSideJSInsert.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Disclaimer.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/JSLessTimeline.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Timeline.jsp

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ArchiveComment.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ArchiveComment.jsp	2008-07-02 00:17:37 UTC (rev 2386)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ArchiveComment.jsp	2008-07-02 00:22:06 UTC (rev 2387)
@@ -2,12 +2,12 @@
 <%@ page import="java.util.Date" %>
 <%@ page import="org.archive.wayback.core.Timestamp" %>
 <%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
 <%@ page import="org.archive.wayback.util.StringFormatter" %>
 <%
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
+UIReplayResult results = (UIReplayResult) UIResults.getFromRequest(request);
 StringFormatter fmt = results.getFormatter();
-Date exactDate = results.getExactRequestedTimestamp().getDate();
+Date exactDate = results.getResult().getCaptureDate();
 Date now = new Date();
 String prettyDateFormat = "{0,date,H:mm:ss MMM d, yyyy}";
 String prettyArchiveString = fmt.format(prettyDateFormat,exactDate);

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ClientSideJSInsert.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ClientSideJSInsert.jsp	2008-07-02 00:17:37 UTC (rev 2386)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/ClientSideJSInsert.jsp	2008-07-02 00:22:06 UTC (rev 2387)
@@ -4,13 +4,12 @@
 <%@ page import="org.archive.wayback.core.Timestamp" %>
 <%@ page import="org.archive.wayback.core.UIResults" %>
 <%@ page import="org.archive.wayback.core.WaybackRequest" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
 <%@ page import="org.archive.wayback.util.StringFormatter" %>
 <%
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
-ResultURIConverter uriConverter = results.getURIConverter();
-String requestDate = results.getExactRequestedTimestamp().getDateStr();
-String contextPath = uriConverter.makeReplayURI(requestDate, "");
+UIReplayResult results = (UIReplayResult) UIResults.getFromRequest(request);
+String requestDate = results.getResult().getCaptureTimestamp();
+String contextPath = results.makeReplayUrl("",requestDate);
 String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" 
   + request.getServerPort() + request.getContextPath();
 

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Disclaimer.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Disclaimer.jsp	2008-07-02 00:17:37 UTC (rev 2386)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Disclaimer.jsp	2008-07-02 00:22:06 UTC (rev 2387)
@@ -2,21 +2,20 @@
 <%@ page import="java.util.Date" %>
 <%@ page import="org.archive.wayback.WaybackConstants" %>
 <%@ page import="org.archive.wayback.core.Timestamp" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
 <%@ page import="org.archive.wayback.core.UIResults" %>
 <%@ page import="org.archive.wayback.core.WaybackRequest" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
 <%@ page import="org.archive.wayback.util.StringFormatter" %>
 <%
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
+UIReplayResult results = (UIReplayResult) UIResults.getFromRequest(request);
 
 StringFormatter fmt = results.getFormatter();
-SearchResult result = results.getResult();
+CaptureSearchResult result = results.getResult();
 String dupeMsg = "";
 if(result != null) {
-        String dupeType = result.get(WaybackConstants.RESULT_DUPLICATE_ANNOTATION);
-        if(dupeType != null) {
-                String dupeDate = result.get(WaybackConstants.RESULT_DUPLICATE_STORED_DATE);
+        if(result.isDuplicateDigest()) {
+                String dupeDate = result.getDuplicateDigestStoredTimestamp();
                 String prettyDate = "";
                 if(dupeDate != null) {
                 	  Timestamp dupeTS = Timestamp.parseBefore(dupeDate);
@@ -29,10 +28,10 @@
         }
 }
 
-Date requestDate = results.getExactRequestedTimestamp().getDate();
-String requestUrl = results.getSearchUrl();
+Date resultDate = result.getCaptureDate();
+String resultUrl = result.getOriginalUrl();
 
-String wmNotice = fmt.format("ReplayView.banner", requestUrl, requestDate);
+String wmNotice = fmt.format("ReplayView.banner", resultUrl, resultDate);
 String wmHideNotice = fmt.format("ReplayView.bannerHideLink");
 
 String contextRoot = request.getScheme() + "://" + request.getServerName() + ":"

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/JSLessTimeline.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/JSLessTimeline.jsp	2008-07-02 00:17:37 UTC (rev 2386)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/JSLessTimeline.jsp	2008-07-02 00:22:06 UTC (rev 2387)
@@ -4,11 +4,12 @@
 <%@ page import="java.util.Date" %>
 <%@ page import="java.text.ParseException" %>
 <%@ page import="org.archive.wayback.WaybackConstants" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResults" %>
 <%@ page import="org.archive.wayback.core.Timestamp" %>
 <%@ page import="org.archive.wayback.core.UIResults" %>
 <%@ page import="org.archive.wayback.core.WaybackRequest" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
 <%@ page import="org.archive.wayback.query.resultspartitioner.ResultsTimelinePartitionsFactory" %>
 <%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartition" %>
 <%@ page import="org.archive.wayback.util.StringFormatter" %>
@@ -17,40 +18,38 @@
 String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" 
 	+ request.getServerPort() + request.getContextPath();
 
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
+UIReplayResult results = (UIReplayResult) UIResults.getFromRequest(request);
 StringFormatter fmt = results.getFormatter();
-
-Timestamp searchStartTs = results.getStartTimestamp();
-Timestamp searchEndTs = results.getEndTimestamp();
-Timestamp exactTs = results.getExactRequestedTimestamp();
-String searchUrl = results.getSearchUrl();
-Date exactDate = exactTs.getDate();
-
-String exactDateStr = exactTs.getDateStr();
 WaybackRequest wbRequest = results.getWbRequest();
+CaptureSearchResults cResults = results.getResults();
+
+String exactDateStr = wbRequest.get(WaybackConstants.REQUEST_DATE);
+String searchUrl = wbRequest.get(WaybackConstants.REQUEST_URL);
 String resolution = wbRequest.get(WaybackConstants.REQUEST_RESOLUTION);
+String metaMode = wbRequest.get(WaybackConstants.REQUEST_META_MODE);
+
+Date exactDate = Timestamp.parseBefore(exactDateStr).getDate();
+
+
 if(resolution == null) {
 	resolution = WaybackConstants.REQUEST_RESOLUTION_AUTO;
 }
-String metaMode = wbRequest.get(WaybackConstants.REQUEST_META_MODE);
 String metaChecked = "";
 if(metaMode != null && metaMode.equals("yes")) {
 	metaChecked = "checked";
 }
 
-String searchString = results.getSearchUrl();
+CaptureSearchResult first = null;
+CaptureSearchResult prev = null;
+CaptureSearchResult next = null;
+CaptureSearchResult last = null;
 
-SearchResult first = null;
-SearchResult prev = null;
-SearchResult next = null;
-SearchResult last = null;
-
-int resultCount = results.getResultsReturned();
+long resultCount = cResults.getReturnedCount();
 int resultIndex = 1;
-Iterator<SearchResult> it = results.resultsIterator();
+Iterator<CaptureSearchResult> it = cResults.iterator();
 while(it.hasNext()) {
-	SearchResult res = it.next();
-	String resDateStr = res.get(WaybackConstants.RESULT_CAPTURE_DATE);
+	CaptureSearchResult res = it.next();
+	String resDateStr = res.getCaptureTimestamp();
 	int compared = resDateStr.compareTo(exactDateStr.substring(0,resDateStr.length()));
 	if(compared < 0) {
 		resultIndex++;
@@ -72,8 +71,7 @@
 String hoursOptSelected = "";
 String autoOptSelected = "";
 
-String minResolution = ResultsTimelinePartitionsFactory.getMinResolution(
-							results.getResults());
+String minResolution = ResultsTimelinePartitionsFactory.getMinResolution(cResults);
 
 String optimal = "";
 if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_HOURS)) {
@@ -174,7 +172,7 @@
 						if(first != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.firstVersionTitle",
-									results.resultToDate(first)) + "\"";
+									first.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(first) %>"><%
 						}
 						%><img <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><%
@@ -185,7 +183,7 @@
 						if(prev != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.prevVersionTitle",
-									results.resultToDate(prev)) + "\"";
+										prev.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(prev) %>"><%
 						}
 						%><img <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><%
@@ -204,15 +202,15 @@
 		String prettyDateTime = null;
 		if(numResults == 1) {
 			imageUrl = contextRoot + "/images/mark_one.jpg";
-		  	SearchResult result = (SearchResult) partitionResults.get(0);
+		  	CaptureSearchResult result = (CaptureSearchResult) partitionResults.get(0);
 			replayUrl = results.resultToReplayUrl(result);
-			prettyDateTime = fmt.format("TimelineView.markDateTitle",results.resultToDate(result));
+			prettyDateTime = fmt.format("TimelineView.markDateTitle",result.getCaptureDate());
 			
 		} else if (numResults > 1) {
 			imageUrl = contextRoot + "/images/mark_several.jpg";
-		  	SearchResult result = (SearchResult) partitionResults.get(numResults - 1);
+			CaptureSearchResult result = (CaptureSearchResult) partitionResults.get(numResults - 1);
 			replayUrl = results.resultToReplayUrl(result);
-			prettyDateTime = fmt.format("TimelineView.markDateTitle",results.resultToDate(result));
+			prettyDateTime = fmt.format("TimelineView.markDateTitle",result.getCaptureDate());
 
 		}
 		if((i > 0) && (i < numPartitions)) {
@@ -238,7 +236,7 @@
 						if(next != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.nextVersionTitle",
-									results.resultToDate(next)) + "\"";
+									next.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(next) %>"><%
 						}
 						%><img wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><%
@@ -249,7 +247,7 @@
 						if(last != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.lastVersionTitle",
-									results.resultToDate(last)) + "\"";
+									last.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(last) %>"><%
 						}
 						%><img wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><%

Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Timeline.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Timeline.jsp	2008-07-02 00:17:37 UTC (rev 2386)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/replay/Timeline.jsp	2008-07-02 00:22:06 UTC (rev 2387)
@@ -4,11 +4,12 @@
 <%@ page import="java.util.Date" %>
 <%@ page import="java.text.ParseException" %>
 <%@ page import="org.archive.wayback.WaybackConstants" %>
-<%@ page import="org.archive.wayback.core.SearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResult" %>
+<%@ page import="org.archive.wayback.core.CaptureSearchResults" %>
 <%@ page import="org.archive.wayback.core.Timestamp" %>
 <%@ page import="org.archive.wayback.core.UIResults" %>
 <%@ page import="org.archive.wayback.core.WaybackRequest" %>
-<%@ page import="org.archive.wayback.query.UIQueryResults" %>
+<%@ page import="org.archive.wayback.replay.UIReplayResult" %>
 <%@ page import="org.archive.wayback.query.resultspartitioner.ResultsTimelinePartitionsFactory" %>
 <%@ page import="org.archive.wayback.query.resultspartitioner.ResultsPartition" %>
 <%@ page import="org.archive.wayback.util.StringFormatter" %>
@@ -17,53 +18,51 @@
 String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" 
 	+ request.getServerPort() + request.getContextPath();
 
-UIQueryResults results = (UIQueryResults) UIResults.getFromRequest(request);
+UIReplayResult results = (UIReplayResult) UIResults.getFromRequest(request);
 StringFormatter fmt = results.getFormatter();
-
-Timestamp searchStartTs = results.getStartTimestamp();
-Timestamp searchEndTs = results.getEndTimestamp();
-Timestamp exactTs = results.getExactRequestedTimestamp();
-String searchUrl = results.getSearchUrl();
-Date exactDate = exactTs.getDate();
-
-String exactDateStr = exactTs.getDateStr();
 WaybackRequest wbRequest = results.getWbRequest();
+CaptureSearchResults cResults = results.getResults();
+
+String exactDateStr = wbRequest.get(WaybackConstants.REQUEST_DATE);
+String searchUrl = wbRequest.get(WaybackConstants.REQUEST_URL);
 String resolution = wbRequest.get(WaybackConstants.REQUEST_RESOLUTION);
+String metaMode = wbRequest.get(WaybackConstants.REQUEST_META_MODE);
+
+Date exactDate = Timestamp.parseBefore(exactDateStr).getDate();
+
+
 if(resolution == null) {
-	resolution = WaybackConstants.REQUEST_RESOLUTION_AUTO;
+  resolution = WaybackConstants.REQUEST_RESOLUTION_AUTO;
 }
-String metaMode = wbRequest.get(WaybackConstants.REQUEST_META_MODE);
 String metaChecked = "";
 if(metaMode != null && metaMode.equals("yes")) {
-	metaChecked = "checked";
+  metaChecked = "checked";
 }
 
-String searchString = results.getSearchUrl();
+CaptureSearchResult first = null;
+CaptureSearchResult prev = null;
+CaptureSearchResult next = null;
+CaptureSearchResult last = null;
 
-SearchResult first = null;
-SearchResult prev = null;
-SearchResult next = null;
-SearchResult last = null;
-
-int resultCount = results.getResultsReturned();
+long resultCount = cResults.getReturnedCount();
 int resultIndex = 1;
-Iterator<SearchResult> it = results.resultsIterator();
+Iterator<CaptureSearchResult> it = cResults.iterator();
 while(it.hasNext()) {
-	SearchResult res = it.next();
-	String resDateStr = res.get(WaybackConstants.RESULT_CAPTURE_DATE);
-	int compared = resDateStr.compareTo(exactDateStr.substring(0,resDateStr.length()));
-	if(compared < 0) {
-		resultIndex++;
-		prev = res;
-		if(first == null) {
-			first = res;
-		}
-	} else if(compared > 0) {
-		last = res;
-		if(next == null) {
-			next = res;
-		}
-	}
+  CaptureSearchResult res = it.next();
+  String resDateStr = res.getCaptureTimestamp();
+  int compared = resDateStr.compareTo(exactDateStr.substring(0,resDateStr.length()));
+  if(compared < 0) {
+    resultIndex++;
+    prev = res;
+    if(first == null) {
+      first = res;
+    }
+  } else if(compared > 0) {
+    last = res;
+    if(next == null) {
+      next = res;
+    }
+  }
 }
 // string to indicate which select option is currently active
 String yearsOptSelected = "";
@@ -72,50 +71,49 @@
 String hoursOptSelected = "";
 String autoOptSelected = "";
 
-String minResolution = ResultsTimelinePartitionsFactory.getMinResolution(
-							results.getResults());
+String minResolution = ResultsTimelinePartitionsFactory.getMinResolution(cResults);
 
 String optimal = "";
 if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_HOURS)) {
-	optimal = fmt.format("TimelineView.timeRange.hours");
+  optimal = fmt.format("TimelineView.timeRange.hours");
 } else if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_DAYS)) {
-	optimal = fmt.format("TimelineView.timeRange.days");
+  optimal = fmt.format("TimelineView.timeRange.days");
 } else if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_MONTHS)) {
-	optimal = fmt.format("TimelineView.timeRange.months");
+  optimal = fmt.format("TimelineView.timeRange.months");
 } else if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_TWO_MONTHS)) {
-	  optimal = fmt.format("TimelineView.timeRange.twomonths");
+    optimal = fmt.format("TimelineView.timeRange.twomonths");
 } else if(minResolution.equals(WaybackConstants.REQUEST_RESOLUTION_YEARS)) {
-	optimal = fmt.format("TimelineView.timeRange.years");
+  optimal = fmt.format("TimelineView.timeRange.years");
 } else {
-	optimal = fmt.format("TimelineView.timeRange.unknown");
+  optimal = fmt.format("TimelineView.timeRange.unknown");
 }
 String autoOptString = fmt.format("TimelineView.timeRange.auto",optimal);
 
 ArrayList<ResultsPartition> partitions;
 if(resolution.equals(WaybackConstants.REQUEST_RESOLUTION_HOURS)) {
-	hoursOptSelected = "selected";
-	partitions = ResultsTimelinePartitionsFactory.getHour(results.getResults(),
-		wbRequest);
+  hoursOptSelected = "selected";
+  partitions = ResultsTimelinePartitionsFactory.getHour(results.getResults(),
+    wbRequest);
 } else if(resolution.equals(WaybackConstants.REQUEST_RESOLUTION_DAYS)) {
-	daysOptSelected = "selected";
-	partitions = ResultsTimelinePartitionsFactory.getDay(results.getResults(),
-		wbRequest);
+  daysOptSelected = "selected";
+  partitions = ResultsTimelinePartitionsFactory.getDay(results.getResults(),
+    wbRequest);
 } else if(resolution.equals(WaybackConstants.REQUEST_RESOLUTION_MONTHS)) {
-	monthsOptSelected = "selected";
-	partitions = ResultsTimelinePartitionsFactory.getMonth(results.getResults(),
-		wbRequest);
+  monthsOptSelected = "selected";
+  partitions = ResultsTimelinePartitionsFactory.getMonth(results.getResults(),
+    wbRequest);
 } else if(resolution.equals(WaybackConstants.REQUEST_RESOLUTION_TWO_MONTHS)) {
-	  monthsOptSelected = "selected";
-	  partitions = ResultsTimelinePartitionsFactory.getTwoMonth(results.getResults(),
-	    wbRequest);
+    monthsOptSelected = "selected";
+    partitions = ResultsTimelinePartitionsFactory.getTwoMonth(results.getResults(),
+      wbRequest);
 } else if(resolution.equals(WaybackConstants.REQUEST_RESOLUTION_YEARS)) {
-	yearsOptSelected = "selected";
-	partitions = ResultsTimelinePartitionsFactory.getYear(results.getResults(),
-		wbRequest);
+  yearsOptSelected = "selected";
+  partitions = ResultsTimelinePartitionsFactory.getYear(results.getResults(),
+    wbRequest);
 } else {
-	autoOptSelected = "selected";
-	partitions = ResultsTimelinePartitionsFactory.getAuto(results.getResults(),
-		wbRequest);
+  autoOptSelected = "selected";
+  partitions = ResultsTimelinePartitionsFactory.getAuto(results.getResults(),
+    wbRequest);
 }
 int numPartitions = partitions.size();
 ResultsPartition firstP = (ResultsPartition) partitions.get(0);
@@ -196,7 +194,7 @@
 						if(first != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.firstVersionTitle",
-									results.resultToDate(first)) + "\"";
+									first.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(first) %>"><%
 						}
 						%><img <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><%
@@ -207,7 +205,7 @@
 						if(prev != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.prevVersionTitle",
-									results.resultToDate(prev)) + "\"";
+									prev.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(prev) %>"><%
 						}
 						%><img <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><%
@@ -226,15 +224,15 @@
 		String prettyDateTime = null;
 		if(numResults == 1) {
 			imageUrl = contextRoot + "/images/mark_one.jpg";
-		  	SearchResult result = (SearchResult) partitionResults.get(0);
+		  	CaptureSearchResult result = (CaptureSearchResult) partitionResults.get(0);
 			replayUrl = results.resultToReplayUrl(result);
-			prettyDateTime = fmt.format("TimelineView.markDateTitle",results.resultToDate(result));
+			prettyDateTime = fmt.format("TimelineView.markDateTitle",result.getCaptureDate());
 			
 		} else if (numResults > 1) {
 			imageUrl = contextRoot + "/images/mark_several.jpg";
-		  	SearchResult result = (SearchResult) partitionResults.get(numResults - 1);
+		  	CaptureSearchResult result = (CaptureSearchResult) partitionResults.get(numResults - 1);
 			replayUrl = results.resultToReplayUrl(result);
-			prettyDateTime = fmt.format("TimelineView.markDateTitle",results.resultToDate(result));
+			prettyDateTime = fmt.format("TimelineView.markDateTitle",result.getCaptureDate());
 
 		}
 		if((i > 0) && (i < numPartitions)) {
@@ -260,7 +258,7 @@
 						if(next != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.nextVersionTitle",
-									results.resultToDate(next)) + "\"";
+									next.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(next) %>"><%
 						}
 						%><img wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><%
@@ -271,7 +269,7 @@
 						if(last != null) {
 							titleString = "title=\"" + 
 								fmt.format("TimelineView.lastVersionTitle",
-									results.resultToDate(last)) + "\"";
+									last.getCaptureDate()) + "\"";
 							%><a wmSpecial="1" href="<%= results.resultToReplayUrl(last) %>"><%
 						}
 						%><img wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><%


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2386] trunk/archive-access/projects/wayback/ wayback-webapp/src/main/webapp

From: <bra...@us...> - 2008-07-02 00:17:28

Revision: 2386
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2386&view=rev
Author:   bradtofel
Date:     2008-07-01 17:17:37 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
MOVED: exception related rendering .jsps to /exception/

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/CSSError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/HTMLError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/JavaScriptError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/XMLError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/error_image.gif

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CSSError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/JavaScriptError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLError.jsp
    trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/error_image.gif

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/CSSError.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CSSError.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/CSSError.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/CSSError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -0,0 +1,18 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="org.archive.wayback.exception.WaybackException" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<%
+
+WaybackException e = (WaybackException) request.getAttribute("exception");
+UIResults results = UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+response.setStatus(e.getStatus());
+
+%>
+/* CSS wayback retrieval error:
+
+ Title:   <%= fmt.format(e.getTitleKey()) %>
+ Message: <%= fmt.format(e.getMessageKey()) %>
+ 
+ */

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/HTMLError.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLError.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/HTMLError.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/HTMLError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -0,0 +1,19 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="org.archive.wayback.exception.WaybackException" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<%
+WaybackException e = (WaybackException) request.getAttribute("exception");
+e.setupResponse(response);
+%>
+<jsp:include page="/template/UI-header.jsp" flush="true" />
+<%
+
+UIResults results = UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+
+%>
+
+<h2><%= fmt.format(e.getTitleKey()) %></h2>
+<p><b><%= fmt.format(e.getMessageKey(),e.getMessage()) %></b></p>
+<jsp:include page="/template/UI-footer.jsp" flush="true" />

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/JavaScriptError.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/JavaScriptError.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/JavaScriptError.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/JavaScriptError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -0,0 +1,16 @@
+<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
+<%@ page import="org.archive.wayback.exception.WaybackException" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<%
+
+WaybackException e = (WaybackException) request.getAttribute("exception");
+UIResults results = UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+response.setStatus(e.getStatus());
+
+%>
+// Javascript wayback retrieval error:
+//
+// Title:   <%= fmt.format(e.getTitleKey()) %>
+// Message: <%= fmt.format(e.getMessageKey()) %>

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/XMLError.jsp (from rev 2228, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLError.jsp)
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/XMLError.jsp	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/XMLError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -0,0 +1,19 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<%@ page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8"%>
+<%@ page import="org.archive.wayback.exception.WaybackException" %>
+<%@ page import="org.archive.wayback.core.UIResults" %>
+<%@ page import="org.archive.wayback.util.StringFormatter" %>
+<%
+
+WaybackException e = (WaybackException) request.getAttribute("exception");
+UIResults results = UIResults.getFromRequest(request);
+StringFormatter fmt = results.getFormatter();
+//response.setStatus(e.getStatus());
+
+%>
+<wayback>
+	<error>
+		<title><%= UIResults.encodeXMLContent(fmt.format(e.getTitleKey())) %></title>
+		<message><%= UIResults.encodeXMLContent(fmt.format(e.getMessageKey())) %></message>
+	</error>
+</wayback>

Copied: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/exception/error_image.gif (from rev 2055, trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/error_image.gif)
===================================================================
(Binary files differ)

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CSSError.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CSSError.jsp	2008-07-02 00:16:07 UTC (rev 2385)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/CSSError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -1,18 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="org.archive.wayback.exception.WaybackException" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<%
-
-WaybackException e = (WaybackException) request.getAttribute("exception");
-UIResults results = UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-response.setStatus(e.getStatus());
-
-%>
-/* CSS wayback retrieval error:
-
- Title:   <%= fmt.format(e.getTitleKey()) %>
- Message: <%= fmt.format(e.getMessageKey()) %>
- 
- */

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLError.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLError.jsp	2008-07-02 00:16:07 UTC (rev 2385)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/HTMLError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -1,19 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="org.archive.wayback.exception.WaybackException" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<%
-WaybackException e = (WaybackException) request.getAttribute("exception");
-e.setupResponse(response);
-%>
-<jsp:include page="/template/UI-header.jsp" flush="true" />
-<%
-
-UIResults results = UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-
-%>
-
-<h2><%= fmt.format(e.getTitleKey()) %></h2>
-<p><b><%= fmt.format(e.getMessageKey(),e.getMessage()) %></b></p>
-<jsp:include page="/template/UI-footer.jsp" flush="true" />

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/JavaScriptError.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/JavaScriptError.jsp	2008-07-02 00:16:07 UTC (rev 2385)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/JavaScriptError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -1,16 +0,0 @@
-<%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>
-<%@ page import="org.archive.wayback.exception.WaybackException" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<%
-
-WaybackException e = (WaybackException) request.getAttribute("exception");
-UIResults results = UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-response.setStatus(e.getStatus());
-
-%>
-// Javascript wayback retrieval error:
-//
-// Title:   <%= fmt.format(e.getTitleKey()) %>
-// Message: <%= fmt.format(e.getMessageKey()) %>

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLError.jsp
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLError.jsp	2008-07-02 00:16:07 UTC (rev 2385)
+++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/XMLError.jsp	2008-07-02 00:17:37 UTC (rev 2386)
@@ -1,19 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<%@ page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8"%>
-<%@ page import="org.archive.wayback.exception.WaybackException" %>
-<%@ page import="org.archive.wayback.core.UIResults" %>
-<%@ page import="org.archive.wayback.util.StringFormatter" %>
-<%
-
-WaybackException e = (WaybackException) request.getAttribute("exception");
-UIResults results = UIResults.getFromRequest(request);
-StringFormatter fmt = results.getFormatter();
-//response.setStatus(e.getStatus());
-
-%>
-<wayback>
-	<error>
-		<title><%= UIResults.encodeXMLContent(fmt.format(e.getTitleKey())) %></title>
-		<message><%= UIResults.encodeXMLContent(fmt.format(e.getMessageKey())) %></message>
-	</error>
-</wayback>

Deleted: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/jsp/error_image.gif
===================================================================
(Binary files differ)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2385] trunk/archive-access/projects/wayback/ wayback-mapreduce-prereq/src/main/java/org/archive/wayback/resourceindex/ indexer/hadoop/Driver.java

From: <bra...@us...> - 2008-07-02 00:15:59

Revision: 2385
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2385&view=rev
Author:   bradtofel
Date:     2008-07-01 17:16:07 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REFACTOR: SearchResult => (Url|Capture)SearchResult

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/src/main/java/org/archive/wayback/resourceindex/indexer/hadoop/Driver.java

Modified: trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/src/main/java/org/archive/wayback/resourceindex/indexer/hadoop/Driver.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/src/main/java/org/archive/wayback/resourceindex/indexer/hadoop/Driver.java	2008-07-02 00:15:22 UTC (rev 2384)
+++ trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/src/main/java/org/archive/wayback/resourceindex/indexer/hadoop/Driver.java	2008-07-02 00:16:07 UTC (rev 2385)
@@ -24,8 +24,8 @@
 import org.archive.io.arc.ARCRecord;
 import org.archive.mapred.ARCMapRunner;
 import org.archive.mapred.ARCRecordMapper;
-import org.archive.wayback.core.SearchResult;
-import org.archive.wayback.resourcestore.ARCRecordToSearchResultAdapter;
+import org.archive.wayback.core.CaptureSearchResult;
+import org.archive.wayback.resourcestore.indexer.ARCRecordToSearchResultAdapter;
 import org.archive.wayback.resourceindex.cdx.SearchResultToCDXLineAdapter;
 
 /**
@@ -58,7 +58,7 @@
 			ObjectWritable ow = (ObjectWritable) value;
 			ARCRecord rec = (ARCRecord) ow.get();
 			String line;
-			SearchResult result = ARtoSR.adapt(rec);
+			CaptureSearchResult result = ARtoSR.adapt(rec);
 			if(result != null) {
 				line = SRtoCDX.adapt(result);
 				if(line != null) {


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2384] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/replay/BaseReplayDispatcher. java

From: <bra...@us...> - 2008-07-02 00:15:13

Revision: 2384
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2384&view=rev
Author:   bradtofel
Date:     2008-07-01 17:15:22 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REMOVED: no longer needed with new simplified ReplayDispatcher interface.

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/BaseReplayDispatcher.java

Deleted: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/BaseReplayDispatcher.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/BaseReplayDispatcher.java	2008-07-01 23:56:58 UTC (rev 2383)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/BaseReplayDispatcher.java	2008-07-02 00:15:22 UTC (rev 2384)
@@ -1,210 +0,0 @@
-/* ReplayRendererDispatcher
- *
- * $Id$
- *
- * Created on 5:23:35 PM Aug 8, 2007.
- *
- * Copyright (C) 2007 Internet Archive.
- *
- * This file is part of wayback-core.
- *
- * wayback-core is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser Public License as published by
- * the Free Software Foundation; either version 2.1 of the License, or
- * any later version.
- *
- * wayback-core is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU Lesser Public License for more details.
- *
- * You should have received a copy of the GNU Lesser Public License
- * along with wayback-core; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- */
-package org.archive.wayback.replay;
-
-import java.io.IOException;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
-import javax.servlet.RequestDispatcher;
-import javax.servlet.ServletException;
-import javax.servlet.http.HttpServletRequest;
-import javax.servlet.http.HttpServletResponse;
-
-import org.archive.wayback.ReplayDispatcher;
-import org.archive.wayback.ReplayRenderer;
-import org.archive.wayback.ResultURIConverter;
-import org.archive.wayback.WaybackConstants;
-import org.archive.wayback.core.Resource;
-import org.archive.wayback.core.SearchResult;
-import org.archive.wayback.core.SearchResults;
-import org.archive.wayback.core.UIResults;
-import org.archive.wayback.core.WaybackRequest;
-import org.archive.wayback.exception.WaybackException;
-
-/**
- * 
- * 
- * @author brad
- * @version $Date$, $Revision$
- */
-public abstract class BaseReplayDispatcher implements ReplayDispatcher {
-
-	private String errorJsp = "/jsp/HTMLError.jsp";
-	private String imageErrorJsp = "/jsp/HTMLError.jsp";
-	private String javascriptErrorJsp = "/jsp/JavaScriptError.jsp";
-	private String cssErrorJsp = "/jsp/CSSError.jsp";
-
-	protected final Pattern IMAGE_REGEX = Pattern
-			.compile(".*\\.(jpg|jpeg|gif|png|bmp|tiff|tif)$");
-
-	/* ERROR HANDLING RESPONSES: */
-
-	private boolean requestIsEmbedded(HttpServletRequest httpRequest,
-			WaybackRequest wbRequest) {
-		// without a wbRequest, assume it is not embedded: send back HTML
-		if (wbRequest == null) {
-			return false;
-		}
-		String referer = wbRequest.get(WaybackConstants.REQUEST_REFERER_URL);
-		return (referer != null && referer.length() > 0);
-	}
-
-	private boolean requestIsImage(HttpServletRequest httpRequest,
-			WaybackRequest wbRequest) {
-		String requestUrl = wbRequest.get(WaybackConstants.REQUEST_URL);
-		if (requestUrl == null)
-			return false;
-		Matcher matcher = IMAGE_REGEX.matcher(requestUrl);
-		return (matcher != null && matcher.matches());
-	}
-
-	private boolean requestIsJavascript(HttpServletRequest httpRequest,
-			WaybackRequest wbRequest) {
-
-		String requestUrl = wbRequest.get(WaybackConstants.REQUEST_URL);
-		return (requestUrl != null) && requestUrl.endsWith(".js");
-	}
-
-	private boolean requestIsCSS(HttpServletRequest httpRequest,
-			WaybackRequest wbRequest) {
-
-		String requestUrl = wbRequest.get(WaybackConstants.REQUEST_URL);
-		return (requestUrl != null) && requestUrl.endsWith(".css");
-	}
-
-	/*
-	 * (non-Javadoc)
-	 * 
-	 * @see org.archive.wayback.ReplayRenderer#renderException(javax.servlet.http.HttpServletRequest,
-	 *      javax.servlet.http.HttpServletResponse,
-	 *      org.archive.wayback.core.WaybackRequest,
-	 *      org.archive.wayback.exception.WaybackException)
-	 */
-	public void renderException(HttpServletRequest httpRequest,
-			HttpServletResponse httpResponse, WaybackRequest wbRequest,
-			WaybackException exception) throws ServletException, IOException {
-
-		// the "standard HTML" response handler:
-		String finalJspPath = errorJsp;
-
-		// try to not cause client errors by sending the HTML response if
-		// this request is ebedded, and is obviously one of the special types:
-		if (requestIsEmbedded(httpRequest, wbRequest)) {
-
-			if (requestIsJavascript(httpRequest, wbRequest)) {
-
-				finalJspPath = javascriptErrorJsp;
-
-			} else if (requestIsCSS(httpRequest, wbRequest)) {
-
-				finalJspPath = cssErrorJsp;
-
-			} else if (requestIsImage(httpRequest, wbRequest)) {
-
-				finalJspPath = imageErrorJsp;
-
-			}
-		}
-
-		httpRequest.setAttribute("exception", exception);
-		UIResults uiResults = new UIResults(wbRequest);
-		uiResults.storeInRequest(httpRequest, finalJspPath);
-
-		RequestDispatcher dispatcher = httpRequest
-				.getRequestDispatcher(finalJspPath);
-		if(dispatcher == null) {
-			throw new ServletException("Null dispatcher for " + finalJspPath);
-		}
-		dispatcher.forward(httpRequest, httpResponse);
-	}
-
-	/**
-	 * @param wbRequest
-	 * @param result
-	 * @param resource
-	 * @return the correct ReplayRenderer for the Resource
-	 */
-	public abstract ReplayRenderer getRenderer(WaybackRequest wbRequest,
-			SearchResult result, Resource resource);
-	
-	/*
-	 * (non-Javadoc)
-	 * 
-	 * @see org.archive.wayback.ReplayRenderer#renderResource(javax.servlet.http.HttpServletRequest,
-	 *      javax.servlet.http.HttpServletResponse,
-	 *      org.archive.wayback.core.WaybackRequest,
-	 *      org.archive.wayback.core.SearchResult,
-	 *      org.archive.wayback.core.Resource,
-	 *      org.archive.wayback.ResultURIConverter,
-	 *      org.archive.wayback.core.SearchResults)
-	 */
-	public void renderResource(HttpServletRequest httpRequest,
-			HttpServletResponse httpResponse, WaybackRequest wbRequest,
-			SearchResult result, Resource resource,
-			ResultURIConverter uriConverter, SearchResults results)
-			throws ServletException, IOException {
-		
-		ReplayRenderer renderer = getRenderer(wbRequest, result, resource);
-		try {
-			renderer.renderResource(httpRequest, httpResponse, wbRequest, result, 
-					resource, uriConverter, results);
-		} catch (WaybackException e) {
-			renderException(httpRequest, httpResponse, wbRequest, e);
-		}
-	}
-
-	public String getErrorJsp() {
-		return errorJsp;
-	}
-
-	public void setErrorJsp(String errorJsp) {
-		this.errorJsp = errorJsp;
-	}
-
-	public String getImageErrorJsp() {
-		return imageErrorJsp;
-	}
-
-	public void setImageErrorJsp(String imageErrorJsp) {
-		this.imageErrorJsp = imageErrorJsp;
-	}
-
-	public String getJavascriptErrorJsp() {
-		return javascriptErrorJsp;
-	}
-
-	public void setJavascriptErrorJsp(String javascriptErrorJsp) {
-		this.javascriptErrorJsp = javascriptErrorJsp;
-	}
-
-	public String getCssErrorJsp() {
-		return cssErrorJsp;
-	}
-
-	public void setCssErrorJsp(String cssErrorJsp) {
-		this.cssErrorJsp = cssErrorJsp;
-	}
-}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2383] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/ CaptureToUrlResultFilter.java

From: <bra...@us...> - 2008-07-01 23:56:49

Revision: 2383
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2383&view=rev
Author:   bradtofel
Date:     2008-07-01 16:56:58 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REFACTOR: replaced with adapter.

Removed Paths:
-------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/CaptureToUrlResultFilter.java

Deleted: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/CaptureToUrlResultFilter.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/CaptureToUrlResultFilter.java	2008-07-01 23:56:23 UTC (rev 2382)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/CaptureToUrlResultFilter.java	2008-07-01 23:56:58 UTC (rev 2383)
@@ -1,117 +0,0 @@
-/* CaptureToUrlResultFilter
- *
- * $Id$
- *
- * Created on 6:23:07 PM Apr 19, 2007.
- *
- * Copyright (C) 2007 Internet Archive.
- *
- * This file is part of wayback-core.
- *
- * wayback-core is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser Public License as published by
- * the Free Software Foundation; either version 2.1 of the License, or
- * any later version.
- *
- * wayback-core is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU Lesser Public License for more details.
- *
- * You should have received a copy of the GNU Lesser Public License
- * along with wayback-core; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- */
-package org.archive.wayback.resourceindex.filters;
-
-import java.util.HashMap;
-import java.util.Properties;
-
-import org.archive.wayback.WaybackConstants;
-import org.archive.wayback.core.SearchResult;
-import org.archive.wayback.util.ObjectFilter;
-
-/**
- *
- *
- * @author brad
- * @version $Date$, $Revision$
- */
-public class CaptureToUrlResultFilter implements ObjectFilter<SearchResult> {
-	private String currentUrl;
-	private String firstCapture;
-	private String lastCapture;
-	private int numCaptures;
-	private HashMap<String,Object> digests;
-	private SearchResult resultRef = null;
-
-	/**
-	 * 
-	 */
-	public final static String RESULT_URL = "result.url";
-	/**
-	 * 
-	 */
-	public final static String RESULT_FIRST_CAPTURE = "result.firstcapture";
-	/**
-	 * 
-	 */
-	public final static String RESULT_LAST_CAPTURE = "result.lastcapture";
-	/**
-	 * 
-	 */
-	public final static String RESULT_NUM_CAPTURES = "result.numcaptures";
-	/**
-	 * 
-	 */
-	public final static String RESULT_NUM_VERSIONS = "result.numversions";
-	/**
-	 * 
-	 */
-	public final static String RESULT_ORIGINAL_URL = "result.originalurl";
-	
-	private void fungeSearchResult(SearchResult result) {
-		String originalUrl = result.get(WaybackConstants.RESULT_URL);
-		currentUrl = result.get(WaybackConstants.RESULT_URL_KEY);
-		firstCapture = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
-		lastCapture = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
-		digests = new HashMap<String,Object>();
-		digests.put(result.get(WaybackConstants.RESULT_MD5_DIGEST),null);
-		numCaptures = 1;
-
-		Properties p = result.getData();
-		p.clear();
-		resultRef = result;
-		resultRef.put(RESULT_ORIGINAL_URL,originalUrl);
-		resultRef.put(RESULT_URL,currentUrl);
-		resultRef.put(RESULT_FIRST_CAPTURE,firstCapture);
-		resultRef.put(RESULT_LAST_CAPTURE,lastCapture);
-		resultRef.put(RESULT_NUM_CAPTURES,"1");
-		resultRef.put(RESULT_NUM_VERSIONS,"1");
-	}
-
-	public int filterObject(SearchResult r) {
-		String urlKey = r.get(WaybackConstants.RESULT_URL_KEY);
-		if(resultRef == null || !currentUrl.equals(urlKey)) {
-			fungeSearchResult(r);
-			return FILTER_INCLUDE;
-		}
-
-		// same url -- accumulate:
-		String captureDate = r.get(WaybackConstants.RESULT_CAPTURE_DATE);
-		if(captureDate.compareTo(firstCapture) < 0) {
-			firstCapture = captureDate;
-			resultRef.put(RESULT_FIRST_CAPTURE,firstCapture);
-		}
-		if(captureDate.compareTo(lastCapture) > 0) {
-			lastCapture = captureDate;
-			resultRef.put(RESULT_LAST_CAPTURE,lastCapture);
-		}
-		numCaptures++;
-		digests.put(r.get(WaybackConstants.RESULT_MD5_DIGEST), null);
-		resultRef.put(RESULT_NUM_CAPTURES,String.valueOf(numCaptures));
-		resultRef.put(RESULT_NUM_VERSIONS,String.valueOf(digests.size()));
-		return FILTER_EXCLUDE;
-	}
-
-}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2382] trunk/archive-access/projects/wayback/ wayback-core/src/test/java/org/archive/wayback/accesscontrol/staticmap/ StaticMapExclusionFilterTest.java

From: <bra...@us...> - 2008-07-01 23:56:15

Revision: 2382
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2382&view=rev
Author:   bradtofel
Date:     2008-07-01 16:56:23 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REFACTOR: SearchResult => (Url|Capture)SearchResult

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/test/java/org/archive/wayback/accesscontrol/staticmap/StaticMapExclusionFilterTest.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/test/java/org/archive/wayback/accesscontrol/staticmap/StaticMapExclusionFilterTest.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/test/java/org/archive/wayback/accesscontrol/staticmap/StaticMapExclusionFilterTest.java	2008-07-01 23:56:08 UTC (rev 2381)
+++ trunk/archive-access/projects/wayback/wayback-core/src/test/java/org/archive/wayback/accesscontrol/staticmap/StaticMapExclusionFilterTest.java	2008-07-01 23:56:23 UTC (rev 2382)
@@ -29,8 +29,7 @@
 import java.io.IOException;
 import java.util.Map;
 
-import org.archive.wayback.WaybackConstants;
-import org.archive.wayback.core.SearchResult;
+import org.archive.wayback.core.CaptureSearchResult;
 import org.archive.wayback.util.ObjectFilter;
 
 import junit.framework.TestCase;
@@ -72,21 +71,21 @@
 		String bases[] = {"http://www.peagreenboat.com/",
 							"http://peagreenboat.com/"};
 //		setTmpContents(bases);
-		ObjectFilter<SearchResult> filter = getFilter(bases);
-		assertTrue("unmassaged",isBlocked(filter,"www.peagreenboat.com"));
-		assertTrue("unmassaged",isBlocked(filter,"peagreenboat.com"));
-		assertFalse("other1",isBlocked(filter,"peagreenboatt.com"));
-		assertFalse("other2",isBlocked(filter,"peagreenboat.org"));
-		assertFalse("other3",isBlocked(filter,"www.peagreenboat.org"));
+		ObjectFilter<CaptureSearchResult> filter = getFilter(bases);
+		assertTrue("unmassaged",isBlocked(filter,"http://www.peagreenboat.com"));
+		assertTrue("unmassaged",isBlocked(filter,"http://peagreenboat.com"));
+		assertFalse("other1",isBlocked(filter,"http://peagreenboatt.com"));
+		assertFalse("other2",isBlocked(filter,"http://peagreenboat.org"));
+		assertFalse("other3",isBlocked(filter,"http://www.peagreenboat.org"));
 		// there is a problem with the SURTTokenizer... deal with ports!
-//		assertFalse("other4",isBlocked(filter,"www.peagreenboat.com:8080"));
-		assertTrue("subpath",isBlocked(filter,"www.peagreenboat.com/foo"));
-		assertTrue("emptypath",isBlocked(filter,"www.peagreenboat.com/"));
+//		assertFalse("other4",isBlocked(filter,"http://www.peagreenboat.com:8080"));
+		assertTrue("subpath",isBlocked(filter,"http://www.peagreenboat.com/foo"));
+		assertTrue("emptypath",isBlocked(filter,"http://www.peagreenboat.com/"));
 	}
 	
-	private boolean isBlocked(ObjectFilter<SearchResult> filter, String url) {
-		SearchResult result = new SearchResult();
-		result.put(WaybackConstants.RESULT_URL,url);
+	private boolean isBlocked(ObjectFilter<CaptureSearchResult> filter, String url) {
+		CaptureSearchResult result = new CaptureSearchResult();
+		result.setOriginalUrl(url);
 		int filterResult = filter.filterObject(result);
 		if(filterResult == ObjectFilter.FILTER_EXCLUDE) {
 			return true;
@@ -94,7 +93,7 @@
 		return false;
 	}
 	
-	private ObjectFilter<SearchResult> getFilter(String lines[]) 
+	private ObjectFilter<CaptureSearchResult> getFilter(String lines[]) 
 		throws IOException {
 		
 		setTmpContents(lines);


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2381] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java

From: <bra...@us...> - 2008-07-01 23:55:59

Revision: 2381
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2381&view=rev
Author:   bradtofel
Date:     2008-07-01 16:56:08 -0700 (Tue, 01 Jul 2008)

Log Message:
-----------
REFACTOR: SearchResult => (Url|Capture)SearchResult

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java	2008-07-01 23:55:46 UTC (rev 2380)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java	2008-07-01 23:56:08 UTC (rev 2381)
@@ -41,11 +41,12 @@
 import org.archive.wayback.ResultURIConverter;
 import org.archive.wayback.WaybackConstants;
 import org.archive.wayback.accesscontrol.ExclusionFilterFactory;
+import org.archive.wayback.core.CaptureSearchResult;
 import org.archive.wayback.core.CaptureSearchResults;
 import org.archive.wayback.core.Resource;
-import org.archive.wayback.core.SearchResult;
 import org.archive.wayback.core.SearchResults;
 import org.archive.wayback.core.UIResults;
+import org.archive.wayback.core.UrlSearchResults;
 import org.archive.wayback.core.WaybackRequest;
 import org.archive.wayback.exception.AuthenticationControlException;
 import org.archive.wayback.exception.BaseExceptionRenderer;
@@ -230,7 +231,7 @@
 		WaybackRequest wbRequest = new WaybackRequest();
 		wbRequest.setContextPrefix(getAbsoluteLocalPrefix(httpRequest));
 		wbRequest.setContext(this);
-		UIResults uiResults = new UIResults(wbRequest);
+		UIResults uiResults = new UIResults(wbRequest,uriConverter);
 		String translated = "/" + translateRequestPathQuery(httpRequest);
 		uiResults.storeInRequest(httpRequest,translated);
 		RequestDispatcher dispatcher = null;
@@ -310,7 +311,7 @@
 			CaptureSearchResults captureResults = (CaptureSearchResults) results;
 	
 			// TODO: check which versions are actually accessible right now?
-			SearchResult closest = captureResults.getClosest(wbRequest);
+			CaptureSearchResult closest = captureResults.getClosest(wbRequest);
 			resource = collection.getResourceStore().retrieveResource(closest);
 			ReplayRenderer renderer = replay.getRenderer(wbRequest, closest, resource);
 			renderer.renderResource(httpRequest, httpResponse, wbRequest,
@@ -327,18 +328,19 @@
 	throws ServletException, IOException, WaybackException {
 
 		SearchResults results = collection.getResourceIndex().query(wbRequest);
-		if(results.getResultsType().equals(
-				WaybackConstants.RESULTS_TYPE_CAPTURE)) {
+		if(results instanceof CaptureSearchResults) {
 			CaptureSearchResults cResults = (CaptureSearchResults) results;
-			SearchResult closest = cResults.getClosest(wbRequest);
-			closest.put(WaybackConstants.RESULT_CLOSEST_INDICATOR, 
-					WaybackConstants.RESULT_CLOSEST_VALUE);
+			CaptureSearchResult closest = cResults.getClosest(wbRequest);
+			closest.setClosest(true);
+			query.renderCaptureResults(httpRequest,httpResponse,wbRequest,
+					cResults,uriConverter);
+
+		} else if(results instanceof UrlSearchResults) {
+			UrlSearchResults uResults = (UrlSearchResults) results;
 			query.renderUrlResults(httpRequest,httpResponse,wbRequest,
-					results,uriConverter);
-
+					uResults,uriConverter);
 		} else {
-			query.renderUrlPrefixResults(httpRequest,httpResponse,wbRequest,
-					results,uriConverter);
+			throw new WaybackException("Unknown index format");
 		}
 	}
 	


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 55 56 57 58 59 .. 171 > >> (Page 57 of 171)