[Archive-access-cvs] SF.net SVN: archive-access: [2396] trunk/archive-access/projects/nutchwax/ arc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2396
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2396&view=rev
Author:   binzino
Date:     2008-07-02 19:01:46 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Initial revision.  Very rough draft.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/README-dedup.txt

Added: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	2008-07-03 02:01:46 UTC (rev 2396)
@@ -0,0 +1,697 @@
+
+README-dedup.txt
+2008-07-02
+Aaron Binns
+
+De-duplication and NutchWAX
+
+This document assumes that the reader is familiar with the topic of
+de-duplication with regards to archiving web data.  That said, let us
+review what we mean by de-duplication in NutchWAX.
+
+When archive files (ARC/WARC) are written, the tool used to create
+them may or may not prevent multiple copies of the same URL to be
+written.  Some archive file creation tools perform duplicate
+prevention, but many do not.
+
+What NutchWAX has to contend with is the scenario where one or more
+archive files that are imported and indexed have multiple copies of an
+URL.
+
+Ideally, NutchWAX would only import and index one unique version of
+the URL.  If the same version of the URL was seen a second, third,
+fourth, etc. time, then NutchWAX would simply update the existing
+record in its search index by adding the subsequent crawl dates to it.
+This way, if a URL was crawled 10 times and didn't change, there would
+only be one entry in the search index for it, but with 10 crawl dates
+associated with it.
+
+======================================================================
+
+This sounds simple enough, but in practice the implementation is not
+as straightfoward as suggested by the above.
+
+For one, Nutch's underlying Lucene search indexes are not easily
+modified "in place".  That is, updating an existing record by adding
+an additional date to it is not easily accomplished via the Lucene
+public APIs.  The Lucene documentation informs us that records are not
+modified in place, but rather are deleted and re-added with the
+modified/new information.
+
+Doing a complete delete+re-add on a large Lucene database containing
+possibly millions of records is a computationally expensive process.
+Furthermore, since many fields in Nutch's Lucene indexes are not
+stored, it is infeasible to delete and re-add them w/o data loss.
+
+Fortunately, using parallel Lucene indexes and the ParallelIndexReader
+can help solve the problem.  More on that later.
+
+======================================================================
+
+Another challenge in handling duplicates is defining what makes for a
+unique version of a URL.
+
+Most tools, Nutch included, use the URL as a unique identifier for a
+page.  Since most tools don't care about old versions of pages,
+retaining only the latest version and using the URL to identify it is
+sufficient.
+
+However, for archive data, we need to use more than just the URL to
+identify a page, we need something that has the URL but also some
+notion of the *version* of the page.
+
+For example, consider a page like
+
+  http://www.cnn.com/index.html
+
+This page changes frequently.  If it were crawled 10 times, once per
+week, each crawl could capture a different version of the page.  We
+would have 10 different, unique versions.
+
+Now, if NutchWAX used *only* the URL as the unique identifier for the
+page, there would be no way to distinguish the first one from the
+second, from the third, etc.
+
+NutchWAX needs a unique identifier that has the URL and also some
+notion of the *version* of the page.  For that we use a digest of the
+page's content.  The digest is used as a version number of sorts.
+Each version will have a different digest.  So, if we need to find a
+specific version of the page, we can use the URL combined with the
+digest to uniquely identify it.
+
+Currently we use SHA-1 for digesting the content.
+
+Using URL+digest rather than just the URL as a unique page identifier
+is conceptually simple, but does have some repercussions within Nutch.
+Nutch assumes that the URL alone is a unique identifier and that
+assumption is coded into the software in various ways.  To use the
+URL+digest instead, we had to work around some of those hard-coded
+assumptions in various ways.  More on that later.
+
+======================================================================
+
+The next challenge is to know if a version of a URL (the URL+digest
+described above) has already been imported and indexed so that we
+don't import it again.
+
+To prevent the importing of multiple copies of the same version of a
+page, we could get the URL+digest of the page to be imported, then
+look in the existing Nutch index to see if we alread have it.  If we
+do, do not import it, instead add the crawl date to the existing
+record in the search index.
+
+Now, the above describes two challenges:
+
+ 1. Searching the existing index to see if there is an existing record
+    to be updated.
+
+ 2. Updating an existing record.  This was discussed above and we do
+    have a solution, which we'll describe in more detail later.
+
+The first doesn't seem challenging at first and in theory it isn't.
+However, in practice it is difficult becuase for a a large deployment,
+we usually have many Lucene indexes spread over many machines.  It's
+not as simple as opening up a single Lucene index on the local machine
+and searching for a matching URL+digest.  In one of the deployments at
+the Internet Archive, we have 100s of Lucene indexes spread over 5
+machines.
+
+Now, we could use the Nutch web search rather than accessing the
+Lucene indexes directly.  That is, to find out if we have already
+indexed a URL+digest, we could send an HTTP request to the Nutch
+search server asking if the URL+digest is already in the index or not.
+
+Although this is a workable solution, performing a search for each and
+every URL being imported would likely put too much strain on the
+search server and would slow down the importing process.  When some
+import & index jobs process 100s of millions of documents and take
+weeks to run, adding a 5-second HTTP request to each URL import is a
+significant cost.
+
+What would be ideal is a centralized database of all the URLs
+processed by NutchWAX.  Ideally, this centralized database would also
+be used by the archiver (e.g. Heritrix) to perform de-duplication
+during a crawl; and also by the Wayback for storing historical
+metadata.
+
+The de-duplication strategy described in this document utilizes the
+Wayback tools and CDX files as the central URL database for performing
+NutchWAX de-duplication.
+
+======================================================================
+
+Review
+------
+
+Our de-duplication strategy for NutchWAX as described so far has three
+key elements:
+
+ o Use URL+digest as a unique identifier for a unique version of a page.
+ o Use ParallelIndexReader to provide index record modification/update.
+ o Use Wayback and CDX files as a central database of URL processing state.
+
+======================================================================
+
+Using CDX files to detect duplicate pages in a set of archive files is
+fortunately rather straightforward.
+
+CDX files are text files with one line for each and every page
+(record) in an archive file.  These CDX lines have three bits of data
+we can use for detecting duplicate pages:
+
+ o URL
+ o digest
+ o date
+
+NutchWAX provides a 'dedup-cdx' script that reads a CDX file and
+produces a "duplicates" file containing the URL, digest and date of
+each duplicate copy of a unique version of a URL in the CDX file.
+
+For example, suppose we have a collection of 100 ARC files.  In those
+ARC files, the page
+
+  http://www.example.org/index.html
+
+appears 10 times, but only 5 of those are different, the other 5 are
+duplicate copies.  Suppose we have
+
+  Date         Digest    Content sample
+  2007-10-01   abc123    Hello, welcome to my page.
+  2007-10-02   abc123    Hello, welcome to my page.
+  2007-10-03   def456    Sorry I haven't updated this in a while.
+  2007-10-04   def456    Sorry I haven't updated this in a while.
+  2007-10-05   abc123    Hello, welcome to my page.
+  2007-10-06   abc123    Hello, welcome to my page.
+  2007-10-07   ghi789    Hey, I finally updated this.
+  2007-10-08   jkl012    Under construction.
+  2007-10-09   jkl012    Under construction.
+  2007-10-10   mno345    My homepage is great!
+
+Notice how we started with the "abc123" version, changed to the
+"def456" version then reverted back to the "abc123" version.  In this
+simple example, we have an webmaster who just can't make up his mind
+on what to say.
+
+Thep point is that our CDX file will have lines of the form
+
+  20071001 abc123 example.org/index.html
+  20071002 abc123 example.org/index.html
+  20071003 def456 example.org/index.html
+  20071004 def456 example.org/index.html
+  20071005 abc123 example.org/index.html
+  20071006 abc123 example.org/index.html
+  20071007 ghi789 example.org/index.html
+  20071008 jkl012 example.org/index.html
+  20071009 jkl012 example.org/index.html
+  20071010 mno345 example.org/index.html
+
+It's easy to find the duplicate lines in the CDX file.
+
+The NutchWAX 'dedup-cdx' script will extract the duplicates, writing out all
+the duplicate lines, except for the first.  For the above, the output is
+
+  20071002 abc123 example.org/index.html
+  20071004 def456 example.org/index.html
+  20071005 abc123 example.org/index.html
+  20071006 abc123 example.org/index.html
+  20071009 jkl012 example.org/index.html
+
+Only the 2nd, 3rd, etc. instance of a URL+digest line are printed.
+The first instance of "abc123" is not printed, or is "ghi789" since it
+has no duplicates.
+
+Now what do we do with these?
+
+When importing archive files with NutchWAX, we pass it this list of
+duplicates, which it uses as an exclusion list.  Any URL+digest+date
+on the list is excluded from import, all others pass through.
+
+Looking at our CDX sample again
+
+  Date     Digest URL                    Import?
+  20071001 abc123 example.org/index.html   Y
+  20071002 abc123 example.org/index.html   N
+  20071003 def456 example.org/index.html   Y
+  20071004 def456 example.org/index.html   N
+  20071005 abc123 example.org/index.html   N
+  20071006 abc123 example.org/index.html   N
+  20071007 ghi789 example.org/index.html   Y
+  20071008 jkl012 example.org/index.html   Y
+  20071009 jkl012 example.org/index.html   N
+  20071010 mno345 example.org/index.html   Y
+
+Excellent, we've just prevented duplicate copies of the same version
+of a page from being imported!
+
+======================================================================
+
+But what about the fact that we crawled the page on 5 dates and it
+didn't change, we want to record that somewhere right?
+
+Yes.
+
+NutchWAX provides an "add-dates" command (in the 'nutchwax' script)
+for adding dates to an existing index by creating a parallel index for
+it.
+
+Using our "add-dates" command, we can add those crawl dates to the
+index so that each unique version of the page will have all the crawl
+dates associated with it.  For our above example, resulting in:
+
+  Date     Digest URL                    
+  20071001, abc123 example.org/index.html
+  20071002,
+  20071005,
+  20071006
+
+  20071003, def456 example.org/index.html
+  20071004
+
+  20071007  ghi789 example.org/index.html
+
+  20071008, jkl012 example.org/index.html
+  20071009
+
+  20071010  mno345 example.org/index.html
+
+Voila!
+
+======================================================================
+
+Recap
+-----
+
+By using CDX files and the NutchWAX tools we are able to de-duplicate
+during import.  
+
+For example, for a list of arcs
+
+  $ wayback/bin/arc-indexer foo.arc.gz > foo.cdx
+  $ nutchwax/bin/dedup-cdx foo.cdx > foo.dup
+  $ echo "foo.arc.gz" > manifest
+  $ nutchwax/bin/nutchwax import -e foo.dup manifest
+  $ nutchwax/bin/nutch updatedb crawldb -dir segments
+  $ nutchwax/bin/nutch invertlinks linkdb -dir segments
+  $ nutchwax/bin/nutch index indexes crawldb linkdb segments/*
+  $ nutchwax/bin/nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup
+
+The important steps being the creation of the the "foo.dup" file
+containing the duplicate records, the use of that file to exclude
+duplicates during import, and the use of that same file for adding the
+crawl dates to the index.
+
+======================================================================
+
+Parallel Indexes
+
+Since updating an existing Lucene index is not feasible, we "virtually
+update" an index by using a modified version of the Lucene
+ParallelIndexReader.
+
+The basic idea is to take the metadata field you want to update and
+put it in a parallel index.  In DB table-speak, this would be moving a
+column to a separate table and using the record index/position as the
+foreign key to join the two tables.
+
+The NutchWAX 'add-dates' command does this for the date metadata
+field.  It will take an existing index and create a parallel index,
+adding dates listed in an external file.
+
+The command-line syntax is of the form:
+
+  nutchwax add-dates <key index> <source indices>... <dest index> <dates>
+
+Suppose we have an index created by the Nutch "index" command and we also have
+a list of crawl dates we want to add to it.  The index is in a sub-directory
+"indexes/part-00000" and the dates are in a file "dates.txt"
+
+  $ nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates dates.txt
+
+In this case our key index and source index are the same, since we
+want to preserve any dates in the original index and add the new dates
+to them.  But let's suppose we've already done this once, but then have even more
+dates to add, in a file "dates2.txt"
+
+  $ nutchwax add-dates indexes/part-00000 indexes/dates indexes/dates2 dates2.txt
+  $ rm -r indexes/dates
+  $ mv indexes/dates2 indexes/dates
+
+In this case, we copy the values from the existing "dates" index,
+adding the new dates to them.  Afterwards, we replace the old "dates"
+index with the new, fully up-to-date one.
+
+----------------------------------------------------------------------
+
+Using Parallel Index
+
+This is all well and good, but how to we make Nutch(WAX) use these
+parallel indices?
+
+NutchWAX provides a NutchWaxBean, which extends NutchBean by adding
+support for parallel indices.  The NutchWaxBean follows the NutchBean
+conventions by looking for a directory containing the indices in a
+directory named "crawl" or as specified in the "searcher.dir"
+configuration property.
+
+However, rather than looking for indices in "index" and "indexes",
+NutchWaxBean looks in "pindexes".  If that directory is found, it
+iterates through all sub-directories and expects each to contain a set
+of parallel indices within it.  A sample directory structure might
+look like:
+
+  crawl/pindexes/foo
+                    dates
+                    main
+                 bar
+                    dates
+                    main
+                 baz
+                    dates
+                    main
+
+where "dates" and "main" are parallel indexes.
+
+----------------------------------------------------------------------
+
+This is all fine and good when calling the NutchWaxBean from
+the command-line, but what about in a webapp?
+
+The NutchBean has a static method for self-initialization upon recipt
+of a application startup message from the servlet container.  We have
+a similar hook in NutchWaxBean, which is run after the NutchBean is
+initialized.
+
+The NutchWaxBean hook must be added to the Nutch web.xml file:
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+    <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
+  </listener>
+
+If you don't do this, then the NutchBean won't use the
+ParallelIndexReader and your parallel indices won't be used.
+
+======================================================================
+
+WARC + revisit records
+
+The WARC format supports revisit records.  Revisit records are
+typically written by WARC writing tools (such as Heritrix) when a URL
+is visited a second, third, etc time and the content hasn't changed.
+
+Taking our example from above, whenever the page is crawled and hasn't
+changed, a revisit record would be written to the WARC file.
+
+For de-duplication, WARC files with revisit records are nice becuase
+the crawler is doing the duplicate detection for us.  Rather than write
+a duplicate copy of the page, it writes a record that has
+
+  URL
+  digest
+  date
+
+of the visit.  Now, if you look at the output of 'dedup-cdx' you'll
+notice similarity.
+
+In fact, WARC records can be used to create a list of additional crawl
+dates without having to actually perform the full CDX de-duplication
+(which can be computationally expensive).
+
+A CDX file generated from a WARC will have the 9th field set to "-"
+for revisit records.  We can use this to easily find those lines and
+generate a list of crawl dates for a URL+digest.
+
+NutchWAX comes with a script called 'revisits' the does precisely
+that.  It takes CDX files as input, finds the lines for the revisit
+records, then emits them in a form that can be used by the 'add-dates'
+command.
+
+For example
+
+  $ wayback/bin/warc-indexer foo.warc.gz > foo.cdx
+  $ nutchwax/bin/revisits foo.cdx > foo.dup
+  $ nutchwax/bin/add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup
+
+Since the WARC files are known not to contain duplicates, we don't
+have to de-dup them in order to provide the importing process with an
+exclusion list.  However, we still use the 'revisits' script to
+generate a list of crawl dates for the revisit records so we can add
+them to the parallel index.
+
+======================================================================
+
+Doesn't NutchWAX (0.10) already handle duplicates?
+
+All this business about URL vs. URL+digest as a unique identifier for
+a version of a page may seem a surprising to some.  Many users of
+NutchWAX have been importing and indexing ARC files and haven't seen a
+situation where a newer version of a URL over-writes an older one.
+
+That is true, in certain circumstances different versions of a page
+will peacefully co-exist in a Nutch deployment.
+
+* The key is in the grouping of ARC files for importing. *
+
+When I said that by default, only one version of a URL can live in a
+Nutch index I was being a bit general.  Actually, only one version of
+a URL can live in a Nutch *segment*.
+
+When a batch of ARC files are imported, a new segment is created.  If
+you are lucky, then ARC files containing duplicates will be imported
+in different batches and the different versions of the same URL will
+each live in a separate segment.
+
+Consider the most extreme case, where a NutchWAX user imports ARC
+files one-at-a-time.  The result would be a Nutch segment for each ARC
+file.  This would be nice because if there were 5 different versions
+of a URL in 5 different ARC files; then there will be 5 segments, each
+containing one of the 5 versions of the URL.  No conflicts among
+versions.
+
+However, using a one-segment-per-ARC plan is not practical since most
+NutchWAX users have 1000s, 10000s, 100000s or more ARC files.  Having
+100000 segment directories on disk is simply not practical.
+
+Most NutchWAX users import ARC files in groups that either correspond
+to distinct crawls, or groups that are sized according to memory
+and/or CPU limits.
+
+We can't rely on good fortune to provide us with ARC file batches that
+don't have multiple versions of a URL.
+
+----------------------------------------------------------------------
+
+The worst-case scenario is if all the ARC files for a single
+collection are imported in one batch.  In this case, they would all go
+into a single Nutch segment and only 1 version of each URL would be
+imported and indexed.  All other versions would be discarded.
+
+----------------------------------------------------------------------
+
+If Nutch does this "automatic deduplication" by URL within each
+segment, why does it have a "dedup" command?
+
+That command is designed to operate on a set of segment indexes.  The
+segments are deduped internally automatically, the "dedup" command
+removes duplicates across segments.
+
+----------------------------------------------------------------------
+
+The fact that a later version of a URL replaced an earlier one is not
+always easy to notice just by performing searches against the
+resulting index.  One would have to know the contents of the pages
+such that a query would be able to find one specific version -- or not
+if it wasn't there.
+
+And especially with large collections, if a version of a page is
+missing from the search index, it could easily go unnoticed for quite
+some time.
+
+
+One way to test an existing index is to use CDX files in conjunction
+with the NutchWAX 'dumpindex' command.  
+
+  o Generate a list of duplicate records from all CDX for the entire
+    collection.
+
+  o Using the Wayback, identify a URL that has many different
+    versions.  Choose a URL that will be indexed for full-text search,
+    such as a HTML, text or PDF document; not an image.
+
+  o Dump the entire Lucene index with NutchWAX 'dumpindex' and 
+    find all the records for the URL.
+
+Chances are some of the versions of the URL will be in the index but
+not all.
+
+
+======================================================================
+
+Is this all necessary?
+
+No.  If you don't want to de-duplicate ARC files during import and
+indexing you don't have to.
+
+You can continue to perform the import, update, invert and index steps
+like before and just live with the consequences of not de-duplicating.
+
+If you don't de-duplicate, you will just have redundant records in
+your search index.  This means that you'll have a search result hit
+for each copy of the page in the index.  If you imported the same page
+10 times, then a search query that finds that page will find all 10
+copies and return 10 identical search results -- one for eaach copy.
+
+
+In addition, the de-duplication feature and the add-dates feature with
+the parallel index are also independent of each other.  You can
+de-duplicate but decide to not use parallel indices to add dates to
+the records in the Lucene index.
+
+In this case, you would only have 1 date associated with each record:
+the date the record was imorted.  Any information about subsequent
+revisits to the same version of the page would not be in the search
+index.
+
+
+Also, if you have a system of your own devising that keeps track of
+duplicates in archive files; have it output the duplicates files in
+the same form as the 'dedup-cdx' script.  The import command doesn't
+care where the exclusion list comes from, just that it has the correct
+format.
+
+
+======================================================================
+
+Implementation notes on URL+digest vs URL
+
+Although the use of 'dedup-cdx' and associated tools for de-duping and
+managing revisit dates are entirely optional and have no impact on
+Nutch(WAX) if not used, one area of change in NutchWAX that does
+impact Nutch is changing the unique 'key' for a document from URL to
+URL+digest.
+
+Without this change, you cannot have different versions of the same
+URL in a Nutch segment.  Such a limitation is simply incompatible with
+NutchWAX and archive files.  This change is not optional.
+
+The core of the change from URL to URL+digest happens in the NutchWAX
+Indexer class.  In that class the segment is created and all the
+document-related information is added to it.  When a document is added
+to a segment, it is written to a Haddop MapFile.
+
+Hadoop MapFiles act like Java Maps.  They are essentially key/value
+pairs.  In Nutch, the key is the URL and the value is a collection of
+information for that URL.
+
+In the Importer.java source code, where we add the information to the
+segment, we use 
+
+  <URL> <digest>
+
+as the key, such as
+
+  "http://www.example.org/index.html sha1:HJG5ZWG3MQQKHIN43BXJY3FUWP7WTU43"
+
+instead of simply
+
+  "http://www.example.org/index.html"
+
+We also stuff the URL into the document in a metadata field titled
+"url", which we use later in our indexing filter plugin.
+
+This is simple enough in the Importer code, it does however have a few
+consequences elsewhere in Nutch.   The places where it affects Nutch
+are where Nutch assumes
+
+   URL == key
+
+There are two places in particular where this assumption causes a
+problem because the URL is no longer the key.
+
+1. BasicIndexingFilter (index-basic plugin)
+
+In the call from Indexer.java to BasicIndexingFilter.java, the key is 
+treated as the URL:
+
+   Indexer.java:
+
+   249:  doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
+
+   BasicIndexingFilter.java:
+
+   55 public Document filter(Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
+   56 throws IndexingException {
+   57
+   58  Text reprUrl = (Text) datum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
+   59  String reprUrlString = reprUrl != null ? reprUrl.toString() : null;
+   60  String urlString = url.toString();
+    
+The Indexer passes the key, the BasicIndexingFilter treats it as the
+URL.
+
+Not only that, but the BasicIndexingFilter goes on to insert that
+urlString into the Lucene document in the "url" field.
+
+We work around this by configuring our NutchWAX indexin filter plugin
+to run *after* the BasicIndexingFilter and over-write the "url" field
+with the correct URL.
+
+We do this by setting the Nutch configuration property (in
+nutch-site.xml for example) with
+
+  <property>
+    <name>indexingfilter.order</name>
+    <value>
+      org.apache.nutch.indexer.basic.BasicIndexingFilter
+      org.archive.nutchwax.index.ConfigurableIndexingFilter
+     </value>
+  </property>
+
+without this property, the indexing filters are run in an arbitrary
+order.  We need our ConfigurableIndexingFilter to run after the
+BasicIndexingFilter.
+
+The configuration for the ConfigurableIndexingFilter specifies that
+the "url" field will be filled with the value from the "url" metadata
+field (which we set in Importer.java remember) and over-write any
+previous value.
+
+
+2. FetchedSegments
+
+This class has a lovely little routine called "getUrl" which is used
+*not* to get the URL per se, rather it gets the URL from a Lucene
+document /in order to use it as a document key/.
+
+Let's take a look:
+
+  private Text getUrl(HitDetails details) {
+    String url = details.getValue("orig");
+    if (StringUtils.isBlank(url)) {
+      url = details.getValue("url");
+    }
+    return new Text(url);
+  }
+
+The problem is that we've stored the true URL in the "url" field, so
+the value returned is the true URL.  Now when the code that calls this
+method tries to use it as the key, it can't find the document since
+the key is "URL digest".
+
+Since this method is private and this code is rather deep inside of
+Nutch, over-riding it with a subclass isn't feasible.
+
+But, if you notice, getURL does come with a little oddity where it
+first consults "orig" before "url".  We don't use "orig" for anything,
+so in our Importer, we set the "orig" metadata field to be the key.
+
+This way, when getUrl calls 
+
+    String url = details.getValue("orig");
+
+the key is found and everything is happy.
+
+Yes, it's a hack.  No, I'm not ashamed.
+
+======================================================================
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access: [2396] trunk/archive-access/projects/nutchwax/ arc

[Archive-access-cvs] SF.net SVN: archive-access: [2396] trunk/archive-access/projects/nutchwax/ archive/README-dedup.txt