[Archive-access-cvs] SF.net SVN: archive-access:[2506] trunk/archive-access/projects/nutchwax/ arch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2506
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2506&view=rev
Author:   binzino
Date:     2008-07-28 19:34:33 +0000 (Mon, 28 Jul 2008)

Log Message:
-----------
Corrected some type-o's.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/README-dedup.txt

Modified: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	2008-07-28 19:29:16 UTC (rev 2505)
+++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt	2008-07-28 19:34:33 UTC (rev 2506)
@@ -29,7 +29,7 @@
 ======================================================================
 
 This sounds simple enough, but in practice the implementation is not
-as straightfoward as suggested by the above.
+as straightforward as suggested by the above.
 
 For one, Nutch's underlying Lucene search indexes are not easily
 modified "in place".  That is, updating an existing record by adding
@@ -96,7 +96,7 @@
 
 To prevent the importing of multiple copies of the same version of a
 page, we could get the URL+digest of the page to be imported, then
-look in the existing Nutch index to see if we alread have it.  If we
+look in the existing Nutch index to see if we already have it.  If we
 do, do not import it, instead add the crawl date to the existing
 record in the search index.
 
@@ -109,7 +109,7 @@
     have a solution, which we'll describe in more detail later.
 
 The first doesn't seem challenging at first and in theory it isn't.
-However, in practice it is difficult becuase for a a large deployment,
+However, in practice it is difficult because for a a large deployment,
 we usually have many Lucene indexes spread over many machines.  It's
 not as simple as opening up a single Lucene index on the local machine
 and searching for a matching URL+digest.  In one of the deployments at
@@ -192,7 +192,7 @@
 simple example, we have an webmaster who just can't make up his mind
 on what to say.
 
-Thep point is that our CDX file will have lines of the form
+The point is that our CDX file will have lines of the form
 
   20071001 abc123 example.org/index.html
   20071002 abc123 example.org/index.html
@@ -376,7 +376,7 @@
 This is all fine and good when calling the NutchWaxBean from
 the command-line, but what about in a webapp?
 
-The NutchBean has a static method for self-initialization upon recipt
+The NutchBean has a static method for self-initialization upon receipt
 of a application startup message from the servlet container.  We have
 a similar hook in NutchWaxBean, which is run after the NutchBean is
 initialized.
@@ -402,7 +402,7 @@
 Taking our example from above, whenever the page is crawled and hasn't
 changed, a revisit record would be written to the WARC file.
 
-For de-duplication, WARC files with revisit records are nice becuase
+For de-duplication, WARC files with revisit records are nice because
 the crawler is doing the duplicate detection for us.  Rather than write
 a duplicate copy of the page, it writes a record that has
 
@@ -539,7 +539,7 @@
 your search index.  This means that you'll have a search result hit
 for each copy of the page in the index.  If you imported the same page
 10 times, then a search query that finds that page will find all 10
-copies and return 10 identical search results -- one for eaach copy.
+copies and return 10 identical search results -- one for each copy.
 
 
 In addition, the de-duplication feature and the add-dates feature with
@@ -548,7 +548,7 @@
 the records in the Lucene index.
 
 In this case, you would only have 1 date associated with each record:
-the date the record was imorted.  Any information about subsequent
+the date the record was imported.  Any information about subsequent
 revisits to the same version of the page would not be in the search
 index.
 
@@ -577,7 +577,7 @@
 The core of the change from URL to URL+digest happens in the NutchWAX
 Indexer class.  In that class the segment is created and all the
 document-related information is added to it.  When a document is added
-to a segment, it is written to a Haddop MapFile.
+to a segment, it is written to a Hadoop MapFile.
 
 Hadoop MapFiles act like Java Maps.  They are essentially key/value
 pairs.  In Nutch, the key is the URL and the value is a collection of
@@ -632,7 +632,7 @@
 Not only that, but the BasicIndexingFilter goes on to insert that
 urlString into the Lucene document in the "url" field.
 
-We work around this by configuring our NutchWAX indexin filter plugin
+We work around this by configuring our NutchWAX indexing filter plugin
 to run *after* the BasicIndexingFilter and over-write the "url" field
 with the correct URL.
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2506] trunk/archive-access/projects/nutchwax/ arch

[Archive-access-cvs] SF.net SVN: archive-access:[2506] trunk/archive-access/projects/nutchwax/ archive/README-dedup.txt