From: <bi...@us...> - 2008-07-28 19:34:24
|
Revision: 2506 http://archive-access.svn.sourceforge.net/archive-access/?rev=2506&view=rev Author: binzino Date: 2008-07-28 19:34:33 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Corrected some type-o's. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt Modified: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt 2008-07-28 19:29:16 UTC (rev 2505) +++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt 2008-07-28 19:34:33 UTC (rev 2506) @@ -29,7 +29,7 @@ ====================================================================== This sounds simple enough, but in practice the implementation is not -as straightfoward as suggested by the above. +as straightforward as suggested by the above. For one, Nutch's underlying Lucene search indexes are not easily modified "in place". That is, updating an existing record by adding @@ -96,7 +96,7 @@ To prevent the importing of multiple copies of the same version of a page, we could get the URL+digest of the page to be imported, then -look in the existing Nutch index to see if we alread have it. If we +look in the existing Nutch index to see if we already have it. If we do, do not import it, instead add the crawl date to the existing record in the search index. @@ -109,7 +109,7 @@ have a solution, which we'll describe in more detail later. The first doesn't seem challenging at first and in theory it isn't. -However, in practice it is difficult becuase for a a large deployment, +However, in practice it is difficult because for a a large deployment, we usually have many Lucene indexes spread over many machines. It's not as simple as opening up a single Lucene index on the local machine and searching for a matching URL+digest. In one of the deployments at @@ -192,7 +192,7 @@ simple example, we have an webmaster who just can't make up his mind on what to say. -Thep point is that our CDX file will have lines of the form +The point is that our CDX file will have lines of the form 20071001 abc123 example.org/index.html 20071002 abc123 example.org/index.html @@ -376,7 +376,7 @@ This is all fine and good when calling the NutchWaxBean from the command-line, but what about in a webapp? -The NutchBean has a static method for self-initialization upon recipt +The NutchBean has a static method for self-initialization upon receipt of a application startup message from the servlet container. We have a similar hook in NutchWaxBean, which is run after the NutchBean is initialized. @@ -402,7 +402,7 @@ Taking our example from above, whenever the page is crawled and hasn't changed, a revisit record would be written to the WARC file. -For de-duplication, WARC files with revisit records are nice becuase +For de-duplication, WARC files with revisit records are nice because the crawler is doing the duplicate detection for us. Rather than write a duplicate copy of the page, it writes a record that has @@ -539,7 +539,7 @@ your search index. This means that you'll have a search result hit for each copy of the page in the index. If you imported the same page 10 times, then a search query that finds that page will find all 10 -copies and return 10 identical search results -- one for eaach copy. +copies and return 10 identical search results -- one for each copy. In addition, the de-duplication feature and the add-dates feature with @@ -548,7 +548,7 @@ the records in the Lucene index. In this case, you would only have 1 date associated with each record: -the date the record was imorted. Any information about subsequent +the date the record was imported. Any information about subsequent revisits to the same version of the page would not be in the search index. @@ -577,7 +577,7 @@ The core of the change from URL to URL+digest happens in the NutchWAX Indexer class. In that class the segment is created and all the document-related information is added to it. When a document is added -to a segment, it is written to a Haddop MapFile. +to a segment, it is written to a Hadoop MapFile. Hadoop MapFiles act like Java Maps. They are essentially key/value pairs. In Nutch, the key is the URL and the value is a collection of @@ -632,7 +632,7 @@ Not only that, but the BasicIndexingFilter goes on to insert that urlString into the Lucene document in the "url" field. -We work around this by configuring our NutchWAX indexin filter plugin +We work around this by configuring our NutchWAX indexing filter plugin to run *after* the BasicIndexingFilter and over-write the "url" field with the correct URL. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |