From: <bi...@us...> - 2008-07-03 02:01:39
|
Revision: 2396 http://archive-access.svn.sourceforge.net/archive-access/?rev=2396&view=rev Author: binzino Date: 2008-07-02 19:01:46 -0700 (Wed, 02 Jul 2008) Log Message: ----------- Initial revision. Very rough draft. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt Added: trunk/archive-access/projects/nutchwax/archive/README-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README-dedup.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/README-dedup.txt 2008-07-03 02:01:46 UTC (rev 2396) @@ -0,0 +1,697 @@ + +README-dedup.txt +2008-07-02 +Aaron Binns + +De-duplication and NutchWAX + +This document assumes that the reader is familiar with the topic of +de-duplication with regards to archiving web data. That said, let us +review what we mean by de-duplication in NutchWAX. + +When archive files (ARC/WARC) are written, the tool used to create +them may or may not prevent multiple copies of the same URL to be +written. Some archive file creation tools perform duplicate +prevention, but many do not. + +What NutchWAX has to contend with is the scenario where one or more +archive files that are imported and indexed have multiple copies of an +URL. + +Ideally, NutchWAX would only import and index one unique version of +the URL. If the same version of the URL was seen a second, third, +fourth, etc. time, then NutchWAX would simply update the existing +record in its search index by adding the subsequent crawl dates to it. +This way, if a URL was crawled 10 times and didn't change, there would +only be one entry in the search index for it, but with 10 crawl dates +associated with it. + +====================================================================== + +This sounds simple enough, but in practice the implementation is not +as straightfoward as suggested by the above. + +For one, Nutch's underlying Lucene search indexes are not easily +modified "in place". That is, updating an existing record by adding +an additional date to it is not easily accomplished via the Lucene +public APIs. The Lucene documentation informs us that records are not +modified in place, but rather are deleted and re-added with the +modified/new information. + +Doing a complete delete+re-add on a large Lucene database containing +possibly millions of records is a computationally expensive process. +Furthermore, since many fields in Nutch's Lucene indexes are not +stored, it is infeasible to delete and re-add them w/o data loss. + +Fortunately, using parallel Lucene indexes and the ParallelIndexReader +can help solve the problem. More on that later. + +====================================================================== + +Another challenge in handling duplicates is defining what makes for a +unique version of a URL. + +Most tools, Nutch included, use the URL as a unique identifier for a +page. Since most tools don't care about old versions of pages, +retaining only the latest version and using the URL to identify it is +sufficient. + +However, for archive data, we need to use more than just the URL to +identify a page, we need something that has the URL but also some +notion of the *version* of the page. + +For example, consider a page like + + http://www.cnn.com/index.html + +This page changes frequently. If it were crawled 10 times, once per +week, each crawl could capture a different version of the page. We +would have 10 different, unique versions. + +Now, if NutchWAX used *only* the URL as the unique identifier for the +page, there would be no way to distinguish the first one from the +second, from the third, etc. + +NutchWAX needs a unique identifier that has the URL and also some +notion of the *version* of the page. For that we use a digest of the +page's content. The digest is used as a version number of sorts. +Each version will have a different digest. So, if we need to find a +specific version of the page, we can use the URL combined with the +digest to uniquely identify it. + +Currently we use SHA-1 for digesting the content. + +Using URL+digest rather than just the URL as a unique page identifier +is conceptually simple, but does have some repercussions within Nutch. +Nutch assumes that the URL alone is a unique identifier and that +assumption is coded into the software in various ways. To use the +URL+digest instead, we had to work around some of those hard-coded +assumptions in various ways. More on that later. + +====================================================================== + +The next challenge is to know if a version of a URL (the URL+digest +described above) has already been imported and indexed so that we +don't import it again. + +To prevent the importing of multiple copies of the same version of a +page, we could get the URL+digest of the page to be imported, then +look in the existing Nutch index to see if we alread have it. If we +do, do not import it, instead add the crawl date to the existing +record in the search index. + +Now, the above describes two challenges: + + 1. Searching the existing index to see if there is an existing record + to be updated. + + 2. Updating an existing record. This was discussed above and we do + have a solution, which we'll describe in more detail later. + +The first doesn't seem challenging at first and in theory it isn't. +However, in practice it is difficult becuase for a a large deployment, +we usually have many Lucene indexes spread over many machines. It's +not as simple as opening up a single Lucene index on the local machine +and searching for a matching URL+digest. In one of the deployments at +the Internet Archive, we have 100s of Lucene indexes spread over 5 +machines. + +Now, we could use the Nutch web search rather than accessing the +Lucene indexes directly. That is, to find out if we have already +indexed a URL+digest, we could send an HTTP request to the Nutch +search server asking if the URL+digest is already in the index or not. + +Although this is a workable solution, performing a search for each and +every URL being imported would likely put too much strain on the +search server and would slow down the importing process. When some +import & index jobs process 100s of millions of documents and take +weeks to run, adding a 5-second HTTP request to each URL import is a +significant cost. + +What would be ideal is a centralized database of all the URLs +processed by NutchWAX. Ideally, this centralized database would also +be used by the archiver (e.g. Heritrix) to perform de-duplication +during a crawl; and also by the Wayback for storing historical +metadata. + +The de-duplication strategy described in this document utilizes the +Wayback tools and CDX files as the central URL database for performing +NutchWAX de-duplication. + +====================================================================== + +Review +------ + +Our de-duplication strategy for NutchWAX as described so far has three +key elements: + + o Use URL+digest as a unique identifier for a unique version of a page. + o Use ParallelIndexReader to provide index record modification/update. + o Use Wayback and CDX files as a central database of URL processing state. + +====================================================================== + +Using CDX files to detect duplicate pages in a set of archive files is +fortunately rather straightforward. + +CDX files are text files with one line for each and every page +(record) in an archive file. These CDX lines have three bits of data +we can use for detecting duplicate pages: + + o URL + o digest + o date + +NutchWAX provides a 'dedup-cdx' script that reads a CDX file and +produces a "duplicates" file containing the URL, digest and date of +each duplicate copy of a unique version of a URL in the CDX file. + +For example, suppose we have a collection of 100 ARC files. In those +ARC files, the page + + http://www.example.org/index.html + +appears 10 times, but only 5 of those are different, the other 5 are +duplicate copies. Suppose we have + + Date Digest Content sample + 2007-10-01 abc123 Hello, welcome to my page. + 2007-10-02 abc123 Hello, welcome to my page. + 2007-10-03 def456 Sorry I haven't updated this in a while. + 2007-10-04 def456 Sorry I haven't updated this in a while. + 2007-10-05 abc123 Hello, welcome to my page. + 2007-10-06 abc123 Hello, welcome to my page. + 2007-10-07 ghi789 Hey, I finally updated this. + 2007-10-08 jkl012 Under construction. + 2007-10-09 jkl012 Under construction. + 2007-10-10 mno345 My homepage is great! + +Notice how we started with the "abc123" version, changed to the +"def456" version then reverted back to the "abc123" version. In this +simple example, we have an webmaster who just can't make up his mind +on what to say. + +Thep point is that our CDX file will have lines of the form + + 20071001 abc123 example.org/index.html + 20071002 abc123 example.org/index.html + 20071003 def456 example.org/index.html + 20071004 def456 example.org/index.html + 20071005 abc123 example.org/index.html + 20071006 abc123 example.org/index.html + 20071007 ghi789 example.org/index.html + 20071008 jkl012 example.org/index.html + 20071009 jkl012 example.org/index.html + 20071010 mno345 example.org/index.html + +It's easy to find the duplicate lines in the CDX file. + +The NutchWAX 'dedup-cdx' script will extract the duplicates, writing out all +the duplicate lines, except for the first. For the above, the output is + + 20071002 abc123 example.org/index.html + 20071004 def456 example.org/index.html + 20071005 abc123 example.org/index.html + 20071006 abc123 example.org/index.html + 20071009 jkl012 example.org/index.html + +Only the 2nd, 3rd, etc. instance of a URL+digest line are printed. +The first instance of "abc123" is not printed, or is "ghi789" since it +has no duplicates. + +Now what do we do with these? + +When importing archive files with NutchWAX, we pass it this list of +duplicates, which it uses as an exclusion list. Any URL+digest+date +on the list is excluded from import, all others pass through. + +Looking at our CDX sample again + + Date Digest URL Import? + 20071001 abc123 example.org/index.html Y + 20071002 abc123 example.org/index.html N + 20071003 def456 example.org/index.html Y + 20071004 def456 example.org/index.html N + 20071005 abc123 example.org/index.html N + 20071006 abc123 example.org/index.html N + 20071007 ghi789 example.org/index.html Y + 20071008 jkl012 example.org/index.html Y + 20071009 jkl012 example.org/index.html N + 20071010 mno345 example.org/index.html Y + +Excellent, we've just prevented duplicate copies of the same version +of a page from being imported! + +====================================================================== + +But what about the fact that we crawled the page on 5 dates and it +didn't change, we want to record that somewhere right? + +Yes. + +NutchWAX provides an "add-dates" command (in the 'nutchwax' script) +for adding dates to an existing index by creating a parallel index for +it. + +Using our "add-dates" command, we can add those crawl dates to the +index so that each unique version of the page will have all the crawl +dates associated with it. For our above example, resulting in: + + Date Digest URL + 20071001, abc123 example.org/index.html + 20071002, + 20071005, + 20071006 + + 20071003, def456 example.org/index.html + 20071004 + + 20071007 ghi789 example.org/index.html + + 20071008, jkl012 example.org/index.html + 20071009 + + 20071010 mno345 example.org/index.html + +Voila! + +====================================================================== + +Recap +----- + +By using CDX files and the NutchWAX tools we are able to de-duplicate +during import. + +For example, for a list of arcs + + $ wayback/bin/arc-indexer foo.arc.gz > foo.cdx + $ nutchwax/bin/dedup-cdx foo.cdx > foo.dup + $ echo "foo.arc.gz" > manifest + $ nutchwax/bin/nutchwax import -e foo.dup manifest + $ nutchwax/bin/nutch updatedb crawldb -dir segments + $ nutchwax/bin/nutch invertlinks linkdb -dir segments + $ nutchwax/bin/nutch index indexes crawldb linkdb segments/* + $ nutchwax/bin/nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup + +The important steps being the creation of the the "foo.dup" file +containing the duplicate records, the use of that file to exclude +duplicates during import, and the use of that same file for adding the +crawl dates to the index. + +====================================================================== + +Parallel Indexes + +Since updating an existing Lucene index is not feasible, we "virtually +update" an index by using a modified version of the Lucene +ParallelIndexReader. + +The basic idea is to take the metadata field you want to update and +put it in a parallel index. In DB table-speak, this would be moving a +column to a separate table and using the record index/position as the +foreign key to join the two tables. + +The NutchWAX 'add-dates' command does this for the date metadata +field. It will take an existing index and create a parallel index, +adding dates listed in an external file. + +The command-line syntax is of the form: + + nutchwax add-dates <key index> <source indices>... <dest index> <dates> + +Suppose we have an index created by the Nutch "index" command and we also have +a list of crawl dates we want to add to it. The index is in a sub-directory +"indexes/part-00000" and the dates are in a file "dates.txt" + + $ nutchwax add-dates indexes/part-00000 indexes/part-00000 indexes/dates dates.txt + +In this case our key index and source index are the same, since we +want to preserve any dates in the original index and add the new dates +to them. But let's suppose we've already done this once, but then have even more +dates to add, in a file "dates2.txt" + + $ nutchwax add-dates indexes/part-00000 indexes/dates indexes/dates2 dates2.txt + $ rm -r indexes/dates + $ mv indexes/dates2 indexes/dates + +In this case, we copy the values from the existing "dates" index, +adding the new dates to them. Afterwards, we replace the old "dates" +index with the new, fully up-to-date one. + +---------------------------------------------------------------------- + +Using Parallel Index + +This is all well and good, but how to we make Nutch(WAX) use these +parallel indices? + +NutchWAX provides a NutchWaxBean, which extends NutchBean by adding +support for parallel indices. The NutchWaxBean follows the NutchBean +conventions by looking for a directory containing the indices in a +directory named "crawl" or as specified in the "searcher.dir" +configuration property. + +However, rather than looking for indices in "index" and "indexes", +NutchWaxBean looks in "pindexes". If that directory is found, it +iterates through all sub-directories and expects each to contain a set +of parallel indices within it. A sample directory structure might +look like: + + crawl/pindexes/foo + dates + main + bar + dates + main + baz + dates + main + +where "dates" and "main" are parallel indexes. + +---------------------------------------------------------------------- + +This is all fine and good when calling the NutchWaxBean from +the command-line, but what about in a webapp? + +The NutchBean has a static method for self-initialization upon recipt +of a application startup message from the servlet container. We have +a similar hook in NutchWaxBean, which is run after the NutchBean is +initialized. + +The NutchWaxBean hook must be added to the Nutch web.xml file: + + <listener> + <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> + </listener> + +If you don't do this, then the NutchBean won't use the +ParallelIndexReader and your parallel indices won't be used. + +====================================================================== + +WARC + revisit records + +The WARC format supports revisit records. Revisit records are +typically written by WARC writing tools (such as Heritrix) when a URL +is visited a second, third, etc time and the content hasn't changed. + +Taking our example from above, whenever the page is crawled and hasn't +changed, a revisit record would be written to the WARC file. + +For de-duplication, WARC files with revisit records are nice becuase +the crawler is doing the duplicate detection for us. Rather than write +a duplicate copy of the page, it writes a record that has + + URL + digest + date + +of the visit. Now, if you look at the output of 'dedup-cdx' you'll +notice similarity. + +In fact, WARC records can be used to create a list of additional crawl +dates without having to actually perform the full CDX de-duplication +(which can be computationally expensive). + +A CDX file generated from a WARC will have the 9th field set to "-" +for revisit records. We can use this to easily find those lines and +generate a list of crawl dates for a URL+digest. + +NutchWAX comes with a script called 'revisits' the does precisely +that. It takes CDX files as input, finds the lines for the revisit +records, then emits them in a form that can be used by the 'add-dates' +command. + +For example + + $ wayback/bin/warc-indexer foo.warc.gz > foo.cdx + $ nutchwax/bin/revisits foo.cdx > foo.dup + $ nutchwax/bin/add-dates indexes/part-00000 indexes/part-00000 indexes/dates foo.dup + +Since the WARC files are known not to contain duplicates, we don't +have to de-dup them in order to provide the importing process with an +exclusion list. However, we still use the 'revisits' script to +generate a list of crawl dates for the revisit records so we can add +them to the parallel index. + +====================================================================== + +Doesn't NutchWAX (0.10) already handle duplicates? + +All this business about URL vs. URL+digest as a unique identifier for +a version of a page may seem a surprising to some. Many users of +NutchWAX have been importing and indexing ARC files and haven't seen a +situation where a newer version of a URL over-writes an older one. + +That is true, in certain circumstances different versions of a page +will peacefully co-exist in a Nutch deployment. + +* The key is in the grouping of ARC files for importing. * + +When I said that by default, only one version of a URL can live in a +Nutch index I was being a bit general. Actually, only one version of +a URL can live in a Nutch *segment*. + +When a batch of ARC files are imported, a new segment is created. If +you are lucky, then ARC files containing duplicates will be imported +in different batches and the different versions of the same URL will +each live in a separate segment. + +Consider the most extreme case, where a NutchWAX user imports ARC +files one-at-a-time. The result would be a Nutch segment for each ARC +file. This would be nice because if there were 5 different versions +of a URL in 5 different ARC files; then there will be 5 segments, each +containing one of the 5 versions of the URL. No conflicts among +versions. + +However, using a one-segment-per-ARC plan is not practical since most +NutchWAX users have 1000s, 10000s, 100000s or more ARC files. Having +100000 segment directories on disk is simply not practical. + +Most NutchWAX users import ARC files in groups that either correspond +to distinct crawls, or groups that are sized according to memory +and/or CPU limits. + +We can't rely on good fortune to provide us with ARC file batches that +don't have multiple versions of a URL. + +---------------------------------------------------------------------- + +The worst-case scenario is if all the ARC files for a single +collection are imported in one batch. In this case, they would all go +into a single Nutch segment and only 1 version of each URL would be +imported and indexed. All other versions would be discarded. + +---------------------------------------------------------------------- + +If Nutch does this "automatic deduplication" by URL within each +segment, why does it have a "dedup" command? + +That command is designed to operate on a set of segment indexes. The +segments are deduped internally automatically, the "dedup" command +removes duplicates across segments. + +---------------------------------------------------------------------- + +The fact that a later version of a URL replaced an earlier one is not +always easy to notice just by performing searches against the +resulting index. One would have to know the contents of the pages +such that a query would be able to find one specific version -- or not +if it wasn't there. + +And especially with large collections, if a version of a page is +missing from the search index, it could easily go unnoticed for quite +some time. + + +One way to test an existing index is to use CDX files in conjunction +with the NutchWAX 'dumpindex' command. + + o Generate a list of duplicate records from all CDX for the entire + collection. + + o Using the Wayback, identify a URL that has many different + versions. Choose a URL that will be indexed for full-text search, + such as a HTML, text or PDF document; not an image. + + o Dump the entire Lucene index with NutchWAX 'dumpindex' and + find all the records for the URL. + +Chances are some of the versions of the URL will be in the index but +not all. + + +====================================================================== + +Is this all necessary? + +No. If you don't want to de-duplicate ARC files during import and +indexing you don't have to. + +You can continue to perform the import, update, invert and index steps +like before and just live with the consequences of not de-duplicating. + +If you don't de-duplicate, you will just have redundant records in +your search index. This means that you'll have a search result hit +for each copy of the page in the index. If you imported the same page +10 times, then a search query that finds that page will find all 10 +copies and return 10 identical search results -- one for eaach copy. + + +In addition, the de-duplication feature and the add-dates feature with +the parallel index are also independent of each other. You can +de-duplicate but decide to not use parallel indices to add dates to +the records in the Lucene index. + +In this case, you would only have 1 date associated with each record: +the date the record was imorted. Any information about subsequent +revisits to the same version of the page would not be in the search +index. + + +Also, if you have a system of your own devising that keeps track of +duplicates in archive files; have it output the duplicates files in +the same form as the 'dedup-cdx' script. The import command doesn't +care where the exclusion list comes from, just that it has the correct +format. + + +====================================================================== + +Implementation notes on URL+digest vs URL + +Although the use of 'dedup-cdx' and associated tools for de-duping and +managing revisit dates are entirely optional and have no impact on +Nutch(WAX) if not used, one area of change in NutchWAX that does +impact Nutch is changing the unique 'key' for a document from URL to +URL+digest. + +Without this change, you cannot have different versions of the same +URL in a Nutch segment. Such a limitation is simply incompatible with +NutchWAX and archive files. This change is not optional. + +The core of the change from URL to URL+digest happens in the NutchWAX +Indexer class. In that class the segment is created and all the +document-related information is added to it. When a document is added +to a segment, it is written to a Haddop MapFile. + +Hadoop MapFiles act like Java Maps. They are essentially key/value +pairs. In Nutch, the key is the URL and the value is a collection of +information for that URL. + +In the Importer.java source code, where we add the information to the +segment, we use + + <URL> <digest> + +as the key, such as + + "http://www.example.org/index.html sha1:HJG5ZWG3MQQKHIN43BXJY3FUWP7WTU43" + +instead of simply + + "http://www.example.org/index.html" + +We also stuff the URL into the document in a metadata field titled +"url", which we use later in our indexing filter plugin. + +This is simple enough in the Importer code, it does however have a few +consequences elsewhere in Nutch. The places where it affects Nutch +are where Nutch assumes + + URL == key + +There are two places in particular where this assumption causes a +problem because the URL is no longer the key. + +1. BasicIndexingFilter (index-basic plugin) + +In the call from Indexer.java to BasicIndexingFilter.java, the key is +treated as the URL: + + Indexer.java: + + 249: doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks); + + BasicIndexingFilter.java: + + 55 public Document filter(Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) + 56 throws IndexingException { + 57 + 58 Text reprUrl = (Text) datum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY); + 59 String reprUrlString = reprUrl != null ? reprUrl.toString() : null; + 60 String urlString = url.toString(); + +The Indexer passes the key, the BasicIndexingFilter treats it as the +URL. + +Not only that, but the BasicIndexingFilter goes on to insert that +urlString into the Lucene document in the "url" field. + +We work around this by configuring our NutchWAX indexin filter plugin +to run *after* the BasicIndexingFilter and over-write the "url" field +with the correct URL. + +We do this by setting the Nutch configuration property (in +nutch-site.xml for example) with + + <property> + <name>indexingfilter.order</name> + <value> + org.apache.nutch.indexer.basic.BasicIndexingFilter + org.archive.nutchwax.index.ConfigurableIndexingFilter + </value> + </property> + +without this property, the indexing filters are run in an arbitrary +order. We need our ConfigurableIndexingFilter to run after the +BasicIndexingFilter. + +The configuration for the ConfigurableIndexingFilter specifies that +the "url" field will be filled with the value from the "url" metadata +field (which we set in Importer.java remember) and over-write any +previous value. + + +2. FetchedSegments + +This class has a lovely little routine called "getUrl" which is used +*not* to get the URL per se, rather it gets the URL from a Lucene +document /in order to use it as a document key/. + +Let's take a look: + + private Text getUrl(HitDetails details) { + String url = details.getValue("orig"); + if (StringUtils.isBlank(url)) { + url = details.getValue("url"); + } + return new Text(url); + } + +The problem is that we've stored the true URL in the "url" field, so +the value returned is the true URL. Now when the code that calls this +method tries to use it as the key, it can't find the document since +the key is "URL digest". + +Since this method is private and this code is rather deep inside of +Nutch, over-riding it with a subclass isn't feasible. + +But, if you notice, getURL does come with a little oddity where it +first consults "orig" before "url". We don't use "orig" for anything, +so in our Importer, we set the "orig" metadata field to be the key. + +This way, when getUrl calls + + String url = details.getValue("orig"); + +the key is found and everything is happy. + +Yes, it's a hack. No, I'm not ashamed. + +====================================================================== + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |