From: Aaron B. <aa...@ar...> - 2011-05-02 23:28:50
|
Gerard Suades i Méndez <gs...@ce...> writes: >>> We tried both approaches for the entire ARC collection: >>> >>> a) IndexMerger Lucene API (inside TNH). >>> index size: 813GB >>> >>> b) Re-built the entire index giving as input both old and new NutchWAX >>> segments of the ARC files. >>> index size: 563GB >>> >>> is it normal that there is this difference of sizes in the indexes? > > If I don't get the wrong idea, NutchWAX 0.13 (official release), which > is the version we've used in method b), doesn't de-duplicate. So if > neither of the two methods de-duplicates, could it be any other reason > for such a difference in indexes sizes? De-duplication while index building: NO NutchWAX 0.13 YES NutchWAX 0.13-JIRA-WAX-75 YES JBs I double-checked the source code. Sorry if I said something different before. So, it sounds like you re-built the entire index in one job using NutchWAX 0.13 (NO deduplication) and yielded a much smaller index. That is strange. Can you confirm that the number of documents in the indexes? You can use a utility in TNH do dump out the counts of documents by mime-type and compare the totals: $ mkdir tnh $ cd tnh $ jar xvf ../tnh-*.war $ java -cp WEB-INF/classes:WEB-INF/lib/lucene-core-3.0.1.jar TermDumper -c type This will print out the number of documents for each mime-type. They should be the same (or at least the same total) for both indexes. >>> 3.- We have only one collection for all the ARC files. We have our >>> collection on open access and the service is load balanced through >>> several nodes. That's the scenario in where several tomcats are >>> accessing the same indexes. >>> > > Yes, the index is on an NFS shared storage system accessed by several nodes. Hmmm, you might consider running some performance tests with the index on local disk. Maybe I'm just an old Unix guy, but I would expect a big performance hit of searching a Lucene index over NFS. > We will try both JBs and NutchWAX-with-deduplication. By the way, the > SVN branch you pointed out ( > > http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive > > ) is it possible that it is only suitable for nutch 1.1 and not for > nutch 1.0 as it is said in INSTALL.txt? Yes, you are correct. Since this hasn't been made into an official release, I have not updated all the documentation yet. > * TNH > - Is it possible to define collections? Yes, and there are two ways to do this: 1. Index each collection separately. If you have your (W)ARC files grouped by collection, process each one separately. For example if you had a "election 2011 collection" and a "word cup 2010" collection, with separate groups of (W)ARCs for each, you would import and index each collection totally separately. Then, in TNH, you would put them as sibling directories in the root 'index' directory, such as: /var/lib/tnh/index /2011-election /2010-world-cup And set your index directory in TNH's web.xml to be /var/lib/tnh/index TNH will automatically traverse the directory tree, finding the indexes and mapping them internally according to their directory name. Then, on the URL, you can use the 'i' parameter to specify which index to search, such as: search?q=winner&i=2011-election search?q=winner&i=2010-world-cup Multiple 'i' parameters can be used to search multiple collections, and if no 'i' parameter is given, then all collections are searched by default. 2. When importing, put the name of the collection (W)ARC file next to the (W)ARC URL, e.g. /mnt/data/warcs/foo.warc.gz 2011-election /mnt/data/warcs/bar.warc.gz 2010-world-cup During the import process, the Importer will decorate each record with a metadata field "collection" with the appropriate value. Then, during indexing, the value will be added to the Lucene index in the field titled "collection". This field can be searched by adding the 'c' parameter to the OpenSearch URL, such as: search?q=winner&c=2011-election search?q=winner&c=2010-world-cup In this case, however, we only build one Lucene index, rather than one for each collection. In this case, the collection name is just another field, like mime-type. I prefer method #1. For us, it's much easier to manage. In fact, in our Archive-It.org hosted service, we have close to 2000 collections by over 150 partners. By keeping each collection in a separate index allows us to manage them much better/easier than if they were in one giant index with 1.2 billion documents and 4.3 TB in size. With separate indexes for each collection, you have managable sized indexes on disk, with the ability to arbitrarily combine them at search time via the 'i' parameter. > - How are the results sorted? The results are ordered by rank, from best to worst. The ranking is determined by Lucene. Also, 'site collapsing' is performed, so that if you get multiple hits from one website, we only show the top 1, or 2, or however many you want, specified by the 'h' parameter on the OpenSearch URL search?q=winner&h=3 would show up to 3 of the top hits from any one website. This is pretty much what web search companies have been doing for a long time. That way if you search for "Facebook" on Yahoo!, you don't get the first 500000 hits from facebook.com, but get a mix of things from their site, news coverage about Facebook, etc. > * JB/NutchWAX > - de-duplication is not possible in either tool (JB or NutchWAX) if > we want to add new crawls to an existing index. ¿? The most straightforward way to do this is to re-build the entire index, with all the data -- new and old -- in a single indexing job. Both the JBs and NutchWAX 0.13-JIRA-WAX-75 support this. There is another way to do this, but it is more complicated. You have to analyze the CDX files and extract out all the lines for duplicate captures, then feed that to the 'import' command to use as an exlusion filter, telling it which captures to ignore when importing. In this case you are de-duplicating at the front of the processing chain: during the import stage. > - We've tried JB with hadoop 20.2, but it didn't end up > well. org.apache.hadoop.util.DiskChecker$DiskErrorException was thrown > so I guess there wasn't enough space in /tmp. If segments have > somewhere around 650 GB (having removed crawl_*/ and content/), how > much free space should be left on disk in order to carry out the index > process? any estimate size? based on our first try 2TB doesn't seem to > be enough. Yes, lots of disk space is needed. There is close to a 1:1 mapping of the size of the segment to the size of the index. Plus, with Hadoop you keep a copy of the map output and reduce input in /tmp during the Map/Reduce job. For an index 500MB in size, here at IA we'd use our Hadoop cluster of at least 10 machines. How many machines are you using? > * #crawls (WAYBACK 1.6.0/CDX) I'll have to let Brad tackle this one. Aaron |