From: Aaron B. <aa...@ar...> - 2011-03-15 21:58:15
|
Gerard Suades i Méndez <gs...@ce...> writes: > If segments needs to be kept in order to update the indexes with new > crawls then we need to bear in mind that indexes+segments size > represents somewhere around 50% of the all ARCs, specially in terms of > scalability. Are these numbers usual? That seems large to me. In your segments, which of the following sub-directories appear? <segment>/crawl_data crawl_? parse_text parse_data content I'll have to check the SVN revision in NutchWAX where I removed the dependencies on the "crawl_*" and "content" sub-dirs. Those sub-dirs are left-overs from Nutch and we don't use them at all for NutchWAX. If you see those sub-dirs, re-calculate the sizes w/o them for a more realistic measurement. Once I can confirm the SVN revision where I removed their use in NutchWAX (never used by JBs), then you can delete them as they aren't used at all. In earlier versions of NutchWAX, there was code to check that the sub-dirs were there, even though they weren't used. For the JBs and more recent versions of NutchWAX, only parse_text parse_data are used. In my indexing projects, after the (w)arcs are all imported, I just do a rm -rf segments/*/c* to remove the sub-dirs starting with 'c'. > We tried both approaches for the entire ARC collection: > > a) IndexMerger Lucene API (inside TNH). > index size: 813GB > > b) Re-built the entire index giving as input both old and new NutchWAX > segments of the ARC files. > index size: 563GB > > is it normal that there is this difference of sizes in the indexes? It's quite possible. If there was a lot of duplicate captures, then you could see such a large reduction in size. Method A would preserve the duplicates whereas method B de-duplicates. In later versions of NutchWAX and the JBs, de-duplication happens automatically, but only *within a single indexing job*. If you index all the segments in one job, then they will be de-duplicated. If you index subsets of segments, creating multiple indexes, then there could be duplicates across the indexes. NutchWAX segments and their sub-directories are rather simple data structures actually. They are in a compressed binary (Hadoop) format, so you can't simply 'cat' them, but they are in essence: [<unique-id>, <set of key/value properties], ... Each record has a unique key, for which we use "<url> <hash>". Then the record is simply key/value pairs of properties. In the parse_data sub-dir, we have records following the form ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20101202092343", "type" => "text/hmtl", ....] ] ['http://example.com/contact.html 3452...', ["title" => "Contact us", "date" => "20101202092355", "type" => "text/hmtl", ....] ] And in the pase_text, we have ['http://example.com/ 123456...', ["body" => "Here is the body of the webpage."] ] ... When indexing, with either NutchWAX or the JBs, the sub-dirs are opened up and Hadoop combines the records together from the sub-dirs, matching according to unique-id. In NutchWAX and JBs, we also detect multiple merged records with the same unique-id and then perform our own merging by retaining only 1 key/value pair for properties such as "title" and combine values for the "date" property so that we have *all* the capture dates for the unique version of a URL. For example, imagine that we had two records ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20090304101509", "type" => "text/hmtl", ....] ] ['http://example.com/ 123456...', ["title" => "My webpage", "date" => "20101202092343", "type" => "text/hmtl", ....] ] during the indexing process, these would be combined into a single record with two capture dates ['http://example.com/ 123456...', ["title" => "My webpage", "date" => ["20090304101509", "20101202092343"], "type" => "text/hmtl", ....] ] but for all the other properties, we only have one value. It doesn't make sense to have the title or mime-type twice. This is the core of the de-duplication process during indexing. But this de-duplication process is done by the Java code in the NutchWAX Indexer and the JBs Indexer. Lucene doesn't know anything about it. > 3.- We have only one collection for all the ARC files. We have our > collection on open access and the service is load balanced through > several nodes. That's the scenario in where several tomcats are > accessing the same indexes. Does that mean that each node has a local copy of the index? Or perhaps the index is on an NFS share or SAN mounted on each node? Lastly, the indexing process for the JBs is pretty much the same as for NutchWAX. The command-lines are similar, but for the JBs, you have to use the Hadoop command-line driver, whereas NutchWAX comes with its own. E.g. $ nutchwax index indexes segments/* vs. $ hadoop jar jbs-*.jar Indexer indexes segments/* The version of Hadoop that we use is the Cloudera distribution, which is based on Hadoop 0.20.2 with some Cloudera patches to fix bugs. I believe you can use Hadoop 0.20.1 or 0.20.2 w/o any problems. The JBs also does a better job of filtering out obvious crap, such as "words" which are do not contain any letters, such as "34983545$%23432" is filtered out when indexing with JBs. It also canonicalizes the mime-types so that all the dozens of different known variaties of MS Office mime-types are all mapped to one standard set. It also omits 'robots.txt' files and ignores mime-types that probably don't have text in them, such as "application/octet-stream". I'd recommend giving JBs a try, at least to test and compare to the index built with NutchWAX; especially since JBs does the accented character collapsing. Aaron -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |