Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Aaron Binns escribió:
[...
> For the JBs and more recent versions of NutchWAX, only 
>
>   parse_text
>   parse_data
>
> are used.  In my indexing projects, after the (w)arcs are all imported,
> I just do a 
>
>    rm -rf segments/*/c*
>
> to remove the sub-dirs starting with 'c'.
>   
done.
>> We tried both approaches for the entire ARC collection:
>>
>> a) IndexMerger Lucene API (inside TNH).
>> index size: 813GB
>>
>> b) Re-built the entire index giving as input both old and new NutchWAX
>> segments of the ARC files.
>> index size: 563GB
>>
>> is it normal that there is this difference of sizes in the indexes?
>>     
>
> It's quite possible.  If there was a lot of duplicate captures, then you
> could see such a large reduction in size.  Method A would preserve the
> duplicates whereas method B de-duplicates.
>
> In later versions of NutchWAX and the JBs, de-duplication happens
> automatically, but only *within a single indexing job*.  If you
> index all the segments in one job, then they will be de-duplicated.
>
> If you index subsets of segments, creating multiple indexes, then there
> could be duplicates across the indexes.
>   
...]
There should be a lot of duplicate captures on those Top Level 
Domain/agreement institutions crawls.
If I don't get the wrong idea, NutchWAX 0.13 (official release), which 
is the version we've used in method b), doesn't de-duplicate. So
 if neither of the two methods de-duplicates, could it be any other 
reason for such a difference in indexes sizes?

>> 3.- We have only one collection for all the ARC files. We have our
>> collection on open access and the service is load balanced through
>> several nodes. That's the scenario in where several tomcats are
>> accessing the same indexes.
>>     
>
> Does that mean that each node has a local copy of the index?  Or perhaps
> the index is on an NFS share or SAN mounted on each node?
>   
Yes, the index is on an NFS shared storage system accessed by several nodes.
> Lastly, the indexing process for the JBs is pretty much the same as for
> NutchWAX.  The command-lines are similar, but for the JBs, you have to
> use the Hadoop command-line driver, whereas NutchWAX comes with its own.
>
> E.g.
>
>    $ nutchwax index indexes segments/*
>
> vs.
>
>    $ hadoop jar jbs-*.jar Indexer indexes segments/*
>
> The version of Hadoop that we use is the Cloudera distribution, which is
> based on Hadoop 0.20.2 with some Cloudera patches to fix bugs.  I
> believe you can use Hadoop 0.20.1 or 0.20.2 w/o any problems.
>
> The JBs also does a better job of filtering out obvious crap, such as
> "words" which are do not contain any letters, such as "34983545$%23432"
> is filtered out when indexing with JBs.
>
> It also canonicalizes the mime-types so that all the dozens of different
> known variaties of MS Office mime-types are all mapped to one standard
> set.  It also omits 'robots.txt' files and ignores mime-types that
> probably don't have text in them, such as "application/octet-stream".
>
> I'd recommend giving JBs a try, at least to test and compare to the
> index built with NutchWAX; especially since JBs does the accented
> character collapsing.
>   
We will try both JBs and NutchWAX-with-deduplication. By the way, the 
SVN branch you pointed out (

http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive

) is it possible that it is only suitable for nutch 1.1 and not for 
nutch 1.0 as it is said in INSTALL.txt?

We would like to ask you a few more questions regarding to:

* TNH
    - Is it possible to define collections?
    - How are the results sorted?
* JB/NutchWAX
    - de-duplication is not possible in either tool (JB or NutchWAX) if 
we want to add new crawls to an existing index. ¿?
    - We've tried JB with hadoop 20.2, but it didn't end up well. 
org.apache.hadoop.util.DiskChecker$DiskErrorException was thrown so I 
guess there wasn't enough space in /tmp. If segments have somewhere 
around 650 GB (having removed crawl_*/ and content/), how much free 
space should be left on disk in order to carry out the index process? 
any estimate size? based on our first try 2TB doesn't seem to be enough.

* #crawls (WAYBACK 1.6.0/CDX)
    Wayback returns the crawls per URL number through a URL query 
search. The crawl per URL number is displayed in the ToolBar.jsp 
(data.getResultCount()). I can't see any thing that allows me to bind 
"crawl"<->"URL" in CDX index file, so I guess Wayback calculates it 
based on ARC files... Is it possible?

Is it possible to know (in an easy way) how many crawls are there using 
Wayback java classes? I'm thinking in something similar that it is done 
in ToolBar.jsp and (somehow) taking advantage of  
(data.getResultCount()) but without especifying the URL on the query.
It would be very helpful if you could point me out which classes should 
I look at.

Thank you very much and best regards,

-- Gerard
......................................................................
        __
       / /          Gerard Suades Méndez
 C E / S / C A      Departament d'Aplicacions i Projectes
     /_/            Centre de Supercomputació de Catalunya

  Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
  T. 93 551 62 20 · F.  93 205 6979 · gs...@ce...
......................................................................