Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Aaron,

If segments needs to be kept in order to update the indexes with new 
crawls then we need to bear in mind that indexes+segments size 
represents somewhere around 50% of the all ARCs, specially in terms of 
scalability. Are these numbers usual?

1.- Regarding to the merge index process we don't have any 
de-duplication strategy right now due to OOM errors we found when we 
were building the indexes in the first steps with NutchWAX. We were 
unable to build the indexes from scratch in a single job, we had to 
split in different processes with a small set of segments (we discuss 
that in the beginning of this thread).

You pointed out  that a NutchWAX minor revision on the 0.13 version has 
some new feature related to duplicate records during index-building. 
That might be helpful to try de-duplication index building.

We tried both approaches for the entire ARC collection:

a) IndexMerger Lucene API (inside TNH).
index size: 813GB

b) Re-built the entire index giving as input both old and new NutchWAX 
segments of the ARC files.
index size: 563GB

is it normal that there is this difference of sizes in the indexes?

JBs sounds good. "Accented letter collapsing" is an interesting feature 
for a web pages in Catalan, that clearly meets our needs for a Catalonia 
web archive. We think its worth trying it. We would be glad if you could 
give us further information on JBs process and de-duplication index. Is 
de-duplication enabled by default?

2.- OK, our archive files are mainly text (~82%), so its usual that kind 
of percentage.

3.- We have only one collection for all the ARC files. We have our 
collection on open access and the service is load balanced through 
several nodes. That's the scenario in where several tomcats are 
accessing the same indexes.

4.- Having read your "SOLR-Nutch Report" I understand the situation we 
are now. Some of the "key problems" were pointed out also in IWAW2010 in 
Viena.

P.S.: Don't worry if your answers are very long.
P.S.2: This thread has been evolving through several topics, if you 
think it's better to answer in a different thread (JBs tool) with a new 
title feel free to switch it.

Thank you very match for your answers.

Best regards,

Gerard

Aaron Binns escribió:
> Gerard Suades i Méndez <gs...@ce...> writes
>> 1.- We have a new set of ARC that we would like to include in full
>> text search. We were wondering if there is any special procedure to
>> update the already existing NutchWAX indexes with the new crawls. Any
>> idea for the merge process? Do we need to keep segments of old crawls
>> in order to generate the indexes of the new crawls before merging all
>> together?
>>     
>
> Yes, for *building* the indexes you need to keep the segments, only for
> the TNH search service you don't need the segments as the index has all
> the information in it needed for search services.
>
> There are basically two ways to merge indexes, which one you choose
> depends on your de-duplication strategy.
>
> If you have two Lucene indexes A and B, you can just use the IndexMerger
> command in TNH to merge them together.  TNH provides a simple
> command-line wrapper around the Lucene index merging API call.  Since
> TNH is a webapp, you have to un-jar it to be able to use the Java
> command-line wrappers, for example
>
>   $ mkdir tnh
>   $ cd tnh
>   $ jar xf ~/tnh.war
>   $ export CLASSPATH=WEB-INF/classes:WEB-INF/lib/lucene-core-*.jar
>   $ java IndexMerger <merged> <index-A> <index-B>
>
> This simply calls the Lucene library index-merge function, so it does
> *not* know anything about de-duplication.  If you have the same record
> in both index A and index B, then you will have them both in the merged
> index.
>
> So, if you already have an index for your existing collection, then get
> some new (W)ARC files, you and index those separately and then merge the
> two indexes together.
>
>
> Another approach is to re-build the entire index, giving as inputs the
> initial NutchWAX segments and the new NutchWAX segment for the new
> (W)ARCs.  Then, you will have one single index with everything in it.
>
> In this case, any duplicate records can be detected and merged when the
> combined index is being built.  The merging of duplicate records during
> index-building was a feature put into a minor revision of NutchWAX 0.13.
> I'll have to look up the specific SVN revision.
>
>
> With regards to indexing, there is a side-project of mine similar to TNH
> which does a better job of index-building than NutchWAX.  This project
> is called "The JBs", which was the name of the band for the famous
> musician James Brown.
>
> One of the many improvements in The JBs does is "accented letter
> collapsing" so that words with accented characters are indexed so that
> they can be found with or without the accent mark.  For example,
>
>   Méndez
>
> with NutchWAX it is put into the index exactly as "Méndez".  If someone
> searches for "Mendez", it will not be found.  But if the index is built
> with then both "Méndez" and "Mendez" can be found.
>
> The JBs also performs merging of duplicates when building a single index
> from multiple NutchWAX segments.
>
> But, this email is getting rather long already, with more below, so I
> will conclude this section on The JBs.  We can discuss further if you
> are interested.
>
>   
>> 2.- The size of the index which self-contained the segments
>> information is a linear growth size related to the ARC? at this moment
>> index represents pretty much 7.5% of the whole collection ARCs size.
>>     
>
> It depends on the mix of file types in the original ARC files.  Only
> text types are put into the full-text search, so things like JPG, MP3,
> AVI, ZIP, etc. are omitted.  You're 7.5% number does not seem unusual to
> me.  In our full-text search for Archive-It.org, there are just over 1
> billion documents in the index and the on-disk index size is ~3.5TB, and
> the size of all the (W)ARC files is somewhere around 100TB.  But I know
> there are lots of large binary files, including lots of YouTube video in
> the Archive-It collection.
>
>   
>> 3.- Is it possible to install TNH in several tomcats sharing the same
>> index? in other words, does TNH block index while searching as Wayback
>> used to?
>>     
>
> I don't remember if that specific use-case was tested.  It should work.
>
> TNH is built on Lucene and when TNH opens the index, it uses the Lucene
> API call to open the index in read-only mode; so there should be no
> exclusive locking and multiple TNH web application instances should be
> able to open the same index.
>
> However, TNH and the Lucene library do cache parts of the index in
> memory, so if you have multiple instances of the TNH web appliction, you
> will have multiple instances of the caches as well.
>
> An alternative approach might be to use a multi-index setup in a single
> TNH instance and use the "i=<indexname>" URL parameter to select which
> index to search.
>
> Maybe you can describe what you are trying to do with multiple TNH
> webapp instances reading the same index and I can provide some
> suggestions on how to implement it.
>
>   
>> 4.- Based on the results of our tests we are thinking of using TNH for
>> full text search instead of WERA. Is there any roadmap or a major
>> release planned for the future?
>>     
>
> No, there isn't any roadmap.  Well, the roadmap is to migrate everything
> to Apache SOLR, which merged projects with Lucene last year and is now
> considered *the* open-source full-text search platform.
>
> Unfortunately, there are some features missing from SOLR which are
> required for full-text search on web archives.  Also, we don't know yet
> how SOLR will scale, especially in a multi-server configuration.
>
> I produced a report for the IIPC covering the issues with migrating from
> NutchWAX to SOLR.
>
>   http://archive.org/~aaron/iipc/
>
>
> So, that leaves us in an intermediate state where NutchWAX's search
> service performance is not sufficient, but SOLR is not quite ready for
> full-scale migration.  The Internet Archive needs to decide if we commit
> to supporting TNH (with an official release) as an intermediate step in
> the migration path to SOLR.
>
> And if people are finding TNH useful and an adequate replacement for the
> NutchWAX search service, then we would have a stronger case to commit
> the resources to support an official TNH release.
>
>
>   

-- 
......................................................................
        __
       / /          Gerard Suades Méndez
 C E / S / C A      Departament d'Aplicacions i Projectes
     /_/            Centre de Supercomputació de Catalunya

  Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
  T. 93 551 62 20 · F.  93 205 6979 · gs...@ce...
......................................................................