Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Aaron Binns escribió:
> Gerard Suades i Méndez <gs...@ce...> writes
>> using dumper tool -c option: 146.235.591 documents
>>     
>
> Hmm, a 12GB machine should be able to serve a 146 million document
> index.
>
> In one of our deployments, we have a ~380 million document index spread
> (unevenly) across three nodes, each with 8GB RAM.  The sizes of each
> are:
>
>   114.555.371
>   152.748.262
>   114.567.931
>
> So if a 8GB RAM node can handle between 115-150 million documents, I
> would expect your 12GB machine could as well.
>
> Now, our deployment is using the "tnh" code I mentioned before; so
> that could be a differentiating factor.
>
> Also, since you are using 64-bit JVM, I strongly recommend using the JVM
> option:
>
>   -XX:+UseCompressedOops
>
> With this feature enabled, the JVM will use 32-bit object references
> rather than 64-bit.  As long as the number of *objects* in your system
> are below 2^32 (~4billion) then 32-bit references are sufficient.
>
> This can save a lot of memory since there are going to be hundreds of
> millions of references in the JVM's heap.
>
> For example, on the 8GB nodes in our ~380 million document deployment,
> the JVM options we use are:
>
>   JAVA_OPTS="-Djava.awt.headless=true -Xmx5000m -XX:+UseCompressedOops"
>   

We are using java 1.6.0_12 version and unfortunately it has a bug with 
UseCompressedOops which seems to be solved in update 14. We will update 
java version and try it again.

>> yes, fields stored in the index are: collection, content, date,
>> digest, length, segment, site, title, type and url.
>>     
>
> Since the 'content' field is stored in the index, if you use the "tnh"
> code, you don't need the Nutch(WAX) segments.  Everything is
> self-contained in the index.
>
> I just wanted to point this out in case you want to use "tnh" rather
> than NutchWAX for search serving.  I recommend "tnh" over NutchWAX, it's
> what we are using at the Archive now for all our deployments.
>   
We gave it a spin with the whole collection of ARC and TNH shows a 
dramatic improvement on performance compared with NutchWAX. Really 
impressive. Congratz. good job ;)

As you said  before, NutchWAX segments are no longer needed with TNH.

We would like to ask a few questions:

1.- We have a new set of ARC that we would like to include in full text 
search. We were wondering if there is any special procedure to update 
the already existing NutchWAX indexes with the new crawls. Any idea for 
the merge process? Do we need to keep  segments of old crawls in order 
to generate the indexes of the new crawls before merging all together?

2.- The size of the index which self-contained the segments information 
is a linear growth size related to the ARC? at this moment index 
represents pretty much 7.5% of the whole collection ARCs size.

3.- Is it possible to install TNH in several tomcats sharing the same 
index? in other words, does TNH block index while searching as Wayback 
used to?

4.- Based on the results of our tests we are thinking of using TNH for 
full text search instead of WERA. Is there any roadmap or a major 
release planned for the future?

-- Gerard

......................................................................
        __
       / /          Gerard Suades Méndez
 C E / S / C A      Departament d'Aplicacions i Projectes
     /_/            Centre de Supercomputació de Catalunya

  Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
  T. 93 551 62 20 · F.  93 205 6979 · gs...@ce...
......................................................................