From: Gerard S. i M. <gs...@ce...> - 2011-03-02 16:09:34
|
Aaron Binns escribió: > Gerard Suades i Méndez <gs...@ce...> writes >> using dumper tool -c option: 146.235.591 documents >> > > Hmm, a 12GB machine should be able to serve a 146 million document > index. > > In one of our deployments, we have a ~380 million document index spread > (unevenly) across three nodes, each with 8GB RAM. The sizes of each > are: > > 114.555.371 > 152.748.262 > 114.567.931 > > So if a 8GB RAM node can handle between 115-150 million documents, I > would expect your 12GB machine could as well. > > Now, our deployment is using the "tnh" code I mentioned before; so > that could be a differentiating factor. > > Also, since you are using 64-bit JVM, I strongly recommend using the JVM > option: > > -XX:+UseCompressedOops > > With this feature enabled, the JVM will use 32-bit object references > rather than 64-bit. As long as the number of *objects* in your system > are below 2^32 (~4billion) then 32-bit references are sufficient. > > This can save a lot of memory since there are going to be hundreds of > millions of references in the JVM's heap. > > For example, on the 8GB nodes in our ~380 million document deployment, > the JVM options we use are: > > JAVA_OPTS="-Djava.awt.headless=true -Xmx5000m -XX:+UseCompressedOops" > We are using java 1.6.0_12 version and unfortunately it has a bug with UseCompressedOops which seems to be solved in update 14. We will update java version and try it again. >> yes, fields stored in the index are: collection, content, date, >> digest, length, segment, site, title, type and url. >> > > Since the 'content' field is stored in the index, if you use the "tnh" > code, you don't need the Nutch(WAX) segments. Everything is > self-contained in the index. > > I just wanted to point this out in case you want to use "tnh" rather > than NutchWAX for search serving. I recommend "tnh" over NutchWAX, it's > what we are using at the Archive now for all our deployments. > We gave it a spin with the whole collection of ARC and TNH shows a dramatic improvement on performance compared with NutchWAX. Really impressive. Congratz. good job ;) As you said before, NutchWAX segments are no longer needed with TNH. We would like to ask a few questions: 1.- We have a new set of ARC that we would like to include in full text search. We were wondering if there is any special procedure to update the already existing NutchWAX indexes with the new crawls. Any idea for the merge process? Do we need to keep segments of old crawls in order to generate the indexes of the new crawls before merging all together? 2.- The size of the index which self-contained the segments information is a linear growth size related to the ARC? at this moment index represents pretty much 7.5% of the whole collection ARCs size. 3.- Is it possible to install TNH in several tomcats sharing the same index? in other words, does TNH block index while searching as Wayback used to? 4.- Based on the results of our tests we are thinking of using TNH for full text search instead of WERA. Is there any roadmap or a major release planned for the future? -- Gerard ...................................................................... __ / / Gerard Suades Méndez C E / S / C A Departament d'Aplicacions i Projectes /_/ Centre de Supercomputació de Catalunya Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona T. 93 551 62 20 · F. 93 205 6979 · gs...@ce... ...................................................................... |