Hi Lee,
Thanks for your reply.
I have 2 doubts about your response
1- After I deploy on 10 (n) machines, should I index locally each subset in
parallel on the 10 machines or distributly (indexing the 10 subsets
sequently)?
2 - If I split the ARCS, the ranking values will use local statistics from
the ARC subset or global statistics from all the collection and web graph.
If local the ranking will not be normalized between subsets, if global, when
are these values merged? At runtime during query responses?
Regards,
_____
From: arc...@li...
[mailto:arc...@li...] On Behalf Of
John H. Lee
Sent: terça-feira, 5 de Fevereiro de 2008 20:30
To: arc...@li...
Subject: Re: [Archive-access-discuss] how to partition the index?
Hi Miguel.
To use distributed search, you need to plan ahead a bit and generate
multiple indices. I don't know of a way to partition an existing large index
into smaller chunks.
For example, if you're indexing 100,000 ARCs and want to deploy on 10
machines, you should split your list of ARCs into 10 chunks of 10,000,
invoke ImportArcs for each chunk, and invoke NutchwaxIndexer for each chunk.
This will produce 10 segment/index pairs, each of which could be deployed on
one of your 10 machines.
For large jobs, I usually split the ARCs into groups of 1000. This produces
segment/index pairs that are small enough to be manageable and flexible when
it comes to deployment layout.
Hope this helps.
-J
On Feb 5, 2008, at 5:12 AM, Miguel Costa wrote:
Hi to all,
After reading the nutchwax + nutch documentation I can index ARC files and
search them using the nutchwax + wayback machine.
However, I would like to perform a distributed search but I don't find any
documentation on how to partition the index in n parts/segments for n
machines.
On the other hand there is information explaining how to distribute search
using the search-servers.txt file, but I need to partition the index first.
Can anyone explain me or give me a clue on how to partition an index for n
machines?
Regards,
Miguel Costa
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_____________________
__________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
|