Re: [Archive-access-discuss] how to partition the index?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Miguel.

To use distributed search, you need to plan ahead a bit and generate  
multiple indices. I don't know of a way to partition an existing large  
index into smaller chunks.

For example, if you're indexing 100,000 ARCs and want to deploy on 10  
machines, you should split your list of ARCs into 10 chunks of 10,000,  
invoke ImportArcs for each chunk, and invoke NutchwaxIndexer for each  
chunk. This will produce 10 segment/index pairs, each of which could  
be deployed on one of your 10 machines.

For large jobs, I usually split the ARCs into groups of 1000. This  
produces segment/index pairs that are small enough to be manageable  
and flexible when it comes to deployment layout.

Hope this helps.

-J

On Feb 5, 2008, at 5:12 AM, Miguel Costa wrote:

> Hi  to all,
>
> After reading the nutchwax + nutch documentation I can index ARC  
> files and search them using the nutchwax + wayback machine.
> However, I would like to perform a distributed search but I don't  
> find any documentation on how to partition the index in n parts/ 
> segments for n machines.
> On the other hand there is information explaining how to distribute  
> search using the search-servers.txt file, but I need to partition  
> the index first.
> Can anyone explain me or give me a clue on how to partition an index  
> for n machines?
>
> Regards,
>
> Miguel Costa
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss