|
From: Michael S. <st...@ar...> - 2006-07-20 16:23:04
|
Natalia Torres wrote: > Thanks Michael, I'll experiment indexing job this way. > > > About indexing proces .. > > I'm testing how it works (Heritrix+Hadoop+NutchWax+Wera) with our web > and I'm running it in standalone mode with one crawled job (about 7 arc > 700Mb). > How long is it taking you to index your 7 ARCs? > I want to start a hadoop cluster but i d0n't know how many slaves put > and hardware requerimets to it. I'm looking for infomation about > benchmarks, indexing performance .... to know more about hardware needed > , but I don't find anything. > When the software settles more -- hadoop, nutch, and nutchwax -- I'll put up some figures on our experience here at the Archive. Meantime, here's a few coarse stats. + A cluster should have at least 3, probably 4 machines, to make distribution worth the bother. + Here at the Archive, we have a rack that has between 16 and 30 machines that we've been running/debugging indexing jobs on over the last bunch of months (The number of slaves participating varies because the hardware we use is not of the best quality and these indexing jobs lasting days doing checksums of all read and written are a good way of finding those flakey RAM sticks and erroring motherboards). We find on this rack that total processing of an ARC including ingest through indexing takes about 3 minutes (Machines are 4Gig 2Ghz dual-core Athlons with 4x400 SATA disks). Other things to consider: + Make all slave nodes exactly the same -- same RAM and disk configuration. It'll save you headache down the road. + Setup rsync so you can pull ARCs into your cluster with it. Once done, you can then feed nutchwax lists of ARCs using rsync URLs. This way, you can leave your ARCs out on storage nodes and the indexing software will take care of making the ARCs local to the indexing cluster. + DFS cannot be trusted. It'll be fixed soon but for now, as soon as an indexing job is completed, make a backup of the produced data -- segments and indices -- to local storage. Yours, St.Ack |