|
From: Natalia T. <nt...@ce...> - 2006-06-22 10:02:28
|
Hello I have a problem indexing new jobs with haddop and nutchwax. The forum archive of this list doesn't works so I can't find information about it. I index a couple of jobs crawled whith Heritrix to try NutchWax search and it seems to work. I search a word in Nutchwax Search and the results are showed. But when I click the title or "other versions" the url was wrong. It's something like http://www.myurl.com/null/*/http//www.urlcrawled.com. Surfing examples on internet archive web I think is that "null" in path may be collection name used at index time, I'm right? Why null? There's any way to list the collections used indexing? After try it i decided to add new jobs. When I try to index new jobs using the same command an error appears because the indexes directory in the output dir exists. How can I add jobs to this index? Thanks Natalia |
|
From: Michael S. <st...@ar...> - 2006-06-22 21:50:06
|
Natalia Torres wrote: > Hello > > I have a problem indexing new jobs with haddop and nutchwax. The forum > archive of this list doesn't works so I can't find information about it. > I wrote sourceforge asking where's our archive! > I index a couple of jobs crawled whith Heritrix to try NutchWax search > and it seems to work. > > I search a word in Nutchwax Search and the results are showed. But when > I click the title or "other versions" the url was wrong. It's something > like http://www.myurl.com/null/*/http//www.urlcrawled.com. The host 'myurl.com' is a server that will return the content of ARCs? > Surfing > examples on internet archive web I think is that "null" in path may be > collection name used at index time, I'm right? Why null? > Looks like your collection name is 'null'. If you do an explain of your search result, is there a 'collection' field, and if so, is its value null? You used 0.6.2 Nutchwax? With that version it was not possible to do an indexing without supplying a collection name -- supposedly. You can edit the search.jsp and add in a collection name. > There's any way to list the collections used indexing? > See the explain above. Otherwise, use nutch tools to read the content of metadata in your segments -- let me know and I'll supply more detail -- or you can look at the index produced using tools like luke (http://www.getopt.org/luke/) or some quick lucene code that iterates over each document printing out content of the content field (Sounds like yours is null though). > After try it i decided to add new jobs. When I try to index new jobs > using the same command an error appears because the indexes directory > in the output dir exists. Is this at the merge indices step? Try moving aside the old merged index -- i.e. DATA_DIR/index -- and retry running the single merge step. The 'all' command for nutchwax is for running through a complete indexing -- from start to finish. Adding increments needs work in nutchwax. Adding doc on howto with my experience running a few here will be the focus of the next nutchwax release. St.Ack > How can I add jobs to this index? > > Thanks > > > Natalia > > All the advantages of Linux Managed Hosting--Without the Cost and Risk! > Fully trained technicians. The highest number of Red Hat certifications in > the hosting industry. Fanatical Support. Click to learn more > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-06-23 11:56:27
|
when I index my jobs with nutchwax 0.6.1 I use this command hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all /data/inputs/ /data/outputs ciencia help explains that input, output, collection are required I put "ciencia" as collection name not null, but listing search results this name is not included in path ... If I edit search.jsp page and add my collection name then search doesn't work (it doesn't recognize this collection). Natalia |
|
From: Michael S. <st...@ar...> - 2006-06-23 15:33:46
|
Natalia Torres wrote: > when I index my jobs with nutchwax 0.6.1 I use this command > > hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all > /data/inputs/ /data/outputs ciencia > > help explains that input, output, collection are required > I put "ciencia" as collection name not null, but listing search results > this name is not included in path ... > > If I edit search.jsp page and add my collection name then search doesn't > work (it doesn't recognize this collection). > Just add it to the path that gets made as part of the unrolling of search results. Put in place 'ciencia' instead of the value of collection at that point -- around line 196 where we assign to the archiveCollection value. Collection name not being passed to index is a bug. Looks like the fix is not in 0.6.1. It was fixed 2006/05/12. I'll make a 0.6.2 -- hopefully today. St.Ack P.S. Regards why the archives are not in place, from SF support: Per the site status page: ( 2006-06-20 12:41:07 - Mailing List Service ) On 2006-06-20 the Mailing List Archives were taken down for preventative maintenance that occurs about once every two years. We expect the duration of this downtime to last between 1 to 3 days. > Natalia > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Michael S. <st...@ar...> - 2006-06-23 20:27:29
|
Looks like it will be a while before I can get to a release. I'm out next week. Meantime I just took this build for a test run: http://crawltools.archive.org:8080/cruisecontrol/artifacts/HEAD-archive-access/20060623113807. Fixes at least your collection issue. Requires hadoop 0.3.2. Yours, St.Ack Natalia Torres wrote: > when I index my jobs with nutchwax 0.6.1 I use this command > > hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all > /data/inputs/ /data/outputs ciencia > > help explains that input, output, collection are required > I put "ciencia" as collection name not null, but listing search results > this name is not included in path ... > > If I edit search.jsp page and add my collection name then search doesn't > work (it doesn't recognize this collection). > > Natalia > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-06-26 16:46:13
|
Hello Using this now it's running and i can index!! I can read the explanation and "more from the site" links but i can't acces to title. I've a dubt with "collectionsHost" variable. It points to the server that will return the content of ARCs. I put directly the arc files (or arc.gz file) in this server but link on "title" and "other versions" on nutchwax search results doesn't work. Wich information offers this server?? Thanks. Natalia |
|
From: Natalia T. <nt...@ce...> - 2006-07-03 11:43:55
|
Hello I tried to add the new job moving the indexes directory before starting index process and it works fine. Thanks!! So, every time I want to index a new job I need to move indexes directory? If I move this directory the nuch wax search still working? This proces takes many hours ... Natalia |
|
From: Michael S. <st...@ar...> - 2006-07-07 00:41:45
|
Natalia Torres wrote: > Hello > > I tried to add the new job moving the indexes directory before starting > index process and it works fine. Thanks!! > > So, every time I want to index a new job I need to move indexes > directory? If I move this directory the nuch wax search still working? > I presume you are using the 'all' command each time? It will complain if there are already indices in place from a previous run. The 'all' command is a convenience. It assumes you want to do a single-pass indexing of a set of ARCs. Running the 'all' command to bring in a new set of ARCs will run through all steps and index all the new additions as well as reindex all ARCs added previously. Sounds like you want to do incremental updates to your index. Experiment by calling the jobs that comprise the 'all' command individually. For example, run the import passing it a directory that contains a file that points to just the new ARCs you want to ingest. Then do 'update' and 'invert'. Next run indexing just of the segments that were added by the ingest step. Save aside the indexes made previously first. Run your deduplication. Finally merge the new indices and the old. I'm working currently on tools and documentation to better support incremental updates to indices. They'll form core of next release (Coming soon -- month or so). > This proces takes many hours ... > > Yes. It can. Depends on number of ARCs you have. Sounds like too that you are running in the standalone mode. You might consider starting a small hadoop cluster. That should improve your throughput. Yours, St.Ack > Natalia > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-07-07 11:45:11
|
Thanks Michael, I'll experiment indexing job this way. About indexing proces .. I'm testing how it works (Heritrix+Hadoop+NutchWax+Wera) with our web and I'm running it in standalone mode with one crawled job (about 7 arc 700Mb). I want to start a hadoop cluster but i d0n't know how many slaves put and hardware requerimets to it. I'm looking for infomation about benchmarks, indexing performance .... to know more about hardware needed , but I don't find anything. Thanks, Natalia |
|
From: Michael S. <st...@ar...> - 2006-07-20 16:23:04
|
Natalia Torres wrote: > Thanks Michael, I'll experiment indexing job this way. > > > About indexing proces .. > > I'm testing how it works (Heritrix+Hadoop+NutchWax+Wera) with our web > and I'm running it in standalone mode with one crawled job (about 7 arc > 700Mb). > How long is it taking you to index your 7 ARCs? > I want to start a hadoop cluster but i d0n't know how many slaves put > and hardware requerimets to it. I'm looking for infomation about > benchmarks, indexing performance .... to know more about hardware needed > , but I don't find anything. > When the software settles more -- hadoop, nutch, and nutchwax -- I'll put up some figures on our experience here at the Archive. Meantime, here's a few coarse stats. + A cluster should have at least 3, probably 4 machines, to make distribution worth the bother. + Here at the Archive, we have a rack that has between 16 and 30 machines that we've been running/debugging indexing jobs on over the last bunch of months (The number of slaves participating varies because the hardware we use is not of the best quality and these indexing jobs lasting days doing checksums of all read and written are a good way of finding those flakey RAM sticks and erroring motherboards). We find on this rack that total processing of an ARC including ingest through indexing takes about 3 minutes (Machines are 4Gig 2Ghz dual-core Athlons with 4x400 SATA disks). Other things to consider: + Make all slave nodes exactly the same -- same RAM and disk configuration. It'll save you headache down the road. + Setup rsync so you can pull ARCs into your cluster with it. Once done, you can then feed nutchwax lists of ARCs using rsync URLs. This way, you can leave your ARCs out on storage nodes and the indexing software will take care of making the ARCs local to the indexing cluster. + DFS cannot be trusted. It'll be fixed soon but for now, as soon as an indexing job is completed, make a backup of the produced data -- segments and indices -- to local storage. Yours, St.Ack |