|
From: Michael S. <st...@ar...> - 2006-06-22 21:50:06
|
Natalia Torres wrote: > Hello > > I have a problem indexing new jobs with haddop and nutchwax. The forum > archive of this list doesn't works so I can't find information about it. > I wrote sourceforge asking where's our archive! > I index a couple of jobs crawled whith Heritrix to try NutchWax search > and it seems to work. > > I search a word in Nutchwax Search and the results are showed. But when > I click the title or "other versions" the url was wrong. It's something > like http://www.myurl.com/null/*/http//www.urlcrawled.com. The host 'myurl.com' is a server that will return the content of ARCs? > Surfing > examples on internet archive web I think is that "null" in path may be > collection name used at index time, I'm right? Why null? > Looks like your collection name is 'null'. If you do an explain of your search result, is there a 'collection' field, and if so, is its value null? You used 0.6.2 Nutchwax? With that version it was not possible to do an indexing without supplying a collection name -- supposedly. You can edit the search.jsp and add in a collection name. > There's any way to list the collections used indexing? > See the explain above. Otherwise, use nutch tools to read the content of metadata in your segments -- let me know and I'll supply more detail -- or you can look at the index produced using tools like luke (http://www.getopt.org/luke/) or some quick lucene code that iterates over each document printing out content of the content field (Sounds like yours is null though). > After try it i decided to add new jobs. When I try to index new jobs > using the same command an error appears because the indexes directory > in the output dir exists. Is this at the merge indices step? Try moving aside the old merged index -- i.e. DATA_DIR/index -- and retry running the single merge step. The 'all' command for nutchwax is for running through a complete indexing -- from start to finish. Adding increments needs work in nutchwax. Adding doc on howto with my experience running a few here will be the focus of the next nutchwax release. St.Ack > How can I add jobs to this index? > > Thanks > > > Natalia > > All the advantages of Linux Managed Hosting--Without the Cost and Risk! > Fully trained technicians. The highest number of Red Hat certifications in > the hosting industry. Fanatical Support. Click to learn more > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |