|
From: Michael S. <st...@ar...> - 2006-03-16 16:46:59
|
(Forwarding to the list because likely of general interest) Sverre Bang wrote: > ... > > On Thu, 2006-03-16 at 05:39 +0000, Banu Gandhi wrote: > >> Hi Sverre, >> >> We are using NutchWAX as an indexing tool for our Web Archive(WERA and >> HERITRIX). >> >> We wish to implement multiple indexing, as well as incremental >> indexing we keep our crwal files as an seperate server. >> >> * 1.I have some questions when I try to do Incremental >> Indexing. >> I mount the Arcfiles from another server. I have created queue. When I >> segmenting it, It shows error message that there is no such file in >> the queue fodler eventhough the arc files are linked properly in arcs >> folder. >> Please paste in the error Banu and the commands you run (You're not using the indexarcs wrapper script?). That'll help with diagnosis. >> >> When I try to implement the update statements from new segments, I got >> the message that "FS not specified default LOCAL" How can I specify >> this as not local. But the Update message shows update finished >> successfully. >> Be careful here. You probably want LOCAL for your case, at least for the moment. The alternative is NDFS, the nutch distributed file system that has since evolved in later versions of nutch -- nutchwax is based on nutch 0.7 --- to become DFS, part of the new hadoop apache project. Its phrasing is ominous as though you've left out some important specification but its just an emission to tell you which FS its about to use. >> >> The same message is shown when I update segments from db. >> After that If I check the ars folder of old segments, I can't see the >> new arc files. >> The arcs folder or the queue folder? >> >> Can you explain me where I made the mistake. >> >> * 2. Can we maitain multiple indexed folder, meaning mutliple >> arc file folder in the same machine, it is indexed under diff >> folder.Is WERA can access all the indexed folder for search >> results.... >> I think WERA passes the ARCRetriever a full path so multiple folders should be possible (Sverre)? Do you have an idea of how many ARC files you'll be dealing with? But there'll be upper limits to how many ARCs you can keep on a single machine.... so a means of keeping them distributed over multiple machines is needed. The open source wayback will have such a facility and we'll slot it into place when ready in place of ARCRetriever. >> >> * 3. Regarding the scalability of NutchWax,If I don't want to >> index the image file for full text searching. I wish to have >> just URL link to the images. How can we do that? >> Thats what currently happens. image/* and their like are passed to the default parser. All it does is add to the index meta info such as URL, type, etc. These resource types are not 'indexed' in the way text/* are. >> >> And also please let me know where I can I find the functionality parrt >> of all the folders as well as scripts of NutchWAX other than FAQ. >> I'm not clear what you're asking above. Please retry. Thanks Banu, St.Ack >> >> Thanks in advance. >> >> Best Regards, >> Banu >> >> >> ______________________________________________________________________ >> Jiyo cricket on Yahoo! India cricket >> Yahoo! Messenger Mobile Stay in touch with your buddies all the time. >> |