|
From: Chris V. <cv...@gm...> - 2007-10-09 19:47:44
|
Hi Brad, Thanks for the response. Unfortunately, you won't be able to access our servers due to IP restriction. I tried your suggestion of pointing to nutch/opensearch (and nutchwax/opensearch) without success. There were no errors produced, but it didn't provide the functionality I am looking for either. I am currently implementing full text search and retrieval using a combination of NutchWax (and its index) for search and Wayback (with a separate CDX index) for retrieval. This works fine. I was hoping for a single index solution, but it sounds like you are using the same technique. If you learn anything new from the NutchWax team, please pass it on. Thanks, Chris On 9/26/07, Brad Tofel <br...@ar...> wrote: > > Hi Chris, > > I can't access your nutch service, so am unable to provide very detailed > assistance. One quick thing to test is changing: > > http://chaz.hul.harvard.edu:10622/xmlquery > > to > > http://chaz.hul.harvard.edu:10622/nutch/opensearch > > > As far as which components should be doing what -- NutchWax and Wayback > have drifted a little bit from the point when they were integrated so > that Wayback could utilize a NutchWax index as the it's ResourceIndex. > Performance issues with the NutchWax index motivated us to: > > 1) build a Wayback installation with it's own index, either CDX or BDB > 2) modify seach.jsp as you've done already so links generated by > NutchWax search result pages point into the wayback installation. > > I'm working with John Lee, who is currently running the NutchWax > project, to get a better answer on how this will work going forward. > > Brad > > Chris Vicary wrote: > > Hi, > > > > I am attempting to render nutchwax full text search results using the > > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) > and > > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > > Creating and searching full-text indexes of arc files using nutchwax > works > > fine. Unfortunately, I have been unsuccessful in rendering the result > > resources. I attempted to follow the instructions for Wayback-NutchWAX > at > > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but > the > > instructions seem to be based on an older version of wayback, and the > some > > changes specified for the wayback's web.xml do not apply to the newest > > wayback version. > > > > The errors encountered depend on the configuration values I use, so > here's a > > rundown of the properties: > > > > hadoop-site.xml: > > > > searcher.dir points to a local nutchwax "outputs" directory > (/tmp/outputs) > > wax.host points to the host and port of the tomcat installation, it does > not > > include wayback context information (just host:port, > > chaz.hul.harvard.edu:10622) > > > > search.jsp: > > > > made the change: > > > > < String archiveCollection = > > detail.getValue("collection"); > > --- > > > >> String archiveCollection = "wayback"; // detail.getValue > ("collection"); > >> > > > > > > > > wayback/WEB-INF/web.xml: > > > > The changes required for web.xml are to "[disable] wayback indexing of > > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > > ResourceIndex option". > > > > The Local-ARC ResourceStore option is enabled, and all others are > disabled. > > resourcestore.autoindex is set to 0, and all physical paths have been > > checked for accuracy. > > > > I was unable to find any reference to PipeLineFilter, so there was no > need > > to comment it out. > > > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > > ResourceIndex options. The Remote-Nutch option values are: > > > > <context-param> > > <param-name>resourceindex.classname</param-name> > > <param-value> > org.archive.wayback.resourceindex.NutchResourceIndex > > </param-value> > > <description>Class that implements ResourceIndex for this > > Wayback</description> > > </context-param> > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > <context-param> > > <param-name>maxresults</param-name> > > <param-value>1000</param-value> > > <description> > > Maximum number of results to return from the > ResourceIndex. > > </description> > > </context-param> > > > > > > With the current setup, I can perform a full-text query using nutchwax > and > > the result links seem to be of the correct form: > > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I > get > > the error: > > Index not available > > > > *Unexpected SAX: White spaces are required between publicId and > systemId.* > > > > * > > *in catalina.out, the stack trace is: > > > > [Fatal Error] > > > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > > > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > > spaces ar > > e required between publicId and systemId. > > org.xml.sax.SAXParseException: White spaces are required between > publicId > > and systemId. > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:264) > > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > > 19960101000000, 20070919221459 > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > at org.archive.wayback.replay.ReplayServlet.doGet( > ReplayServlet.java > > :122) > > ... > > > > if I set the resourceindex.baseurl property closer to the original value > > like this: > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > when I click on a result link, I get this error: > > Index not available * > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp. > .. > > * > > > > and the stack trace looks like this: > > > > INFO: initialized > > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > > java.io.FileNotFoundException: > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > > > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > > eld=site&hitsPerDup=10&hitsPerSite=10 > > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > > HttpURLConnection.java:1147) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity > ( > > XMLEntityManager.java:973) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > > (XMLVersionDetector.java:184) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:798) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:764) > > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > > XMLParser.java:148) > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:250) > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > ... > > > > It seems like I have not configured the Remote-Nutch ResourceIndex > > properties correctly, but I don't have much to go on to try to correct > it. > > Or perhaps I am not using nutchwax and wayback in the correct roles? > > > > Any help with this is greatly appreciated. > > > > Thanks, > > > > Chris > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > |