|
From: Brad T. <br...@ar...> - 2007-09-27 01:01:02
|
Hi Chris, I can't access your nutch service, so am unable to provide very detailed assistance. One quick thing to test is changing: http://chaz.hul.harvard.edu:10622/xmlquery to http://chaz.hul.harvard.edu:10622/nutch/opensearch As far as which components should be doing what -- NutchWax and Wayback have drifted a little bit from the point when they were integrated so that Wayback could utilize a NutchWax index as the it's ResourceIndex. Performance issues with the NutchWax index motivated us to: 1) build a Wayback installation with it's own index, either CDX or BDB 2) modify seach.jsp as you've done already so links generated by NutchWax search result pages point into the wayback installation. I'm working with John Lee, who is currently running the NutchWax project, to get a better answer on how this will work going forward. Brad Chris Vicary wrote: > Hi, > > I am attempting to render nutchwax full text search results using the > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > Creating and searching full-text indexes of arc files using nutchwax works > fine. Unfortunately, I have been unsuccessful in rendering the result > resources. I attempted to follow the instructions for Wayback-NutchWAX at > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the > instructions seem to be based on an older version of wayback, and the some > changes specified for the wayback's web.xml do not apply to the newest > wayback version. > > The errors encountered depend on the configuration values I use, so here's a > rundown of the properties: > > hadoop-site.xml: > > searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) > wax.host points to the host and port of the tomcat installation, it does not > include wayback context information (just host:port, > chaz.hul.harvard.edu:10622) > > search.jsp: > > made the change: > > < String archiveCollection = > detail.getValue("collection"); > --- > >> String archiveCollection = "wayback"; // detail.getValue("collection"); >> > > > > wayback/WEB-INF/web.xml: > > The changes required for web.xml are to "[disable] wayback indexing of > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > ResourceIndex option". > > The Local-ARC ResourceStore option is enabled, and all others are disabled. > resourcestore.autoindex is set to 0, and all physical paths have been > checked for accuracy. > > I was unable to find any reference to PipeLineFilter, so there was no need > to comment it out. > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > ResourceIndex options. The Remote-Nutch option values are: > > <context-param> > <param-name>resourceindex.classname</param-name> > <param-value>org.archive.wayback.resourceindex.NutchResourceIndex > </param-value> > <description>Class that implements ResourceIndex for this > Wayback</description> > </context-param> > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > <context-param> > <param-name>maxresults</param-name> > <param-value>1000</param-value> > <description> > Maximum number of results to return from the ResourceIndex. > </description> > </context-param> > > > With the current setup, I can perform a full-text query using nutchwax and > the result links seem to be of the correct form: > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get > the error: > Index not available > > *Unexpected SAX: White spaces are required between publicId and systemId.* > > * > *in catalina.out, the stack trace is: > > [Fatal Error] > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > spaces ar > e required between publicId and systemId. > org.xml.sax.SAXParseException: White spaces are required between publicId > and systemId. > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:264) > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > 19960101000000, 20070919221459 > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java > :122) > ... > > if I set the resourceindex.baseurl property closer to the original value > like this: > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > when I click on a result link, I get this error: > Index not available * > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... > * > > and the stack trace looks like this: > > INFO: initialized > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > java.io.FileNotFoundException: > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > eld=site&hitsPerDup=10&hitsPerSite=10 > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > HttpURLConnection.java:1147) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( > XMLEntityManager.java:973) > at > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > (XMLVersionDetector.java:184) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:798) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:250) > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > ... > > It seems like I have not configured the Remote-Nutch ResourceIndex > properties correctly, but I don't have much to go on to try to correct it. > Or perhaps I am not using nutchwax and wayback in the correct roles? > > Any help with this is greatly appreciated. > > Thanks, > > Chris > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |