|
From: Chris V. <cv...@gm...> - 2007-09-19 22:45:21
|
Hi, I am attempting to render nutchwax full text search results using the open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. Creating and searching full-text indexes of arc files using nutchwax works fine. Unfortunately, I have been unsuccessful in rendering the result resources. I attempted to follow the instructions for Wayback-NutchWAX at http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the instructions seem to be based on an older version of wayback, and the some changes specified for the wayback's web.xml do not apply to the newest wayback version. The errors encountered depend on the configuration values I use, so here's a rundown of the properties: hadoop-site.xml: searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) wax.host points to the host and port of the tomcat installation, it does not include wayback context information (just host:port, chaz.hul.harvard.edu:10622) search.jsp: made the change: < String archiveCollection = detail.getValue("collection"); --- > String archiveCollection = "wayback"; // detail.getValue("collection"); wayback/WEB-INF/web.xml: The changes required for web.xml are to "[disable] wayback indexing of ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch ResourceIndex option". The Local-ARC ResourceStore option is enabled, and all others are disabled. resourcestore.autoindex is set to 0, and all physical paths have been checked for accuracy. I was unable to find any reference to PipeLineFilter, so there was no need to comment it out. I enabled the Remote-Nutch ResourceIndex option, and disabled all other ResourceIndex options. The Remote-Nutch option values are: <context-param> <param-name>resourceindex.classname</param-name> <param-value>org.archive.wayback.resourceindex.NutchResourceIndex </param-value> <description>Class that implements ResourceIndex for this Wayback</description> </context-param> <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax </param-value> <description>absolute URL to Nutch server</description> </context-param> <context-param> <param-name>maxresults</param-name> <param-value>1000</param-value> <description> Maximum number of results to return from the ResourceIndex. </description> </context-param> With the current setup, I can perform a full-text query using nutchwax and the result links seem to be of the correct form: http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get the error: Index not available *Unexpected SAX: White spaces are required between publicId and systemId.* * *in catalina.out, the stack trace is: [Fatal Error] ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White spaces ar e required between publicId and systemId. org.xml.sax.SAXParseException: White spaces are required between publicId and systemId. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:264) 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: 19960101000000, 20070919221459 at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java :122) ... if I set the resourceindex.baseurl property closer to the original value like this: <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>http://chaz.hul.harvard.edu:10622/xmlquery </param-value> <description>absolute URL to Nutch server</description> </context-param> when I click on a result link, I get this error: Index not available * http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... * and the stack trace looks like this: INFO: initialized org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter java.io.FileNotFoundException: http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi eld=site&hitsPerDup=10&hitsPerSite=10 at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1147) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( XMLEntityManager.java:973) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion (XMLVersionDetector.java:184) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:798) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:250) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) ... It seems like I have not configured the Remote-Nutch ResourceIndex properties correctly, but I don't have much to go on to try to correct it. Or perhaps I am not using nutchwax and wayback in the correct roles? Any help with this is greatly appreciated. Thanks, Chris |