|
From: Chris V. <cv...@gm...> - 2007-09-19 22:45:21
|
Hi, I am attempting to render nutchwax full text search results using the open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. Creating and searching full-text indexes of arc files using nutchwax works fine. Unfortunately, I have been unsuccessful in rendering the result resources. I attempted to follow the instructions for Wayback-NutchWAX at http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the instructions seem to be based on an older version of wayback, and the some changes specified for the wayback's web.xml do not apply to the newest wayback version. The errors encountered depend on the configuration values I use, so here's a rundown of the properties: hadoop-site.xml: searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) wax.host points to the host and port of the tomcat installation, it does not include wayback context information (just host:port, chaz.hul.harvard.edu:10622) search.jsp: made the change: < String archiveCollection = detail.getValue("collection"); --- > String archiveCollection = "wayback"; // detail.getValue("collection"); wayback/WEB-INF/web.xml: The changes required for web.xml are to "[disable] wayback indexing of ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch ResourceIndex option". The Local-ARC ResourceStore option is enabled, and all others are disabled. resourcestore.autoindex is set to 0, and all physical paths have been checked for accuracy. I was unable to find any reference to PipeLineFilter, so there was no need to comment it out. I enabled the Remote-Nutch ResourceIndex option, and disabled all other ResourceIndex options. The Remote-Nutch option values are: <context-param> <param-name>resourceindex.classname</param-name> <param-value>org.archive.wayback.resourceindex.NutchResourceIndex </param-value> <description>Class that implements ResourceIndex for this Wayback</description> </context-param> <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax </param-value> <description>absolute URL to Nutch server</description> </context-param> <context-param> <param-name>maxresults</param-name> <param-value>1000</param-value> <description> Maximum number of results to return from the ResourceIndex. </description> </context-param> With the current setup, I can perform a full-text query using nutchwax and the result links seem to be of the correct form: http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get the error: Index not available *Unexpected SAX: White spaces are required between publicId and systemId.* * *in catalina.out, the stack trace is: [Fatal Error] ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White spaces ar e required between publicId and systemId. org.xml.sax.SAXParseException: White spaces are required between publicId and systemId. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:264) 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: 19960101000000, 20070919221459 at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java :122) ... if I set the resourceindex.baseurl property closer to the original value like this: <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>http://chaz.hul.harvard.edu:10622/xmlquery </param-value> <description>absolute URL to Nutch server</description> </context-param> when I click on a result link, I get this error: Index not available * http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... * and the stack trace looks like this: INFO: initialized org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter java.io.FileNotFoundException: http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi eld=site&hitsPerDup=10&hitsPerSite=10 at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1147) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( XMLEntityManager.java:973) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion (XMLVersionDetector.java:184) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:798) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:250) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) ... It seems like I have not configured the Remote-Nutch ResourceIndex properties correctly, but I don't have much to go on to try to correct it. Or perhaps I am not using nutchwax and wayback in the correct roles? Any help with this is greatly appreciated. Thanks, Chris |
|
From: Brad T. <br...@ar...> - 2007-09-27 01:01:02
|
Hi Chris, I can't access your nutch service, so am unable to provide very detailed assistance. One quick thing to test is changing: http://chaz.hul.harvard.edu:10622/xmlquery to http://chaz.hul.harvard.edu:10622/nutch/opensearch As far as which components should be doing what -- NutchWax and Wayback have drifted a little bit from the point when they were integrated so that Wayback could utilize a NutchWax index as the it's ResourceIndex. Performance issues with the NutchWax index motivated us to: 1) build a Wayback installation with it's own index, either CDX or BDB 2) modify seach.jsp as you've done already so links generated by NutchWax search result pages point into the wayback installation. I'm working with John Lee, who is currently running the NutchWax project, to get a better answer on how this will work going forward. Brad Chris Vicary wrote: > Hi, > > I am attempting to render nutchwax full text search results using the > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > Creating and searching full-text indexes of arc files using nutchwax works > fine. Unfortunately, I have been unsuccessful in rendering the result > resources. I attempted to follow the instructions for Wayback-NutchWAX at > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the > instructions seem to be based on an older version of wayback, and the some > changes specified for the wayback's web.xml do not apply to the newest > wayback version. > > The errors encountered depend on the configuration values I use, so here's a > rundown of the properties: > > hadoop-site.xml: > > searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) > wax.host points to the host and port of the tomcat installation, it does not > include wayback context information (just host:port, > chaz.hul.harvard.edu:10622) > > search.jsp: > > made the change: > > < String archiveCollection = > detail.getValue("collection"); > --- > >> String archiveCollection = "wayback"; // detail.getValue("collection"); >> > > > > wayback/WEB-INF/web.xml: > > The changes required for web.xml are to "[disable] wayback indexing of > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > ResourceIndex option". > > The Local-ARC ResourceStore option is enabled, and all others are disabled. > resourcestore.autoindex is set to 0, and all physical paths have been > checked for accuracy. > > I was unable to find any reference to PipeLineFilter, so there was no need > to comment it out. > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > ResourceIndex options. The Remote-Nutch option values are: > > <context-param> > <param-name>resourceindex.classname</param-name> > <param-value>org.archive.wayback.resourceindex.NutchResourceIndex > </param-value> > <description>Class that implements ResourceIndex for this > Wayback</description> > </context-param> > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > <context-param> > <param-name>maxresults</param-name> > <param-value>1000</param-value> > <description> > Maximum number of results to return from the ResourceIndex. > </description> > </context-param> > > > With the current setup, I can perform a full-text query using nutchwax and > the result links seem to be of the correct form: > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get > the error: > Index not available > > *Unexpected SAX: White spaces are required between publicId and systemId.* > > * > *in catalina.out, the stack trace is: > > [Fatal Error] > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > spaces ar > e required between publicId and systemId. > org.xml.sax.SAXParseException: White spaces are required between publicId > and systemId. > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:264) > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > 19960101000000, 20070919221459 > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java > :122) > ... > > if I set the resourceindex.baseurl property closer to the original value > like this: > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > when I click on a result link, I get this error: > Index not available * > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... > * > > and the stack trace looks like this: > > INFO: initialized > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > java.io.FileNotFoundException: > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > eld=site&hitsPerDup=10&hitsPerSite=10 > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > HttpURLConnection.java:1147) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( > XMLEntityManager.java:973) > at > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > (XMLVersionDetector.java:184) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:798) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:250) > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > ... > > It seems like I have not configured the Remote-Nutch ResourceIndex > properties correctly, but I don't have much to go on to try to correct it. > Or perhaps I am not using nutchwax and wayback in the correct roles? > > Any help with this is greatly appreciated. > > Thanks, > > Chris > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Chris V. <cv...@gm...> - 2007-10-09 19:47:44
|
Hi Brad, Thanks for the response. Unfortunately, you won't be able to access our servers due to IP restriction. I tried your suggestion of pointing to nutch/opensearch (and nutchwax/opensearch) without success. There were no errors produced, but it didn't provide the functionality I am looking for either. I am currently implementing full text search and retrieval using a combination of NutchWax (and its index) for search and Wayback (with a separate CDX index) for retrieval. This works fine. I was hoping for a single index solution, but it sounds like you are using the same technique. If you learn anything new from the NutchWax team, please pass it on. Thanks, Chris On 9/26/07, Brad Tofel <br...@ar...> wrote: > > Hi Chris, > > I can't access your nutch service, so am unable to provide very detailed > assistance. One quick thing to test is changing: > > http://chaz.hul.harvard.edu:10622/xmlquery > > to > > http://chaz.hul.harvard.edu:10622/nutch/opensearch > > > As far as which components should be doing what -- NutchWax and Wayback > have drifted a little bit from the point when they were integrated so > that Wayback could utilize a NutchWax index as the it's ResourceIndex. > Performance issues with the NutchWax index motivated us to: > > 1) build a Wayback installation with it's own index, either CDX or BDB > 2) modify seach.jsp as you've done already so links generated by > NutchWax search result pages point into the wayback installation. > > I'm working with John Lee, who is currently running the NutchWax > project, to get a better answer on how this will work going forward. > > Brad > > Chris Vicary wrote: > > Hi, > > > > I am attempting to render nutchwax full text search results using the > > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) > and > > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > > Creating and searching full-text indexes of arc files using nutchwax > works > > fine. Unfortunately, I have been unsuccessful in rendering the result > > resources. I attempted to follow the instructions for Wayback-NutchWAX > at > > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but > the > > instructions seem to be based on an older version of wayback, and the > some > > changes specified for the wayback's web.xml do not apply to the newest > > wayback version. > > > > The errors encountered depend on the configuration values I use, so > here's a > > rundown of the properties: > > > > hadoop-site.xml: > > > > searcher.dir points to a local nutchwax "outputs" directory > (/tmp/outputs) > > wax.host points to the host and port of the tomcat installation, it does > not > > include wayback context information (just host:port, > > chaz.hul.harvard.edu:10622) > > > > search.jsp: > > > > made the change: > > > > < String archiveCollection = > > detail.getValue("collection"); > > --- > > > >> String archiveCollection = "wayback"; // detail.getValue > ("collection"); > >> > > > > > > > > wayback/WEB-INF/web.xml: > > > > The changes required for web.xml are to "[disable] wayback indexing of > > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > > ResourceIndex option". > > > > The Local-ARC ResourceStore option is enabled, and all others are > disabled. > > resourcestore.autoindex is set to 0, and all physical paths have been > > checked for accuracy. > > > > I was unable to find any reference to PipeLineFilter, so there was no > need > > to comment it out. > > > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > > ResourceIndex options. The Remote-Nutch option values are: > > > > <context-param> > > <param-name>resourceindex.classname</param-name> > > <param-value> > org.archive.wayback.resourceindex.NutchResourceIndex > > </param-value> > > <description>Class that implements ResourceIndex for this > > Wayback</description> > > </context-param> > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > <context-param> > > <param-name>maxresults</param-name> > > <param-value>1000</param-value> > > <description> > > Maximum number of results to return from the > ResourceIndex. > > </description> > > </context-param> > > > > > > With the current setup, I can perform a full-text query using nutchwax > and > > the result links seem to be of the correct form: > > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I > get > > the error: > > Index not available > > > > *Unexpected SAX: White spaces are required between publicId and > systemId.* > > > > * > > *in catalina.out, the stack trace is: > > > > [Fatal Error] > > > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > > > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > > spaces ar > > e required between publicId and systemId. > > org.xml.sax.SAXParseException: White spaces are required between > publicId > > and systemId. > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:264) > > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > > 19960101000000, 20070919221459 > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > at org.archive.wayback.replay.ReplayServlet.doGet( > ReplayServlet.java > > :122) > > ... > > > > if I set the resourceindex.baseurl property closer to the original value > > like this: > > > > <context-param> > > <param-name>resourceindex.baseurl</param-name> > > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > > </param-value> > > <description>absolute URL to Nutch server</description> > > </context-param> > > > > when I click on a result link, I get this error: > > Index not available * > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp. > .. > > * > > > > and the stack trace looks like this: > > > > INFO: initialized > > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > > java.io.FileNotFoundException: > > > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > > > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > > eld=site&hitsPerDup=10&hitsPerSite=10 > > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > > HttpURLConnection.java:1147) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity > ( > > XMLEntityManager.java:973) > > at > > > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > > (XMLVersionDetector.java:184) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:798) > > at > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > > XML11Configuration.java:764) > > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > > XMLParser.java:148) > > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > > DOMParser.java:250) > > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > > (DocumentBuilderImpl.java:292) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java > :146) > > at > > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > > NutchResourceIndex.java:348) > > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > > NutchResourceIndex.java:140) > > ... > > > > It seems like I have not configured the Remote-Nutch ResourceIndex > > properties correctly, but I don't have much to go on to try to correct > it. > > Or perhaps I am not using nutchwax and wayback in the correct roles? > > > > Any help with this is greatly appreciated. > > > > Thanks, > > > > Chris > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > |