You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Ignacio G. <igc...@gm...> - 2007-09-28 12:32:52
|
Michael, I do not know if it failed on the same record... the first time it failed I assumed that increasing the -Xmx parameters would solve it, since the OOME has happened before when indexing with Wayback. I will try to narrow it as much as I can if it fails again. On 9/27/07, Michael Stack <st...@du...> wrote: > > What John says and then > > + The OOME exception stack trace might tell us something. > + Is the OOME always in same place processing same record? If so, take > a look at it in the ARC. > > St.Ack > > John H. Lee wrote: > > Hi Ignacio. > > > > It would be helpful if you posted the following information: > > - Are you using standalone or mapreduce? > > - If mapreduce, what are your mapred.map.tasks and > > mapred.reduce.tasks properties set to? > > - If mapreduce, how many slaves do you have and how much memory do > > they have? > > - How many ARCs are you trying to index? > > - Did the map reach 100% completion before the failure occurred? > > > > Some things you may want to try: > > - Set both -Xmx and -Xmx to the maximum available on your systems > > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, > > depending where the failure occurred > > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > > > -J > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > > > > >> Hello, > >> > >> I've been doing some testing with nutchwax and I have never had any > >> major problems. > >> However, right now I am trying to index a collection that is over > >> 100 Gb big, and for some reason the indexing is crashing while it > >> tries to populate 'crawldb' > >> > >> The job will run fine at the beginning importing the information > >> from the ARCs and creating the "segments" section. > >> > >> The error I get is an outOfMemory error when the system is > >> processing each of the part.xx in the segments previously created. > >> > >> I tried increasing the following setting on the hadoop-default.xml > >> config file: mapred.child.java.opts to 1GB, but it still failed in > >> the same part. > >> > >> Is there any way to reduce the amount of memory used by nutchwax/ > >> hadoop to make the process more efficient and be able to index such > >> a collection? > >> > >> Thank you. > >> ---------------------------------------------------------------------- > >> --- > >> This SF.net email is sponsored by: Microsoft > >> Defy all challenges. Microsoft(R) Visual Studio 2005. > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > >> _______________________________________________ > >> Archive-access-discuss mailing list > >> Arc...@li... > >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >> > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > |
|
From: Ignacio G. <igc...@gm...> - 2007-09-28 12:29:06
|
I had already increased the -Xmx to 2Gb, and it still failed.
For everything else I am using the default settings and following the "get
started" guide on the nutchwax site, so i am using the
sudo {HADOOP_HOME}/bin/hadoop jar {NUTCHWAX_HOME}/nutchwax.jar all input
output collection
And I believe I am using mapreduce...
The number of ARCS is 3521, with an average size of 30Mb/ARC.
I am trying right now breaking the job in several chunks, to see if that
helps.
If it fails again I will grab as much information as I can as to when it
exactly failed.
Thank you.
On 9/27/07, John H. Lee <jl...@ar...> wrote:
>
> Hi Ignacio.
>
> It would be helpful if you posted the following information:
> - Are you using standalone or mapreduce?
> - If mapreduce, what are your mapred.map.tasks and
> mapred.reduce.tasks properties set to?
> - If mapreduce, how many slaves do you have and how much memory do
> they have?
> - How many ARCs are you trying to index?
> - Did the map reach 100% completion before the failure occurred?
>
> Some things you may want to try:
> - Set both -Xmx and -Xmx to the maximum available on your systems
> - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,
> depending where the failure occurred
> - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
>
> -J
>
> On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
>
> > Hello,
> >
> > I've been doing some testing with nutchwax and I have never had any
> > major problems.
> > However, right now I am trying to index a collection that is over
> > 100 Gb big, and for some reason the indexing is crashing while it
> > tries to populate 'crawldb'
> >
> > The job will run fine at the beginning importing the information
> > from the ARCs and creating the "segments" section.
> >
> > The error I get is an outOfMemory error when the system is
> > processing each of the part.xx in the segments previously created.
> >
> > I tried increasing the following setting on the hadoop-default.xml
> > config file: mapred.child.java.opts to 1GB, but it still failed in
> > the same part.
> >
> > Is there any way to reduce the amount of memory used by nutchwax/
> > hadoop to make the process more efficient and be able to index such
> > a collection?
> >
> > Thank you.
> > ----------------------------------------------------------------------
> > ---
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
|
|
From: Michael S. <st...@du...> - 2007-09-27 22:47:42
|
What John says and then + The OOME exception stack trace might tell us something. + Is the OOME always in same place processing same record? If so, take a look at it in the ARC. St.Ack John H. Lee wrote: > Hi Ignacio. > > It would be helpful if you posted the following information: > - Are you using standalone or mapreduce? > - If mapreduce, what are your mapred.map.tasks and > mapred.reduce.tasks properties set to? > - If mapreduce, how many slaves do you have and how much memory do > they have? > - How many ARCs are you trying to index? > - Did the map reach 100% completion before the failure occurred? > > Some things you may want to try: > - Set both -Xmx and -Xmx to the maximum available on your systems > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, > depending where the failure occurred > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs > > -J > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > > >> Hello, >> >> I've been doing some testing with nutchwax and I have never had any >> major problems. >> However, right now I am trying to index a collection that is over >> 100 Gb big, and for some reason the indexing is crashing while it >> tries to populate 'crawldb' >> >> The job will run fine at the beginning importing the information >> from the ARCs and creating the "segments" section. >> >> The error I get is an outOfMemory error when the system is >> processing each of the part.xx in the segments previously created. >> >> I tried increasing the following setting on the hadoop-default.xml >> config file: mapred.child.java.opts to 1GB, but it still failed in >> the same part. >> >> Is there any way to reduce the amount of memory used by nutchwax/ >> hadoop to make the process more efficient and be able to index such >> a collection? >> >> Thank you. >> ---------------------------------------------------------------------- >> --- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: John H. L. <jl...@ar...> - 2007-09-27 22:38:42
|
Hi Ignacio. It would be helpful if you posted the following information: - Are you using standalone or mapreduce? - If mapreduce, what are your mapred.map.tasks and mapred.reduce.tasks properties set to? - If mapreduce, how many slaves do you have and how much memory do they have? - How many ARCs are you trying to index? - Did the map reach 100% completion before the failure occurred? Some things you may want to try: - Set both -Xmx and -Xmx to the maximum available on your systems - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, depending where the failure occurred - Break your job up into smaller chunks of say, 1000 or 5000 ARCs -J On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > Hello, > > I've been doing some testing with nutchwax and I have never had any > major problems. > However, right now I am trying to index a collection that is over > 100 Gb big, and for some reason the indexing is crashing while it > tries to populate 'crawldb' > > The job will run fine at the beginning importing the information > from the ARCs and creating the "segments" section. > > The error I get is an outOfMemory error when the system is > processing each of the part.xx in the segments previously created. > > I tried increasing the following setting on the hadoop-default.xml > config file: mapred.child.java.opts to 1GB, but it still failed in > the same part. > > Is there any way to reduce the amount of memory used by nutchwax/ > hadoop to make the process more efficient and be able to index such > a collection? > > Thank you. > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Ignacio G. <igc...@gm...> - 2007-09-27 17:47:30
|
Hello, I've been doing some testing with nutchwax and I have never had any major problems. However, right now I am trying to index a collection that is over 100 Gb big, and for some reason the indexing is crashing while it tries to populate 'crawldb' The job will run fine at the beginning importing the information from the ARCs and creating the "segments" section. The error I get is an outOfMemory error when the system is processing each of the part.xx in the segments previously created. I tried increasing the following setting on the hadoop-default.xml config file: mapred.child.java.opts to 1GB, but it still failed in the same part. Is there any way to reduce the amount of memory used by nutchwax/hadoop to make the process more efficient and be able to index such a collection? Thank you. |
|
From: Brad T. <br...@ar...> - 2007-09-27 01:01:02
|
Hi Chris, I can't access your nutch service, so am unable to provide very detailed assistance. One quick thing to test is changing: http://chaz.hul.harvard.edu:10622/xmlquery to http://chaz.hul.harvard.edu:10622/nutch/opensearch As far as which components should be doing what -- NutchWax and Wayback have drifted a little bit from the point when they were integrated so that Wayback could utilize a NutchWax index as the it's ResourceIndex. Performance issues with the NutchWax index motivated us to: 1) build a Wayback installation with it's own index, either CDX or BDB 2) modify seach.jsp as you've done already so links generated by NutchWax search result pages point into the wayback installation. I'm working with John Lee, who is currently running the NutchWax project, to get a better answer on how this will work going forward. Brad Chris Vicary wrote: > Hi, > > I am attempting to render nutchwax full text search results using the > open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and > wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. > Creating and searching full-text indexes of arc files using nutchwax works > fine. Unfortunately, I have been unsuccessful in rendering the result > resources. I attempted to follow the instructions for Wayback-NutchWAX at > http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the > instructions seem to be based on an older version of wayback, and the some > changes specified for the wayback's web.xml do not apply to the newest > wayback version. > > The errors encountered depend on the configuration values I use, so here's a > rundown of the properties: > > hadoop-site.xml: > > searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) > wax.host points to the host and port of the tomcat installation, it does not > include wayback context information (just host:port, > chaz.hul.harvard.edu:10622) > > search.jsp: > > made the change: > > < String archiveCollection = > detail.getValue("collection"); > --- > >> String archiveCollection = "wayback"; // detail.getValue("collection"); >> > > > > wayback/WEB-INF/web.xml: > > The changes required for web.xml are to "[disable] wayback indexing of > ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch > ResourceIndex option". > > The Local-ARC ResourceStore option is enabled, and all others are disabled. > resourcestore.autoindex is set to 0, and all physical paths have been > checked for accuracy. > > I was unable to find any reference to PipeLineFilter, so there was no need > to comment it out. > > I enabled the Remote-Nutch ResourceIndex option, and disabled all other > ResourceIndex options. The Remote-Nutch option values are: > > <context-param> > <param-name>resourceindex.classname</param-name> > <param-value>org.archive.wayback.resourceindex.NutchResourceIndex > </param-value> > <description>Class that implements ResourceIndex for this > Wayback</description> > </context-param> > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > <context-param> > <param-name>maxresults</param-name> > <param-value>1000</param-value> > <description> > Maximum number of results to return from the ResourceIndex. > </description> > </context-param> > > > With the current setup, I can perform a full-text query using nutchwax and > the result links seem to be of the correct form: > http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get > the error: > Index not available > > *Unexpected SAX: White spaces are required between publicId and systemId.* > > * > *in catalina.out, the stack trace is: > > [Fatal Error] > ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n > otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White > spaces ar > e required between publicId and systemId. > org.xml.sax.SAXParseException: White spaces are required between publicId > and systemId. > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:264) > 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: > 19960101000000, 20070919221459 > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java > :122) > ... > > if I set the resourceindex.baseurl property closer to the original value > like this: > > <context-param> > <param-name>resourceindex.baseurl</param-name> > <param-value>http://chaz.hul.harvard.edu:10622/xmlquery > </param-value> > <description>absolute URL to Nutch server</description> > </context-param> > > when I click on a result link, I get this error: > Index not available * > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... > * > > and the stack trace looks like this: > > INFO: initialized > org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter > java.io.FileNotFoundException: > http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac > turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi > eld=site&hitsPerDup=10&hitsPerSite=10 > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > HttpURLConnection.java:1147) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( > XMLEntityManager.java:973) > at > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion > (XMLVersionDetector.java:184) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:798) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( > XML11Configuration.java:764) > at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( > XMLParser.java:148) > at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( > DOMParser.java:250) > at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse > (DocumentBuilderImpl.java:292) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) > at > org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( > NutchResourceIndex.java:348) > at org.archive.wayback.resourceindex.NutchResourceIndex.query( > NutchResourceIndex.java:140) > ... > > It seems like I have not configured the Remote-Nutch ResourceIndex > properties correctly, but I don't have much to go on to try to correct it. > Or perhaps I am not using nutchwax and wayback in the correct roles? > > Any help with this is greatly appreciated. > > Thanks, > > Chris > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Chris V. <cv...@gm...> - 2007-09-19 22:45:21
|
Hi, I am attempting to render nutchwax full text search results using the open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat. Creating and searching full-text indexes of arc files using nutchwax works fine. Unfortunately, I have been unsuccessful in rendering the result resources. I attempted to follow the instructions for Wayback-NutchWAX at http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the instructions seem to be based on an older version of wayback, and the some changes specified for the wayback's web.xml do not apply to the newest wayback version. The errors encountered depend on the configuration values I use, so here's a rundown of the properties: hadoop-site.xml: searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs) wax.host points to the host and port of the tomcat installation, it does not include wayback context information (just host:port, chaz.hul.harvard.edu:10622) search.jsp: made the change: < String archiveCollection = detail.getValue("collection"); --- > String archiveCollection = "wayback"; // detail.getValue("collection"); wayback/WEB-INF/web.xml: The changes required for web.xml are to "[disable] wayback indexing of ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch ResourceIndex option". The Local-ARC ResourceStore option is enabled, and all others are disabled. resourcestore.autoindex is set to 0, and all physical paths have been checked for accuracy. I was unable to find any reference to PipeLineFilter, so there was no need to comment it out. I enabled the Remote-Nutch ResourceIndex option, and disabled all other ResourceIndex options. The Remote-Nutch option values are: <context-param> <param-name>resourceindex.classname</param-name> <param-value>org.archive.wayback.resourceindex.NutchResourceIndex </param-value> <description>Class that implements ResourceIndex for this Wayback</description> </context-param> <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax </param-value> <description>absolute URL to Nutch server</description> </context-param> <context-param> <param-name>maxresults</param-name> <param-value>1000</param-value> <description> Maximum number of results to return from the ResourceIndex. </description> </context-param> With the current setup, I can perform a full-text query using nutchwax and the result links seem to be of the correct form: http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get the error: Index not available *Unexpected SAX: White spaces are required between publicId and systemId.* * *in catalina.out, the stack trace is: [Fatal Error] ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White spaces ar e required between publicId and systemId. org.xml.sax.SAXParseException: White spaces are required between publicId and systemId. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:264) 2007-09-19 18:14:59,244 INFO WaxDateQueryFilter - Found range date: 19960101000000, 20070919221459 at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java :122) ... if I set the resourceindex.baseurl property closer to the original value like this: <context-param> <param-name>resourceindex.baseurl</param-name> <param-value>http://chaz.hul.harvard.edu:10622/xmlquery </param-value> <description>absolute URL to Nutch server</description> </context-param> when I click on a result link, I get this error: Index not available * http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp... * and the stack trace looks like this: INFO: initialized org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter java.io.FileNotFoundException: http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi eld=site&hitsPerDup=10&hitsPerSite=10 at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1147) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity( XMLEntityManager.java:973) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion (XMLVersionDetector.java:184) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:798) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse( XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse( XMLParser.java:148) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse( DOMParser.java:250) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse (DocumentBuilderImpl.java:292) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146) at org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument( NutchResourceIndex.java:348) at org.archive.wayback.resourceindex.NutchResourceIndex.query( NutchResourceIndex.java:140) ... It seems like I have not configured the Remote-Nutch ResourceIndex properties correctly, but I don't have much to go on to try to correct it. Or perhaps I am not using nutchwax and wayback in the correct roles? Any help with this is greatly appreciated. Thanks, Chris |
|
From: alexis a. <alx...@ya...> - 2007-09-05 03:04:44
|
Hi Sverre, Thanks for confirming this fix. I was able to figure this out a couple of days ago and was testing it out. Unfortunately, I was missing some versions so I have to redo the entire indexing process. Best Regards, Alex Sverre Bang <sve...@nb...> wrote: Hi Alex, I have looked into the Wera/Nutchwax incompatibility. It seems that the element nutch:arcdate returned by nutchwax has chanced its name to nutch:tstamp. Since i'm not involved in the nutchwax development, i can't say when (and why) this happened. Anyway, to patch wera download the file http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal/nutch.inc?revision=1.10 an replace the one in your wera installaion (in /lib/seal/) The patch will support nutchwax released before and after the switch from arcdate to tstamp. Please be aware that every different version of a url must be in a separate nutchwax collection, and deduplication must be skipped (all the versions of a particular url in the same collection be treated as duplicates regardless of dedup or not). See http://archive-access.sourceforge.net/projects/nutch/faq.html#dedup Regards Sverre On Wed, 2007-08-15 at 04:00 -0700, alexis artes wrote: > Hi, > > Has anybody tried using Nutchwax0.10 and WERA together? We are > encountering this problem: The resultset array obtained from > documentLocator->findVersions() does not have the date field for all > the files found. Wera will still be able to display the page but the > timeline will be all messed up. > > Was there any API change in Nutchwax0.10 concerning the searching of > the index or delivery of resultset? > > Best Regards, > Alex > > > > ______________________________________________________________________ > Pinpoint customers who are looking for what you sell. > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss --------------------------------- Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. |
|
From: Sverre B. <sve...@nb...> - 2007-09-04 13:33:57
|
Hi Alex, I have looked into the Wera/Nutchwax incompatibility. It seems that the element nutch:arcdate returned by nutchwax has chanced its name to nutch:tstamp. Since i'm not involved in the nutchwax development, i can't say when (and why) this happened. Anyway, to patch wera download the file http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal/nutch.inc?revision=1.10 an replace the one in your wera installaion (in <wera_inst_dir>/lib/seal/) The patch will support nutchwax released before and after the switch from arcdate to tstamp. Please be aware that every different version of a url must be in a separate nutchwax collection, and deduplication must be skipped (all the versions of a particular url in the same collection be treated as duplicates regardless of dedup or not). See http://archive-access.sourceforge.net/projects/nutch/faq.html#dedup Regards Sverre On Wed, 2007-08-15 at 04:00 -0700, alexis artes wrote: > Hi, > > Has anybody tried using Nutchwax0.10 and WERA together? We are > encountering this problem: The resultset array obtained from > documentLocator->findVersions() does not have the date field for all > the files found. Wera will still be able to display the page but the > timeline will be all messed up. > > Was there any API change in Nutchwax0.10 concerning the searching of > the index or delivery of resultset? > > Best Regards, > Alex > > > > ______________________________________________________________________ > Pinpoint customers who are looking for what you sell. > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: alexis a. <alx...@ya...> - 2007-08-15 11:00:48
|
Hi,
Has anybody tried using Nutchwax0.10 and WERA together? We are encountering this problem: The resultset array obtained from documentLocator->findVersions() does not have the date field for all the files found. Wera will still be able to display the page but the timeline will be all messed up.
Was there any API change in Nutchwax0.10 concerning the searching of the index or delivery of resultset?
Best Regards,
Alex
---------------------------------
Pinpoint customers who are looking for what you sell. |
|
From: Michael S. <st...@du...> - 2007-08-02 15:53:34
|
How large are the indices? Would deploying them side-by-side work for you? See the last paragraph at the end of this FAQ, http://archive-access.sourceforge.net/projects/nutchwax/faq.html#incremental, for pointers on how. St.Ack pangang wrote: > > Hello : > We use Hadoop external the JAR :Nunthwax -10.0.0 of Nunthwax . > We had finished the data indexing of 2006, and we have now done part > of the 2007 data indexing, > the problem we face is how these two parts of the data is combined. > In order that the data indexing of 2006 and the part of 2007 can be > used by searchers! > thank you > wish to your answer > > MSN£ºpangang@126.com <mailto:pa...@12...> > > > > ------------------------------------------------------------------------ > ÈËɽ ÈË º£ Ê¢ ¾°£¬¾¡ ÔÚ ÃÎ »Ã Î÷ ÓÎ > <http://event.mail.163.com/chanel/xyq.htm?from=126_NO6> > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: pangang <pa...@12...> - 2007-08-02 08:51:06
|
Hello : We use Hadoop external the JAR :Nunthwax -10.0.0 of Nunthwax . We =
had finished the data indexing of 2006, and we have now done part of the 20=
07 data indexing, the problem we face is how these two parts of the data is=
combined. In order that the data indexing of 2006 and the part of 2007 can=
be used by searchers! =
thank you =
wish to your answer
MSN...@12... |
|
From: Michael S. <st...@du...> - 2007-07-23 16:28:28
|
I do not believe such an index convertion tool exists (Check the nutch list). Even if it did, I'd suggest you'd spend so much CPU running the convertion of index and supporting segments, you might as well start over (New nutch/hadoop runs much faster.. about X4 times faster). Starting over, you can be sure of the process, more sure than you can be of a little-tested transform, and you will pick up improvements made since old nutch. The ClassCastException in the below is because old nutchwax used an UTF8 class to represent Strings, a class since replaced by the Text class (Your new nutch frontend is trying to use Text to represent a UTF8 class read from segment directories I'm guessing). St.Ack Xavier Torelló wrote: > Hi, > > First of all, thanks for your quick response :) > > The re-index option is not viable, since it is a expensive process > considering that we have about 150gb in indices. > > John talk about the option of convert the indexs. Somebody knows how to > do this process? > > > Finally, the exception that appears when we try to make a request to > nutchwax (via opensearch): > > java.lang.RuntimeException: java.lang.ClassCastException: > org.apache.hadoop.io.Text > > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:344) > org.archive.access.nutch.NutchwaxBean.getSummary(NutchwaxBean.java:52) > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:156) > org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:76) > javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > > > Thanks, > > |
|
From: <xto...@ce...> - 2007-07-23 07:22:00
|
Hi, First of all, thanks for your quick response :) The re-index option is not viable, since it is a expensive process considering that we have about 150gb in indices. John talk about the option of convert the indexs. Somebody knows how to do this process? Finally, the exception that appears when we try to make a request to nutchwax (via opensearch): java.lang.RuntimeException: java.lang.ClassCastException: org.apache.hadoop.io.Text org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:344) org.archive.access.nutch.NutchwaxBean.getSummary(NutchwaxBean.java:52) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:156) org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:76) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) Thanks, -- xt |
|
From: John H. L. <jl...@ar...> - 2007-07-20 20:23:02
|
Hi Xavier. I believe the index format changed between those versions, so you may =20= need to re-index your documents or convert the indices somehow. If you send the stack traces associated with the two exceptions, you =20 have a much better chance of getting a useful response from the list. -J On Jul 20, 2007, at 6:03 AM, Xavier Torell=F3 wrote: > Hi! > > We're currently upgraded nutch-wax from 0.7v to 0.10.0v. When we =20 > try to make a query using the indexs created via hadoop 0.5v, this =20 > process breaks showing two java exceptions. > > It's essential that we re-run the indexation process of the jobs =20 > createds by Heritrix using the lastest version of hadoop? > > Exists any procedure to update the format of the indexs to be able =20 > work with the latest version of nutch-wax? > > Thanks a lot. > > Regards, > -- > xt > ----------------------------------------------------------------------=20= > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=20 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: <xto...@ce...> - 2007-07-20 13:03:36
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> <pre class="moz-signature" cols="72"><font color="#330033">Hi! We're currently upgraded nutch-wax from 0.7v to 0.10.0v. When we try to make a query using the indexs created via hadoop 0.5v, this process breaks showing two java exceptions. It's essential that we re-run the indexation process of the jobs createds by Heritrix using the lastest version of hadoop? Exists any procedure to update the format of the indexs to be able work with the latest version of nutch-wax? Thanks a lot. Regards,</font> -- xt </pre> </body> </html> |
|
From: alexis a. <alx...@ya...> - 2007-06-28 05:17:55
|
Hi, We are encountering a new set of errors aside from the socket time out. Subsequent runs produces the following errors that results to Job Failed. We hope you can guide us in this issue. 2007-06-27 08:48:04,001 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0003_m_000115_0: java.io.IOException: Could not obtain block: blk_-8188170094415436519 file=/user/outputs/segments/20070626172746-test/parse_data/part-00023/data offset=33845248 at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:563) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:675) at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:313) at java.io.DataInputStream.readFully(DataInputStream.java:176) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:404) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:330) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:371) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:58) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:183) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:49) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075) 2007-06-27 08:48:31,758 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0003_m_000154_0' has been lost. 2007-06-27 08:48:31,794 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_0003_m_000154_1' to tip tip_0003_m_000154, for tracker 'tracker_orange.com:50050' 2007-06-27 08:48:32,544 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0003_m_000146_0: java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0, summed=512, read=4096, bytesPerSum=1, inSum=512 at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:100) at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:313) at java.io.DataInputStream.readFully(DataInputStream.java:176) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:404) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:330) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:371) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:58) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:183) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:49) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.util.zip.CRC32.update(Unknown Source) at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98) ... 15 more alexis artes <alx...@ya...> wrote: Hi, We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error. 2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout Exception: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:312) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161) at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126) at java.io.FilterOutputStream.close(FilterOutputStream.java:143) at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97) at java.io.FilterOutputStream.close(FilterOutputStream.java:143) at java.io.FilterOutputStream.close(FilterOutputStream.java:143) at java.io.FilterOutputStream.close(FilterOutputStream.java:143) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160) at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118) at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075) We are using 28 nodes. Our configuration in hadoop-site.xml as follows: <property> <name>fs.default.name</name> <value>apple001:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>apple001:9001</value> </property> <property> <name>dfs.name.dir</name> <value>/opt/hadoop-0.5.0/filesystem/name</value> </property> <property> <name>dfs.data.dir</name> <value>/opt/hadoop-0.5.0/filesystem/data</value> </property> <property> <name>mapred.local.dir</name> <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value> </property> <property> <name>mapred.system.dir</name> <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value> <description>The shared directory where MapReduce stores control files. </description> </property> <property> <name>mapred.temp.dir</name> <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value> <description>A shared directory for temporary files. </description> </property> <property> <name>mapred.map.tasks</name> <value>89</value> <description> define mapred.map tasks to be number of slave hosts </description> </property> <property> <name>mapred.reduce.tasks</name> <value>53</value> <description> define mapred.reduce tasks to be number of slave hosts </description> </property> <property> <name>mapred.tasktracker.tasks.maximum</name> <value>2</value> <description>The maximum number of tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>dfs.replication</name> <value>1</value> </property> Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors. Our System Configuration: Scientific Linux CERN 2.4.21-32.0.1.EL.cernsmp JDK1.5 Hadoop0.5 Nutchwax0.8 Best Regards, Alex --------------------------------- Pinpoint customers who are looking for what you sell. --------------------------------- Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. |
|
From: alexis a. <alx...@ya...> - 2007-06-22 10:35:39
|
Hi,
We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error.
2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout
Exception: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160)
at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)
We are using 28 nodes. Our configuration in hadoop-site.xml as follows:
<property>
<name>fs.default.name</name>
<value>apple001:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>apple001:9001</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/data</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
<description>The shared directory where MapReduce stores control files.
</description>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
<description>A shared directory for temporary files.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>89</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>53</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>The maximum number of tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors.
Our System Configuration:
Scientific Linux CERN
2.4.21-32.0.1.EL.cernsmp
JDK1.5
Hadoop0.5
Nutchwax0.8
Best Regards,
Alex
---------------------------------
Pinpoint customers who are looking for what you sell. |
|
From: John H. L. <jl...@ar...> - 2007-06-20 14:54:10
|
Hi Alexis. NutchWAX 0.10.0 has lots of bug fixes and improvements over 0.8.0, so you may want to start by upgrading your installation. Does your job complete any tasks before you see this error? Do you see any other errors in the logs? Specifically, do you see a BindException when you start-all.sh? The more ARCs you index in a single job, the larger heap space you'll need both during indexing and during deployment. This depends, of course, on how much text is contained in the documents within the ARCs. I've been able to index and deploy batches of 12,000 ARCs with heap spaces around 3200m on 4GB machines. Hope this helps. -J On Jun 20, 2007, at 4:19 AM, alexis artes wrote: > Hi, > > We are having problems in doing an incremental indexing. We have > initially indexed 3000 arcfiles and trying to indexed 3000 more > arcfiles when we encountered the following error. > > 2007-06-19 02:49:25,135 INFO > org.apache.hadoop.mapred.TaskInProgress: Error from > task_0001_r_000035_0: java.net.SocketTimeout > Exception: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:312) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161) > at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source) > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close > (DFSClient.java:1126) > at java.io.FilterOutputStream.close(FilterOutputStream.java: > 143) > at org.apache.hadoop.fs.FSDataOutputStream$Summer.close > (FSDataOutputStream.java:97) > at java.io.FilterOutputStream.close(FilterOutputStream.java: > 143) > at java.io.FilterOutputStream.close(FilterOutputStream.java: > 143) > at java.io.FilterOutputStream.close(FilterOutputStream.java: > 143) > at org.apache.hadoop.io.SequenceFile$Writer.close > (SequenceFile.java:160) > at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118) > at org.archive.access.nutch.ImportArcs > $WaxFetcherOutputFormat$1.close(ImportArcs.java:687) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java: > 281) > at org.apache.hadoop.mapred.TaskTracker$Child.main > (TaskTracker.java:1075) > > > We are using 28 nodes. Our configuration in hadoop-site.xml as > follows: > > <property> > <name>fs.default.name</name> > <value>apple001:9000</value> > </property> > > <property> > <name>mapred.job.tracker</name> > <value>apple001:9001</value> > </property> > > <property> > <name>dfs.name.dir</name> > <value>/opt/hadoop-0.5.0/filesystem/name</value> > </property> > > <property> > <name>dfs.data.dir</name> > <value>/opt/hadoop-0.5.0/filesystem/data</value> > </property> > > <property> > <name>mapred.local.dir</name> > <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value> > </property> > > <property> > <name>mapred.system.dir</name> > <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value> > <description>The shared directory where MapReduce stores > control files. > </description> > </property> > > <property> > <name>mapred.temp.dir</name> > <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value> > <description>A shared directory for temporary files. > </description> > </property> > > <property> > <name>mapred.map.tasks</name> > <value>89</value> > <description> > define mapred.map tasks to be number of slave hosts > </description> > </property> > > <property> > <name>mapred.reduce.tasks</name> > <value>53</value> > <description> > define mapred.reduce tasks to be number of slave hosts > </description> > </property> > > <property> > <name>mapred.tasktracker.tasks.maximum</name> > <value>2</value> > <description>The maximum number of tasks that will be run > simultaneously by a task tracker. > </description> > </property> > > <property> > <name>dfs.replication</name> > <value>1</value> > </property> > > Moreover, what is the maximum number of arc files that can be > indexed in the same batch? We tried 6000 but we encountered errors. > > > Best Regards, > Alex > > Get your own web address. > Have a HUGE year through Yahoo! Small Business. > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: alexis a. <alx...@ya...> - 2007-06-20 11:20:06
|
Hi,
We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error.
2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout
Exception: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160)
at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)
We are using 28 nodes. Our configuration in hadoop-site.xml as follows:
<property>
<name>fs.default.name</name>
<value>apple001:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>apple001:9001</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/data</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
<description>The shared directory where MapReduce stores control files.
</description>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
<description>A shared directory for temporary files.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>89</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>53</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>The maximum number of tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors.
Best Regards,
Alex
---------------------------------
Get your own web address.
Have a HUGE year through Yahoo! Small Business. |
|
From: Gordon M. <go...@ar...> - 2007-06-08 18:49:55
|
Jim Dixon wrote: > Is there anywhere a description of how the archives are structured? I > believe that there is some degree of replication (between San Francisco, > the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing > system. > > Apologies if I somehow overlooked this, but there doesn't seem to be any > information on the subject in the email achives or anywhere else. There is not a good public writeup, but the broad outlines of the web archive can be described: - Web captures are stored in ARC files, essentially verbatim transcripts of HTTP responses with a single line of per-response metadata including date of capture and server IP address, concatenated together into files of 100MB. - As ARCs are brought in, from various crawls, they land on any of 1000+ machines at IA's US facility, based on which machine has space. (So, contemporaneous ARCs usually land on the same banks of machines, but there is no enforced mapping.) The machines are 4-hard-drive 1U commondity linux machines, with plain independent disks and regular filesystems. - Sometimes, as with data collected in partnership with Alexa's crawling, this material arrives 3-6 months after crawling. One master inventory database remembers where the ARC is by its initial copy in; other inventory systems survey and verify actual machine contents at occasional intervals. - At occasional intervals (but again sometimes months after ARC arrival) all new ARCs are scanned for the URL+date captures they contain, and their contents are merged into a master index of holdings, which is roughly: URL timestamp response-code tiny-checksum ARC-file offset-in-ARC This master index is a flat file, one line per URL+date capture, split in hundreds of shards across many machines. (It currently contains over 85 billion lines and will soon go over 100 billion.) When this merge happens is when new material appears in the Wayback Machine. - Wayback Machine requests to list holdings of a particular URL consult contiguous ranges of this master index. - Wayback Machine requests to view an exact URL+date (or most often, nearest-to-URL+date) seek a single best-match line in this master index, then find which machine(s) currently hold that ARC, then contact that machine for just that capture via an HTTP range request into the ARC. - In 2002, the library of Alexandria received a complete mirror of data through part of 2001. In 2006, they again received a complete mirror of the data through early 2006. At times, bi-directional patching of each sides' collection has occurred, but is not currently an automated process. > Also, I understand that there are two versions of the wayback utility, a > Java version in development, which is open source, and a Perl version, > which is the one actually being used and which is closed source. > > Why is the Perl version closed source? The legacy Wayback version relies on a mix of Perl and C code, was co-developed with Alexa, and relies on some Alexa code we don't have permission to put under a proper open source license. We could try to replace just those parts, but there are other assumptions in the legacy Wayback which limit its performance and extensibility. We wanted to leave those behind, and so have been investing effort in the open-source, Java Wayback project instead. The new code will replace the legacy code on our public site this year. - Gordon @ IA > -- > Jim Dixon jd...@gm... cellphone 415 / 570 3608 > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Jim D. <jd...@di...> - 2007-06-04 03:43:22
|
Is there anywhere a description of how the archives are structured? I believe that there is some degree of replication (between San Francisco, the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing system. Apologies if I somehow overlooked this, but there doesn't seem to be any information on the subject in the email achives or anywhere else. Also, I understand that there are two versions of the wayback utility, a Java version in development, which is open source, and a Perl version, which is the one actually being used and which is closed source. Why is the Perl version closed source? -- Jim Dixon jd...@gm... cellphone 415 / 570 3608 |
|
From: Brad T. <br...@ar...> - 2007-05-21 19:25:04
|
There's pretty significant work going into the Wayback right now to
simplify configuration of multiple collections.
When completed, this should also minimize server resources used, so it
should be possible to host hundreds of collections on a modest server.
Before the next release is available, this can be accomplished using
multiple servlet contexts, and CDX files:
1) create a CDX file for each individual collection you want to be able
to search independantly
2) deploy the war under the webapps directory with the name "COLLECTION.war"
3) edit the web.xml under the COLLECTION webapp, customizing the
ResourceIndex to use the appropriate CDX file for that collection with
the "resourceindex.cdxpaths" configuration parameter
4) to create an aggregate collection which searches multiple CDX files,
configure that collection to search all needed CDX files by separating
multiple CDX files with commas (",") in the "resourceindex.cdxpaths"
configuration parameter.
5) edit the WaybackUI.properties file under WEB-INF/classes to alter the
text displayed for the not-in-archive exception:
Exception.resourceNotInArchive.message=The Resource you requested is not
in this archive.
Let me know if you have problems or questions setting this up. We're
currently hosting dozens of collections on a single machine with 2GB of
RAM using this method. We use a simple shell script to generate and
customize each webapp based on a text file listing the collections needed.
With the new release, it will be possible to share rendering .jsp files
across multiple collections, which should simplify institution-level
.jsp customization, and to easily configure and use per-collection text
within those .jsp files.
Brad
Ignacio Garcia wrote:
> Hello,
>
> I have a question regarding collections within wayback.
> In the older perl versions, there was a way to specify different
> collections
> within wayback, and each collection will be handled as a separate set
> of arc
> files. Having specific messages identifying the collections and searching
> withing collections only...
> I was wondering if the latest java versions have such functionallity
> built
> in?
>
> What I'm trying to achieve is the following:
>
> Imagine I have a set of 100 arc files, and 25 are from crawls related
> with
> science magazine articles, 25 related with sports magazines and the
> other 50
> as misc. crawls.
> I would like to create 2 collections: One for science magazines and 1 for
> sport magazines.
> Once the collections are created, I would like to be able to search
> either
> ALL the arcs (100), or search by collection. I select one of the 2
> collections created and then only the specific arcs will the searched.
> Also, if I search for http://espn.magazine.com/* within the science
> magazines collection, and I get NO RESULTS, the message shown by wayback
> would have a specific message created for that particular collection,
> something like: No results withing SCIENCE MAGAZINES collection.
>
> Since the old wayback was able to handle such configurations, I was
> wondering if this was still doable in the newest java versions, or if
> I need
> to modify the actual source code to fit my needs?
>
> Thank you.
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
|
|
From: Ignacio G. <igc...@gm...> - 2007-05-21 12:29:45
|
Hello, I have a question regarding collections within wayback. In the older perl versions, there was a way to specify different collections within wayback, and each collection will be handled as a separate set of arc files. Having specific messages identifying the collections and searching withing collections only... I was wondering if the latest java versions have such functionallity built in? What I'm trying to achieve is the following: Imagine I have a set of 100 arc files, and 25 are from crawls related with science magazine articles, 25 related with sports magazines and the other 50 as misc. crawls. I would like to create 2 collections: One for science magazines and 1 for sport magazines. Once the collections are created, I would like to be able to search either ALL the arcs (100), or search by collection. I select one of the 2 collections created and then only the specific arcs will the searched. Also, if I search for http://espn.magazine.com/* within the science magazines collection, and I get NO RESULTS, the message shown by wayback would have a specific message created for that particular collection, something like: No results withing SCIENCE MAGAZINES collection. Since the old wayback was able to handle such configurations, I was wondering if this was still doable in the newest java versions, or if I need to modify the actual source code to fit my needs? Thank you. |
|
From: Brad T. <br...@ar...> - 2007-05-16 01:09:16
|
What are the memory configurations for the tomcat java process (-Xmx???m -Xms???m) ? Have you tried increasing these configurations? Brad |