archive-access-discuss Mailing List for Web Archive Access Utilities (Page 35)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 33 34 35 36 37 .. 43 > >> (Page 35 of 43)

Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

From: Ignacio G. <igc...@gm...> - 2007-09-28 12:32:52

Michael, I do not know if it failed on the same record...

the first time it failed I assumed that increasing the -Xmx parameters would
solve it, since the OOME has happened before when indexing with Wayback.

I will try to narrow it as much as I can if it fails again.


On 9/27/07, Michael Stack <st...@du...> wrote:
>
> What John says and then
>
> + The OOME exception stack trace might tell us something.
> + Is the OOME always in same place processing same record?  If so, take
> a look at it in the ARC.
>
> St.Ack
>
> John H. Lee wrote:
> > Hi Ignacio.
> >
> > It would be helpful if you posted the following information:
> > - Are you using standalone or mapreduce?
> > - If mapreduce, what are your mapred.map.tasks and
> > mapred.reduce.tasks properties set to?
> > - If mapreduce, how many slaves do you have and how much memory do
> > they have?
> > - How many ARCs are you trying to index?
> > - Did the map reach 100% completion before the failure occurred?
> >
> > Some things you may want to try:
> > - Set both -Xmx and -Xmx to the maximum available on your systems
> > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,
> > depending where the failure occurred
> > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
> >
> > -J
> >
> > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
> >
> >
> >> Hello,
> >>
> >> I've been doing some testing with nutchwax and I have never had any
> >> major problems.
> >> However, right now I am trying to index a collection that is over
> >> 100 Gb big, and for some reason the indexing is crashing while it
> >> tries to populate 'crawldb'
> >>
> >> The job will run fine at the beginning importing the information
> >> from the ARCs and creating the "segments" section.
> >>
> >> The error I get is an outOfMemory error when the system is
> >> processing each of the part.xx in the segments previously created.
> >>
> >> I tried increasing the following setting on the hadoop-default.xml
> >> config file: mapred.child.java.opts to 1GB, but it still failed in
> >> the same part.
> >>
> >> Is there any way to reduce the amount of memory used by nutchwax/
> >> hadoop to make the process more efficient and be able to index such
> >> a collection?
> >>
> >> Thank you.
> >> ----------------------------------------------------------------------
> >> ---
> >> This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> Archive-access-discuss mailing list
> >> Arc...@li...
> >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >>
> >
> >
> >
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >
>
>

Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

From: Ignacio G. <igc...@gm...> - 2007-09-28 12:29:06

I had already increased the -Xmx to 2Gb, and it still failed.

For everything else I am using the default settings and following the "get
started" guide on the nutchwax site, so i am using the
sudo {HADOOP_HOME}/bin/hadoop jar {NUTCHWAX_HOME}/nutchwax.jar all input
output collection

And I believe I am using mapreduce...

The number of ARCS is 3521, with an average size of 30Mb/ARC.

I am trying right now breaking the job in several chunks, to see if that
helps.

If it fails again I will grab as much information as I can as to when it
exactly failed.

Thank you.

On 9/27/07, John H. Lee <jl...@ar...> wrote:
>
> Hi Ignacio.
>
> It would be helpful if you posted the following information:
> - Are you using standalone or mapreduce?
> - If mapreduce, what are your mapred.map.tasks and
> mapred.reduce.tasks properties set to?
> - If mapreduce, how many slaves do you have and how much memory do
> they have?
> - How many ARCs are you trying to index?
> - Did the map reach 100% completion before the failure occurred?
>
> Some things you may want to try:
> - Set both -Xmx and -Xmx to the maximum available on your systems
> - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,
> depending where the failure occurred
> - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
>
> -J
>
> On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
>
> > Hello,
> >
> > I've been doing some testing with nutchwax and I have never had any
> > major problems.
> > However, right now I am trying to index a collection that is over
> > 100 Gb big, and for some reason the indexing is crashing while it
> > tries to populate 'crawldb'
> >
> > The job will run fine at the beginning importing the information
> > from the ARCs and creating the "segments" section.
> >
> > The error I get is an outOfMemory error when the system is
> > processing each of the part.xx in the segments previously created.
> >
> > I tried increasing the following setting on the hadoop-default.xml
> > config file: mapred.child.java.opts to 1GB, but it still failed in
> > the same part.
> >
> > Is there any way to reduce the amount of memory used by nutchwax/
> > hadoop to make the process more efficient and be able to index such
> > a collection?
> >
> > Thank you.
> > ----------------------------------------------------------------------
> > ---
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>

Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

From: Michael S. <st...@du...> - 2007-09-27 22:47:42

What John says and then

+ The OOME exception stack trace might tell us something.
+ Is the OOME always in same place processing same record?  If so, take 
a look at it in the ARC.

St.Ack

John H. Lee wrote:
> Hi Ignacio.
>
> It would be helpful if you posted the following information:
> - Are you using standalone or mapreduce?
> - If mapreduce, what are your mapred.map.tasks and  
> mapred.reduce.tasks properties set to?
> - If mapreduce, how many slaves do you have and how much memory do  
> they have?
> - How many ARCs are you trying to index?
> - Did the map reach 100% completion before the failure occurred?
>
> Some things you may want to try:
> - Set both -Xmx and -Xmx to the maximum available on your systems
> - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,  
> depending where the failure occurred
> - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
>
> -J
>
> On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
>
>   
>> Hello,
>>
>> I've been doing some testing with nutchwax and I have never had any  
>> major problems.
>> However, right now I am trying to index a collection that is over  
>> 100 Gb big, and for some reason the indexing is crashing while it  
>> tries to populate 'crawldb'
>>
>> The job will run fine at the beginning importing the information  
>> from the ARCs and creating the "segments" section.
>>
>> The error I get is an outOfMemory error when the system is  
>> processing each of the part.xx in the segments previously created.
>>
>> I tried increasing the following setting on the hadoop-default.xml  
>> config file: mapred.child.java.opts to 1GB, but it still failed in  
>> the same part.
>>
>> Is there any way to reduce the amount of memory used by nutchwax/ 
>> hadoop to make the process more efficient and be able to index such  
>> a collection?
>>
>> Thank you.
>> ---------------------------------------------------------------------- 
>> ---
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ 
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>     
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

From: John H. L. <jl...@ar...> - 2007-09-27 22:38:42

Hi Ignacio.

It would be helpful if you posted the following information:
- Are you using standalone or mapreduce?
- If mapreduce, what are your mapred.map.tasks and  
mapred.reduce.tasks properties set to?
- If mapreduce, how many slaves do you have and how much memory do  
they have?
- How many ARCs are you trying to index?
- Did the map reach 100% completion before the failure occurred?

Some things you may want to try:
- Set both -Xmx and -Xmx to the maximum available on your systems
- Increase one or both of mapred.map.tasks and mapred.reduce.tasks,  
depending where the failure occurred
- Break your job up into smaller chunks of say, 1000 or 5000 ARCs

-J

On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:

> Hello,
>
> I've been doing some testing with nutchwax and I have never had any  
> major problems.
> However, right now I am trying to index a collection that is over  
> 100 Gb big, and for some reason the indexing is crashing while it  
> tries to populate 'crawldb'
>
> The job will run fine at the beginning importing the information  
> from the ARCs and creating the "segments" section.
>
> The error I get is an outOfMemory error when the system is  
> processing each of the part.xx in the segments previously created.
>
> I tried increasing the following setting on the hadoop-default.xml  
> config file: mapred.child.java.opts to 1GB, but it still failed in  
> the same part.
>
> Is there any way to reduce the amount of memory used by nutchwax/ 
> hadoop to make the process more efficient and be able to index such  
> a collection?
>
> Thank you.
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ 
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] [Nutchwax] Problems indexing large collections

From: Ignacio G. <igc...@gm...> - 2007-09-27 17:47:30

Hello,

I've been doing some testing with nutchwax and I have never had any major
problems.
However, right now I am trying to index a collection that is over 100 Gb
big, and for some reason the indexing is crashing while it tries to populate
'crawldb'

The job will run fine at the beginning importing the information from the
ARCs and creating the "segments" section.

The error I get is an outOfMemory error when the system is processing each
of the part.xx in the segments previously created.

I tried increasing the following setting on the hadoop-default.xml config
file: mapred.child.java.opts to 1GB, but it still failed in the same part.

Is there any way to reduce the amount of memory used by nutchwax/hadoop to
make the process more efficient and be able to index such a collection?

Thank you.

Re: [Archive-access-discuss] problems rendering nutchwax search results using wayback

From: Brad T. <br...@ar...> - 2007-09-27 01:01:02

Hi Chris,

I can't access your nutch service, so am unable to provide very detailed 
assistance. One quick thing to test is changing:

http://chaz.hul.harvard.edu:10622/xmlquery

to

http://chaz.hul.harvard.edu:10622/nutch/opensearch


As far as which components should be doing what -- NutchWax and Wayback 
have drifted a little bit from the point when they were integrated so 
that Wayback could utilize a NutchWax index as the it's ResourceIndex. 
Performance issues with the NutchWax index motivated us to:

1) build a Wayback installation with it's own index, either CDX or BDB
2) modify seach.jsp as you've done already so links generated by 
NutchWax search result pages point into the wayback installation.

I'm working with John Lee, who is currently running the NutchWax 
project, to get a better answer on how this will work going forward.

Brad

Chris Vicary wrote:
> Hi,
>
> I am attempting to render nutchwax full text search results using the
> open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and
> wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat.
> Creating and searching full-text indexes of arc files using nutchwax works
> fine. Unfortunately, I have been unsuccessful in rendering the result
> resources. I attempted to follow the instructions for Wayback-NutchWAX at
> http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the
> instructions seem to be based on an older version of wayback, and the some
> changes specified for the wayback's web.xml do not apply to the newest
> wayback version.
>
> The errors encountered depend on the configuration values I use, so here's a
> rundown of the properties:
>
> hadoop-site.xml:
>
> searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs)
> wax.host points to the host and port of the tomcat installation, it does not
> include wayback context information (just host:port,
> chaz.hul.harvard.edu:10622)
>
> search.jsp:
>
> made the change:
>
> <     String archiveCollection =
> detail.getValue("collection");
> ---
>   
>>     String archiveCollection = "wayback"; // detail.getValue("collection");
>>     
>
>
>
> wayback/WEB-INF/web.xml:
>
> The changes required for web.xml  are to "[disable] wayback indexing of
> ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch
> ResourceIndex option".
>
> The Local-ARC ResourceStore option is enabled, and all others are disabled.
> resourcestore.autoindex is set to 0, and all physical paths have been
> checked for accuracy.
>
> I was unable to find any reference to PipeLineFilter, so there was no need
> to comment it out.
>
> I enabled the Remote-Nutch ResourceIndex option, and disabled all other
> ResourceIndex options. The Remote-Nutch option values are:
>
>     <context-param>
>         <param-name>resourceindex.classname</param-name>
>         <param-value>org.archive.wayback.resourceindex.NutchResourceIndex
> </param-value>
>         <description>Class that implements ResourceIndex for this
> Wayback</description>
>     </context-param>
>
>     <context-param>
>         <param-name>resourceindex.baseurl</param-name>
>         <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax
> </param-value>
>         <description>absolute URL to Nutch server</description>
>     </context-param>
>
>     <context-param>
>         <param-name>maxresults</param-name>
>         <param-value>1000</param-value>
>         <description>
>                 Maximum number of results to return from the ResourceIndex.
>         </description>
>     </context-param>
>
>
> With the current setup, I can perform a full-text query using nutchwax and
> the result links seem to be of the correct form:
> http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get
> the error:
> Index not available
>
> *Unexpected SAX: White spaces are required between publicId and systemId.*
>
> *
> *in catalina.out, the stack trace is:
>
> [Fatal Error]
> ?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n
> otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White
> spaces ar
> e required between publicId and systemId.
> org.xml.sax.SAXParseException: White spaces are required between publicId
> and systemId.
>         at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(
> DOMParser.java:264)
> 2007-09-19 18:14:59,244 INFO  WaxDateQueryFilter - Found range date:
> 19960101000000, 20070919221459
>         at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
> (DocumentBuilderImpl.java:292)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146)
>         at
> org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument(
> NutchResourceIndex.java:348)
>         at org.archive.wayback.resourceindex.NutchResourceIndex.query(
> NutchResourceIndex.java:140)
>         at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java
> :122)
>         ...
>
> if I set the resourceindex.baseurl property closer to the original value
> like this:
>
>     <context-param>
>         <param-name>resourceindex.baseurl</param-name>
>         <param-value>http://chaz.hul.harvard.edu:10622/xmlquery
> </param-value>
>         <description>absolute URL to Nutch server</description>
>     </context-param>
>
> when I click on a result link, I get this error:
> Index not available *
> http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp...
> *
>
> and the stack trace looks like this:
>
> INFO: initialized
> org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter
> java.io.FileNotFoundException:
> http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac
> turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi
> eld=site&hitsPerDup=10&hitsPerSite=10
>         at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
> HttpURLConnection.java:1147)
>         at
> com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(
> XMLEntityManager.java:973)
>         at
> com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion
> (XMLVersionDetector.java:184)
>         at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(
> XML11Configuration.java:798)
>         at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(
> XML11Configuration.java:764)
>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(
> XMLParser.java:148)
>         at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(
> DOMParser.java:250)
>         at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
> (DocumentBuilderImpl.java:292)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146)
>         at
> org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument(
> NutchResourceIndex.java:348)
>         at org.archive.wayback.resourceindex.NutchResourceIndex.query(
> NutchResourceIndex.java:140)
>         ...
>
> It seems like I have not configured the Remote-Nutch ResourceIndex
> properties correctly, but I don't have much to go on to try to correct it.
> Or perhaps I am not using nutchwax and wayback in the correct roles?
>
> Any help with this is greatly appreciated.
>
> Thanks,
>
> Chris
>
>   
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] problems rendering nutchwax search results using wayback

From: Chris V. <cv...@gm...> - 2007-09-19 22:45:21

Hi,

I am attempting to render nutchwax full text search results using the
open-source wayback machine. I have installed hadoop, nutchwax (0.10.0) and
wayback (0.8.0) - wayback and nutchwax are deployed in the same tomcat.
Creating and searching full-text indexes of arc files using nutchwax works
fine. Unfortunately, I have been unsuccessful in rendering the result
resources. I attempted to follow the instructions for Wayback-NutchWAX at
http://archive-access.sourceforge.net/projects/nutch/wayback.html, but the
instructions seem to be based on an older version of wayback, and the some
changes specified for the wayback's web.xml do not apply to the newest
wayback version.

The errors encountered depend on the configuration values I use, so here's a
rundown of the properties:

hadoop-site.xml:

searcher.dir points to a local nutchwax "outputs" directory (/tmp/outputs)
wax.host points to the host and port of the tomcat installation, it does not
include wayback context information (just host:port,
chaz.hul.harvard.edu:10622)

search.jsp:

made the change:

<     String archiveCollection =
detail.getValue("collection");
---
>     String archiveCollection = "wayback"; // detail.getValue("collection");



wayback/WEB-INF/web.xml:

The changes required for web.xml  are to "[disable] wayback indexing of
ARCS, [comment] out the PipeLineFilter, and [enable] the Remove-Nutch
ResourceIndex option".

The Local-ARC ResourceStore option is enabled, and all others are disabled.
resourcestore.autoindex is set to 0, and all physical paths have been
checked for accuracy.

I was unable to find any reference to PipeLineFilter, so there was no need
to comment it out.

I enabled the Remote-Nutch ResourceIndex option, and disabled all other
ResourceIndex options. The Remote-Nutch option values are:

    <context-param>
        <param-name>resourceindex.classname</param-name>
        <param-value>org.archive.wayback.resourceindex.NutchResourceIndex
</param-value>
        <description>Class that implements ResourceIndex for this
Wayback</description>
    </context-param>

    <context-param>
        <param-name>resourceindex.baseurl</param-name>
        <param-value>>http://chaz.hul.harvard.edu:10622/nutchwax
</param-value>
        <description>absolute URL to Nutch server</description>
    </context-param>

    <context-param>
        <param-name>maxresults</param-name>
        <param-value>1000</param-value>
        <description>
                Maximum number of results to return from the ResourceIndex.
        </description>
    </context-param>


With the current setup, I can perform a full-text query using nutchwax and
the result links seem to be of the correct form:
http://[host]:[port]/wayback/[date]/[uri]. But when I click on a link, I get
the error:
Index not available

*Unexpected SAX: White spaces are required between publicId and systemId.*

*
*in catalina.out, the stack trace is:

[Fatal Error]
?query=date%3A19960101000000-20070919221459+exacturl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_n
otes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupField=site&hitsPerDup=10&hitsPerSite=10:1:63:White
spaces ar
e required between publicId and systemId.
org.xml.sax.SAXParseException: White spaces are required between publicId
and systemId.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(
DOMParser.java:264)
2007-09-19 18:14:59,244 INFO  WaxDateQueryFilter - Found range date:
19960101000000, 20070919221459
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
(DocumentBuilderImpl.java:292)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146)
        at
org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument(
NutchResourceIndex.java:348)
        at org.archive.wayback.resourceindex.NutchResourceIndex.query(
NutchResourceIndex.java:140)
        at org.archive.wayback.replay.ReplayServlet.doGet(ReplayServlet.java
:122)
        ...

if I set the resourceindex.baseurl property closer to the original value
like this:

    <context-param>
        <param-name>resourceindex.baseurl</param-name>
        <param-value>http://chaz.hul.harvard.edu:10622/xmlquery
</param-value>
        <description>absolute URL to Nutch server</description>
    </context-param>

when I click on a result link, I get this error:
Index not available *
http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exacturl%3Ahttp...
*

and the stack trace looks like this:

INFO: initialized
org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter
java.io.FileNotFoundException:
http://chaz.hul.harvard.edu:10622/xmlquery?query=date%3A19960101000000-20070919222516+exac
turl%3Ahttp%3A%2F%2Fwww.aandw.net%2Fandrea%2Fhowtos%2Fsde_notes.txt&sort=date&reverse=true&hitsPerPage=10&start=0&dedupFi
eld=site&hitsPerDup=10&hitsPerSite=10
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
HttpURLConnection.java:1147)
        at
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(
XMLEntityManager.java:973)
        at
com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion
(XMLVersionDetector.java:184)
        at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(
XML11Configuration.java:798)
        at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(
XML11Configuration.java:764)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(
XMLParser.java:148)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(
DOMParser.java:250)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
(DocumentBuilderImpl.java:292)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:146)
        at
org.archive.wayback.resourceindex.NutchResourceIndex.getHttpDocument(
NutchResourceIndex.java:348)
        at org.archive.wayback.resourceindex.NutchResourceIndex.query(
NutchResourceIndex.java:140)
        ...

It seems like I have not configured the Remote-Nutch ResourceIndex
properties correctly, but I don't have much to go on to try to correct it.
Or perhaps I am not using nutchwax and wayback in the correct roles?

Any help with this is greatly appreciated.

Thanks,

Chris

Re: [Archive-access-discuss] Nutchwax0.10 and WERA0.4.2: Date field missing from documentLocator->Resultset

From: alexis a. <alx...@ya...> - 2007-09-05 03:04:44

Hi Sverre,

Thanks for confirming this fix. I was able to figure this out a couple of days ago and was testing it out. Unfortunately, I was missing some versions so I have to redo the entire indexing process.

Best Regards,
Alex

Sverre Bang <sve...@nb...> wrote: Hi Alex,
I have looked into the Wera/Nutchwax incompatibility. It seems that the
element nutch:arcdate returned by nutchwax has chanced its name to
nutch:tstamp. Since i'm not involved in the nutchwax development, i
can't say when (and why) this happened.

Anyway, to patch wera download the file
http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal/nutch.inc?revision=1.10
an replace the one in your wera installaion (in
/lib/seal/)

The patch will support nutchwax released before and after the switch
from arcdate to tstamp.

Please be aware that every different version of a url must be in a
separate nutchwax collection, and deduplication must be skipped (all the
versions of a particular url in the same collection be treated as
duplicates regardless of dedup or not). See
http://archive-access.sourceforge.net/projects/nutch/faq.html#dedup

Regards
Sverre

On Wed, 2007-08-15 at 04:00 -0700, alexis artes wrote:
> Hi,
> 
> Has anybody tried using Nutchwax0.10 and WERA together? We are
> encountering this problem: The resultset array obtained from
> documentLocator->findVersions() does not have  the date field for all
> the files found. Wera will still be able to display the page but the
> timeline will be all messed up. 
> 
> Was there any API change in Nutchwax0.10 concerning the searching of
> the index or delivery of resultset? 
> 
> Best Regards,
> Alex
> 
> 
> 
> ______________________________________________________________________
> Pinpoint customers who are looking for what you sell. 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

---------------------------------
Building a website is a piece of cake. 
Yahoo! Small Business gives you all the tools to get online.

Re: [Archive-access-discuss] Nutchwax0.10 and WERA0.4.2: Date field missing from documentLocator->Resultset

From: Sverre B. <sve...@nb...> - 2007-09-04 13:33:57

Hi Alex,
I have looked into the Wera/Nutchwax incompatibility. It seems that the
element nutch:arcdate returned by nutchwax has chanced its name to
nutch:tstamp. Since i'm not involved in the nutchwax development, i
can't say when (and why) this happened.

Anyway, to patch wera download the file
http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/wera/src/webapps/wera/lib/seal/nutch.inc?revision=1.10
an replace the one in your wera installaion (in
<wera_inst_dir>/lib/seal/)

The patch will support nutchwax released before and after the switch
from arcdate to tstamp.

Please be aware that every different version of a url must be in a
separate nutchwax collection, and deduplication must be skipped (all the
versions of a particular url in the same collection be treated as
duplicates regardless of dedup or not). See
http://archive-access.sourceforge.net/projects/nutch/faq.html#dedup

Regards
Sverre

On Wed, 2007-08-15 at 04:00 -0700, alexis artes wrote:
> Hi,
> 
> Has anybody tried using Nutchwax0.10 and WERA together? We are
> encountering this problem: The resultset array obtained from
> documentLocator->findVersions() does not have  the date field for all
> the files found. Wera will still be able to display the page but the
> timeline will be all messed up. 
> 
> Was there any API change in Nutchwax0.10 concerning the searching of
> the index or delivery of resultset? 
> 
> Best Regards,
> Alex
> 
> 
> 
> ______________________________________________________________________
> Pinpoint customers who are looking for what you sell. 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] Nutchwax0.10 and WERA0.4.2: Date field missing from documentLocator->Resultset

From: alexis a. <alx...@ya...> - 2007-08-15 11:00:48

Hi,

Has anybody tried using Nutchwax0.10 and WERA together? We are encountering this problem: The resultset array obtained from documentLocator->findVersions() does not have  the date field for all the files found. Wera will still be able to display the page but the timeline will be all messed up. 

Was there any API change in Nutchwax0.10 concerning the searching of the index or delivery of resultset? 

Best Regards,
Alex

       
---------------------------------
Pinpoint customers who are looking for what you sell.

Re: [Archive-access-discuss] Call you for Help

From: Michael S. <st...@du...> - 2007-08-02 15:53:34

How large are the indices? Would deploying them side-by-side work for 
you? See the last paragraph at the end of this FAQ, 
http://archive-access.sourceforge.net/projects/nutchwax/faq.html#incremental, 
for pointers on how.

St.Ack


pangang wrote:
>
> Hello :
> We use Hadoop external the JAR :Nunthwax -10.0.0 of Nunthwax .
> We had finished the data indexing of 2006, and we have now done part 
> of the 2007 data indexing,
> the problem we face is how these two parts of the data is combined.
> In order that the data indexing of 2006 and the part of 2007 can be 
> used by searchers!
> thank you
> wish to your answer
>
> MSN£ºpangang@126.com <mailto:pa...@12...>
>
>
>
> ------------------------------------------------------------------------
> ÈËÉ½ ÈË º£ Ê¢ ¾°£¬¾¡ ÔÚ ÃÎ »Ã Î÷ ÓÎ 
> <http://event.mail.163.com/chanel/xyq.htm?from=126_NO6>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Call you for Help

From: pangang <pa...@12...> - 2007-08-02 08:51:06

Hello :  We use Hadoop external the JAR :Nunthwax -10.0.0 of Nunthwax . We =
had finished the data indexing of 2006, and we have now done part of the 20=
07 data indexing, the problem we face is how these two parts of the data is=
 combined. In order that the data indexing of 2006 and the part of 2007 can=
 be used by searchers!                                                     =
                                                      thank you            =
                                          wish to your answer
 MSN...@12...

Re: [Archive-access-discuss] Nutch-wax upgrade problem

From: Michael S. <st...@du...> - 2007-07-23 16:28:28

I do not believe such an index convertion tool exists (Check the nutch 
list).   Even if it did, I'd suggest you'd spend so much CPU running the 
convertion of index and supporting segments, you might as well start 
over (New nutch/hadoop runs much faster.. about X4 times faster).  
Starting over, you can be sure of the process, more sure than you can be 
of a little-tested transform, and you will pick up improvements made 
since old nutch.

The ClassCastException in the below is because old nutchwax used an UTF8 
class to represent Strings, a class since replaced by the Text class 
(Your new nutch frontend is trying to use Text to represent a UTF8 class 
read from segment directories I'm guessing).

St.Ack

Xavier Torelló wrote:
> Hi,
>
> First of all, thanks for your quick response :)
>
> The re-index option is not viable, since it is a expensive process 
> considering that we have about 150gb in indices.
>
> John talk about the option of convert the indexs. Somebody knows how to 
> do this process?
>
>
> Finally, the exception that appears when we try to make a request to 
> nutchwax (via opensearch):
>
> java.lang.RuntimeException: java.lang.ClassCastException: 
> org.apache.hadoop.io.Text
>
> 	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
> 	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:344)
> 	org.archive.access.nutch.NutchwaxBean.getSummary(NutchwaxBean.java:52)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:156)
> 	org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:76)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
>
> Thanks,
>
>

Re: [Archive-access-discuss] Nutch-wax upgrade problem

From: <xto...@ce...> - 2007-07-23 07:22:00

Hi,

First of all, thanks for your quick response :)

The re-index option is not viable, since it is a expensive process 
considering that we have about 150gb in indices.

John talk about the option of convert the indexs. Somebody knows how to 
do this process?


Finally, the exception that appears when we try to make a request to 
nutchwax (via opensearch):

java.lang.RuntimeException: java.lang.ClassCastException: 
org.apache.hadoop.io.Text

	org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
	org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:344)
	org.archive.access.nutch.NutchwaxBean.getSummary(NutchwaxBean.java:52)
	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:156)
	org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:76)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


Thanks,

-- 
xt

Re: [Archive-access-discuss] Nutch-wax upgrade problem

From: John H. L. <jl...@ar...> - 2007-07-20 20:23:02

Hi Xavier.

I believe the index format changed between those versions, so you may =20=

need to re-index your documents or convert the indices somehow.

If you send the stack traces associated with the two exceptions, you =20
have a much better chance of getting a useful response from the list.

-J


On Jul 20, 2007, at 6:03 AM, Xavier Torell=F3 wrote:

> Hi!
>
> We're currently upgraded nutch-wax from 0.7v to 0.10.0v. When we =20
> try to make a query using the indexs created via hadoop 0.5v, this =20
> process breaks showing two java exceptions.
>
> It's essential that we re-run the indexation process of the jobs =20
> createds by Heritrix using the lastest version of hadoop?
>
> Exists any procedure to update the format of the indexs to be able =20
> work with the latest version of nutch-wax?
>
> Thanks a lot.
>
> Regards,
> --
> xt
> ----------------------------------------------------------------------=20=

> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=20
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] Nutch-wax upgrade problem

From: <xto...@ce...> - 2007-07-20 13:03:36

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<pre class="moz-signature" cols="72"><font color="#330033">Hi!

We're currently upgraded nutch-wax from 0.7v to 0.10.0v. When we try to make a query using the indexs created via hadoop 0.5v, this process breaks showing two java exceptions.

It's essential that we re-run the indexation process of the jobs createds by Heritrix using the lastest version of hadoop?

Exists any procedure to update the format of the indexs to be able work with the latest version of nutch-wax?

Thanks a lot.

Regards,</font>
--
xt
</pre>
</body>
</html>

Re: [Archive-access-discuss] Problem in Incremental indexing with nutchwax0.8

From: alexis a. <alx...@ya...> - 2007-06-28 05:17:55

Hi,
   
  We are encountering a new set of errors aside from the socket time out. Subsequent runs produces the following errors that results to Job Failed.
   
  We hope you can guide us in this issue. 
   

2007-06-27 08:48:04,001 INFO org.apache.hadoop.mapred.TaskInProgress:   Error from task_0003_m_000115_0: java.io.IOException: Could not obtain   block: blk_-8188170094415436519   file=/user/outputs/segments/20070626172746-test/parse_data/part-00023/data offset=33845248   at   org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:563)   at   org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:675)   at   org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170)   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)   at java.io.BufferedInputStream.read(BufferedInputStream.java:313)   at java.io.DataInputStream.readFully(DataInputStream.java:176)   at   org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)   at   org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)   at  
 org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:404)   at   org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:330)   at   org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:371)   at   org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:58)   at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:183)   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:49)   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195)   at   org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)    2007-06-27 08:48:31,758 INFO org.apache.hadoop.mapred.TaskInProgress:   Task 'task_0003_m_000154_0' has been lost.  2007-06-27 08:48:31,794 INFO org.apache.hadoop.mapred.JobTracker:   Adding task 'task_0003_m_000154_1' to tip tip_0003_m_000154, for tracker   'tracker_orange.com:50050'  2007-06-27 08:48:32,544 INFO org.apache.hadoop.mapred.TaskInProgress:   Error from task_0003_m_000146_0:
 java.lang.RuntimeException: Summer   buffer overflow b.len=4096, off=0, summed=512, read=4096, bytesPerSum=1,   inSum=512   at   org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:100)   at   org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170)   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)   at java.io.BufferedInputStream.read(BufferedInputStream.java:313)   at java.io.DataInputStream.readFully(DataInputStream.java:176)   at   org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)   at   org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)   at   org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:404)   at   org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:330)   at   org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:371)   at  
 org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:58)   at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:183)   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:49)   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195)   at   org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)  Caused by: java.lang.ArrayIndexOutOfBoundsException   at java.util.zip.CRC32.update(Unknown Source)   at   org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)   ... 15 more  
  

alexis artes <alx...@ya...> wrote:
  Hi,

We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error.

2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout
Exception: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:312)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
        at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160)
        at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
        at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)


We are using 28 nodes. Our configuration in hadoop-site.xml as follows:

<property>
        <name>fs.default.name</name>
        <value>apple001:9000</value>
</property>

<property>
        <name>mapred.job.tracker</name>
        <value>apple001:9001</value>
</property>

<property>
        <name>dfs.name.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/name</value>
</property>

<property>
        <name>dfs.data.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/data</value>
</property>

<property>
        <name>mapred.local.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
</property>

<property>
        <name>mapred.system.dir</name>
        <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
        <description>The shared directory where MapReduce stores control files.
        </description>
</property>

<property>
        <name>mapred.temp.dir</name>
        <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
        <description>A shared directory for temporary files.
        </description>
</property>

<property>
        <name>mapred.map.tasks</name>
        <value>89</value>
        <description>
        define mapred.map tasks to be number of slave hosts
        </description>
</property>

<property>
        <name>mapred.reduce.tasks</name>
        <value>53</value>
        <description>
        define mapred.reduce tasks to be number of slave hosts
        </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>

Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors.


Our System Configuration:
 Scientific Linux CERN
 2.4.21-32.0.1.EL.cernsmp
 JDK1.5
 Hadoop0.5
 Nutchwax0.8

Best Regards,
Alex    
---------------------------------
  Pinpoint customers who are looking for what you sell. 

       
---------------------------------
Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV.

[Archive-access-discuss] Problem in Incremental indexing with nutchwax0.8

From: alexis a. <alx...@ya...> - 2007-06-22 10:35:39

Hi,

We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error.

2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout
Exception: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:312)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
        at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at  org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160)
        at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
        at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281)
        at  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)


We are using 28 nodes. Our configuration in hadoop-site.xml as follows:

<property>
        <name>fs.default.name</name>
        <value>apple001:9000</value>
</property>

<property>
        <name>mapred.job.tracker</name>
        <value>apple001:9001</value>
</property>

<property>
        <name>dfs.name.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/name</value>
</property>

<property>
        <name>dfs.data.dir</name>
         <value>/opt/hadoop-0.5.0/filesystem/data</value>
</property>

<property>
        <name>mapred.local.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
</property>

<property>
        <name>mapred.system.dir</name>
        <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
        <description>The shared directory where MapReduce stores control files.
        </description>
</property>

<property>
        <name>mapred.temp.dir</name>
         <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
        <description>A shared directory for temporary files.
        </description>
</property>

<property>
        <name>mapred.map.tasks</name>
        <value>89</value>
        <description>
        define mapred.map tasks to be number of slave hosts
        </description>
</property>

<property>
        <name>mapred.reduce.tasks</name>
        <value>53</value>
         <description>
        define mapred.reduce tasks to be number of slave hosts
        </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>

Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors.


Our System Configuration:
 Scientific Linux CERN
 2.4.21-32.0.1.EL.cernsmp
 JDK1.5
 Hadoop0.5
 Nutchwax0.8

Best Regards,
Alex
       
---------------------------------
Pinpoint customers who are looking for what you sell.

Re: [Archive-access-discuss] incremental indexing with nutchwax0.8

From: John H. L. <jl...@ar...> - 2007-06-20 14:54:10

Hi Alexis.

NutchWAX 0.10.0 has lots of bug fixes and improvements over 0.8.0, so  
you may want to start by upgrading your installation.

Does your job complete any tasks before you see this error? Do you  
see any other errors in the logs? Specifically, do you see a  
BindException when you start-all.sh?

The more ARCs you index in a single job, the larger heap space you'll  
need both during indexing and during deployment. This depends, of  
course, on how much text is contained in the documents within the  
ARCs. I've been able to index and deploy batches of 12,000 ARCs with  
heap spaces around 3200m on 4GB machines.

Hope this helps.

-J


On Jun 20, 2007, at 4:19 AM, alexis artes wrote:

> Hi,
>
> We are having problems in doing an incremental indexing. We have  
> initially indexed 3000 arcfiles and trying to indexed 3000 more  
> arcfiles when we encountered the following error.
>
> 2007-06-19 02:49:25,135 INFO  
> org.apache.hadoop.mapred.TaskInProgress: Error from  
> task_0001_r_000035_0: java.net.SocketTimeout
> Exception: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:312)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>         at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close 
> (DFSClient.java:1126)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at org.apache.hadoop.fs.FSDataOutputStream$Summer.close 
> (FSDataOutputStream.java:97)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at org.apache.hadoop.io.SequenceFile$Writer.close 
> (SequenceFile.java:160)
>         at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
>         at org.archive.access.nutch.ImportArcs 
> $WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java: 
> 281)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1075)
>
>
> We are using 28 nodes. Our configuration in hadoop-site.xml as  
> follows:
>
> <property>
>         <name>fs.default.name</name>
>         <value>apple001:9000</value>
> </property>
>
> <property>
>         <name>mapred.job.tracker</name>
>         <value>apple001:9001</value>
> </property>
>
> <property>
>         <name>dfs.name.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/name</value>
> </property>
>
> <property>
>         <name>dfs.data.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/data</value>
> </property>
>
> <property>
>         <name>mapred.local.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
> </property>
>
> <property>
>         <name>mapred.system.dir</name>
>         <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
>         <description>The shared directory where MapReduce stores  
> control files.
>         </description>
> </property>
>
> <property>
>         <name>mapred.temp.dir</name>
>         <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
>         <description>A shared directory for temporary files.
>         </description>
> </property>
>
> <property>
>         <name>mapred.map.tasks</name>
>         <value>89</value>
>         <description>
>         define mapred.map tasks to be number of slave hosts
>         </description>
> </property>
>
> <property>
>         <name>mapred.reduce.tasks</name>
>         <value>53</value>
>         <description>
>         define mapred.reduce tasks to be number of slave hosts
>         </description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>         <name>dfs.replication</name>
>         <value>1</value>
> </property>
>
> Moreover, what is the maximum number of arc files that can be  
> indexed in the same batch? We tried 6000 but we encountered errors.
>
>
> Best Regards,
> Alex
>
> Get your own web address.
> Have a HUGE year through Yahoo! Small Business.
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/ 
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] incremental indexing with nutchwax0.8

From: alexis a. <alx...@ya...> - 2007-06-20 11:20:06

Hi,

We are having problems in doing an incremental indexing. We have initially indexed 3000 arcfiles and trying to indexed 3000 more arcfiles when we encountered the following error.

2007-06-19 02:49:25,135 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_0001_r_000035_0: java.net.SocketTimeout
Exception: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:312)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
        at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1126)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:97)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
        at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:160)
        at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
        at org.archive.access.nutch.ImportArcs$WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:281)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)


We are using 28 nodes. Our configuration in hadoop-site.xml as follows:

<property>
        <name>fs.default.name</name>
        <value>apple001:9000</value>
</property>

<property>
        <name>mapred.job.tracker</name>
        <value>apple001:9001</value>
</property>

<property>
        <name>dfs.name.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/name</value>
</property>

<property>
        <name>dfs.data.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/data</value>
</property>

<property>
        <name>mapred.local.dir</name>
        <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
</property>

<property>
        <name>mapred.system.dir</name>
        <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
        <description>The shared directory where MapReduce stores control files.
        </description>
</property>

<property>
        <name>mapred.temp.dir</name>
        <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
        <description>A shared directory for temporary files.
        </description>
</property>

<property>
        <name>mapred.map.tasks</name>
        <value>89</value>
        <description>
        define mapred.map tasks to be number of slave hosts
        </description>
</property>

<property>
        <name>mapred.reduce.tasks</name>
        <value>53</value>
        <description>
        define mapred.reduce tasks to be number of slave hosts
        </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>

Moreover, what is the maximum number of arc files that can be indexed in the same batch? We tried 6000 but we encountered errors.


Best Regards,
Alex

 
---------------------------------
 Get your own web address.
 Have a HUGE year through Yahoo! Small Business.

Re: [Archive-access-discuss] organization of the archive

From: Gordon M. <go...@ar...> - 2007-06-08 18:49:55

Jim Dixon wrote:
> Is there anywhere a description of how the archives are structured? I
> believe that there is some degree of replication (between San Francisco,
> the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing
> system.
> 
> Apologies if I somehow overlooked this, but there doesn't seem to be any
> information on the subject in the email achives or anywhere else.

There is not a good public writeup, but the broad outlines of the web 
archive can be described:

- Web captures are stored in ARC files, essentially verbatim transcripts 
of HTTP responses with a single line of per-response metadata including 
date of capture and server IP address, concatenated together into files 
of 100MB.

- As ARCs are brought in, from various crawls, they land on any of 1000+ 
machines at IA's US facility, based on which machine has space. (So, 
contemporaneous ARCs usually land on the same banks of machines, but 
there is no enforced mapping.) The machines are 4-hard-drive 1U 
commondity linux machines, with plain independent disks and regular 
filesystems.

- Sometimes, as with data collected in partnership with Alexa's 
crawling, this material arrives 3-6 months after crawling. One master 
inventory database remembers where the ARC is by its initial copy in; 
other inventory systems survey and verify actual machine contents at 
occasional intervals.

- At occasional intervals (but again sometimes months after ARC arrival) 
all new ARCs are scanned for the URL+date captures they contain, and 
their contents are merged into a master index of holdings, which is roughly:
  URL timestamp response-code tiny-checksum ARC-file offset-in-ARC
This master index is a flat file, one line per URL+date capture, split 
in hundreds of shards across many machines. (It currently contains over 
85 billion lines and will soon go over 100 billion.) When this merge 
happens is when new material appears in the Wayback Machine.

- Wayback Machine requests to list holdings of a particular URL consult 
contiguous ranges of this master index.

- Wayback Machine requests to view an exact URL+date (or most often, 
nearest-to-URL+date) seek a single best-match line in this master index, 
then find which machine(s) currently hold that ARC, then contact that 
machine for just that capture via an HTTP range request into the ARC.

- In 2002, the library of Alexandria received a complete mirror of data 
through part of 2001. In 2006, they again received a complete mirror of 
the data through early 2006. At times, bi-directional patching of each 
sides' collection has occurred, but is not currently an automated process.

> Also, I understand that there are two versions of the wayback utility, a
> Java version in development, which is open source, and a Perl version,
> which is the one actually being used and which is closed source.
> 
> Why is the Perl version closed source?

The legacy Wayback version relies on a mix of Perl and C code, was 
co-developed with Alexa, and relies on some Alexa code we don't have 
permission to put under a proper open source license.

We could try to replace just those parts, but there are other 
assumptions in the legacy Wayback which limit its performance and 
extensibility.  We wanted to leave those behind, and so have been 
investing effort in the open-source, Java Wayback project instead.

The new code will replace the legacy code on our public site this year.

- Gordon @ IA

> --
> Jim Dixon  jd...@gm...  cellphone 415 / 570 3608
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] organization of the archive

From: Jim D. <jd...@di...> - 2007-06-04 03:43:22

Is there anywhere a description of how the archives are structured? I
believe that there is some degree of replication (between San Francisco,
the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing
system.

Apologies if I somehow overlooked this, but there doesn't seem to be any
information on the subject in the email achives or anywhere else.

Also, I understand that there are two versions of the wayback utility, a
Java version in development, which is open source, and a Perl version,
which is the one actually being used and which is closed source.

Why is the Perl version closed source?

--
Jim Dixon  jd...@gm...  cellphone 415 / 570 3608

Re: [Archive-access-discuss] [wayback] dividing by collections

From: Brad T. <br...@ar...> - 2007-05-21 19:25:04

There's pretty significant work going into the Wayback right now to 
simplify configuration of multiple collections.

When completed, this should also minimize server resources used, so it 
should be possible to host hundreds of collections on a modest server.

Before the next release is available, this can be accomplished using 
multiple servlet contexts, and CDX files:
1) create a CDX file for each individual collection you want to be able 
to search independantly
2) deploy the war under the webapps directory with the name "COLLECTION.war"
3) edit the web.xml under the COLLECTION webapp, customizing the 
ResourceIndex to use the appropriate CDX file for that collection with 
the "resourceindex.cdxpaths" configuration parameter
4) to create an aggregate collection which searches multiple CDX files, 
configure that collection to search all needed CDX files by separating 
multiple CDX files with commas (",") in the "resourceindex.cdxpaths" 
configuration parameter.
5) edit the WaybackUI.properties file under WEB-INF/classes to alter the 
text displayed for the not-in-archive exception:

Exception.resourceNotInArchive.message=The Resource you requested is not 
in this archive.

Let me know if you have problems or questions setting this up. We're 
currently hosting dozens of collections on a single machine with 2GB of 
RAM using this method. We use a simple shell script to generate and 
customize each webapp based on a text file listing the collections needed.

With the new release, it will be possible to share rendering .jsp files 
across multiple collections, which should simplify institution-level 
.jsp customization, and to easily configure and use per-collection text 
within those .jsp files.

Brad

Ignacio Garcia wrote:
> Hello,
>
> I have a question regarding collections within wayback.
> In the older perl versions, there was a way to specify different 
> collections
> within wayback, and each collection will be handled as a separate set 
> of arc
> files. Having specific messages identifying the collections and searching
> withing collections only...
> I was wondering if the latest java versions have such functionallity 
> built
> in?
>
> What I'm trying to achieve is the following:
>
> Imagine I have a set of 100 arc files, and 25 are from crawls related 
> with
> science magazine articles, 25 related with sports magazines and the 
> other 50
> as misc. crawls.
> I would like to create 2 collections: One for science magazines and 1 for
> sport magazines.
> Once the collections are created, I would like to be able to search 
> either
> ALL the arcs (100), or search by collection. I select one of the 2
> collections created and then only the specific arcs will the searched.
> Also, if I search for http://espn.magazine.com/* within the science
> magazines collection, and I get NO RESULTS, the message shown by wayback
> would have a specific message created for that particular collection,
> something like: No results withing SCIENCE MAGAZINES collection.
>
> Since the old wayback was able to handle such configurations, I was
> wondering if this was still doable in the newest java versions, or if 
> I need
> to modify the actual source code to fit my needs?
>
> Thank you.
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] [wayback] dividing by collections

From: Ignacio G. <igc...@gm...> - 2007-05-21 12:29:45

Hello,

I have a question regarding collections within wayback.
In the older perl versions, there was a way to specify different collections
within wayback, and each collection will be handled as a separate set of arc
files. Having specific messages identifying the collections and searching
withing collections only...
I was wondering if the latest java versions have such functionallity built
in?

What I'm trying to achieve is the following:

Imagine I have a set of 100 arc files, and 25 are from crawls related with
science magazine articles, 25 related with sports magazines and the other 50
as misc. crawls.
I would like to create 2 collections: One for science magazines and 1 for
sport magazines.
Once the collections are created, I would like to be able to search either
ALL the arcs (100), or search by collection. I select one of the 2
collections created and then only the specific arcs will the searched.
Also, if I search for http://espn.magazine.com/* within the science
magazines collection, and I get NO RESULTS, the message shown by wayback
would have a specific message created for that particular collection,
something like: No results withing SCIENCE MAGAZINES collection.

Since the old wayback was able to handle such configurations, I was
wondering if this was still doable in the newest java versions, or if I need
to modify the actual source code to fit my needs?

Thank you.

Re: [Archive-access-discuss] [wayback] AutoARCIndexThread Crashing

From: Brad T. <br...@ar...> - 2007-05-16 01:09:16

What are the memory configurations for the tomcat java process (-Xmx???m 
-Xms???m) ?
 
Have you tried increasing these configurations?

Brad

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 33 34 35 36 37 .. 43 > >> (Page 35 of 43)