archive-access-discuss Mailing List for Web Archive Access Utilities (Page 37)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 35 36 37 38 39 .. 43 > >> (Page 37 of 43)

[Archive-access-discuss] [ANN] wayback-0.8.0 released

From: Brad T. <br...@ar...> - 2007-01-13 00:29:02

This note is to announce release of Wayback 0.8.0. Its available for 
download from sourceforge at:

http://sourceforge.net/project/showfiles.php?group_id=118427.

Wayback 0.8.0 includes numerous bug-fixes, improves character detection 
reliability, and introduces a new ResourceIndex implementation using 
sorted CDX flat-files, which allow far larger indexes to be used with 
the Wayback software. This new version also includes several new 
command-line tools for creating, maintaining, and transitioning between 
BDB and CDX indexes.

The site documentation has also been significantly revised.

This new version requires significant changes to the web.xml file -- 
recommended transition strategy is to start with the new default 
web.xml, and repeat any customizations made in previous versions. It 
also requires a new format of BDB data, which will need to be recreated.

Yours,
Internet Archive Webteam

Re: [Archive-access-discuss] Parsing with help of the NutchWax

From: Michael S. <st...@ar...> - 2006-12-26 17:48:55

Artem Antonov wrote:
> Hello,
>
> I'm a novice at the NutchWax.
> I'm using the lateset version from the Sourceforge (0.8.0 Release).
>
> Please, could you give me a hint how I can parse ARC file from my Java 
> application using the NutchWax.

Here is a pointer to the code that does parse of ARCs in NutchWAX: 
http://archive-access.sourceforge.net/projects/nutch/xref/org/archive/access/nutch/ImportArcs.html#434.  
To obtain an ARCReader, use the ArchiveReaderFactory: 
http://crawler.archive.org/apidocs/org/archive/io/ArchiveReaderFactory.html.

Does this answer your question?
St.Ack



>
> Thanks.
>
> Regards,
> Artem Antonov.
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Parsing with help of the NutchWax

From: Artem A. <ant...@ya...> - 2006-12-25 12:18:36

Hello, =0A=0AI'm a novice at the NutchWax.=0AI'm using the lateset version =
from the Sourceforge (0.8.0 Release).=0A=0APlease, could you give me a hint=
 how I can parse ARC file from my Java application using the NutchWax.=0A=
=0AThanks.=0A=0ARegards,=0AArtem Antonov.=0A=0A=0A=0A______________________=
____________________________=0ADo You Yahoo!?=0ATired of spam?  Yahoo! Mail=
 has the best spam protection around =0Ahttp://mail.yahoo.com

Re: [Archive-access-discuss] release 0.8.0 : parse-pdf.sh

From: Michael S. <st...@ar...> - 2006-12-20 20:01:47

Hey Kaisa.

The script is built into the nutchwax.jar but its a bug that its not
found when you run in standalone mode (I'm guessing this is what you're
doing since if you run it distributed -- or even pseudo-distributed the
parse-pdf.sh script is found).  As a workaround, you can download the
script from here:

http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/src/plugin/parse-waxext/bin/parse-pdf.sh?content-type=text%2Fplain 
<http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/src/plugin/parse-waxext/bin/parse-pdf.sh?content-type=text%2Fplain>

(or unjar the jar and get it from there) and put it where it can be
found by the indexing job -- such as under a 'bin' directory in your
current working directory (where the latter is wherever you launched the
indexing from) -- or you can try running pseudo-distributed mode.

I should fix this issue but lets have 0.8 stew for a bit and see if any
other issues show up first before I spend time on a new release.

Thanks Kaisa.
St.Ack


Kaisa Kaunonen wrote:
> Thanks for the new nutchwax release 0.8.0
>
> I haven't yet studied it deeper, only test-indexed one
> collection. I had a problem with pdf files because a script
> 'parse-pdf' is missing. I can't find it in nutchwax-0.8.0/bin
> Yes, I have xpdf installed in path but I guess this script
> is needed to launch it?
>
> Quote from logs =>
> 'External command /bin/bash ./bin/parse-pdf.sh failed with error:
> /bin/bash: ./bin/parse-pdf.sh: No such file or directory..'
>
> Otherwise, it's very useful to now have incremental indexing
> and multiple collections in a single index.
>
> Best,
> Kaisa
>
>
> ---------- Forwarded message ----------
> Date: Tue, 12 Dec 2006 17:45:20 -0800
> From: Michael Stack <st...@ar...>
> To: arc...@li...
> Subject: [Archive-access-discuss] [ANN] nutchwax-0.8.0 released
>
> This note is to announce release of NutchWAX 0.8.0.  Its available for
> download from sourceforge at
> http://sourceforge.net/project/showfiles.php?group_id=118427&package_id=128933&release_id=470852.
> NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A
> version of this software was recently used to make an index of greater
> than 400 million documents. See Release Notes
> [http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html]
> for significant changes and fixes since NutchWAX 0.6.0.   The site
> documentation has also been significantly revised.
>
> Yours,
> Internet Archive Webteam
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] release 0.8.0 : parse-pdf.sh

From: Kaisa K. <kau...@cc...> - 2006-12-20 11:02:12

Thanks for the new nutchwax release 0.8.0

I haven't yet studied it deeper, only test-indexed one
collection. I had a problem with pdf files because a script
'parse-pdf' is missing. I can't find it in nutchwax-0.8.0/bin
Yes, I have xpdf installed in path but I guess this script
is needed to launch it?

Quote from logs =>
'External command /bin/bash ./bin/parse-pdf.sh failed with error:
/bin/bash: ./bin/parse-pdf.sh: No such file or directory..'

Otherwise, it's very useful to now have incremental indexing
and multiple collections in a single index.

Best,
Kaisa


---------- Forwarded message ----------
Date: Tue, 12 Dec 2006 17:45:20 -0800
From: Michael Stack <st...@ar...>
To: arc...@li...
Subject: [Archive-access-discuss] [ANN] nutchwax-0.8.0 released

This note is to announce release of NutchWAX 0.8.0.  Its available for
download from sourceforge at
http://sourceforge.net/project/showfiles.php?group_id=118427&package_id=128933&release_id=470852.
NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A
version of this software was recently used to make an index of greater
than 400 million documents. See Release Notes
[http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html]
for significant changes and fixes since NutchWAX 0.6.0.   The site
documentation has also been significantly revised.

Yours,
Internet Archive Webteam

[Archive-access-discuss] [ANN] nutchwax-0.8.0 released

From: Michael S. <st...@ar...> - 2006-12-13 01:45:25

This note is to announce release of NutchWAX 0.8.0.  Its available for 
download from sourceforge at 
http://sourceforge.net/project/showfiles.php?group_id=118427&package_id=128933&release_id=470852. 
NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A 
version of this software was recently used to make an index of greater 
than 400 million documents. See Release Notes 
[http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html] 
for significant changes and fixes since NutchWAX 0.6.0.   The site 
documentation has also been significantly revised.

Yours,
Internet Archive Webteam

[Archive-access-discuss] Nutch Re-crawl same file over and over again

From: Armel T. N. <arm...@id...> - 2006-12-06 23:43:48

Hi,

I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth
2. But when Nutch re-crawls, it re-crawls the same files over and over
again. The directory doesn't contain any other sub-directories, can someone
let me what might be the cause. There are more than 20 files in the
directory so why nutch only getting the same twenty files?

Thanks,

Armel


-----Original Message-----
From: Michael Stack [mailto:st...@ar...] 
Sent: 06 December 2006 16:04
To: Shay Lawless
Cc: nut...@lu...; nut...@lu...;
arc...@li...
Subject: Re: [Archive-access-discuss] Full List of Metadata Fields

Hey Shay.

Some friendly advice.  Cross-posting a question will make you unpopular 
fast.   Its best to start on the most appropriate seeming list and only 
move on from there if you are getting no satisfaction.  The below 
question looks best at home over on the archive-access list.  Let me 
have a go at answering it there.

Yours,
St.Ack 


Shay Lawless wrote:
> Hi all,
>
> I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 
> 0.5.0-200611082313) to Index a collection of ARC files generated by a 
> web crawl using the Heritrix web crawler (Version 1.4.0).
>
> When I check the metadata tag on the wera front-end the following list 
> of tags are displayed
>
> ARC Identifier
> URL
> Time of Archival
> Last Modified Time
> Mime-Type
> File Status
> Content Checksum
> HTTP Header
>
> When I click on the explain link in the NutchWax front-end the 
> following list of tags are displayed
>
> Segment
> Digest
> Date
> ARCDate
> Encoding
> Collection
> ARCName
> ARCOffset
> ContentLength
> PrimaryType
> subType
> URL
> Title
> Boost
>
> Is there a full list of the metadata fields that NutchWax/Nutch 
> creates when indexing? I'm particularly interested in tags relating to 
> the actual content on each page i.e. content type, description etc etc
> When searching does NutchWax/Nutch search across such tags or just 
> across the parsed text of each page for occurances of keywords etc?
>
> Any help you can provide would be greatly appreciated!
>
> Shay
>  
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] Full List of Metadata Fields

From: Michael S. <st...@ar...> - 2006-12-06 16:00:25

Hey Shay.

Some friendly advice.  Cross-posting a question will make you unpopular 
fast.   Its best to start on the most appropriate seeming list and only 
move on from there if you are getting no satisfaction.  The below 
question looks best at home over on the archive-access list.  Let me 
have a go at answering it there.

Yours,
St.Ack 


Shay Lawless wrote:
> Hi all,
>
> I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 
> 0.5.0-200611082313) to Index a collection of ARC files generated by a 
> web crawl using the Heritrix web crawler (Version 1.4.0).
>
> When I check the metadata tag on the wera front-end the following list 
> of tags are displayed
>
> ARC Identifier
> URL
> Time of Archival
> Last Modified Time
> Mime-Type
> File Status
> Content Checksum
> HTTP Header
>
> When I click on the explain link in the NutchWax front-end the 
> following list of tags are displayed
>
> Segment
> Digest
> Date
> ARCDate
> Encoding
> Collection
> ARCName
> ARCOffset
> ContentLength
> PrimaryType
> subType
> URL
> Title
> Boost
>
> Is there a full list of the metadata fields that NutchWax/Nutch 
> creates when indexing? I'm particularly interested in tags relating to 
> the actual content on each page i.e. content type, description etc etc
> When searching does NutchWax/Nutch search across such tags or just 
> across the parsed text of each page for occurances of keywords etc?
>
> Any help you can provide would be greatly appreciated!
>
> Shay
>  
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Full List of Metadata Fields

From: Shay L. <sea...@gm...> - 2006-12-06 15:31:49

Hi all,

I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a web
crawl using the Heritrix web crawler (Version 1.4.0).

When I check the metadata tag on the wera front-end the following list of
tags are displayed

ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header

When I click on the explain link in the NutchWax front-end the following
list of tags are displayed

Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost

Is there a full list of the metadata fields that NutchWax/Nutch creates when
indexing? I'm particularly interested in tags relating to the actual content
on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just across
the parsed text of each page for occurances of keywords etc?

Any help you can provide would be greatly appreciated!

Shay

[Archive-access-discuss] [Wayback] Server-side redirect

From: Dang N. H. <dan...@ya...> - 2006-12-06 07:18:36

Hi everyone,=0AMy project used Wayback to render the webpages crawled by He=
ritrix. However, we encounted some problem related to server-side redirecte=
d link. =0AThe website that we want to crawl is JSP site running on Tomcat.=
 There are many link which are redirected by the server (not using js, but =
by the jsp script itself). I can check that Heritrix actually follows these=
 redirected link and crawls these webpage into ARC file. However, when we t=
ry to render using Wayback, we can not follow these redirected link (and th=
e Wayback display error of "No resource available"). So I wonder whether an=
y of you have a plan to fix it and what is your approach?=0AThanks=0ANam Ha=
i=0A=0A=0A =0A_____________________________________________________________=
_______________________=0AAny questions? Get answers on any topic at www.An=
swers.yahoo.com.  Try it now.

Re: [Archive-access-discuss] Job failed in standalone operation

From: Lukas M. <lma...@gm...> - 2006-11-27 16:35:58

Dne ned=C4=9Ble 26 listopad 2006 04:57 AaRon napsal(a):
> Hi,
>
> I have been following the Getting Started guide to get NutchWAX up and
> running in standalone configuration but I just keep getting the following
> error:
>
> 06/11/26 11:24:58 WARN mapred.LocalJobRunner: job_afrgrp
> java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum
>         at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.ja=
va

I run into the same problem.

lukas


>
> :67)
>
>         at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:107)
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
>         at org.archive.access.nutch.NutchwaxIndexer.index(
> NutchwaxIndexer.java:193)
>         at org.archive.access.nutch.Nutchwax.doIndexing(Nutchwax.java:241)
>         at org.archive.access.nutch.Nutchwax.doIndexing(Nutchwax.java:234)
>         at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:154)
>         at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:379)
>         at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:651)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
>
> I'd appreciate if someone can help me on this. I'm using a
> hadoop-0.8.0installation with nightly build nutchwax (
> nutchwax-0.7.0-200611202206).
>
> Thanks,
> Aaron

[Archive-access-discuss] Job failed in standalone operation

From: AaRon <aw...@gm...> - 2006-11-26 03:57:47

Hi,

I have been following the Getting Started guide to get NutchWAX up and
running in standalone configuration but I just keep getting the following
error:

06/11/26 11:24:58 WARN mapred.LocalJobRunner: job_afrgrp
java.lang.ClassCastException: org.apache.nutch.crawl.CrawlDatum
        at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.java
:67)
        at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:107)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
        at org.archive.access.nutch.NutchwaxIndexer.index(
NutchwaxIndexer.java:193)
        at org.archive.access.nutch.Nutchwax.doIndexing(Nutchwax.java:241)
        at org.archive.access.nutch.Nutchwax.doIndexing(Nutchwax.java:234)
        at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:154)
        at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:379)
        at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:651)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

I'd appreciate if someone can help me on this. I'm using a
hadoop-0.8.0installation with nightly build nutchwax (
nutchwax-0.7.0-200611202206).

Thanks,
Aaron

Re: [Archive-access-discuss] WERA quirks

From: James G. <jg...@si...> - 2006-11-17 23:07:40

 > 1) Images *usually* don't seem to be displayed.

Aha, the image thing was my fault; seems the images I was missing were 
on a different domain (and I had restricted my crawl to the single 
domain).   I'll have to do another test, but I think that's explained.

jamesG

James Grahn wrote:
> A few quick comments:
> 
> 1) Images *usually* don't seem to be displayed.   Though I saved images 
> in one of the ARC files I'm using, they do not appear on the page in 
> WERA.   I've also noticed this occurring on the WERA test site 
> http://nwa.nb.no/wera/ when I search for "library" and examine the front 
> page of the library of congress.
> 
> Behavior: The image will appear, only to be replaced by the image's 
> "alt" tag as the page has its links remapped.
> 
> Expected behavior: The image should reappear after the links are 
> remapped (because the image should be in the ARC).
> 
> 2) There are some webpages that throw off the formatting of WERA.   They 
> seem to be primarily textareas with html embedded.
> 
> When indexed, they sometimes throw off the table formatting of WERA and 
> sometimes cause input boxes and submit buttons to appear on the search page.
> 
> Always-valid examples:
> http://cl.cnn.com/ctxtlink/jsp/cnn/cl/1.5/cnn-story-cl.jsp
> http://sportsillustrated.cnn.com/.element/ssi/misc/2.0/contextual/story.html
> http://www.cnn.com/.element/ssi/sect/1.3/WEATHER/weatherPageBox.html
> http://www.cnn.com/WEATHER/
> http://cnn.dyn.cnn.com/intlWeatherBox.html
> 
> 
> Perhaps-not-always-valid examples:
> http://www.cnn.com/.element/ssi/www/breaking_news/1.1/banner.exclude.html
> 
> An example of such offending html:
> <textarea name="breakingNews"><!--breaking news banner-->
> <div id="cnnBNBBreakingNews">
> 	<table cellpadding="0" cellspacing="0" border="0">
> 		<tr valign="middle">
> 			<td width="181" valign="top"><img 
> src="http://i.a.cnn.net/cnn/.element/img/1.5/ceiling/bnb/breaking_news.gif" 
> alt="" width="181" height="47" hspace="0" vspace="0" border="0"></td>
> 			<td class="right"><div id="cnnNarrowBulletinText">Britney Spears 
> files for divorce from her husband Kevin Federline, citing 
> irreconcilable differences. </div></td>
> 		</tr>
> 	</table>
> </div>
> <!--/breaking news banner-->
> </textarea>
> 
> 
> 3) This xml file resulted in an abrupt end of a table in WERA:
> http://edition.cnn.com/.element/img/1.3/swf/pipeline_mainpage/config_intl.xml
> 
> The source for this was a crawl of CNN at a depth of 2 links. 
> Hopefully the examples are revealing.
> 
> jamesG
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>

[Archive-access-discuss] WERA quirks

From: James G. <jg...@si...> - 2006-11-17 01:06:18

A few quick comments:

1) Images *usually* don't seem to be displayed.   Though I saved images 
in one of the ARC files I'm using, they do not appear on the page in 
WERA.   I've also noticed this occurring on the WERA test site 
http://nwa.nb.no/wera/ when I search for "library" and examine the front 
page of the library of congress.

Behavior: The image will appear, only to be replaced by the image's 
"alt" tag as the page has its links remapped.

Expected behavior: The image should reappear after the links are 
remapped (because the image should be in the ARC).

2) There are some webpages that throw off the formatting of WERA.   They 
seem to be primarily textareas with html embedded.

When indexed, they sometimes throw off the table formatting of WERA and 
sometimes cause input boxes and submit buttons to appear on the search page.

Always-valid examples:
http://cl.cnn.com/ctxtlink/jsp/cnn/cl/1.5/cnn-story-cl.jsp
http://sportsillustrated.cnn.com/.element/ssi/misc/2.0/contextual/story.html
http://www.cnn.com/.element/ssi/sect/1.3/WEATHER/weatherPageBox.html
http://www.cnn.com/WEATHER/
http://cnn.dyn.cnn.com/intlWeatherBox.html


Perhaps-not-always-valid examples:
http://www.cnn.com/.element/ssi/www/breaking_news/1.1/banner.exclude.html

An example of such offending html:
<textarea name="breakingNews"><!--breaking news banner-->
<div id="cnnBNBBreakingNews">
	<table cellpadding="0" cellspacing="0" border="0">
		<tr valign="middle">
			<td width="181" valign="top"><img 
src="http://i.a.cnn.net/cnn/.element/img/1.5/ceiling/bnb/breaking_news.gif" 
alt="" width="181" height="47" hspace="0" vspace="0" border="0"></td>
			<td class="right"><div id="cnnNarrowBulletinText">Britney Spears 
files for divorce from her husband Kevin Federline, citing 
irreconcilable differences. </div></td>
		</tr>
	</table>
</div>
<!--/breaking news banner-->
</textarea>


3) This xml file resulted in an abrupt end of a table in WERA:
http://edition.cnn.com/.element/img/1.3/swf/pipeline_mainpage/config_intl.xml

The source for this was a crawl of CNN at a depth of 2 links. 
Hopefully the examples are revealing.

jamesG

[Archive-access-discuss] [Fwd: Re: Search multiple versions of one URL - working!]

From: James G. <jg...@si...> - 2006-11-15 20:37:53

Forwarded by St.Ack's request.

Michael Stack wrote:
 > I'm glad its working for you now.  Suggestions for improving doc. so
 > others don't fall into your little wormhole?
 > Thanks James,
 > St.Ack

The "Getting Started" document was great for initial testing of the  system.

I had a problem with Hadoop early, but I was using the hadoop that came 
packaged with nutch 0.8.1... which turned out to be version 0.4.   I had 
assumed incorrectly that nutch itself would be using a recent version of 
hadoop.

When I wanted to begin working with WERA and keep multiple versions of a 
page around, however, my resources were: St.Ack's response to someone 
else on this list, the bug report about keeping multiple versions of a 
webpage, and revisiting the "Getting Started" document (since it 
contained the listing of commands in order).

So I'd say a guide outlining the steps to take to preserve multiple 
versions of a webpage would have been a plus.

Current documentation about how to do incremental indexing would be nice 
too, as this is something I'll be working on soon (I suppose the old FAQ 
solution applies?).

Outside of documentation, most of my desires from Heritrix/NutchWAX/WERA
would be for automation and integration:
- I'm looking forward to the automatic recrawling that I've seen on the 
roadmap of Heritrix.
- A non-manual way of importing new crawls from Heritrix to NutchWAX 
would be desirable
- It would have been really nice if WERA was Tomcat friendly, so WERA, 
NutchWAX, ArcRetriever, and even Heritrix could coexist on one server.
- It would have also been nice if ArcRetriever had the same args as the 
wayback machine, so that either could be used with NutchWAX.   (though 
perhaps they are compatible and I missed it?)

But, as I said, I realize that most of these tools are pre-version-1.0, 
and I'm happy that they're around to begin with.

jamesG

Re: [Archive-access-discuss] Search multiple versions of one URL

From: James G. <jg...@si...> - 2006-11-10 03:15:42

I'm currently having the same problem that Natalia initially had...

I'm using the nightly build (from a few days back) of nutchwax and am 
trying to build an index that will be used by wera.

It seems to me that if you are going to store the crawls under different 
collection names, then you have to do multiple imports (with differing 
collection names), before proceeding through update, invert, index, 
dedup, and merge.

I have been attempting to do this with multiple collections, using the 
optional "segments" arguments to keep the tools aware of the multiple 
collections.

I've gone through several permutations of the command line arguments but 
have not had any luck yet; what's the proper sequence of commands to get 
this running?

Thanks,
James

Michael Stack wrote:
> Natalia Torres wrote:
>> Hello
>>
>> I'm trying nutchwax+wera whith multiple crawls of some web pages. After 
>> index it I can't see it on wera. The Overview page only shows one crawl 
>> date. For us that's an important issue.
>>
>>   
>> I found it as a bug from july in the Nutchwax bug list (1518431 - Search 
>> multiple versions of one URL broken).
>>
>>   
>> There's a new version cooming soon? How can I solve it?
>>
>>   
> Did you give each crawl a different collection name or are they indexed 
> all with the same collection name?
> 
> In nutch, the URL for a page is used as the key in mapreduce processing 
> (Keys are used to identify records and must be unique).  It makes it so 
> you can only have one URL in a nutch index.  While an URL as primary key 
> is far from optimal, its convenient having the key be an URL.  It makes 
> it so the URL is easily available at various points during indexing 
> processing. 
> 
> In nutchwax, we've made it so that the key is collection-name + URL so 
> you can have multiple URLs as long as they are of different 
> collections.  This is a climb-down from how it used to work in nutchwax 
> -- pre-mapreduce -- where you could have multiple URLs distingushed by 
> date alone.
> 
> I'm wondering if a key of collection-name+URL is sufficient?  It means 
> indexing, collection names must be carefully chosen.  Otherwise, we need 
> to make the key uglier still: collection-name+URL+date.
> 
> Yours,
> St.Ack
> P.S.  Yes a new release is imminent.
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>

Re: [Archive-access-discuss] Search multiple versions of one URL

From: Natalia T. <nt...@ce...> - 2006-11-09 16:56:44

Thanks Michael,

All this crawls (diferent number of crawls from more than 20 diferents 
uris) are indexed in the same collection because I want make
specific searches on this collection (using "collection:mycollection 
query" as search)

If I index the crawl using a collection for each date in which urls were 
crawled I can see they on the overview page. But, how can I do the same 
search?

More questions
How many collections can I create?
The number of collections affects the response time when a search is made?


Natalia

Re: [Archive-access-discuss] NutchWaxOpenSearch

From: Shay L. <sea...@gm...> - 2006-11-09 16:51:32

Happens everytime I click the "RSS" link at the bottom of the nutchWax
screen.

061109 165132 11 query request from 134.226.35.130
061109 165132 11 query: introduction select statement
061109 165132 11 searching for 20 raw hits
061109 165132 11 total hits: 111
061109 165132 11 SEVERE Servlet.service() for servlet NutchwaxOpenSearch
threw exception
java.lang.StringIndexOutOfBoundsException: String index out of range: -87
    at java.lang.String.substring(String.java:1768)
    at org.archive.access.nutch.NutchwaxOpenSearchServlet.xmlize(
NutchwaxOpenSearchServlet.java:372)
    at org.archive.access.nutch.NutchwaxOpenSearchServlet.getXmlStr(
NutchwaxOpenSearchServlet.java:331)
    at org.archive.access.nutch.NutchwaxOpenSearchServlet.addNode(
NutchwaxOpenSearchServlet.java:280)
    at org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(
NutchwaxOpenSearchServlet.java:181)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
    at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
    at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
    at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:126)
    at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
    at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
    at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:869)
    at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
    at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
    at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
    at java.lang.Thread.run(Thread.java:595)

Thanks,

Shay

On 09/11/06, Michael Stack <st...@ar...> wrote:
>
> Happens on every URL?
>
> Paste in the full stacktrace.  That might help figure the problem.
>
> St.Ack
>
> Shay Lawless wrote:
> > Hi,
> >
> > I have installed nutchWax (0.6.1) and it seems to be indexing and
> > searching my arc files fine. However when I click on the RSS tag I am
> > getting the following error message.
> >
> > SEVERE Servlet.service() for servlet NutchwaxOpenSearch threw exception
> > java.lang.StringIndexOutOfBoundsException: String index out of range:
> -58
> >
> > This appears to be a problem with the opensearch servlet generating
> > the rss version of the url. Any ideas on this?
> >
> > Thanks in advance
> >
> > Shay
> > ------------------------------------------------------------------------
> >
> >
> -------------------------------------------------------------------------
> > Using Tomcat but need to do more? Need to support web services,
> security?
> > Get stuff done quickly with pre-integrated technology to make your job
> easier
> > Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >
>
>

[Archive-access-discuss] NutchWaxOpenSearch

From: Shay L. <sea...@gm...> - 2006-11-09 12:07:38

Hi,

I have installed nutchWax (0.6.1) and it seems to be indexing and searching
my arc files fine. However when I click on the RSS tag I am getting the
following error message.

SEVERE Servlet.service() for servlet NutchwaxOpenSearch threw exception
java.lang.StringIndexOutOfBoundsException: String index out of range: -58

This appears to be a problem with the opensearch servlet generating the rss
version of the url. Any ideas on this?

Thanks in advance

Shay

Re: [Archive-access-discuss] Search multiple versions of one URL

From: Michael S. <st...@ar...> - 2006-11-08 17:24:04

Natalia Torres wrote:
> Hello
>
> I'm trying nutchwax+wera whith multiple crawls of some web pages. After 
> index it I can't see it on wera. The Overview page only shows one crawl 
> date. For us that's an important issue.
>
>   
> I found it as a bug from july in the Nutchwax bug list (1518431 - Search 
> multiple versions of one URL broken).
>
>   
> There's a new version cooming soon? How can I solve it?
>
>   
Did you give each crawl a different collection name or are they indexed 
all with the same collection name?

In nutch, the URL for a page is used as the key in mapreduce processing 
(Keys are used to identify records and must be unique).  It makes it so 
you can only have one URL in a nutch index.  While an URL as primary key 
is far from optimal, its convenient having the key be an URL.  It makes 
it so the URL is easily available at various points during indexing 
processing. 

In nutchwax, we've made it so that the key is collection-name + URL so 
you can have multiple URLs as long as they are of different 
collections.  This is a climb-down from how it used to work in nutchwax 
-- pre-mapreduce -- where you could have multiple URLs distingushed by 
date alone.

I'm wondering if a key of collection-name+URL is sufficient?  It means 
indexing, collection names must be carefully chosen.  Otherwise, we need 
to make the key uglier still: collection-name+URL+date.

Yours,
St.Ack
P.S.  Yes a new release is imminent.

[Archive-access-discuss] Search multiple versions of one URL

From: Natalia T. <nt...@ce...> - 2006-11-08 15:40:00

Hello

I'm trying nutchwax+wera whith multiple crawls of some web pages. After 
index it I can't see it on wera. The Overview page only shows one crawl 
date. For us that's an important issue.

I found it as a bug from july in the Nutchwax bug list (1518431 - Search 
multiple versions of one URL broken).

There's a new version cooming soon? How can I solve it?

Thanks

Natalia

Re: [Archive-access-discuss] problems running in standalone mode

From: Michael S. <st...@ar...> - 2006-11-07 18:46:42

James Grahn wrote:
> Just FYI,
> My problem was resolved by switching to the nightly build of NutchWAX 
> (at St.Ack's advice) and switching to the .5 version of hadoop (I think 
> I was using 0.6.2).
>
> I now can generate a search page properly.
>
> A few problems remain, though.
> 1) All results link to a non-existent page: 
> http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/
>   
Checkout the 'Searching' section here: 
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html. 
It doesn't make mention of the 'wax.host' property you'll need to change 
-- I'll fix that -- but if you look at this file, available in the src 
version of nutchwax, it notes the property to change and others you 
might want to also change: 
http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/conf/hadoop-site.xml.template?revision=1.7


> 2) The "Other versions" link likewise directs me example.com
>
> I have looked for a way to change that in the configuration, but 
> couldn't find it.   The "Other versions" has me curious though; is this 
> going to be an integration point for something like WERA?
>
>   
By default, we'll only show the most recent version of a page. If there 
are multiple versions in an index, we'll show all (Sets hitsPerDup to 
'0' which says show all -- usually hitsPerDup is 1).

> Additional problem:
> 3) Inaccurate "hits" count: the page claims to display results 1-3 out 
> of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
> seems to originate from it not taking into account the pages hidden by 
> the "more from cnn.com".   Because I currently just have a single-domain 
> crawled, it's especially obvious.
>   
Yeah. Known issue. Need to fix. Also in play is the fact that we only 
show one hit per site by default (Add hitsPerSite=0 to your query string 
to confirm this is rather than 'more' is the issue).

> ...
>
> Also, I was wondering; would implementing something like query expansion 
> be accomplished in the same manner as it is in nutch?   That is, would 
> changing the nutch configuration file in the webapps directory to 
> perform query expansion work in NutchWAX?
>   
Nutchwax includes near all of nutch. Only reason it wouldn't work would 
be because we've not built in a plugin or some conf file or our jsp page 
diverges slightly from default nutch. If you let me know whats missing, 
I'll change build scripts to include it.

Yours,
St.Ack


> Thanks,
> James
>
> James Grahn wrote:
>   
>> Greets,
>> I have been attempting to follow the tutorial to get NutchWAX up and 
>> running in standalone mode, but I've reached an error that confounds me.
>>
>> The printlns seem to indicate that NutchWAX does successfully import the 
>> ARC files.
>>
>> I see this line:
>> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
>>
>> And after many individual pages being imported, I see this line:
>>
>> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
>>
>> This followed by more individual pages.   So that seems fine.   But no 
>> index is generated and the printlns end like this:
>>
>> ...
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
>> text/html
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
>> Exception in thread "main" java.io.IOException: Job failed!
>>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>>          at 
>> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>          at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:585)
>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
>>
>>
>> --------
>>
>> Any suggestions for this error?   I am using a hadoop installation I 
>> acquired with the current version of nutch, and am running the "all" 
>> command as per the tutorial:
>>
>> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
>> /tmp/inputs /tmp/outputs test
>>
>>
>> Thanks,
>> James
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>
>>
>>     
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] problems running in standalone mode

From: James G. <jg...@si...> - 2006-11-07 18:22:49

Just FYI,
My problem was resolved by switching to the nightly build of NutchWAX 
(at St.Ack's advice) and switching to the .5 version of hadoop (I think 
I was using 0.6.2).

I now can generate a search page properly.

A few problems remain, though.
1) All results link to a non-existent page: 
http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/

2) The "Other versions" link likewise directs me example.com

I have looked for a way to change that in the configuration, but 
couldn't find it.   The "Other versions" has me curious though; is this 
going to be an integration point for something like WERA?

Additional problem:
3) Inaccurate "hits" count: the page claims to display results 1-3 out 
of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
seems to originate from it not taking into account the pages hidden by 
the "more from cnn.com".   Because I currently just have a single-domain 
crawled, it's especially obvious.

...

Also, I was wondering; would implementing something like query expansion 
be accomplished in the same manner as it is in nutch?   That is, would 
changing the nutch configuration file in the webapps directory to 
perform query expansion work in NutchWAX?

Thanks,
James

James Grahn wrote:
> Greets,
> I have been attempting to follow the tutorial to get NutchWAX up and 
> running in standalone mode, but I've reached an error that confounds me.
> 
> The printlns seem to indicate that NutchWAX does successfully import the 
> ARC files.
> 
> I see this line:
> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
> 
> And after many individual pages being imported, I see this line:
> 
> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
> 
> This followed by more individual pages.   So that seems fine.   But no 
> index is generated and the printlns end like this:
> 
> ...
> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
> text/html
> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>          at 
> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
> 
> 
> --------
> 
> Any suggestions for this error?   I am using a hadoop installation I 
> acquired with the current version of nutch, and am running the "all" 
> command as per the tutorial:
> 
> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
> /tmp/inputs /tmp/outputs test
> 
> 
> Thanks,
> James
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>

Re: [Archive-access-discuss] A record version mismatch while indexing?

From: Kaisa K. <kau...@cc...> - 2006-11-06 11:54:44

I deleted everything old and started a new index
and now the log shines with the golden words 'nutchwax finished'.

I vaguely remember deleting old indexes every now and then
when testing different versions of hadoop+nutchwax, but probably
didn't do it when really needed.

Ok the one arc test went smoothly with hadoop-0.5.0 +
nutchwax-0.7.0-200611030343

Next I'll try to index the whole of our library's recent
mini size music archive.

Many thanks,
Kaisa


On Sat, 4 Nov 2006, Michael Stack wrote:

> When you changed hadoop+nutchwax combinations, did you clean the target 
> directory of all previous outputs?  What I see in the log below is that the 
> import works fine but when we move to do the crawldb update, its complaining 
> that the sequencefiles its being fed don't jibe with what it already 
> digested.   Was there a crawldb already in-place made with a different 
> version of hadoop?
>
> You should use the latest nutchwax build+hadoop-0.5.0.  Current nutchwax is 
> based on the nutch 0.8.1 release.  Nutch 0.8.1 is built against hadoop-0.5.0. 
> Nutch and Hadoop are moving at different rates. 
> The latest nutchwax+hadoop-0.5.0 is what we're currently using internally 
> running a large indexing job: ~800milllion documents.  We're learning lots 
> operating at this new scale.  I'll try and summarize our findings and post 
> them alongside the new release when it goes out (Should happen when this big 
> job completes -- in a week or so).
>
> Yours,
> St.Ack
>
>
>
> Kaisa Kaunonen wrote:
>> 
>> Hi all,
>> 
>> I don't seem to find a combination of hadoop-0.5.0 and
>> nutchwax-0.6.x or nutchwax-0.7.x that would index on my
>> machines.
>> 
>> hadoop-0.5.0 + nutchwax-0.6.1 (latest official) fails
>> (for different reasons than 0.7.0-200611030343)
>> 
>> hadoop-0.5.0 + nutchwax-0.7.0-200611030343 (latest build artifact) fails
>> 
>> Attached log from the 0.7.0 run when trying to index one arc.
>> The run stops by saying 'A record version mismatch occurred.
>> Expecting v3, found v5'
>> 
>> 
>> Best,
>> Kaisa Kaunonen
>> Nat.Lib.Finland

Re: [Archive-access-discuss] Viewing Search Results for Archived Content

From: Lukas M. <lma...@gm...> - 2006-11-05 19:51:16

Dne p=E1tek 03 listopad 2006 15:33 Shay Lawless napsal(a):
> Hi,
>
> I am using nutchWax to index a series of ARC files created in a webcrawl
> using the Heritrix crawler.

which version of NutchWax do you use?
>
> My problem occurs when I perform a query on nutchWax and attempt to view
> the results, nutch attempts to send me to the URL in question rather than
> the archived content item. As a result I am getting an error as the URL is
> not being correctly formed.
>
> Has anyone any experience with displaying content from an ARC content
> archive rather than directly from the URL. Do I require an ARC-access
> redisplay tool such as 'Wayback Machine' to achieve this. If so, can anyo=
ne
> give advice on this or other similar tools for ARC redisplay?

arcretriever, part of WERA (previous NWA), allows retrieving of ARCRecord=20
through offset and arcname. =20

>
> Any help would be greatly appreciated, thanks in advance
>
> Seamus

Lukas

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 35 36 37 38 39 .. 43 > >> (Page 37 of 43)