archive-access-discuss Mailing List for Web Archive Access Utilities (Page 38)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 36 37 38 39 40 .. 43 > >> (Page 38 of 43)

[Archive-access-discuss] A record version mismatch while indexing?

From: Kaisa K. <kau...@cc...> - 2006-11-04 08:24:01

Attachments: roska0.7

Hi all,

I don't seem to find a combination of hadoop-0.5.0 and
nutchwax-0.6.x or nutchwax-0.7.x that would index on my
machines.

hadoop-0.5.0 + nutchwax-0.6.1 (latest official) fails
(for different reasons than 0.7.0-200611030343)

hadoop-0.5.0 + nutchwax-0.7.0-200611030343 (latest build artifact) fails

Attached log from the 0.7.0 run when trying to index one arc.
The run stops by saying 'A record version mismatch occurred.
Expecting v3, found v5'


Best,
Kaisa Kaunonen
Nat.Lib.Finland

Re: [Archive-access-discuss] Viewing Search Results for Archived Content

From: Michael S. <st...@ar...> - 2006-11-03 15:24:20

Shay Lawless wrote:
> Hi,
>
> I am using nutchWax to index a series of ARC files created in a 
> webcrawl using the Heritrix crawler.
>
> My problem occurs when I perform a query on nutchWax and attempt to 
> view the results, nutch attempts to send me to the URL in question 
> rather than the archived content item. As a result I am getting an 
> error as the URL is not being correctly formed.
Thats right.  You need something to serve up the Archived content. 

Nutchwax has traditionally been paired with WERA: 
http://archive-access.sourceforge.net/projects/wera/.  Check it out.

We also need to make it so Nutchwax works using the opensource wayback 
machine.   Its been reported recently that the bridge between the two is 
broken at the moment.  It needs to be fixed.

Yours,
St.Ack
>
> Has anyone any experience with displaying content from an ARC content 
> archive rather than directly from the URL. Do I require an ARC-access 
> redisplay tool such as 'Wayback Machine' to achieve this. If so, can 
> anyone give advice on this or other similar tools for ARC redisplay?
>
> Any help would be greatly appreciated, thanks in advance
>
> Seamus
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Viewing Search Results for Archived Content

From: Shay L. <sea...@gm...> - 2006-11-03 14:33:16

Hi,

I am using nutchWax to index a series of ARC files created in a webcrawl
using the Heritrix crawler.

My problem occurs when I perform a query on nutchWax and attempt to view the
results, nutch attempts to send me to the URL in question rather than the
archived content item. As a result I am getting an error as the URL is not
being correctly formed.

Has anyone any experience with displaying content from an ARC content
archive rather than directly from the URL. Do I require an ARC-access
redisplay tool such as 'Wayback Machine' to achieve this. If so, can anyone
give advice on this or other similar tools for ARC redisplay?

Any help would be greatly appreciated, thanks in advance

Seamus

Re: [Archive-access-discuss] problems running in standalone mode

From: Kaisa K. <kau...@cc...> - 2006-11-03 07:56:24

I had something similar and was given the advice to use
the very lastest version of nutchwax with hadoop-0.5.0
(and not hadoop-0.7.2 for example)


On Thu, 2 Nov 2006, James Grahn wrote:

> Greets,
> I have been attempting to follow the tutorial to get NutchWAX up and
> running in standalone mode, but I've reached an error that confounds me.
>
> The printlns seem to indicate that NutchWAX does successfully import the
> ARC files.
>
> I see this line:
> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
>
> And after many individual pages being imported, I see this line:
>
> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
>
> This followed by more individual pages.   So that seems fine.   But no
> index is generated and the printlns end like this:
>
> ...
> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869
> text/html
> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>         at
> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>         at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>         at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>         at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>         at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
>
>
> --------
>
> Any suggestions for this error?   I am using a hadoop installation I
> acquired with the current version of nutch, and am running the "all"
> command as per the tutorial:
>
> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all
> /tmp/inputs /tmp/outputs test
>
>
> Thanks,
> James
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] problems running in standalone mode

From: James G. <jg...@si...> - 2006-11-02 17:46:04

Greets,
I have been attempting to follow the tutorial to get NutchWAX up and 
running in standalone mode, but I've reached an error that confounds me.

The printlns seem to indicate that NutchWAX does successfully import the 
ARC files.

I see this line:
opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz

And after many individual pages being imported, I see this line:

061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz

This followed by more individual pages.   So that seems fine.   But no 
index is generated and the printlns end like this:

...
061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
text/html
061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
Exception in thread "main" java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
         at 
org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
         at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
         at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
         at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
         at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:130)


--------

Any suggestions for this error?   I am using a hadoop installation I 
acquired with the current version of nutch, and am running the "all" 
command as per the tutorial:

${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
/tmp/inputs /tmp/outputs test


Thanks,
James

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-10-27 17:32:20

I've destilled Michael Stack's instruction into a shell script which I'd
like to share. It seems to work quite good for me, but I've only used it
on smaller archives (several hundert MBs) with the latest NutchWAX (CVS
Head) and under Cygwin.
Please let me know if it works for you and whether you still find
everything with the new indices:

http://www.cip.ifi.lmu.de/~schoefma/howto/incremental_indexing_with_nutch=
wax/incr_index.sh

Usage:
./incr_index.sh input_dir target_dir [collection_name]
  or
./incr_index.sh --arcs dir_with_arc_files target_dir [collection_name]

Example:
./incr_index.sh --arcs heritrix/jobs/MyJob-12345/arcs myarch/output mycol=
l

Proconditions:
- HADOOP_HOME and NUTCHWAX_HOME must be set
- You need an existing index in "target_dir" to operate on, e.g. one
  generated by running  NutchWAX' "all" task on a set of arc files.

Hints:
- Save your production index directory before running this script on it!
- When using Cygwin, use relative paths especially for the input dir.
- Either shut down NutchWAX when running this script or operate on a copy
  of your live index (to avoid permission denied errors).

Return codes:
This script returns exit codes which can be used by other scripts:
0  -  Everything went fine,
1  -  Script failed to start (directory not found etc.)
2  -  The importing/indexing process was already started and the index
      in the target directory might have been damaged. You should restore
      it from your backup in this case.


- Max

Re: [Archive-access-discuss] Wayback with NutchWax: RTE: field "link" does not appear to be indexed

From: Michael S. <st...@ar...> - 2006-10-26 14:50:52

Maximilian Schoefmann wrote:
> Hi,
> ...
>
> Am I missing some important configuration thing here or did the Nutch part
> of wayback just not get enough love the last months (-: ?
>
>   
The latter is likely the problem.  Let me talk to Brad (Mr. Wayback).  
Its a priority that it gets fixed.

St.Ack

> Cheers,
> Max
>
>
> Full stacktrace:
>
> java.lang.RuntimeException: field "link" does not appear to be indexed
> 	org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:356)
> 	org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedHitQueue.java:341)
> 	org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:184)
> 	org.apache.lucene.search.FieldSortedHitQueue.<init>(FieldSortedHitQueue.java:58)
> 	org.apache.lucene.search.TopFieldDocCollector.<init>(TopFieldDocCollector.java:44)
> 	org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:108)
> 	org.apache.lucene.search.Searcher.search(Searcher.java:76)
> 	org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer.java:268)
> 	org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
> 	org.apache.nutch.searcher.NutchBean.search(NutchBean.java:180)
> 	org.apache.nutch.searcher.NutchBean.search(NutchBean.java:242)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:136)
> 	org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:69)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Wayback with NutchWax: RTE: field "link" does not appear to be indexed

From: Maximilian S. <sch...@ci...> - 2006-10-26 12:32:07

Hi,

I'm trying to get Wayback with nutchWax working, but I'm running into thi=
s
error from nutchwax: field "link" does not appear to be indexed

Now I don't know wether nutchWax or Wayback is to blame here, or if "link=
"
_should_ be in my index but isn't somehow?!
When I just remove &sort=3Dlink from the query URL the query works fine. =
I
found it being added here:

org.archive.wayback.resourceindex.NutchResourceIndex.java (285):
ms.append("&sort=3Dlink");

But even without the exception being thrown I don't get any results as th=
e
exact date is added to the query each time. And by exact I mean _second_!
Even when I select "All" from the years select box,
"date%3A20061231235959" is added (?), when I select "2003",
date%3A20031231235959 is added. Nutch will then only search for documents
with this specific timestamp.

Am I missing some important configuration thing here or did the Nutch par=
t
of wayback just not get enough love the last months (-: ?

Cheers,
Max


Full stacktrace:

java.lang.RuntimeException: field "link" does not appear to be indexed
	org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:356)
	org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH=
itQueue.java:341)
	org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo=
rtedHitQueue.java:184)
	org.apache.lucene.search.FieldSortedHitQueue.<init>(FieldSortedHitQueue.=
java:58)
	org.apache.lucene.search.TopFieldDocCollector.<init>(TopFieldDocCollecto=
r.java:44)
	org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:108)
	org.apache.lucene.search.Searcher.search(Searcher.java:76)
	org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptim=
izer.java:268)
	org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
	org.apache.nutch.searcher.NutchBean.search(NutchBean.java:180)
	org.apache.nutch.searcher.NutchBean.search(NutchBean.java:242)
	org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java=
:136)
	org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSea=
rchServlet.java:69)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

Re: [Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-10-24 19:06:20

Hi Brad,

Another update. Thank you for the input.

We modified org.archive.wayback.cdx.RemoteCDXIndex.java to search  
multiple remote indexes. One wayback instance (the frontend) has been  
configured to use RemoteCDX in the web.xml and the other wayback  
instances are using LocalBDB configuration.


[code block]
private String sdscSearchUrlBases[] = {"http://machine:9000/wayback/ 
xmlquery", "http://machine:9002/wayback/xmlquery", ...}

public SearchResults query(WaybackRequest wbRequest) {
...
       SearchResults searchResults = new SearchResults();
                 for(int i=0;i<sdscSearchUrlBases.length;i++) {
                         try {
				doc = queryOneUrl(sdscSearchUrlBases[i], wbRequest);
       ...
       }
...
}
[end code block]







On Oct 18, 2006, at 2:58 PM, Brad Tofel wrote:

> Sorry for the delayed response -- lots of balls in the air right now..
>
> I just did a large check in of the software the supports sorted  
> flat files, but will not have time to update the docs for another  
> week or so. There are some comments in the code and in the new  
> web.xml, (which has changed pretty significantly) that might be  
> enough to make sense of how to use the new functionality.
>
>
> Currently, there is no support for querying multiple remote  
> indexes, but it seems like this should be relatively  
> straightforward, in the good-guy case, using the new software:  
> you'd just need to make a RemoteSearchResultSource out of the  
> RemoteResourceIndex, and modify the SearchResultSourceFactory to  
> build a composite from several of them...
>
> I say "in the good-guy case" because the failure modes might get  
> complicated in terms of timeouts, failed connections, etc. However,  
> if your hardware is stable, then the "easy solution" I outlined  
> might be good enough.
>
> I'll drop you another line when the documentation has been updated.
>
> Brad
>
> Bing Zhu wrote:
>> Dear Mr. Brad,
>>
>> This is Bing Zhu from University of California: San Diego.
>>
>> We really appreciate your time to put the answers for our questions.
>>
>> Is it possible for a Wayback machine to query multiple index  
>> sources (e.g. index info
>> in multiple Wayback machines) when using RemoteCDXIndex ? If yes,  
>> would you
>> let us how to do so? Many thanks.
>>
>> Sincerely,
>> Bing
>>
>>
>>
>>

Alex Wu
858-534-5074

Re: [Archive-access-discuss] [resolved] continuous build server and cvs not responding

From: Maximilian S. <sch...@ci...> - 2006-10-11 09:33:49

Ok, CVS works. But someone should update the webpage as the instructions
there don't work anymore. This works (archive-access.cvs.sourcef...
instead of cvs.sourcef...):

cvs
-d:pserver:ano...@ar...:/cvsroot/archive-=
access
login

cvs -z3
-d:pserver:ano...@ar...:/cvsroot/archive-=
access
co -P archive-access

> It seems that the cruisecontrol server has been down for a while. As
> sourceforge CVS also doesn't work (neither directly nor with CVSGrab) t=
his
> makes it a bit hard to test the latest versions.

[Archive-access-discuss] continuous build server and cvs not responding

From: Maximilian S. <sch...@ci...> - 2006-10-11 09:21:11

Hi,

It seems that the cruisecontrol server has been down for a while. As
sourceforge CVS also doesn't work (neither directly nor with CVSGrab) thi=
s
makes it a bit hard to test the latest versions.
Could it be that sourceforge neglects its cvs services a bit...? I'm
seeing this quite often :-(

Max

[Archive-access-discuss] Update: Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-10-10 19:11:16

Hi Brad,

Thank you for your input. Wanted to give an update on our experience  
with the wayback application within Tomcat.

We tried one setup, where on one machine, we ran 12 instances of the  
wayback application, each in it's own Tomcat container, and gave  
about 2,700 ARC files for each instance. Each Tomcat was allocated  
1GB memory. This was done over the weekend, and over 30,000 ARCs were  
processed.

Another setup was tried on the same machine, where 3 tomcat instances  
were run, each with 6 wayback applications. Each wayback application  
handles 2,700 ARC files. Each tomcat was allocated 1 to 3 GB memory.  
Within the instance of Tomcat with 3GB allocated, the result in about  
48 hours was just over 3,000 ARCs processed. The other two tomcat  
instances are mostly idle, having indexed/merged their respective set  
of ARCs almost completely over the weekend.

We are experimenting with different setups that involve many  
variables, such as the varying size of ARC files, non-wayback load on  
the machine, etc., so it's difficult to give a more accurate  
performance comparison without controlling the variables more.

We modified slightly org/archive/wayback/cdx/indexer/ 
IndexPipeline.class so that the indexing, queuing, and merging are  
running in separate threads, and sleeping at different intervals. And  
with this, 3 indexing threads are running.

Lastly, I was not able to view the CVS at http:// 
crawltools.archive.org:8080/cruisecontrol/buildresults/HEAD-archive- 
access. "Firefox can't establish a connection to the server at  
crawltools.archive.org:8080."

Thank you again,
Alex


> Hi,
>
> We have a project with about 48000 ARC files, and would like inputs  
> on the best way to implement the wayback machine 0.6.0
>
> Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have  
> only 6000 ARCs indexed at this point over a 1 week period. We would  
> like to increase this rate significantly.
>
>
> Some questions we have are:
>
> 1. Suggested environment setup for this number of ARC files and  
> greater.
>
>
Your current setup should be fine for this, but when the distributed  
index option is available, it would be advisable to move to this  
configuration.


> 2. Parallel indexing option for the current version or additional  
> tools that will allow for this.
>
>

The pipeline-client command line tool has a new option to generate a  
flat-file version of the index data on STDOUT. This process could be  
executed in parallel across multiple nodes, and their outputs sorted,  
and merged together to form a single flat-file. This flat-file can be  
used today with the BDBJE option, by manually placing the file into  
the "toBeMerged" directory on the host holding the index. We've seen  
acceptable performance inserting large sorted files in this manner.

With the new flat-file binary searching ResourceIndex code, this  
sorted flat-file could be used as-is, bypassing the BDBJE altogether.  
I'll let you know when it's checked in.

> 3. The index is tied to the machine name. How to avoid this.
>
>

Not sure what you mean. Do you mean there is data internal to the  
BDBJE that is aware of the host where it was created and cannot be  
used on other hosts? Can you elaborate?


> 4. Is it possible to have multiple wayback installations, each with  
> its own JVM, use the same arc files and/or index.
>
>
Yes. We have a couple of installations that include front end UIs for  
Proxy, Timeline, and Archival URL replay modes on top of the same  
index, where each installation uses a RemoteCDXIndex. I'll add some  
documentation to the User Manual outlining this configuration in the  
next day or two.



> 5. The user manual at http://archive-access.sourceforge.net/ 
> projects/wayback/user_manual.html mentions a non- 
> LocalBDBResourceIndex resource implementation that communicates  
> with a remote wayback installation. The user manual does not cover  
> the preparation of the index data. What are the steps for this  
> setup, including index data preparation.
>
>
As mentioned in #4, I'll outline this configuration in the User  
Manual, but the basics: set up one webapp with a  
LocalBDBResourceIndex, making sure it has a QueryUI with the  
QueryXMLUI jsps set up. This will allow HTTP-XML queries of the  
index. Then you set up one or more webapps, using whatever replay  
modes you prefer, using the RemoteCDXIndex ResourceIndex  
implementation to connect to the HTTP-XML exported ResourceIndex.


> 6. Is there a limitation to the number of ARCs wayback will handle.
>
>
With the 0.8.0 features, we expect the WM to be able to scale to  
arbitrarily large numbers of ARC files. Generating indexes for larger  
installations will be handled offline, and will be a manual process  
until the 1.0.0 release.

Thanks for the feedback and questions. We're very interested in your  
experiences and making this software as easy to use as possible.

Re: [Archive-access-discuss] NutchWax

From: Lukas M. <mat...@ce...> - 2006-10-09 14:28:37

Dne pond=C4=9Bl=C3=AD 09 =C5=99=C3=ADjen 2006 16:13 Shay Lawless napsal(a):
> Hi all,
>
> I'm attempting to use NutchWax to index a number of .arc files generated =
by
> a web crawl. I can get the indexing step to run fine, and when I perform a
> keyword search results are returned and ranked by nutch. However when I
> click on any of the results, the content cannot be displayed. The message
> returned is as follows,



Did   you fill collection name during indexing process? Which version of=20
NutchWax are you using?

lukas
>
> Not Found The requested URL /null/20060930150000/http://blah.blah.com/ was
> not found on this server.
>
> Additionally, a 404 Not Found error was encountered while trying to use an
> ErrorDocument to handle the request.
>
> Any help you can provide would really be appreciated,
>
> Thanks,
>
> S=C3=A9amus Lawless

[Archive-access-discuss] NutchWax

From: Shay L. <sea...@gm...> - 2006-10-09 14:13:22

Hi all,

I'm attempting to use NutchWax to index a number of .arc files generated by
a web crawl. I can get the indexing step to run fine, and when I perform a
keyword search results are returned and ranked by nutch. However when I
click on any of the results, the content cannot be displayed. The message
returned is as follows,

Not Found The requested URL /null/20060930150000/http://blah.blah.com/ was
not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an
ErrorDocument to handle the request.

Any help you can provide would really be appreciated,

Thanks,

S=E9amus Lawless

Re: [Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Brad T. <br...@ar...> - 2006-09-26 20:20:28

Hi Alex,

Good questions, all of them. First off, your collection is larger than 
any collection we've implemented using the current WM, but we are in the 
process, right now, of creating an installation of about 5TB, or about 
50K ARCs, so you're not completely out in front of the crowd.

Firstly, the BDBJE has performance issues at larger scales when 
inserting in random order, both in insert, and in subsequent lookup. We 
haven't yet done serious performance analysis on this. Our solution has 
been to externally sort the index data. This makes insert linear in 
performance, and lookup performance has been good on BDBJE's created 
this way(see answer to #2 below for a few more hints on implementing 
this, or the online User Manual in the near future).

I'll add some notes on how we've been implementing this to the User Manual.

0.8.0, which will hopefully be available soon, will include modules for 
distributing an index across multiple nodes, in alphabetic regions. This 
code is mostly done now, but is not checked in. 0.8.0 will also include 
several new Index related features, including: capability to use sorted 
flat files as a Wayback index (which will allow external sort tools to 
be used to generate the index, long term(1.0.0) we're planning on using 
Hadoop for this) capability to merge results found from multiple index 
sources, which could involve multiple sorted flat files, and a BDBJE, 
for example. We expect that the combination of these features will allow 
indexes of arbitrarily large sizes to be created and searched efficiently.

Today, 48K ARCs is pushing the edge. I can probably do a check in in the 
next few days of most of the functionality I've described above, if 
you're interested in helping to test this new software.

Specific answers to your questions below.

Alex Wu wrote:
> Hi,
>
> We have a project with about 48000 ARC files, and would like inputs on 
> the best way to implement the wayback machine 0.6.0
>
> Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have only 
> 6000 ARCs indexed at this point over a 1 week period. We would like to 
> increase this rate significantly.
>
>
> Some questions we have are:
>
> 1. Suggested environment setup for this number of ARC files and greater.
>
Your current setup should be fine for this, but when the distributed 
index option is available, it would be advisable to move to this 
configuration.

> 2. Parallel indexing option for the current version or additional 
> tools that will allow for this.
>

The pipeline-client command line tool has a new option to generate a 
flat-file version of the index data on STDOUT. This process could be 
executed in parallel across multiple nodes, and their outputs sorted, 
and merged together to form a single flat-file. This flat-file can be 
used today with the BDBJE option, by manually placing the file into the 
"toBeMerged" directory on the host holding the index. We've seen 
acceptable performance inserting large sorted files in this manner.

With the new flat-file binary searching ResourceIndex code, this sorted 
flat-file could be used as-is, bypassing the BDBJE altogether. I'll let 
you know when it's checked in.
> 3. The index is tied to the machine name. How to avoid this.
>

Not sure what you mean. Do you mean there is data internal to the BDBJE 
that is aware of the host where it was created and cannot be used on 
other hosts? Can you elaborate?

> 4. Is it possible to have multiple wayback installations, each with 
> its own JVM, use the same arc files and/or index.
>
Yes. We have a couple of installations that include front end UIs for 
Proxy, Timeline, and Archival URL replay modes on top of the same index, 
where each installation uses a RemoteCDXIndex. I'll add some 
documentation to the User Manual outlining this configuration in the 
next day or two.

> 5. The user manual at 
> http://archive-access.sourceforge.net/projects/wayback/user_manual.html 
> mentions a non-LocalBDBResourceIndex resource implementation that 
> communicates with a remote wayback installation. The user manual does 
> not cover the preparation of the index data. What are the steps for 
> this setup, including index data preparation.
>
As mentioned in #4, I'll outline this configuration in the User Manual, 
but the basics: set up one webapp with a LocalBDBResourceIndex, making 
sure it has a QueryUI with the QueryXMLUI jsps set up. This will allow 
HTTP-XML queries of the index. Then you set up one or more webapps, 
using whatever replay modes you prefer, using the RemoteCDXIndex 
ResourceIndex implementation to connect to the HTTP-XML exported 
ResourceIndex.

> 6. Is there a limitation to the number of ARCs wayback will handle.
>
With the 0.8.0 features, we expect the WM to be able to scale to 
arbitrarily large numbers of ARC files. Generating indexes for larger 
installations will be handled offline, and will be a manual process 
until the 1.0.0 release.

Thanks for the feedback and questions. We're very interested in your 
experiences and making this software as easy to use as possible.

Brad
>
> Thank you for your input.
>
> Alex Wu
> 858-534-5074
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Env Setup Help for Large Set of ARCs

From: Alex Wu <aw...@sd...> - 2006-09-26 19:32:49

Hi,

We have a project with about 48000 ARC files, and would like inputs  
on the best way to implement the wayback machine 0.6.0

Our setup is Tomcat 5.5.17, JDK 1.5, 1GB memory for JVM. We have only  
6000 ARCs indexed at this point over a 1 week period. We would like  
to increase this rate significantly.


Some questions we have are:

1. Suggested environment setup for this number of ARC files and greater.

2. Parallel indexing option for the current version or additional  
tools that will allow for this.

3. The index is tied to the machine name. How to avoid this.

4. Is it possible to have multiple wayback installations, each with  
its own JVM, use the same arc files and/or index.

5. The user manual at http://archive-access.sourceforge.net/projects/ 
wayback/user_manual.html mentions a non-LocalBDBResourceIndex  
resource implementation that communicates with a remote wayback  
installation. The user manual does not cover the preparation of the  
index data. What are the steps for this setup, including index data  
preparation.

6. Is there a limitation to the number of ARCs wayback will handle.


Thank you for your input.

Alex Wu
858-534-5074

[Archive-access-discuss] java.lang.RuntimeException: java.lang.NullPointerException

From: roger p <ro...@ho...> - 2006-08-27 23:16:20

Can anyone tell me howto overcome the following problem.

I use:
Fedora Core 5
Sun Java 1.5

and I have followed the instructions throughout the manual, but still I get 
an error retrieving the records in the arc files. The following error 
occurs:

:WARN:  /nutchwax/opensearch:
java.lang.RuntimeException: java.lang.NullPointerException
        at 
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
        at 
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:346)
        at 
org.archive.access.nutch.NutchwaxBean.getSummary(NutchwaxBean.java:53)
        at 
org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
        at 
org.archive.access.nutch.NutchwaxOpenSearchServlet.doGet(NutchwaxOpenSearchServlet.java:69)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:442)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:357)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:226)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:615)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:150)
        at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:123)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:141)
        at org.mortbay.jetty.Server.handle(Server.java:272)
        at 
org.mortbay.jetty.HttpConnection.handlerRequest(HttpConnection.java:404)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:650)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:488)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:198)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:319)
        at 
org.mortbay.jetty.nio.HttpChannelEndPoint.run(HttpChannelEndPoint.java:270)
        at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:475)

Before this it has found 14 hits of the searched word:

2006-08-27 17:11:30,038 INFO  NutchBean - found 14 raw hits
2006-08-27 17:11:30,039 INFO  NutchBean - total hits: 1829

So the problem does not seem to be searching in the indeces. Maybe there 
might be a problem with accessing the arcs to get the specific page. I put 
the correct path (searcher.dir) inside the nutch-site.xml and 
hadoop-site.xml

Has anyone any idea about how to solve this problem?

CK

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-08-01 08:28:38

Thanks for the fast and informative answer!

> Its not so much that its broken.

good to hear. Btw, I'm now using a 0.7 version from the integration
server. Don't know if that changes anything.

> + If no /index/ sub-directory in /${searcher.dir}/ then the nutch
> searcher NutchBean in the webapp opens all indices in the /indexes/
> subdir.  Usually, under /indexes, there/ are subdirectories holding an
> /index/ per segment.  I've tested mixing in ${searcher.dir}/indexes the
> indexes of merged segments and individual segment indices.  This works
> as long as the indices under search.dir/indexes have a (empty)
> /index.done/ file added (Merged indexes don't have this file present --
> you may have to add manually). So, you can ingest new ARCS, then add th=
e
> new segment to the crawldb, do a new link invertion (the new segments
> links will be added to the old linkdb, as for the crawldb), and then
> index your new segment.  When done, add the new index to
> ${segment.dir}/indexes and perhaps move the old, big merged index here
> (adding in the index.done file) and restart your webapp. You should be
> able to search the old and new.
> + But you may find that you might have to merge your new incremental
> segment indices into the large index and then 'sort' the merged index t=
o
> get good, 'balanced' results. Sorting is a recent feature added to nutc=
h
> that allows sorting the index by rank so the highest ranked pages are
> returned first.  Generally you sort to get the best results returned
> faster than you would from the unsorted index.  But, I've observed that
> querying across multiple indices, one index may be favored.  To fix, I
> found I had to merge and sort all indices (To sort, do
> ''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
> org.apache.nutch.indexer.IndexSorter').

Seems complicated, but I will give it a try as soon as I have crawled a
few smaller arcs which don't take too long to index.
But as the URLs of the site I'm crawling don't change too often and
searching across multiple versions doesn't work right now, having a merge=
d
index won't mean much to me.

Would it make sense to just name the collections after the crawl date to
be able to distinguish between different versions?

Regards,
Max

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Michael S. <st...@ar...> - 2006-07-31 20:56:53

Maximilian Schoefmann wrote:
> Hi *,
>
> I'want to do regular crawls of a bigger website. I've already crawled it
> successfully with heritrix, indexed the resulting arcs with nutchwax and
> also searched/browsed them with wera. Works pretty well!
>   

I'm glad to hear.

> Now I wanted to do a second crawl but I've read that incremental indexing
> is broken in nutchwax 0.6 (which I'am using).
>   
Its not so much that its broken.  Its more that I don't yet have a good 
story to tell on how to do incremental indexing in 0.6+ of nutchwax. 
Here is what I currently know (I've been kinda waiting on getting more 
practise under my belt before starting in writing a recipe for others):

+ If no /index/ sub-directory in /${searcher.dir}/ then the nutch 
searcher NutchBean in the webapp opens all indices in the /indexes/ 
subdir.  Usually, under /indexes, there/ are subdirectories holding an 
/index/ per segment.  I've tested mixing in ${searcher.dir}/indexes the 
indexes of merged segments and individual segment indices.  This works 
as long as the indices under search.dir/indexes have a (empty) 
/index.done/ file added (Merged indexes don't have this file present -- 
you may have to add manually). So, you can ingest new ARCS, then add the 
new segment to the crawldb, do a new link invertion (the new segments 
links will be added to the old linkdb, as for the crawldb), and then 
index your new segment.  When done, add the new index to 
${segment.dir}/indexes and perhaps move the old, big merged index here 
(adding in the index.done file) and restart your webapp. You should be 
able to search the old and new.
+ But you may find that you might have to merge your new incremental 
segment indices into the large index and then 'sort' the merged index to 
get good, 'balanced' results. Sorting is a recent feature added to nutch 
that allows sorting the index by rank so the highest ranked pages are 
returned first.  Generally you sort to get the best results returned 
faster than you would from the unsorted index.  But, I've observed that 
querying across multiple indices, one index may be favored.  To fix, I 
found I had to merge and sort all indices (To sort, do 
''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class 
org.apache.nutch.indexer.IndexSorter').


> I guess I need incremental indexing if I want to be able to search across
> all versions of the site?
>   

Yes.  Sort of.  The not so good news is that in new nutch(wax), the key 
it uses doing all of the  mapreduce indexing steps is the URL (Not 
URL+date but URL only).  What this means is that only the latest version 
of a page is searchable; unlike old nutch, you can't search a single URL 
across all page versions.  This feature was lost when we moved on to new 
nutch.  Recently I made nutchwax use URL + collection as the key 
end-to-end indexing and at query time.  This makes it so I can have the 
same URL in the index multiple times distingushed by collection.  Next 
will be to key by URL + date (See '[ 1518431 ] [nutchwax] Search 
multiple versions of one URL broken' 
http://sourceforge.net/tracker/index.php?func=detail&aid=1518431&group_id=118427&atid=681137).


> Now I think I have three options:
> 1. wait until incremental indexing is fixed
> 2. use the 4.3 branch
>   

4.3 branch is dead and no longer supported.
> 3. index only the newly crawled arcs and let the user select on which date
> she want's to search
>
> So my questions are:
> - Is it foreseeable when incremental indexing will be fixed - and if -
> what performance can I expect compared to completely reindexing all arc
> files?
>   
Soon (smile).  Its about time we had a new NutchWAX release.  Lots of 
changes of late (not the least of which is that there is an official 0.8 
nutch release).  I'm currently working on making incremental updates 
work for us internally.  Once the internal client is satisfied, I'll 
document and release.   I'd SWAG a month.

> - Will the 4.3 branch be maintained beside the 0.6 branch and will it be
> possible to convert the webdb/indices later (doesn't seem to be the case
> right now)?
>   

No to both questions.
St.Ack

[Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-07-31 12:22:39

Hi *,

I'want to do regular crawls of a bigger website. I've already crawled it
successfully with heritrix, indexed the resulting arcs with nutchwax and
also searched/browsed them with wera. Works pretty well!

Now I wanted to do a second crawl but I've read that incremental indexing
is broken in nutchwax 0.6 (which I'am using).
I guess I need incremental indexing if I want to be able to search across
all versions of the site?

Now I think I have three options:
1. wait until incremental indexing is fixed
2. use the 4.3 branch
3. index only the newly crawled arcs and let the user select on which dat=
e
she want's to search

So my questions are:
- Is it foreseeable when incremental indexing will be fixed - and if -
what performance can I expect compared to completely reindexing all arc
files?
- Will the 4.3 branch be maintained beside the 0.6 branch and will it be
possible to convert the webdb/indices later (doesn't seem to be the case
right now)?

What solution would you suggest?

Thanks & Best regards,

Max

Re: [Archive-access-discuss] Indexing and searching

From: Michael S. <st...@ar...> - 2006-07-20 16:23:04

Natalia Torres wrote:
> Thanks Michael, I'll experiment indexing job this way.
>
>
> About indexing proces ..
>
> I'm testing how it works (Heritrix+Hadoop+NutchWax+Wera) with our web 
> and I'm running it in standalone mode with one crawled job (about 7 arc 
> 700Mb).
>   

How long is it taking you to index your 7 ARCs?
> I want to start a hadoop cluster but i d0n't know how many slaves put 
> and hardware requerimets to it. I'm looking for infomation about 
> benchmarks, indexing performance .... to know more about hardware needed 
> , but I don't find anything.
>   
When the software settles more -- hadoop, nutch, and nutchwax -- I'll 
put up some figures on our experience here at the Archive.

Meantime, here's a few coarse stats.

+ A cluster should have at least 3, probably 4 machines, to make 
distribution worth the bother.
+ Here at the Archive, we have a rack that has between 16 and 30 
machines that we've been running/debugging indexing jobs on over the 
last bunch of months (The number of slaves participating varies because 
the hardware we use is not of the best quality and these indexing jobs 
lasting days doing checksums of all read and written are a good way of 
finding those flakey RAM sticks and erroring motherboards).  We find on 
this rack that total processing of an ARC including ingest through 
indexing takes about 3 minutes (Machines are 4Gig 2Ghz dual-core Athlons 
with 4x400 SATA disks).

Other things to consider:

+ Make all slave nodes exactly the same -- same RAM and disk 
configuration.  It'll save you headache down the road.
+ Setup rsync so you can pull ARCs into your cluster with it.  Once 
done, you can then feed nutchwax lists of ARCs using rsync URLs.  This 
way, you can leave your ARCs out on storage nodes and the indexing 
software will take care of making the ARCs local to the indexing cluster.
+ DFS cannot be trusted.  It'll be fixed soon but for now, as soon as an 
indexing job is completed, make a backup of the produced data -- 
segments and indices -- to local storage.

Yours,
St.Ack

Re: [Archive-access-discuss] Indexing and searching

From: Natalia T. <nt...@ce...> - 2006-07-07 11:45:11

Thanks Michael, I'll experiment indexing job this way.


About indexing proces ..

I'm testing how it works (Heritrix+Hadoop+NutchWax+Wera) with our web 
and I'm running it in standalone mode with one crawled job (about 7 arc 
700Mb).

I want to start a hadoop cluster but i d0n't know how many slaves put 
and hardware requerimets to it. I'm looking for infomation about 
benchmarks, indexing performance .... to know more about hardware needed 
, but I don't find anything.


Thanks,

Natalia

Re: [Archive-access-discuss] Indexing and searching

From: Michael S. <st...@ar...> - 2006-07-07 00:41:45

Natalia Torres wrote:
> Hello
>
> I tried to add the new job moving the indexes directory before starting 
> index process and it works fine. Thanks!!
>
> So, every time I want to index a new job I need to move indexes 
> directory? If I move this directory the nuch wax search still working?
>   

I presume you are using the 'all' command each time?  It will complain 
if there are already indices in place from a previous run.

The 'all' command is a convenience.  It assumes you want to do a 
single-pass indexing of a set of ARCs.   Running the 'all' command to 
bring in a new set of ARCs will run through all steps and index all the 
new additions as well as reindex all ARCs added previously.

Sounds like you want to do incremental updates to your index.  
Experiment by calling the jobs that comprise the 'all' command 
individually.  For example, run the import passing it a directory that 
contains a file that points to just the new ARCs you want to ingest.   
Then do 'update' and 'invert'.  Next run indexing just of the segments 
that were added by the ingest step.  Save aside the indexes made 
previously first.  Run your deduplication.   Finally merge the new 
indices and the old.

I'm working currently on tools and documentation to better support 
incremental updates to indices.  They'll form core of next release 
(Coming soon -- month or so).

> This proces takes many hours ...
>
>   
Yes.  It can.  Depends on number of ARCs you have.  Sounds like too that 
you are running in the standalone mode.  You might consider starting a 
small hadoop cluster.  That should improve your throughput.

Yours,
St.Ack

> Natalia
>
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] [Re: Nutchwax+Wera problems]

From: Michael S. <st...@du...> - 2006-07-07 00:18:01

Attachments: Re: [Archive-access-discuss] Nutchwax+Wera problems

Re: [Archive-access-discuss] Archive-access-discuss Digest, Vol 2, Issue 2

From: Natalia T. <nt...@ce...> - 2006-07-05 12:10:28

I have the same problem searchin any document (gif,html...)as JCL using=20
this versions of Wera and Nutchwax.
, and  Wayback works fine.


I tried to change arc path in documentDispatcher but it doesn't work.


Natalia


arc...@li... wrote:
> Send Archive-access-discuss mailing list submissions to
> 	arc...@li...
>=20
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> or, via email, send a message with subject or body 'help' to
> 	arc...@li...
>=20
> You can reach the person managing the list at
> 	arc...@li...
>=20
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Archive-access-discuss digest..."
>=20
>=20
> Today's Topics:
>=20
>    1. Re: Nutchwax+Wera problems (Michael Stack)
>=20
>=20
> ----------------------------------------------------------------------
>=20
> Message: 1
> Date: Mon, 03 Jul 2006 16:03:27 -0700
> From: Michael Stack <st...@ar...>
> Subject: Re: [Archive-access-discuss] Nutchwax+Wera problems
> To: Jo?o Cl?udio Luzio <jl...@ex...>
> Cc: arc...@li...
> Message-ID: <44A...@ar...>
> Content-Type: text/plain; charset=3DISO-8859-1; format=3Dflowed
>=20
> Jo?o Cl?udio Luzio wrote:
>=20
>>Oops.. forgot to say that the arcs where on=20
>>/var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20
>>directory but with .arc.gz instead of .arc.
>>
>> =20
>=20
> This should be fine.
>=20
>=20
>>Jo?o Cl?udio Luzio wrote:
>> =20
>>
>>>Hi,
>>>    I've been trying to get the pair up and running for a while now bu=
t=20
>>>had some problems..
>>>    Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20
>>>get it running but some of the related files (images)
>>>aren't displayed. Those get:
>>><retrievermessage>
>>>    <head>
>>>        <errorcode>4</errorcode>
>>>        <errormessage>Unable to parse Archive Identifier</errormessage=
>
>>>    </head>
>>></retrievermessage>
>>>    Using wera debug I found that the "[archiveidentifier] =3D>=20
>>>2770/IAH-20060619172903-00000-webarchive1" for a specific search i mad=
e.
>>>(Starting tomcat from the nutchwax indexed data)
>>>   =20
>=20
>=20
> So, it generally works but some of the images don't show sometimes?
>=20
>=20
>>>    Using wayback I dont have the same problems(I dont use nutchwax wi=
th=20
>>>wayback..).
>>>
>>>    I've tried to get nutchwax 0.6.1 and wera running but the opensear=
ch=20
>>>servlet for the rss from nutchwax gives an exception..
>>>   =20
>=20
>=20
> Do you still have the exception?
>=20
>=20
>=20
>>>    So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but n=
ow=20
>>>the arcretriever gives an exception when trying to get the document.
>>>    Using wera debug I found that the "[archiveidentifier] =3D>=20
>>>2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20
>>>same search i made.
>>>(Starting tomcat from anywhere)
>>>
>>>The exception:
>>>7  Bad function argument Cause: java.io.FileNotFoundException:=20
>>>/var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filed=
esc:/IAH-20060619172903-00000-webarchive1.arc=20
>>>does not exist. Stack trace:=20
>>>org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20
>>>org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20
>>>org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20
>>>org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20
>>>no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20
>>>no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131)
>>>   =20
>=20
>=20
> Looks like we shouldn't be putting the 'filedesc:' on front of ARC=20
> name?  Does ARCRetreiver work if you make a request with=20
> IAH-20060619172903-00000-webarchive1.arc instead of=20
> filedesc:/IAH-20060619172903-00000-webarchive1.arc?
>=20
> St.Ack
>=20
>=20
>=20
> ------------------------------
>=20
> Using Tomcat but need to do more? Need to support web services, securit=
y?
> Get stuff done quickly with pre-integrated technology to make your job =
easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geron=
imo
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat=
=3D121642
>=20
> ------------------------------
>=20
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>=20
>=20
> End of Archive-access-discuss Digest, Vol 2, Issue 2
> ****************************************************
>=20
>=20

--=20

......................................................................
          __
         / /          Natalia Torres
   C E / S / C A      Dept. de Sistemes
       /_/            Centre de Supercomputaci=C3=B3 de Catalunya

   Gran Capit=C3=A0, 2-4 (Edifici Nexus) =E2=80=A2 08034 Barcelona
   T. 93 205 6464 =E2=80=A2 F.  93 205 6979 =E2=80=A2 nt...@ce...
......................................................................

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 36 37 38 39 40 .. 43 > >> (Page 38 of 43)