archive-access-discuss Mailing List for Web Archive Access Utilities (Page 36)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 34 35 36 37 38 .. 43 > >> (Page 36 of 43)

Re: [Archive-access-discuss] listing all contents

From: Brad T. <br...@ar...> - 2007-05-16 01:07:21

Couple of mechanisms, depending on what Replay UI, and what ResourceIndex
you're using.

If you're using Archival URL replay mode, then you can use a wildcard '*'
for the datespec, and a trailing '*' to list documents in the index
prefixed with the given url.

Example:

http://wayback.yourhost.org:8080/wayback/*/example.com/*

Also, you can alter the the query URL 'type' argument from 'urlquery' to 
'urlprefixquery'. This is clearly a problem with the current search 
form, and will hopefully be addressed in the next release.

If you want to do further processing on URLs in the index, there are two 
tools packaged with the wayback called bdb-client and bin-search. They 
are command line tools for dumping URLs with a given prefix from either 
a BDB index or from a sorted CDX index. Hopefully the online 
documentation for these tools will be enough to get you started with 
them, but let me know if it falls short.

Brad

Ignacio Garcia wrote:
> is there any way to list all contents in my archived files using wayback?
>
> I have three different crawls, but one of them is not complete, so I 
> don't
> know exactly what files I actually have archived and I would like to know
> what files is wayback serving.
>
> I don't know if this is possible, since the url field doesn't seem to be
> taking wildcards.
>
> Thank you.
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] [wayback] AutoARCIndexThread Crashing

From: Ignacio G. <igc...@gm...> - 2007-05-14 13:13:17

Hello,

I have the following issue with wayback.
I am trying to add several arcs to wayback, but for some reason some of the
them make the ArcIndexer procedure crash and I get the error: 'Exception in
thread "AutoARCIndexThread" java.lang.OutOfMemoryError: Java heap space'
The collection I was trying to add consisted of 11 arc files with a total
size of ~4.5 Gb
I also tried adding them one by one, and I found out that only some of the
arc files caused wayback to fail.

My first thought was that maybe the arcs were corrupted and Wayback was not
able to index them because of that, but if I use the 'index-client'
application or Heritrix's 'arcreader' script, the contents of the mentioned
arcs are listed properly and no error messages appear.

I have tried with version 0.8 and also with two 0.9 countinous releases and
I've had the same results in all cases...

Does anyone know what the issue may be, or if there is any certain way to
test if an arc file is corrupted and will crash the AutoARCIndexThread?

Thank you.

[Archive-access-discuss] listing all contents

From: Ignacio G. <igc...@gm...> - 2007-05-02 13:52:35

is there any way to list all contents in my archived files using wayback?

I have three different crawls, but one of them is not complete, so I don't
know exactly what files I actually have archived and I would like to know
what files is wayback serving.

I don't know if this is possible, since the url field doesn't seem to be
taking wildcards.

Thank you.

Re: [Archive-access-discuss] Resource Index Exclusions

From: Brad T. <br...@ar...> - 2007-04-27 21:16:59

Hi Ignacio,

The Wayback documentation has fallen a bit behind new features. This 
should be rectified after a check-in that will significantly change the 
configuration system (Wayback will no longer be configured via web.xml), 
and also will switch the build system to maven 2 and continuum. We're 
hoping to have all this complete sometime in May.

The exclusion system that is (barely) documented, and to which you are 
referring, has significant performance issues: Every record retrieved 
from the ResourceIndex requires an HTTP request to an external 
"exclusion service". We are not recommending use of this exclusion 
system until these performance issues have been addressed.

Recent versions of the Wayback software includes an alternate 
"static-map" exclusion system, which monitors the contents of a text 
file, and excludes URLs and URL prefixes placed in the file.

Until we switch standard distributions to maven 2 and continuum, you can 
grab a "preview" .war which includes this "static-map" exclusion 
component, but otherwise is compatible with the older web.xml 
configuration system. This .war should be a drop-in replacement for what 
you're working with now, but will allow you to add the following 
configuration in place of whatever exclusion configuration you're using:

===============
    <context-param>
        <param-name>exclusion.factorytype</param-name>
        <param-value>static-map</param-value>
    </context-param>

    <context-param>
        <param-name>resourceindex.exclusionpath</param-name>
        <param-value>/tmp/wb-excludes.txt</param-value>
    </context-param>
===============

Here's where you can grab the .war:

http://builds.archive.org:8080/maven2/org/archive/wayback/wayback-webapp/0.9.0-SNAPSHOT/wayback-webapp-0.9.0-20070418.010333-23.war

Then the contents of /tmp/wb-excludes.txt might look something like:

==================
www.foo.com/private/
foo.com/private/
www.foo.com/extras/secure/
foo.com/extras/secure/
www.example.com/
example.com/
==================

Updates to the file should be noticed automatically and take affect 
within 10 seconds.

Please let me know how this works for you, and if you have other 
suggestions for how this would be useful to you.

Brad

Ignacio Garcia wrote:
> Hello,
>
> I have a question regarding the Resource Index Exclusions,
>
> I want to create a manual list of URLs that should not be exposed by
> wayback. As far as I understand by reading the online user manual, I 
> have to
> point the option "adminexclusion.dbpath" to the location where my 
> exclusion
> list is.
> My question is: what format does the BDB exclusion file has and how can I
> create it.
>
> The command line tools included with wayback let you maintain BDB 
> files or
> create CDX files, but nowhere it says anything about creating new BDB 
> files
> based on a list of URLs.
> How would I create a exclusion list that will hold the following urls:
>
> http://www.foo.com/private/
> http://www.foo.com/extras/secure/
> http://www.example.com/
>
> In this case I want to hide all URLs from the domain example.com and all
> files URLs under the private and extras/secure directories in the 
> foo.comdomain.
> Is that possible? Do I have to specify absolute URLs on the exclusion 
> list?
>
> Thank you.
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] Merging Error

From: Michael S. <st...@du...> - 2007-04-27 17:21:56

See this old FAQ from heritrix: 
http://crawler.archive.org/faq.html#toomanyopenfiles
St.Ack

alexis artes wrote:
> Hi,
>
> We are encountering a "Too many files are open" error while doing an 
> incremental indexing. We followed the procedure outlined in the FAQ 
> and below are the commands we used.
> -----------------------------------------
> Import
> >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar import inputs 
> outputs test2
>
> Update
> >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar update outputs 
> outputs/segments/20070425125008-test2
>
> Invert
> >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar invert outputs 
> outputs/segments/20070425125008-test2
>
> Dedup
> ->we did not run the dedup command.
>
> Index
> >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class 
> org.archive.access.nutch.NutchwaxIndexer outputs/indexes2 
> outputs/crawldb outputs/linkdb outputs/segments/20070425125008-test2
>
> Merge
> >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class 
> org.apache.nutch.indexer.IndexMerger outputs/index2 outputs/indexes 
> outputs/indexes2
>
> Our System Configuration:
> Scientific Linux CERN
> 2.4.21-32.0.1.EL.cernsmp
> JDK1.5
> Hadoop0.5
> Nutchwax0.8
>
> We also tried running Nutchwax0.10 on Hadoop0.12.3 and Hadoop0.9.2, 
> but still get the same kind of error as below.
> ---------------------------------------------
> 07/04/26 15:49:50 INFO conf.Configuration: parsing 
> file:/opt/hadoop-0.9.2/conf/hadoop-default.xml
> 07/04/26 15:49:50 INFO conf.Configuration: parsing 
> file:/tmp/hadoop-unjar23572/nutch-default.xml
> 07/04/26 15:49:50 INFO ipc.Client: 
> org.apache.hadoop.io.ObjectWritableConnection culler maxidletime= 1000ms
> 07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritable 
> Connection Culler: starting
> 07/04/26 15:49:50 INFO indexer.IndexMerger: merging indexes to: 
> outputs/index2
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00000
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00001
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00002
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00003
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00004
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00005
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00006
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00007
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00008
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00009
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00010
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00011
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00012
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00013
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00014
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00015
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00016
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00017
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00018
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes/part-00019
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00000
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00001
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00002
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00003
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00004
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00005
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00006
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00007
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00008
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00009
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00010
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00011
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00012
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00013
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00014
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00015
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00016
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00017
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00018
> 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding 
> /user/root/outputs/indexes2/part-00019
> 07/04/26 15:50:02 INFO fs.DFSClient: Could not obtain block from any 
> node:  java.io.IOException: No live nodes contain current block
> 07/04/26 15:50:05 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 1 time(s).
> 07/04/26 15:50:06 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 2 time(s).
> 07/04/26 15:50:07 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 3 time(s).
> 07/04/26 15:50:08 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 4 time(s).
> 07/04/26 15:50:09 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 5 time(s).
> 07/04/26 15:50:10 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 6 time(s).
> 07/04/26 15:50:11 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 7 time(s).
> 07/04/26 15:50:12 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 8 time(s).
> 07/04/26 15:50:13 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x4:9000. Already tried 9 time(s).
> 07/04/26 15:50:14 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 10 time(s).
> 07/04/26 15:50:15 WARN fs.DFSClient: DFS Read: 
> java.net.SocketException: Too many open files
>         at java.net.Socket.createImpl(Socket.java:388)
>         at java.net.Socket.connect(Socket.java:514)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
>         at org.apache.hadoop.ipc.Client.call(Client.java:452)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.read(DataInputStream.java:134)
>         at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
>         at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
>         at 
> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
>         at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
>         at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
>         at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
>         at 
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
>         at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
>         at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
>
> 07/04/26 15:50:15 INFO fs.DFSClient: Could not obtain block from any 
> node:  java.io.IOException: No live nodes contain current block
> 07/04/26 15:50:18 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 1 time(s).
> 07/04/26 15:50:19 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 2 time(s).
> 07/04/26 15:50:20 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 3 time(s).
> 07/04/26 15:50:21 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 4 time(s).
> 07/04/26 15:50:22 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 5 time(s).
> 07/04/26 15:50:23 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 6 time(s).
> 07/04/26 15:50:24 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 7 time(s).
> 07/04/26 15:50:25 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 8 time(s).
> 07/04/26 15:50:26 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 9 time(s).
> 07/04/26 15:50:27 INFO ipc.Client: Retrying connect to server: 
> mon034/x.x.x.x:9000. Already tried 10 time(s).
> 07/04/26 15:50:28 WARN fs.DFSClient: DFS Read: 
> java.net.SocketException: Too many open files
>         at java.net.Socket.createImpl(Socket.java:388)
>         at java.net.Socket.connect(Socket.java:514)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
>         at org.apache.hadoop.ipc.Client.call(Client.java:452)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.read(DataInputStream.java:134)
>         at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
>         at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
>         at 
> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
>         at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
>         at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
>         at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
>         at 
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
>         at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
>         at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
>
> 07/04/26 15:50:28 FATAL indexer.IndexMerger: IndexMerger: 
> java.net.SocketException: Too many open files
>         at java.net.Socket.createImpl(Socket.java:388)
>         at java.net.Socket.connect(Socket.java:514)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
>         at org.apache.hadoop.ipc.Client.call(Client.java:452)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
>         at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.read(DataInputStream.java:134)
>         at 
> org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
>         at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
>         at 
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
>         at 
> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
>         at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
>         at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
>         at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
>         at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
>         at 
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
>         at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
>         at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
>
> I hope you can help us solve the issue.
>
> Best Regards,
> Alexis
>
> ------------------------------------------------------------------------
> Ahhh...imagining that irresistible "new car" smell?
> Check out new cars at Yahoo! Autos. 
> <http://us.rd.yahoo.com/evt=48245/*http://autos.yahoo.com/new_cars.html;_ylc=X3oDMTE1YW1jcXJ2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDbmV3LWNhcnM->

[Archive-access-discuss] Resource Index Exclusions

From: Ignacio G. <igc...@gm...> - 2007-04-27 12:48:18

Hello,

I have a question regarding the Resource Index Exclusions,

I want to create a manual list of URLs that should not be exposed by
wayback. As far as I understand by reading the online user manual, I have to
point the option "adminexclusion.dbpath" to the location where my exclusion
list is.
My question is: what format does the BDB exclusion file has and how can I
create it.

The command line tools included with wayback let you maintain BDB files or
create CDX files, but nowhere it says anything about creating new BDB files
based on a list of URLs.
How would I create a exclusion list that will hold the following urls:

http://www.foo.com/private/
http://www.foo.com/extras/secure/
http://www.example.com/

In this case I want to hide all URLs from the domain example.com and all
files URLs under the private and extras/secure directories in the foo.comdomain.
Is that possible? Do I have to specify absolute URLs on the exclusion list?

Thank you.

[Archive-access-discuss] Merging Error

From: alexis a. <alx...@ya...> - 2007-04-27 10:43:43

Hi,

We are encountering a "Too many files are open" error while doing an incremental indexing. We followed the procedure outlined in the FAQ and below are the commands we used.
-----------------------------------------
Import
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar import inputs outputs test2

Update
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar update outputs outputs/segments/20070425125008-test2

Invert
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar invert outputs outputs/segments/20070425125008-test2

Dedup
->we did not run the dedup command.

Index
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class org.archive.access.nutch.NutchwaxIndexer outputs/indexes2 outputs/crawldb outputs/linkdb outputs/segments/20070425125008-test2

Merge
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class org.apache.nutch.indexer.IndexMerger outputs/index2 outputs/indexes outputs/indexes2

Our System Configuration:
Scientific Linux CERN
2.4.21-32.0.1.EL.cernsmp
JDK1.5
Hadoop0.5
Nutchwax0.8

We also tried running Nutchwax0.10 on Hadoop0.12.3 and Hadoop0.9.2, but still get the same kind of error as below.
---------------------------------------------
07/04/26 15:49:50 INFO conf.Configuration: parsing file:/opt/hadoop-0.9.2/conf/hadoop-default.xml
07/04/26 15:49:50 INFO conf.Configuration: parsing file:/tmp/hadoop-unjar23572/nutch-default.xml
07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritableConnection culler maxidletime= 1000ms
07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritable Connection Culler: starting
07/04/26 15:49:50 INFO indexer.IndexMerger: merging indexes to: outputs/index2
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00000
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00001
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00002
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00003
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00004
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00005
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00006
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00007
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00008
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00009
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00010
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00011
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00012
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00013
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00014
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00015
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00016
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00017
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00018
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00019
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00000
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00001
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00002
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00003
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00004
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00005
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00006
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00007
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00008
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00009
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00010
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00011
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00012
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00013
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00014
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00015
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00016
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00017
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00018
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00019
07/04/26 15:50:02 INFO fs.DFSClient: Could not obtain block from any node:  java.io.IOException: No live nodes contain current block
07/04/26 15:50:05 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 1 time(s).
07/04/26 15:50:06 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 2 time(s).
07/04/26 15:50:07 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 3 time(s).
07/04/26 15:50:08 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 4 time(s).
07/04/26 15:50:09 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 5 time(s).
07/04/26 15:50:10 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 6 time(s).
07/04/26 15:50:11 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 7 time(s).
07/04/26 15:50:12 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 8 time(s).
07/04/26 15:50:13 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x4:9000. Already tried 9 time(s).
07/04/26 15:50:14 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 10 time(s).
07/04/26 15:50:15 WARN fs.DFSClient: DFS Read: java.net.SocketException: Too many open files
        at java.net.Socket.createImpl(Socket.java:388)
        at java.net.Socket.connect(Socket.java:514)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
        at org.apache.hadoop.ipc.Client.call(Client.java:452)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
        at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.read(DataInputStream.java:134)
        at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
        at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
        at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
        at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
        at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
        at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
        at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
        at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

07/04/26 15:50:15 INFO fs.DFSClient: Could not obtain block from any node:  java.io.IOException: No live nodes contain current block
07/04/26 15:50:18 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 1 time(s).
07/04/26 15:50:19 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 2 time(s).
07/04/26 15:50:20 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 3 time(s).
07/04/26 15:50:21 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 4 time(s).
07/04/26 15:50:22 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 5 time(s).
07/04/26 15:50:23 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 6 time(s).
07/04/26 15:50:24 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 7 time(s).
07/04/26 15:50:25 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 8 time(s).
07/04/26 15:50:26 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 9 time(s).
07/04/26 15:50:27 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 10 time(s).
07/04/26 15:50:28 WARN fs.DFSClient: DFS Read: java.net.SocketException: Too many open files
        at java.net.Socket.createImpl(Socket.java:388)
        at java.net.Socket.connect(Socket.java:514)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
        at org.apache.hadoop.ipc.Client.call(Client.java:452)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
        at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.read(DataInputStream.java:134)
        at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
        at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
        at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
        at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
        at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
        at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
        at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
        at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

07/04/26 15:50:28 FATAL indexer.IndexMerger: IndexMerger: java.net.SocketException: Too many open files
        at java.net.Socket.createImpl(Socket.java:388)
        at java.net.Socket.connect(Socket.java:514)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
        at org.apache.hadoop.ipc.Client.call(Client.java:452)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
        at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
        at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
        at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.read(DataInputStream.java:134)
        at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
        at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
        at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
        at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
        at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
        at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
        at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
        at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
        at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
        at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

I hope you can help us solve the issue.

Best Regards,
Alexis

       
---------------------------------
Ahhh...imagining that irresistible "new car" smell?
 Check outnew cars at Yahoo! Autos.

Re: [Archive-access-discuss] getting http resourcestore to work in wayback

From: Brad T. <br...@ar...> - 2007-04-10 18:19:48

Hey Jimmy,

I just downloaded and installed wayback on a machine running Cygwin.

Two things I found:

1) need to set the JAVACMD env variable:
    export JAVACMD=`which java`

2) I was only able to get things running when passing a relative argument
to the arc directory. There's some path resolution happening that I
haven't tracked down yet.

Just want to make sure that your Cygwin /tmp is HTTP (assuming on port
8081 from your example) exported on the node holding the ARC data at
/tmp/: so a file in the Cygwin folder at /tmp/foo.arc.gz is accessible at
http://archostip:8081/arc/foo.arc.gz.

Let me know if you're still having problems,

Brad

> Brad,
>
> I tried running index-client today.  Ran it from the machine that is
> hosting the arc files.  I received an error:
>
> bin/index-client: line 81: C:\Program: command not found
>
> This is the line that I used in cygwin:
>
> bin/index-client \tmp\  http://waybackip:8080/wayback/index-incoming/
> http://waybackip/arc-proxy \apache-tomcat-5.5.23\webapps\arc
> http://archostip:8081/arc
>
> I also tried:
>
> bin/index-client /tmp/  http://waybackip:8080/wayback/index-incoming/
> http://waybackip:8080/arc-proxy /apache-tomcat-5.5.23/webapps/arc
> http://archostip:8081/arc
>
> And I tried:
>
> bin/index-client C:\tmp\  http://waybackip:8080/wayback/index-incoming/
> http://waybackip/arc-proxy C:\apache-tomcat-5.5.23\webapps\arc
> http://archostip:8081/arc
>
> I received the same error each time.  Any thought?
>
> Jimmy
>
>
> -----Original Message-----
> From: Brad Tofel [mailto:br...@ar...]
> Sent: Monday, April 09, 2007 5:05 PM
> To: Lin, Jimmy
> Cc: br...@ar...
> Subject: RE: [Archive-access-discuss] getting http resourcestore to work
> in wayback
>
> You're right, you may not need to use the location-client if you used
> the
> second usage of the index-client. The index-client scans through ARC
> files
> and outputs records for each document found in CDX format.
>
> In usage 1, the CDX output is sent to STDOUT, for later (manual) sorting
> and merging to generate an aggregated CDX file from many ARC input
> files.
> The location-client tool is primarily aimed for installations that are
> using this form to generate index files.
>
> In usage 2, the CDX data is sent directly to the ResourceIndex(can only
> be
> done with the BDB ResourceIndex implementation) via HTTP PUT. In this
> second usage, the index-client will also notify the ArcProxy's
> LocationDB
> of where that ARC can be found, which means you don't need to use the
> location-client tool at all.
>
> I haven't tested the codebase on Cygwin for a long time -- please send
> feedback on how it works for you.
>
> Automation of large scale indexing is the next key feature for the
> wayback
> project, so all this should get easier in the near term, but what's
> there
> now is hopefully enough to get smaller scale indexes built, or larger
> scale indexes built with a little shell scripting.
>
> We use these tools at the archive to maintain indexes for 10's of TB of
> ARC data, but we'd be happy to receive other feature suggestions that
> would make things simpler for you.
>
> Brad
>
>> Brad,
>>
>> Thanks.  I did do that, however, I never followed through with
>> location-client.  Its good to see that I was somewhat on the right
>> track.  A couple follow up questions, can you run location-client
>> through cygwin(We are working on windows machines), and do I not need
> to
>> run index-client?  The two shell scripts seem to be very similar.
>>
>> Jimmy
>>
>> -----Original Message-----
>> From: Brad Tofel [mailto:br...@ar...]
>> Sent: Monday, April 09, 2007 4:18 PM
>> To: Lin, Jimmy
>> Cc: arc...@li...
>> Subject: Re: [Archive-access-discuss] getting http resourcestore to
> work
>> in wayback
>>
>> I'll take a crack at improving the docs for other users later today,
> but
>> here are a couple quick tips:
>>
>> * the idea is to set up the ArcProxy to reverse proxy all HTTP 1.1
> range
>> requests to the actual storage node that holds the ARC files. If your
>> ArcProxy server is set up on arc-proxy.foo.org:8080/arc-proxy/
> (implies
>> you placed the wayback.war under the webapps dir on arc-proxy.foo.org,
>> with the name arc-proxy.war) then all ARCs can be accessed at:
>>
>> http://arc-proxy.foo.org:8080/arc-proxy/bar.arc.gz
>> http://arc-proxy.foo.org:8080/arc-proxy/baz.arc.gz
>>
>> even if bar.arc.gz and baz.arc.gz are on different nodes. To do this,
>> you
>> need to modify the arc-proxy web.xml, after it's been unpacked,
>> uncommenting the ArcProxy section of the configuration (and commenting
>> out
>> UI, ResourceStore, and ResourceIndex sections) and restart Tomcat.
>>
>> * the last step is to inform the ArcProxy where all the ARC files
> live,
>> so
>> it knows where to forward requests for the various ARCs stored on the
>> ARC
>> storage machines. This can be done with the location-client script.
>>
>> * I'm not sure which symbolic link you're referring to in the user
>> manual,
>> which version of the software are you using?
>>
>> Let me know if there's still missing info, and thanks for using the
>> tools!
>>
>> Brad
>>
>>
>>> Hello,
>>>
>>>
>>>
>>> I need some guidance in getting this up and running.  The user manual
>>> states the following steps:
>>>
>>>
>>>
>>> 1.	Set up a singleton ArcProxy webapp. This webapp maintains a BDB
>>> that maps ARC filenames to their actual absolute URL, and creates an
>>> indirection, so all ARC files are accessible within a single HTTP
>>> exported directory.
>>>
>>> How do I go about doing this?  Do I rename the wayback.war file to
>>> arc-proxy and install it in tomcat?
>>>
>>> 2.	Export your ARC files via HTTP 1.1, on all hosts that hold them,
>>> to the node running the ArcProxy webapp. Some examples of HTTP 1.1
>>> webservers you can use to export your ARC files are Apache, Tomcat,
>> and
>>> thttpd. Any other webserver that supports HTTP 1.1 will also work.
>>>
>>> This requires that I have one of the webservers listed above
> installed
>>> on each machine that holds arc files right?  I have installed tomcat
>> on
>>> such a machine, how do I go about creating the "symbolic link" that
>> the
>>> User manual refers to.
>>>
>>> 3.	Populate the ArcProxy BDB with the locations of all ARC files in
>>> your repository. See instructions for the using location-client
>>> command-line tool, within this document, to populate the ArcProxy
> BDB.
>>>
>>> Thanks in advance,
>>>
>>> Jimmy Lin
>>>
>>>
>>>
>>>
>>
> ------------------------------------------------------------------------
>> -
>>> Take Surveys. Earn Cash. Influence the Future of IT
>>> Join SourceForge.net's Techsay panel and you'll get the chance to
>> share
>>> your
>>> opinions on IT & business topics through brief surveys-and earn cash
>>>
>>
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
>> V_______________________________________________
>>> Archive-access-discuss mailing list
>>> Arc...@li...
>>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>>
>>
>

Re: [Archive-access-discuss] getting http resourcestore to work in wayback

From: Brad T. <br...@ar...> - 2007-04-09 20:18:02

I'll take a crack at improving the docs for other users later today, but
here are a couple quick tips:

* the idea is to set up the ArcProxy to reverse proxy all HTTP 1.1 range
requests to the actual storage node that holds the ARC files. If your
ArcProxy server is set up on arc-proxy.foo.org:8080/arc-proxy/ (implies
you placed the wayback.war under the webapps dir on arc-proxy.foo.org,
with the name arc-proxy.war) then all ARCs can be accessed at:

http://arc-proxy.foo.org:8080/arc-proxy/bar.arc.gz
http://arc-proxy.foo.org:8080/arc-proxy/baz.arc.gz

even if bar.arc.gz and baz.arc.gz are on different nodes. To do this, you
need to modify the arc-proxy web.xml, after it's been unpacked,
uncommenting the ArcProxy section of the configuration (and commenting out
UI, ResourceStore, and ResourceIndex sections) and restart Tomcat.

* the last step is to inform the ArcProxy where all the ARC files live, so
it knows where to forward requests for the various ARCs stored on the ARC
storage machines. This can be done with the location-client script.

* I'm not sure which symbolic link you're referring to in the user manual,
which version of the software are you using?

Let me know if there's still missing info, and thanks for using the tools!

Brad


> Hello,
>
>
>
> I need some guidance in getting this up and running.  The user manual
> states the following steps:
>
>
>
> 1.	Set up a singleton ArcProxy webapp. This webapp maintains a BDB
> that maps ARC filenames to their actual absolute URL, and creates an
> indirection, so all ARC files are accessible within a single HTTP
> exported directory.
>
> How do I go about doing this?  Do I rename the wayback.war file to
> arc-proxy and install it in tomcat?
>
> 2.	Export your ARC files via HTTP 1.1, on all hosts that hold them,
> to the node running the ArcProxy webapp. Some examples of HTTP 1.1
> webservers you can use to export your ARC files are Apache, Tomcat, and
> thttpd. Any other webserver that supports HTTP 1.1 will also work.
>
> This requires that I have one of the webservers listed above installed
> on each machine that holds arc files right?  I have installed tomcat on
> such a machine, how do I go about creating the "symbolic link" that the
> User manual refers to.
>
> 3.	Populate the ArcProxy BDB with the locations of all ARC files in
> your repository. See instructions for the using location-client
> command-line tool, within this document, to populate the ArcProxy BDB.
>
> Thanks in advance,
>
> Jimmy Lin
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV_______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] getting http resourcestore to work in wayback

From: Lin, J. <lin...@ba...> - 2007-04-09 15:31:31

Hello,

=20

I need some guidance in getting this up and running.  The user manual
states the following steps:

=20

1.	Set up a singleton ArcProxy webapp. This webapp maintains a BDB
that maps ARC filenames to their actual absolute URL, and creates an
indirection, so all ARC files are accessible within a single HTTP
exported directory.=20

How do I go about doing this?  Do I rename the wayback.war file to
arc-proxy and install it in tomcat?

2.	Export your ARC files via HTTP 1.1, on all hosts that hold them,
to the node running the ArcProxy webapp. Some examples of HTTP 1.1
webservers you can use to export your ARC files are Apache, Tomcat, and
thttpd. Any other webserver that supports HTTP 1.1 will also work.=20

This requires that I have one of the webservers listed above installed
on each machine that holds arc files right?  I have installed tomcat on
such a machine, how do I go about creating the "symbolic link" that the
User manual refers to.

3.	Populate the ArcProxy BDB with the locations of all ARC files in
your repository. See instructions for the using location-client
command-line tool, within this document, to populate the ArcProxy BDB.=20

Thanks in advance,

Jimmy Lin

=20

Re: [Archive-access-discuss] [archive-crawler] wayback on windows

From: <sch...@ci...> - 2007-03-30 06:22:11

>> Has anyone been able to get the open source version of wayback to  
>> work
>> on windows?  I am attempting to do this.

I'm running it with Tomcat 5.5 on both WinXP Pro and Win2003. Works  
without problems.

>> Also, when using a Local Resource Store, can the arcpath folder be on
>> a mapped drive from another machine?

Haven't done that, but I don't think this should be a problem.

Max

Re: [Archive-access-discuss] [archive-crawler] wayback on windows

From: Gordon M. <go...@ar...> - 2007-03-29 17:50:43

jimmylinsemail wrote:
> Has anyone been able to get the open source version of wayback to work
> on windows?  I am attempting to do this.
> 
> Also, when using a Local Resource Store, can the arcpath folder be on
> a mapped drive from another machine?
> 
> Thanks in advance.

It's best to ask this and other Wayback questions on the archive-access 
discussion list, <arc...@li...>. I've 
CC'd this to there.

- Gordon @ IA

[Archive-access-discuss] On WERA

From: Michael S. <st...@du...> - 2007-03-23 05:21:10

I just came across this comment of Sverre's in issue "[ 1535953 ] WERA 
development still continuing?": 
http://sourceforge.net/tracker/index.php?func=detail&aid=1535953&group_id=118427&atid=681140.  
In case others have not seen it, I am posting it below to the list.

St.Ack

Date: 2006-09-29 00:44

Sorry for the late answer. 

I'll try to give my view on what will happen to Wera. 
I've been involved in Wera development since the start in 
2000. Currently i'm the only developer working on Wera from 
time to time. I'm an employee of the National Library of 
Norway (NB), which uses Wera for access to their Web 
Archive, so naturally the effort put into Wera is driven by 
the needs of NB. During the latest months focus has been 
elsewhere in NB, and only a fraction of the Web Archive is 
available through Wera. The activities on indexing and 
access will be taken up again early next year. When that 
happens we will maintain Wera to fit the needs of NB. 
However, We will also be monitoring the developement of the 
Open Wayback. If/when Open Wayback is able to fulfill our 
needs as access tool we will have to decide wether to 
switch to Open Wayback or stay with Wera. We may also find 
that some parts of Wera can be replaced by parts of the 
Open Wayback. E.g. the Document Retriever of Wera is a weak 
link in Wera since it only handles the files within one 
directory (and not subdirectories).

As long as someone sees the need for and is willing to take 
on the further development of Wera it will live. Every time 
i stumble into projects or persons that uses Wera for 
access i get a bit bewildered. Why is it that Wera get no 
new developer volunteers and few feature requests / bug 
reports are submitted to sourceforge when there are so many 
users? Must be a pretty good system, huh? Or is it that 
these users do not believe Wera will survive for very long 
and they don't want to invest any time of effort in a dying 
system. Has it something to do with Wera not being pure 
Java?

I really don't know. All i know is that i would like you 
(the users of Wera) to share your opinion on these matters.

Thanks
Sverre

[Archive-access-discuss] Nutchwax / HDFS Writer Processor

From: Graeme S. <li...@gr...> - 2007-03-21 20:37:56

Hi,

Has anyone looked at using Nutchwax with the output from the HDFS Writer 
Processor (or am I missing an obvious point and nothing needs to be done 
- haven't had a chance to play yet)?

Regards,

Graeme

[Archive-access-discuss] WERA - ARCRetriever

From: Natalia T. <nt...@ce...> - 2007-03-12 12:41:29

Hello

I'm using WERA and it works ok, but I ave some questions about ARCRetriever:

WERA can works using more than one arcdir (where  you write the path to 
directory of ARC files)?
All arc file must be in arcdir or i can create subfolder in arcdir and 
give out the arcs over this subfolders?

Thanks

Natalia

[Archive-access-discuss] Google SoC on PDF Processing

From: Eran C. <era...@gm...> - 2007-03-07 03:09:06

Hi,

I am graduate student in Indiana Unviversity, expecting to participate in
Google SoC. I browsed the project ideas page [1] and found the interesting
project in pdf processing. I understand that project page is for 2006 so I'd
like to know whether that requirement still exists for me to work on that
for Google SoC.

I took a course on web mining and I have very good knowledge on that. In
addition to that I'm an active contributor in Apache web services project as
well but now want to work another opensource organization.

If the PDF processing project is still a requirement I'd like to work on
that. If there are more ideas then please let me know to consider about them
as well.

Thanks,
Eran Chinthaka

[1] : http://webteam.archive.org/confluence/display/SOC06/Ideas

[Archive-access-discuss] [wayback] Problem on overwritting embedded content!

From: Ignacio G. <igc...@gm...> - 2007-02-16 17:58:12

Hello,

I just found an error when a page loads images through a cascading style
sheet (css)

The wm.js script, changes the links for all sources on the webpage that is
being displayed. However, the images that are linked through the CSS are not
modified and consequently are not displayed.

I think the wm.js script should be modified or an extra script should be
added to modify that content on CSSs.
The modification should not be too difficult since CSS uses the prefix
url(xxxx) to specify content that has to be linked.

Changing the xxx to the appropriate wayback path should correct this
problem.

Thank you.

Re: [Archive-access-discuss] [wayback] Building from the Source

From: Michael S. <st...@ar...> - 2007-02-14 18:43:02

Wayback, as are all of the other projects in archive-access, are all 
still stuck on maven 1.0.2, a very different animal from maven 2.0.  You 
can find it on the maven site if you are persistent.  We're trying to 
migrate from old to new maven but it likely won't have any time soon.

St.Ack


Ignacio Garcia wrote:
> Hello, I was trying to install wayback on my system using the source 
> distribution so I can make changes if needed.
>
> I've had no problems installing the binary distribution, but when I 
> try to build the source using maven, I get the following error:
>
> [INFO] Scanning for projects...
> [INFO] 
> ----------------------------------------------------------------------------
> [INFO] Building Maven Default Project
> [INFO]    task-segment: [install]
> [INFO] 
> ---------------------------------------------------------------------------- 
>
> [INFO] 
> ------------------------------------------------------------------------
> [ERROR] BUILD ERROR
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Cannot execute mojo: resources. It requires a project with an 
> existing pom.xml, but the build is not using one.
> [INFO] 
> ------------------------------------------------------------------------
>
> Apparently Maven is asking for a pom.xml file in order to execute the 
> mojo:resources... I have never used Maven, so I don't know what the 
> problem may be, and when I looked for information in the project 
> website I could not find anything regarding building/installing from 
> the source distribution.
>
> The maven command used was: mvn install
> Using mvn clean install did not work either.
>
> Does anyone know what the problem is?
>
> Thank you.
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] [wayback] Building from the Source

From: Brad T. <br...@ar...> - 2007-02-14 18:42:34

wayback current builds using maven 1.0.2.

http://archive-access.sourceforge.net/projects/wayback/requirements.html

Which version are you using?

Brad

> Hello, I was trying to install wayback on my system using the source
> distribution so I can make changes if needed.
>
> I've had no problems installing the binary distribution, but when I try to
> build the source using maven, I get the following error:
>
> [INFO] Scanning for projects...
> [INFO]
> ----------------------------------------------------------------------------
> [INFO] Building Maven Default Project
> [INFO]    task-segment: [install]
> [INFO]
> ----------------------------------------------------------------------------
> [INFO]
> ------------------------------------------------------------------------
> [ERROR] BUILD ERROR
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Cannot execute mojo: resources. It requires a project with an
> existing pom.xml, but the build is not using one.
> [INFO]
> ------------------------------------------------------------------------
>
> Apparently Maven is asking for a pom.xml file in order to execute the
> mojo:resources... I have never used Maven, so I don't know what the
> problem
> may be, and when I looked for information in the project website I could
> not
> find anything regarding building/installing from the source distribution.
>
> The maven command used was: mvn install
> Using mvn clean install did not work either.
>
> Does anyone know what the problem is?
>
> Thank you.
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV_______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] [wayback] Building from the Source

From: Ignacio G. <igc...@gm...> - 2007-02-14 18:02:04

Hello, I was trying to install wayback on my system using the source
distribution so I can make changes if needed.

I've had no problems installing the binary distribution, but when I try to
build the source using maven, I get the following error:

[INFO] Scanning for projects...
[INFO]
----------------------------------------------------------------------------
[INFO] Building Maven Default Project
[INFO]    task-segment: [install]
[INFO]
----------------------------------------------------------------------------
[INFO]
------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO]
------------------------------------------------------------------------
[INFO] Cannot execute mojo: resources. It requires a project with an
existing pom.xml, but the build is not using one.
[INFO]
------------------------------------------------------------------------

Apparently Maven is asking for a pom.xml file in order to execute the
mojo:resources... I have never used Maven, so I don't know what the problem
may be, and when I looked for information in the project website I could not
find anything regarding building/installing from the source distribution.

The maven command used was: mvn install
Using mvn clean install did not work either.

Does anyone know what the problem is?

Thank you.

[Archive-access-discuss] 3 problems found in new Wayback 0.8.0

From: Bing Z. <bz...@sd...> - 2007-01-25 01:51:03

Hi Brad,

Here are 3 problems I found in the new Wayback 0.8.0.

1.    Error when indexing an arc file.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
       After I installed the new Wayback 0.8.0 and placed an arc file =
for testing. I received the following error message
       in Tomcat's log file. This error message is repeatable when =
installing a new wayback instance.
               org.apache.commons.httpclient.URIException: invalid port =
number
                    at =
org.apache.commons.httpclient.URI.parseAuthority(URI.java:2226)
                    at =
org.archive.net.LaxURI.parseAuthority(LaxURI.java:183)
                    at =
org.archive.net.LaxURI.parseUriReference(LaxURI.java:348)
                    at =
org.apache.commons.httpclient.URI.<init>(URI.java:145)
                    at org.archive.net.LaxURI.<init>(LaxURI.java:73)
                    at org.archive.net.UURI.<init>(UURI.java:124)
                    at =
org.archive.net.UURIFactory.create(UURIFactory.java:320)
                    at =
org.archive.net.UURIFactory.create(UURIFactory.java:310)
                    at =
org.archive.net.UURIFactory.getInstance(UURIFactory.java:263)
                    at =
org.archive.wayback.resourceindex.cdx.CDXLineToSearchResultAdapter.adapt(=
CDXLineToSearchResultAdapter.java:66)
                    at =
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)=

                    at =
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)=

                    at =
org.archive.wayback.bdb.BDBRecordSet.insertRecords(BDBRecordSet.java:177)=

                    at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater.mergeFile(BDBIndexU=
pdater.java:152)
                    at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater.mergeAll(BDBIndexUp=
dater.java:219)
                    at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater$BDBIndexUpdaterThre=
ad.run(BDBIndexUpdater.java:260)

2. Low page replay quality compared with previous release.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
    Although I had the above error, I was able to use IE to query a URL =
and got a link back from the query result. After I clicked the link to =
replay=20
    the archived page, a lot images were missing. The page replay =
quality of Wayback 0.8.0 is not as good as that in Wayback 0.6.0 in this
    test case.

3. Backward incompatibility for db and files generated by Wayback 0.6.0.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

    When I copied the following files from Wayback 0.6.0 to Wayback =
0.8.0, the Wayback 0.8.0 displayed an error in IE when querying
    for a archived web site (a valid website).
          db files from   /0.6.0/index/   to  /0.0.8/index=20
          files from /0.6.0/pipeline/queued  to  =
/0.8.0/arc-indexer/queued
          files from /0.6.0/pipeline/merged  to  =
/0.8.0//index-data/merged
    Here is the error message in IE browser.
          java.lang.StringIndexOutOfBoundsException: String index out of =
range: -1
                 java.lang.String.substring(String.java:1768)
                 =
org.archive.wayback.resourceindex.bdb.BDBRecordToSearchResultAdapter.adap=
t(BDBRecordToSearchResultAdapter.java:63)
                 =
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)=

                 =
org.archive.wayback.resourceindex.LocalResourceIndex.filterRecords(LocalR=
esourceIndex.java:120)
                 =
org.archive.wayback.resourceindex.LocalResourceIndex.query(LocalResourceI=
ndex.java:296)
                 =
org.archive.wayback.query.QueryServlet.doGet(QueryServlet.java:95)
                 =
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
                 =
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
                 =
org.archive.wayback.core.RequestFilter.doFilter(RequestFilter.java:117)
                 =
org.archive.wayback.core.RequestFilter.doFilter(RequestFilter.java:117)

     Note when I was just using Wayback 0.6.0, I copied above mentioned =
file in new places and ran another Wayback 0.6.0 instance on top of
     the moved db and other files. The Wayback 0.6.0 worked fine.    =20


Here is some info of the system I used for above test.
Java: 1.5.0_09
tomat: 5.5.17
OS: Linux 2.4.20-28.7

Sincerely,
Bing Zhu
San Diego Supercomputer Center
email: bz...@sd...

[Archive-access-discuss] archive-access has moved from CVS to subversion

From: Michael S. <st...@ar...> - 2007-01-22 20:10:43

The archive-access project source -- which includes the open source 
wayback, nutchwax, etc. -- has been migrated from CVS to a subversion 
repository up on sourceforge.  See 
http://sourceforge.net/svn/?group_id=147570 for instruction on how to 
access the new location.

The CVS repository remains available, so its possible to run "cvs diff" 
on existing working copies and move changes to subversion or check out 
older versions of the software from before the migration.

St.Ack

Re: [Archive-access-discuss] Problems viweng pages with wera

From: Michael S. <st...@ar...> - 2007-01-22 16:50:59

Hey Natalia:

You might try the open source wayback.  In some regards it does a better 
job than WERA rendering.  See 
http://archive-access.sourceforge.net/projects/nutch/wayback.html.

St.Ack


Natalia Torres wrote:
> Hello
> I'm using wera+ nutchwax to see data crawled with heritrix.
>
> I have problems viewing indexed data in some pages. I detect that if the 
> web page are using absolut or relative paths
>
> Search results seems to be ok and I can surf over crawled data. If I use 
> Internet explore or Mozilla firefox some result pages are diferent.
>
> I detect some problems:
>
> 1) Pages using absolute paths  (http://www.mypage.com/menu.html) wera 
> shows the current site and not data stored in arc files
>
> 2) Pages using relative path pages can be displayed ok depending how to 
> make links and src images: menu.html or /menu.html
>
> 3) in some pages images disappear after a while or when I close the 
> message "WERA - External links, forms and search boxes may not function 
> within this collection ...".
>
>
> There's any reason?? Can I solve it?
>
> Thanks,
>
> Natalia
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] Problems viweng pages with wera

From: Natalia T. <nt...@ce...> - 2007-01-22 12:04:46

Hello
I'm using wera+ nutchwax to see data crawled with heritrix.

I have problems viewing indexed data in some pages. I detect that if the 
web page are using absolut or relative paths

Search results seems to be ok and I can surf over crawled data. If I use 
Internet explore or Mozilla firefox some result pages are diferent.

I detect some problems:

1) Pages using absolute paths  (http://www.mypage.com/menu.html) wera 
shows the current site and not data stored in arc files

2) Pages using relative path pages can be displayed ok depending how to 
make links and src images: menu.html or /menu.html

3) in some pages images disappear after a while or when I close the 
message "WERA - External links, forms and search boxes may not function 
within this collection ...".


There's any reason?? Can I solve it?

Thanks,

Natalia

[Archive-access-discuss] [ANN] NutchWAX 0.10.0 release

From: Michael S. <st...@ar...> - 2007-01-18 05:19:49

Release 0.10.0 of NutchWAX is now available for download from 
sourceforge (See 
http://sourceforge.net/project/showfiles.php?group_id=118427).   See 
release notes, 
http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html#0_10_0, 
for the list of changes.

Yours,
The Internet Archive Webteam

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 34 35 36 37 38 .. 43 > >> (Page 36 of 43)