You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Brad T. <br...@ar...> - 2007-05-16 01:07:21
|
Couple of mechanisms, depending on what Replay UI, and what ResourceIndex you're using. If you're using Archival URL replay mode, then you can use a wildcard '*' for the datespec, and a trailing '*' to list documents in the index prefixed with the given url. Example: http://wayback.yourhost.org:8080/wayback/*/example.com/* Also, you can alter the the query URL 'type' argument from 'urlquery' to 'urlprefixquery'. This is clearly a problem with the current search form, and will hopefully be addressed in the next release. If you want to do further processing on URLs in the index, there are two tools packaged with the wayback called bdb-client and bin-search. They are command line tools for dumping URLs with a given prefix from either a BDB index or from a sorted CDX index. Hopefully the online documentation for these tools will be enough to get you started with them, but let me know if it falls short. Brad Ignacio Garcia wrote: > is there any way to list all contents in my archived files using wayback? > > I have three different crawls, but one of them is not complete, so I > don't > know exactly what files I actually have archived and I would like to know > what files is wayback serving. > > I don't know if this is possible, since the url field doesn't seem to be > taking wildcards. > > Thank you. > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Ignacio G. <igc...@gm...> - 2007-05-14 13:13:17
|
Hello, I have the following issue with wayback. I am trying to add several arcs to wayback, but for some reason some of the them make the ArcIndexer procedure crash and I get the error: 'Exception in thread "AutoARCIndexThread" java.lang.OutOfMemoryError: Java heap space' The collection I was trying to add consisted of 11 arc files with a total size of ~4.5 Gb I also tried adding them one by one, and I found out that only some of the arc files caused wayback to fail. My first thought was that maybe the arcs were corrupted and Wayback was not able to index them because of that, but if I use the 'index-client' application or Heritrix's 'arcreader' script, the contents of the mentioned arcs are listed properly and no error messages appear. I have tried with version 0.8 and also with two 0.9 countinous releases and I've had the same results in all cases... Does anyone know what the issue may be, or if there is any certain way to test if an arc file is corrupted and will crash the AutoARCIndexThread? Thank you. |
|
From: Ignacio G. <igc...@gm...> - 2007-05-02 13:52:35
|
is there any way to list all contents in my archived files using wayback? I have three different crawls, but one of them is not complete, so I don't know exactly what files I actually have archived and I would like to know what files is wayback serving. I don't know if this is possible, since the url field doesn't seem to be taking wildcards. Thank you. |
|
From: Brad T. <br...@ar...> - 2007-04-27 21:16:59
|
Hi Ignacio,
The Wayback documentation has fallen a bit behind new features. This
should be rectified after a check-in that will significantly change the
configuration system (Wayback will no longer be configured via web.xml),
and also will switch the build system to maven 2 and continuum. We're
hoping to have all this complete sometime in May.
The exclusion system that is (barely) documented, and to which you are
referring, has significant performance issues: Every record retrieved
from the ResourceIndex requires an HTTP request to an external
"exclusion service". We are not recommending use of this exclusion
system until these performance issues have been addressed.
Recent versions of the Wayback software includes an alternate
"static-map" exclusion system, which monitors the contents of a text
file, and excludes URLs and URL prefixes placed in the file.
Until we switch standard distributions to maven 2 and continuum, you can
grab a "preview" .war which includes this "static-map" exclusion
component, but otherwise is compatible with the older web.xml
configuration system. This .war should be a drop-in replacement for what
you're working with now, but will allow you to add the following
configuration in place of whatever exclusion configuration you're using:
===============
<context-param>
<param-name>exclusion.factorytype</param-name>
<param-value>static-map</param-value>
</context-param>
<context-param>
<param-name>resourceindex.exclusionpath</param-name>
<param-value>/tmp/wb-excludes.txt</param-value>
</context-param>
===============
Here's where you can grab the .war:
http://builds.archive.org:8080/maven2/org/archive/wayback/wayback-webapp/0.9.0-SNAPSHOT/wayback-webapp-0.9.0-20070418.010333-23.war
Then the contents of /tmp/wb-excludes.txt might look something like:
==================
www.foo.com/private/
foo.com/private/
www.foo.com/extras/secure/
foo.com/extras/secure/
www.example.com/
example.com/
==================
Updates to the file should be noticed automatically and take affect
within 10 seconds.
Please let me know how this works for you, and if you have other
suggestions for how this would be useful to you.
Brad
Ignacio Garcia wrote:
> Hello,
>
> I have a question regarding the Resource Index Exclusions,
>
> I want to create a manual list of URLs that should not be exposed by
> wayback. As far as I understand by reading the online user manual, I
> have to
> point the option "adminexclusion.dbpath" to the location where my
> exclusion
> list is.
> My question is: what format does the BDB exclusion file has and how can I
> create it.
>
> The command line tools included with wayback let you maintain BDB
> files or
> create CDX files, but nowhere it says anything about creating new BDB
> files
> based on a list of URLs.
> How would I create a exclusion list that will hold the following urls:
>
> http://www.foo.com/private/
> http://www.foo.com/extras/secure/
> http://www.example.com/
>
> In this case I want to hide all URLs from the domain example.com and all
> files URLs under the private and extras/secure directories in the
> foo.comdomain.
> Is that possible? Do I have to specify absolute URLs on the exclusion
> list?
>
> Thank you.
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
|
|
From: Michael S. <st...@du...> - 2007-04-27 17:21:56
|
See this old FAQ from heritrix: http://crawler.archive.org/faq.html#toomanyopenfiles St.Ack alexis artes wrote: > Hi, > > We are encountering a "Too many files are open" error while doing an > incremental indexing. We followed the procedure outlined in the FAQ > and below are the commands we used. > ----------------------------------------- > Import > >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar import inputs > outputs test2 > > Update > >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar update outputs > outputs/segments/20070425125008-test2 > > Invert > >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar invert outputs > outputs/segments/20070425125008-test2 > > Dedup > ->we did not run the dedup command. > > Index > >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class > org.archive.access.nutch.NutchwaxIndexer outputs/indexes2 > outputs/crawldb outputs/linkdb outputs/segments/20070425125008-test2 > > Merge > >bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class > org.apache.nutch.indexer.IndexMerger outputs/index2 outputs/indexes > outputs/indexes2 > > Our System Configuration: > Scientific Linux CERN > 2.4.21-32.0.1.EL.cernsmp > JDK1.5 > Hadoop0.5 > Nutchwax0.8 > > We also tried running Nutchwax0.10 on Hadoop0.12.3 and Hadoop0.9.2, > but still get the same kind of error as below. > --------------------------------------------- > 07/04/26 15:49:50 INFO conf.Configuration: parsing > file:/opt/hadoop-0.9.2/conf/hadoop-default.xml > 07/04/26 15:49:50 INFO conf.Configuration: parsing > file:/tmp/hadoop-unjar23572/nutch-default.xml > 07/04/26 15:49:50 INFO ipc.Client: > org.apache.hadoop.io.ObjectWritableConnection culler maxidletime= 1000ms > 07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritable > Connection Culler: starting > 07/04/26 15:49:50 INFO indexer.IndexMerger: merging indexes to: > outputs/index2 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00000 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00001 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00002 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00003 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00004 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00005 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00006 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00007 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00008 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00009 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00010 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00011 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00012 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00013 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00014 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00015 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00016 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00017 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00018 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes/part-00019 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00000 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00001 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00002 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00003 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00004 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00005 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00006 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00007 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00008 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00009 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00010 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00011 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00012 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00013 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00014 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00015 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00016 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00017 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00018 > 07/04/26 15:49:50 INFO indexer.IndexMerger: Adding > /user/root/outputs/indexes2/part-00019 > 07/04/26 15:50:02 INFO fs.DFSClient: Could not obtain block from any > node: java.io.IOException: No live nodes contain current block > 07/04/26 15:50:05 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 1 time(s). > 07/04/26 15:50:06 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 2 time(s). > 07/04/26 15:50:07 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 3 time(s). > 07/04/26 15:50:08 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 4 time(s). > 07/04/26 15:50:09 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 5 time(s). > 07/04/26 15:50:10 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 6 time(s). > 07/04/26 15:50:11 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 7 time(s). > 07/04/26 15:50:12 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 8 time(s). > 07/04/26 15:50:13 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x4:9000. Already tried 9 time(s). > 07/04/26 15:50:14 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 10 time(s). > 07/04/26 15:50:15 WARN fs.DFSClient: DFS Read: > java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:388) > at java.net.Socket.connect(Socket.java:514) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:525) > at org.apache.hadoop.ipc.Client.call(Client.java:452) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164) > at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686) > at > org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91) > at > org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > at java.io.BufferedInputStream.read(BufferedInputStream.java:313) > at java.io.DataInputStream.read(DataInputStream.java:134) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41) > at > org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507) > at > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406) > at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658) > at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517) > at > org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553) > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98) > at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284) > at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394) > at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:149) > > 07/04/26 15:50:15 INFO fs.DFSClient: Could not obtain block from any > node: java.io.IOException: No live nodes contain current block > 07/04/26 15:50:18 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 1 time(s). > 07/04/26 15:50:19 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 2 time(s). > 07/04/26 15:50:20 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 3 time(s). > 07/04/26 15:50:21 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 4 time(s). > 07/04/26 15:50:22 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 5 time(s). > 07/04/26 15:50:23 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 6 time(s). > 07/04/26 15:50:24 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 7 time(s). > 07/04/26 15:50:25 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 8 time(s). > 07/04/26 15:50:26 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 9 time(s). > 07/04/26 15:50:27 INFO ipc.Client: Retrying connect to server: > mon034/x.x.x.x:9000. Already tried 10 time(s). > 07/04/26 15:50:28 WARN fs.DFSClient: DFS Read: > java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:388) > at java.net.Socket.connect(Socket.java:514) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:525) > at org.apache.hadoop.ipc.Client.call(Client.java:452) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164) > at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686) > at > org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91) > at > org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > at java.io.BufferedInputStream.read(BufferedInputStream.java:313) > at java.io.DataInputStream.read(DataInputStream.java:134) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41) > at > org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507) > at > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406) > at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658) > at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517) > at > org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553) > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98) > at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284) > at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394) > at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:149) > > 07/04/26 15:50:28 FATAL indexer.IndexMerger: IndexMerger: > java.net.SocketException: Too many open files > at java.net.Socket.createImpl(Socket.java:388) > at java.net.Socket.connect(Socket.java:514) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:525) > at org.apache.hadoop.ipc.Client.call(Client.java:452) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164) > at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577) > at > org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686) > at > org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91) > at > org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > at java.io.BufferedInputStream.read(BufferedInputStream.java:313) > at java.io.DataInputStream.read(DataInputStream.java:134) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41) > at > org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507) > at > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406) > at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658) > at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517) > at > org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553) > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98) > at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284) > at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394) > at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:149) > > I hope you can help us solve the issue. > > Best Regards, > Alexis > > ------------------------------------------------------------------------ > Ahhh...imagining that irresistible "new car" smell? > Check out new cars at Yahoo! Autos. > <http://us.rd.yahoo.com/evt=48245/*http://autos.yahoo.com/new_cars.html;_ylc=X3oDMTE1YW1jcXJ2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDbmV3LWNhcnM-> |
|
From: Ignacio G. <igc...@gm...> - 2007-04-27 12:48:18
|
Hello, I have a question regarding the Resource Index Exclusions, I want to create a manual list of URLs that should not be exposed by wayback. As far as I understand by reading the online user manual, I have to point the option "adminexclusion.dbpath" to the location where my exclusion list is. My question is: what format does the BDB exclusion file has and how can I create it. The command line tools included with wayback let you maintain BDB files or create CDX files, but nowhere it says anything about creating new BDB files based on a list of URLs. How would I create a exclusion list that will hold the following urls: http://www.foo.com/private/ http://www.foo.com/extras/secure/ http://www.example.com/ In this case I want to hide all URLs from the domain example.com and all files URLs under the private and extras/secure directories in the foo.comdomain. Is that possible? Do I have to specify absolute URLs on the exclusion list? Thank you. |
|
From: alexis a. <alx...@ya...> - 2007-04-27 10:43:43
|
Hi,
We are encountering a "Too many files are open" error while doing an incremental indexing. We followed the procedure outlined in the FAQ and below are the commands we used.
-----------------------------------------
Import
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar import inputs outputs test2
Update
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar update outputs outputs/segments/20070425125008-test2
Invert
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar invert outputs outputs/segments/20070425125008-test2
Dedup
->we did not run the dedup command.
Index
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class org.archive.access.nutch.NutchwaxIndexer outputs/indexes2 outputs/crawldb outputs/linkdb outputs/segments/20070425125008-test2
Merge
>bin/hadoop jar /opt/nutchwax-0.8.0/nutchwax-0.8.0.jar class org.apache.nutch.indexer.IndexMerger outputs/index2 outputs/indexes outputs/indexes2
Our System Configuration:
Scientific Linux CERN
2.4.21-32.0.1.EL.cernsmp
JDK1.5
Hadoop0.5
Nutchwax0.8
We also tried running Nutchwax0.10 on Hadoop0.12.3 and Hadoop0.9.2, but still get the same kind of error as below.
---------------------------------------------
07/04/26 15:49:50 INFO conf.Configuration: parsing file:/opt/hadoop-0.9.2/conf/hadoop-default.xml
07/04/26 15:49:50 INFO conf.Configuration: parsing file:/tmp/hadoop-unjar23572/nutch-default.xml
07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritableConnection culler maxidletime= 1000ms
07/04/26 15:49:50 INFO ipc.Client: org.apache.hadoop.io.ObjectWritable Connection Culler: starting
07/04/26 15:49:50 INFO indexer.IndexMerger: merging indexes to: outputs/index2
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00000
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00001
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00002
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00003
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00004
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00005
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00006
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00007
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00008
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00009
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00010
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00011
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00012
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00013
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00014
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00015
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00016
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00017
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00018
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes/part-00019
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00000
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00001
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00002
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00003
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00004
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00005
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00006
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00007
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00008
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00009
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00010
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00011
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00012
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00013
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00014
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00015
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00016
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00017
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00018
07/04/26 15:49:50 INFO indexer.IndexMerger: Adding /user/root/outputs/indexes2/part-00019
07/04/26 15:50:02 INFO fs.DFSClient: Could not obtain block from any node: java.io.IOException: No live nodes contain current block
07/04/26 15:50:05 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 1 time(s).
07/04/26 15:50:06 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 2 time(s).
07/04/26 15:50:07 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 3 time(s).
07/04/26 15:50:08 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 4 time(s).
07/04/26 15:50:09 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 5 time(s).
07/04/26 15:50:10 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 6 time(s).
07/04/26 15:50:11 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 7 time(s).
07/04/26 15:50:12 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 8 time(s).
07/04/26 15:50:13 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x4:9000. Already tried 9 time(s).
07/04/26 15:50:14 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 10 time(s).
07/04/26 15:50:15 WARN fs.DFSClient: DFS Read: java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:388)
at java.net.Socket.connect(Socket.java:514)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
at org.apache.hadoop.ipc.Client.call(Client.java:452)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.read(DataInputStream.java:134)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
07/04/26 15:50:15 INFO fs.DFSClient: Could not obtain block from any node: java.io.IOException: No live nodes contain current block
07/04/26 15:50:18 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 1 time(s).
07/04/26 15:50:19 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 2 time(s).
07/04/26 15:50:20 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 3 time(s).
07/04/26 15:50:21 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 4 time(s).
07/04/26 15:50:22 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 5 time(s).
07/04/26 15:50:23 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 6 time(s).
07/04/26 15:50:24 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 7 time(s).
07/04/26 15:50:25 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 8 time(s).
07/04/26 15:50:26 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 9 time(s).
07/04/26 15:50:27 INFO ipc.Client: Retrying connect to server: mon034/x.x.x.x:9000. Already tried 10 time(s).
07/04/26 15:50:28 WARN fs.DFSClient: DFS Read: java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:388)
at java.net.Socket.connect(Socket.java:514)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
at org.apache.hadoop.ipc.Client.call(Client.java:452)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.read(DataInputStream.java:134)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
07/04/26 15:50:28 FATAL indexer.IndexMerger: IndexMerger: java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:388)
at java.net.Socket.connect(Socket.java:514)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:145)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:525)
at org.apache.hadoop.ipc.Client.call(Client.java:452)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:512)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:732)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:577)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:686)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:91)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:189)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.read(DataInputStream.java:134)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:183)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:41)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:507)
at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:406)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:553)
at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:98)
at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.archive.access.nutch.Nutchwax.doClass(Nutchwax.java:284)
at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:394)
at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
I hope you can help us solve the issue.
Best Regards,
Alexis
---------------------------------
Ahhh...imagining that irresistible "new car" smell?
Check outnew cars at Yahoo! Autos. |
|
From: Brad T. <br...@ar...> - 2007-04-10 18:19:48
|
Hey Jimmy,
I just downloaded and installed wayback on a machine running Cygwin.
Two things I found:
1) need to set the JAVACMD env variable:
export JAVACMD=`which java`
2) I was only able to get things running when passing a relative argument
to the arc directory. There's some path resolution happening that I
haven't tracked down yet.
Just want to make sure that your Cygwin /tmp is HTTP (assuming on port
8081 from your example) exported on the node holding the ARC data at
/tmp/: so a file in the Cygwin folder at /tmp/foo.arc.gz is accessible at
http://archostip:8081/arc/foo.arc.gz.
Let me know if you're still having problems,
Brad
> Brad,
>
> I tried running index-client today. Ran it from the machine that is
> hosting the arc files. I received an error:
>
> bin/index-client: line 81: C:\Program: command not found
>
> This is the line that I used in cygwin:
>
> bin/index-client \tmp\ http://waybackip:8080/wayback/index-incoming/
> http://waybackip/arc-proxy \apache-tomcat-5.5.23\webapps\arc
> http://archostip:8081/arc
>
> I also tried:
>
> bin/index-client /tmp/ http://waybackip:8080/wayback/index-incoming/
> http://waybackip:8080/arc-proxy /apache-tomcat-5.5.23/webapps/arc
> http://archostip:8081/arc
>
> And I tried:
>
> bin/index-client C:\tmp\ http://waybackip:8080/wayback/index-incoming/
> http://waybackip/arc-proxy C:\apache-tomcat-5.5.23\webapps\arc
> http://archostip:8081/arc
>
> I received the same error each time. Any thought?
>
> Jimmy
>
>
> -----Original Message-----
> From: Brad Tofel [mailto:br...@ar...]
> Sent: Monday, April 09, 2007 5:05 PM
> To: Lin, Jimmy
> Cc: br...@ar...
> Subject: RE: [Archive-access-discuss] getting http resourcestore to work
> in wayback
>
> You're right, you may not need to use the location-client if you used
> the
> second usage of the index-client. The index-client scans through ARC
> files
> and outputs records for each document found in CDX format.
>
> In usage 1, the CDX output is sent to STDOUT, for later (manual) sorting
> and merging to generate an aggregated CDX file from many ARC input
> files.
> The location-client tool is primarily aimed for installations that are
> using this form to generate index files.
>
> In usage 2, the CDX data is sent directly to the ResourceIndex(can only
> be
> done with the BDB ResourceIndex implementation) via HTTP PUT. In this
> second usage, the index-client will also notify the ArcProxy's
> LocationDB
> of where that ARC can be found, which means you don't need to use the
> location-client tool at all.
>
> I haven't tested the codebase on Cygwin for a long time -- please send
> feedback on how it works for you.
>
> Automation of large scale indexing is the next key feature for the
> wayback
> project, so all this should get easier in the near term, but what's
> there
> now is hopefully enough to get smaller scale indexes built, or larger
> scale indexes built with a little shell scripting.
>
> We use these tools at the archive to maintain indexes for 10's of TB of
> ARC data, but we'd be happy to receive other feature suggestions that
> would make things simpler for you.
>
> Brad
>
>> Brad,
>>
>> Thanks. I did do that, however, I never followed through with
>> location-client. Its good to see that I was somewhat on the right
>> track. A couple follow up questions, can you run location-client
>> through cygwin(We are working on windows machines), and do I not need
> to
>> run index-client? The two shell scripts seem to be very similar.
>>
>> Jimmy
>>
>> -----Original Message-----
>> From: Brad Tofel [mailto:br...@ar...]
>> Sent: Monday, April 09, 2007 4:18 PM
>> To: Lin, Jimmy
>> Cc: arc...@li...
>> Subject: Re: [Archive-access-discuss] getting http resourcestore to
> work
>> in wayback
>>
>> I'll take a crack at improving the docs for other users later today,
> but
>> here are a couple quick tips:
>>
>> * the idea is to set up the ArcProxy to reverse proxy all HTTP 1.1
> range
>> requests to the actual storage node that holds the ARC files. If your
>> ArcProxy server is set up on arc-proxy.foo.org:8080/arc-proxy/
> (implies
>> you placed the wayback.war under the webapps dir on arc-proxy.foo.org,
>> with the name arc-proxy.war) then all ARCs can be accessed at:
>>
>> http://arc-proxy.foo.org:8080/arc-proxy/bar.arc.gz
>> http://arc-proxy.foo.org:8080/arc-proxy/baz.arc.gz
>>
>> even if bar.arc.gz and baz.arc.gz are on different nodes. To do this,
>> you
>> need to modify the arc-proxy web.xml, after it's been unpacked,
>> uncommenting the ArcProxy section of the configuration (and commenting
>> out
>> UI, ResourceStore, and ResourceIndex sections) and restart Tomcat.
>>
>> * the last step is to inform the ArcProxy where all the ARC files
> live,
>> so
>> it knows where to forward requests for the various ARCs stored on the
>> ARC
>> storage machines. This can be done with the location-client script.
>>
>> * I'm not sure which symbolic link you're referring to in the user
>> manual,
>> which version of the software are you using?
>>
>> Let me know if there's still missing info, and thanks for using the
>> tools!
>>
>> Brad
>>
>>
>>> Hello,
>>>
>>>
>>>
>>> I need some guidance in getting this up and running. The user manual
>>> states the following steps:
>>>
>>>
>>>
>>> 1. Set up a singleton ArcProxy webapp. This webapp maintains a BDB
>>> that maps ARC filenames to their actual absolute URL, and creates an
>>> indirection, so all ARC files are accessible within a single HTTP
>>> exported directory.
>>>
>>> How do I go about doing this? Do I rename the wayback.war file to
>>> arc-proxy and install it in tomcat?
>>>
>>> 2. Export your ARC files via HTTP 1.1, on all hosts that hold them,
>>> to the node running the ArcProxy webapp. Some examples of HTTP 1.1
>>> webservers you can use to export your ARC files are Apache, Tomcat,
>> and
>>> thttpd. Any other webserver that supports HTTP 1.1 will also work.
>>>
>>> This requires that I have one of the webservers listed above
> installed
>>> on each machine that holds arc files right? I have installed tomcat
>> on
>>> such a machine, how do I go about creating the "symbolic link" that
>> the
>>> User manual refers to.
>>>
>>> 3. Populate the ArcProxy BDB with the locations of all ARC files in
>>> your repository. See instructions for the using location-client
>>> command-line tool, within this document, to populate the ArcProxy
> BDB.
>>>
>>> Thanks in advance,
>>>
>>> Jimmy Lin
>>>
>>>
>>>
>>>
>>
> ------------------------------------------------------------------------
>> -
>>> Take Surveys. Earn Cash. Influence the Future of IT
>>> Join SourceForge.net's Techsay panel and you'll get the chance to
>> share
>>> your
>>> opinions on IT & business topics through brief surveys-and earn cash
>>>
>>
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
>> V_______________________________________________
>>> Archive-access-discuss mailing list
>>> Arc...@li...
>>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>>
>>
>
|
|
From: Brad T. <br...@ar...> - 2007-04-09 20:18:02
|
I'll take a crack at improving the docs for other users later today, but here are a couple quick tips: * the idea is to set up the ArcProxy to reverse proxy all HTTP 1.1 range requests to the actual storage node that holds the ARC files. If your ArcProxy server is set up on arc-proxy.foo.org:8080/arc-proxy/ (implies you placed the wayback.war under the webapps dir on arc-proxy.foo.org, with the name arc-proxy.war) then all ARCs can be accessed at: http://arc-proxy.foo.org:8080/arc-proxy/bar.arc.gz http://arc-proxy.foo.org:8080/arc-proxy/baz.arc.gz even if bar.arc.gz and baz.arc.gz are on different nodes. To do this, you need to modify the arc-proxy web.xml, after it's been unpacked, uncommenting the ArcProxy section of the configuration (and commenting out UI, ResourceStore, and ResourceIndex sections) and restart Tomcat. * the last step is to inform the ArcProxy where all the ARC files live, so it knows where to forward requests for the various ARCs stored on the ARC storage machines. This can be done with the location-client script. * I'm not sure which symbolic link you're referring to in the user manual, which version of the software are you using? Let me know if there's still missing info, and thanks for using the tools! Brad > Hello, > > > > I need some guidance in getting this up and running. The user manual > states the following steps: > > > > 1. Set up a singleton ArcProxy webapp. This webapp maintains a BDB > that maps ARC filenames to their actual absolute URL, and creates an > indirection, so all ARC files are accessible within a single HTTP > exported directory. > > How do I go about doing this? Do I rename the wayback.war file to > arc-proxy and install it in tomcat? > > 2. Export your ARC files via HTTP 1.1, on all hosts that hold them, > to the node running the ArcProxy webapp. Some examples of HTTP 1.1 > webservers you can use to export your ARC files are Apache, Tomcat, and > thttpd. Any other webserver that supports HTTP 1.1 will also work. > > This requires that I have one of the webservers listed above installed > on each machine that holds arc files right? I have installed tomcat on > such a machine, how do I go about creating the "symbolic link" that the > User manual refers to. > > 3. Populate the ArcProxy BDB with the locations of all ARC files in > your repository. See instructions for the using location-client > command-line tool, within this document, to populate the ArcProxy BDB. > > Thanks in advance, > > Jimmy Lin > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV_______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Lin, J. <lin...@ba...> - 2007-04-09 15:31:31
|
Hello, =20 I need some guidance in getting this up and running. The user manual states the following steps: =20 1. Set up a singleton ArcProxy webapp. This webapp maintains a BDB that maps ARC filenames to their actual absolute URL, and creates an indirection, so all ARC files are accessible within a single HTTP exported directory.=20 How do I go about doing this? Do I rename the wayback.war file to arc-proxy and install it in tomcat? 2. Export your ARC files via HTTP 1.1, on all hosts that hold them, to the node running the ArcProxy webapp. Some examples of HTTP 1.1 webservers you can use to export your ARC files are Apache, Tomcat, and thttpd. Any other webserver that supports HTTP 1.1 will also work.=20 This requires that I have one of the webservers listed above installed on each machine that holds arc files right? I have installed tomcat on such a machine, how do I go about creating the "symbolic link" that the User manual refers to. 3. Populate the ArcProxy BDB with the locations of all ARC files in your repository. See instructions for the using location-client command-line tool, within this document, to populate the ArcProxy BDB.=20 Thanks in advance, Jimmy Lin =20 |
|
From: <sch...@ci...> - 2007-03-30 06:22:11
|
>> Has anyone been able to get the open source version of wayback to >> work >> on windows? I am attempting to do this. I'm running it with Tomcat 5.5 on both WinXP Pro and Win2003. Works without problems. >> Also, when using a Local Resource Store, can the arcpath folder be on >> a mapped drive from another machine? Haven't done that, but I don't think this should be a problem. Max |
|
From: Gordon M. <go...@ar...> - 2007-03-29 17:50:43
|
jimmylinsemail wrote: > Has anyone been able to get the open source version of wayback to work > on windows? I am attempting to do this. > > Also, when using a Local Resource Store, can the arcpath folder be on > a mapped drive from another machine? > > Thanks in advance. It's best to ask this and other Wayback questions on the archive-access discussion list, <arc...@li...>. I've CC'd this to there. - Gordon @ IA |
|
From: Michael S. <st...@du...> - 2007-03-23 05:21:10
|
I just came across this comment of Sverre's in issue "[ 1535953 ] WERA development still continuing?": http://sourceforge.net/tracker/index.php?func=detail&aid=1535953&group_id=118427&atid=681140. In case others have not seen it, I am posting it below to the list. St.Ack Date: 2006-09-29 00:44 Sorry for the late answer. I'll try to give my view on what will happen to Wera. I've been involved in Wera development since the start in 2000. Currently i'm the only developer working on Wera from time to time. I'm an employee of the National Library of Norway (NB), which uses Wera for access to their Web Archive, so naturally the effort put into Wera is driven by the needs of NB. During the latest months focus has been elsewhere in NB, and only a fraction of the Web Archive is available through Wera. The activities on indexing and access will be taken up again early next year. When that happens we will maintain Wera to fit the needs of NB. However, We will also be monitoring the developement of the Open Wayback. If/when Open Wayback is able to fulfill our needs as access tool we will have to decide wether to switch to Open Wayback or stay with Wera. We may also find that some parts of Wera can be replaced by parts of the Open Wayback. E.g. the Document Retriever of Wera is a weak link in Wera since it only handles the files within one directory (and not subdirectories). As long as someone sees the need for and is willing to take on the further development of Wera it will live. Every time i stumble into projects or persons that uses Wera for access i get a bit bewildered. Why is it that Wera get no new developer volunteers and few feature requests / bug reports are submitted to sourceforge when there are so many users? Must be a pretty good system, huh? Or is it that these users do not believe Wera will survive for very long and they don't want to invest any time of effort in a dying system. Has it something to do with Wera not being pure Java? I really don't know. All i know is that i would like you (the users of Wera) to share your opinion on these matters. Thanks Sverre |
|
From: Graeme S. <li...@gr...> - 2007-03-21 20:37:56
|
Hi, Has anyone looked at using Nutchwax with the output from the HDFS Writer Processor (or am I missing an obvious point and nothing needs to be done - haven't had a chance to play yet)? Regards, Graeme |
|
From: Natalia T. <nt...@ce...> - 2007-03-12 12:41:29
|
Hello I'm using WERA and it works ok, but I ave some questions about ARCRetriever: WERA can works using more than one arcdir (where you write the path to directory of ARC files)? All arc file must be in arcdir or i can create subfolder in arcdir and give out the arcs over this subfolders? Thanks Natalia |
|
From: Eran C. <era...@gm...> - 2007-03-07 03:09:06
|
Hi, I am graduate student in Indiana Unviversity, expecting to participate in Google SoC. I browsed the project ideas page [1] and found the interesting project in pdf processing. I understand that project page is for 2006 so I'd like to know whether that requirement still exists for me to work on that for Google SoC. I took a course on web mining and I have very good knowledge on that. In addition to that I'm an active contributor in Apache web services project as well but now want to work another opensource organization. If the PDF processing project is still a requirement I'd like to work on that. If there are more ideas then please let me know to consider about them as well. Thanks, Eran Chinthaka [1] : http://webteam.archive.org/confluence/display/SOC06/Ideas |
|
From: Ignacio G. <igc...@gm...> - 2007-02-16 17:58:12
|
Hello, I just found an error when a page loads images through a cascading style sheet (css) The wm.js script, changes the links for all sources on the webpage that is being displayed. However, the images that are linked through the CSS are not modified and consequently are not displayed. I think the wm.js script should be modified or an extra script should be added to modify that content on CSSs. The modification should not be too difficult since CSS uses the prefix url(xxxx) to specify content that has to be linked. Changing the xxx to the appropriate wayback path should correct this problem. Thank you. |
|
From: Michael S. <st...@ar...> - 2007-02-14 18:43:02
|
Wayback, as are all of the other projects in archive-access, are all still stuck on maven 1.0.2, a very different animal from maven 2.0. You can find it on the maven site if you are persistent. We're trying to migrate from old to new maven but it likely won't have any time soon. St.Ack Ignacio Garcia wrote: > Hello, I was trying to install wayback on my system using the source > distribution so I can make changes if needed. > > I've had no problems installing the binary distribution, but when I > try to build the source using maven, I get the following error: > > [INFO] Scanning for projects... > [INFO] > ---------------------------------------------------------------------------- > [INFO] Building Maven Default Project > [INFO] task-segment: [install] > [INFO] > ---------------------------------------------------------------------------- > > [INFO] > ------------------------------------------------------------------------ > [ERROR] BUILD ERROR > [INFO] > ------------------------------------------------------------------------ > [INFO] Cannot execute mojo: resources. It requires a project with an > existing pom.xml, but the build is not using one. > [INFO] > ------------------------------------------------------------------------ > > Apparently Maven is asking for a pom.xml file in order to execute the > mojo:resources... I have never used Maven, so I don't know what the > problem may be, and when I looked for information in the project > website I could not find anything regarding building/installing from > the source distribution. > > The maven command used was: mvn install > Using mvn clean install did not work either. > > Does anyone know what the problem is? > > Thank you. > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Brad T. <br...@ar...> - 2007-02-14 18:42:34
|
wayback current builds using maven 1.0.2. http://archive-access.sourceforge.net/projects/wayback/requirements.html Which version are you using? Brad > Hello, I was trying to install wayback on my system using the source > distribution so I can make changes if needed. > > I've had no problems installing the binary distribution, but when I try to > build the source using maven, I get the following error: > > [INFO] Scanning for projects... > [INFO] > ---------------------------------------------------------------------------- > [INFO] Building Maven Default Project > [INFO] task-segment: [install] > [INFO] > ---------------------------------------------------------------------------- > [INFO] > ------------------------------------------------------------------------ > [ERROR] BUILD ERROR > [INFO] > ------------------------------------------------------------------------ > [INFO] Cannot execute mojo: resources. It requires a project with an > existing pom.xml, but the build is not using one. > [INFO] > ------------------------------------------------------------------------ > > Apparently Maven is asking for a pom.xml file in order to execute the > mojo:resources... I have never used Maven, so I don't know what the > problem > may be, and when I looked for information in the project website I could > not > find anything regarding building/installing from the source distribution. > > The maven command used was: mvn install > Using mvn clean install did not work either. > > Does anyone know what the problem is? > > Thank you. > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV_______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Ignacio G. <igc...@gm...> - 2007-02-14 18:02:04
|
Hello, I was trying to install wayback on my system using the source distribution so I can make changes if needed. I've had no problems installing the binary distribution, but when I try to build the source using maven, I get the following error: [INFO] Scanning for projects... [INFO] ---------------------------------------------------------------------------- [INFO] Building Maven Default Project [INFO] task-segment: [install] [INFO] ---------------------------------------------------------------------------- [INFO] ------------------------------------------------------------------------ [ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] Cannot execute mojo: resources. It requires a project with an existing pom.xml, but the build is not using one. [INFO] ------------------------------------------------------------------------ Apparently Maven is asking for a pom.xml file in order to execute the mojo:resources... I have never used Maven, so I don't know what the problem may be, and when I looked for information in the project website I could not find anything regarding building/installing from the source distribution. The maven command used was: mvn install Using mvn clean install did not work either. Does anyone know what the problem is? Thank you. |
|
From: Bing Z. <bz...@sd...> - 2007-01-25 01:51:03
|
Hi Brad,
Here are 3 problems I found in the new Wayback 0.8.0.
1. Error when indexing an arc file.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
After I installed the new Wayback 0.8.0 and placed an arc file =
for testing. I received the following error message
in Tomcat's log file. This error message is repeatable when =
installing a new wayback instance.
org.apache.commons.httpclient.URIException: invalid port =
number
at =
org.apache.commons.httpclient.URI.parseAuthority(URI.java:2226)
at =
org.archive.net.LaxURI.parseAuthority(LaxURI.java:183)
at =
org.archive.net.LaxURI.parseUriReference(LaxURI.java:348)
at =
org.apache.commons.httpclient.URI.<init>(URI.java:145)
at org.archive.net.LaxURI.<init>(LaxURI.java:73)
at org.archive.net.UURI.<init>(UURI.java:124)
at =
org.archive.net.UURIFactory.create(UURIFactory.java:320)
at =
org.archive.net.UURIFactory.create(UURIFactory.java:310)
at =
org.archive.net.UURIFactory.getInstance(UURIFactory.java:263)
at =
org.archive.wayback.resourceindex.cdx.CDXLineToSearchResultAdapter.adapt(=
CDXLineToSearchResultAdapter.java:66)
at =
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)=
at =
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)=
at =
org.archive.wayback.bdb.BDBRecordSet.insertRecords(BDBRecordSet.java:177)=
at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater.mergeFile(BDBIndexU=
pdater.java:152)
at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater.mergeAll(BDBIndexUp=
dater.java:219)
at =
org.archive.wayback.resourceindex.bdb.BDBIndexUpdater$BDBIndexUpdaterThre=
ad.run(BDBIndexUpdater.java:260)
2. Low page replay quality compared with previous release.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Although I had the above error, I was able to use IE to query a URL =
and got a link back from the query result. After I clicked the link to =
replay=20
the archived page, a lot images were missing. The page replay =
quality of Wayback 0.8.0 is not as good as that in Wayback 0.6.0 in this
test case.
3. Backward incompatibility for db and files generated by Wayback 0.6.0.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
When I copied the following files from Wayback 0.6.0 to Wayback =
0.8.0, the Wayback 0.8.0 displayed an error in IE when querying
for a archived web site (a valid website).
db files from /0.6.0/index/ to /0.0.8/index=20
files from /0.6.0/pipeline/queued to =
/0.8.0/arc-indexer/queued
files from /0.6.0/pipeline/merged to =
/0.8.0//index-data/merged
Here is the error message in IE browser.
java.lang.StringIndexOutOfBoundsException: String index out of =
range: -1
java.lang.String.substring(String.java:1768)
=
org.archive.wayback.resourceindex.bdb.BDBRecordToSearchResultAdapter.adap=
t(BDBRecordToSearchResultAdapter.java:63)
=
org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)=
=
org.archive.wayback.resourceindex.LocalResourceIndex.filterRecords(LocalR=
esourceIndex.java:120)
=
org.archive.wayback.resourceindex.LocalResourceIndex.query(LocalResourceI=
ndex.java:296)
=
org.archive.wayback.query.QueryServlet.doGet(QueryServlet.java:95)
=
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
=
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
=
org.archive.wayback.core.RequestFilter.doFilter(RequestFilter.java:117)
=
org.archive.wayback.core.RequestFilter.doFilter(RequestFilter.java:117)
Note when I was just using Wayback 0.6.0, I copied above mentioned =
file in new places and ran another Wayback 0.6.0 instance on top of
the moved db and other files. The Wayback 0.6.0 worked fine. =20
Here is some info of the system I used for above test.
Java: 1.5.0_09
tomat: 5.5.17
OS: Linux 2.4.20-28.7
Sincerely,
Bing Zhu
San Diego Supercomputer Center
email: bz...@sd... |
|
From: Michael S. <st...@ar...> - 2007-01-22 20:10:43
|
The archive-access project source -- which includes the open source wayback, nutchwax, etc. -- has been migrated from CVS to a subversion repository up on sourceforge. See http://sourceforge.net/svn/?group_id=147570 for instruction on how to access the new location. The CVS repository remains available, so its possible to run "cvs diff" on existing working copies and move changes to subversion or check out older versions of the software from before the migration. St.Ack |
|
From: Michael S. <st...@ar...> - 2007-01-22 16:50:59
|
Hey Natalia: You might try the open source wayback. In some regards it does a better job than WERA rendering. See http://archive-access.sourceforge.net/projects/nutch/wayback.html. St.Ack Natalia Torres wrote: > Hello > I'm using wera+ nutchwax to see data crawled with heritrix. > > I have problems viewing indexed data in some pages. I detect that if the > web page are using absolut or relative paths > > Search results seems to be ok and I can surf over crawled data. If I use > Internet explore or Mozilla firefox some result pages are diferent. > > I detect some problems: > > 1) Pages using absolute paths (http://www.mypage.com/menu.html) wera > shows the current site and not data stored in arc files > > 2) Pages using relative path pages can be displayed ok depending how to > make links and src images: menu.html or /menu.html > > 3) in some pages images disappear after a while or when I close the > message "WERA - External links, forms and search boxes may not function > within this collection ...". > > > There's any reason?? Can I solve it? > > Thanks, > > Natalia > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2007-01-22 12:04:46
|
Hello I'm using wera+ nutchwax to see data crawled with heritrix. I have problems viewing indexed data in some pages. I detect that if the web page are using absolut or relative paths Search results seems to be ok and I can surf over crawled data. If I use Internet explore or Mozilla firefox some result pages are diferent. I detect some problems: 1) Pages using absolute paths (http://www.mypage.com/menu.html) wera shows the current site and not data stored in arc files 2) Pages using relative path pages can be displayed ok depending how to make links and src images: menu.html or /menu.html 3) in some pages images disappear after a while or when I close the message "WERA - External links, forms and search boxes may not function within this collection ...". There's any reason?? Can I solve it? Thanks, Natalia |
|
From: Michael S. <st...@ar...> - 2007-01-18 05:19:49
|
Release 0.10.0 of NutchWAX is now available for download from sourceforge (See http://sourceforge.net/project/showfiles.php?group_id=118427). See release notes, http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html#0_10_0, for the list of changes. Yours, The Internet Archive Webteam |